PCA is a dimensionality-reduction algorithm. Reducing dimensionality matters because of the “curse of dimensionality.” Boiled down, the “curse of dimensionality” essentially means too many features is a bad thing. The two main problems that occur with large dimensionality are prolonged training times and overfitting.
It is easy to comprehend how increasing the number of features would increase training times. More things for the computer to digest = more time. On the other hand, overfitting can be less straightforward. Overfitting means your model is poor at predictions with new observations. I think the way Aurélien Géron explains this idea in his book Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow, is insightful.
“[There’s] just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. This also means that a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.”
In summary, high dimensionality increases training time and makes finding a good solution harder. PCA counteracts this by reducing a dataset’s dimensionality.
Now that we have the background covered let’s talk PCA. PCA can be hard to wrap your mind around, so I’m going to start with a thought experiment. Think about yourself for a minute. There are so many little things that have made you who you are. Consequently, if someone has ever asked you to tell them about yourself, it might have been hard to answer. I find it difficult to simplify all the little things into an accurate picture of myself. However, that’s what you end up doing. You generalize. Instead of telling someone you’ve been running 3 miles every other day for 20 years, you would most likely say you enjoy running. Or maybe you’ve seen over 100 broadway musicals. Subsequently, you tell the person you like broadway rather than naming every time you have gone.
PCA operates similarly. It helps us generalize data. Think of it as a BuzzFeed quiz. When you take a BuzzFeed quiz, you answer questions and get a result. For example, the quiz could answer what Disney character you are most like.
Now, let’s take this idea and apply it to an arbitrary dataset. Think of every observation in that dataset as a person and the features as different attributes describing that person. If we have too many attributes describing each person, we can get lost in every individual’s uniqueness. Consequently, we will have trouble accurately predicting new people’s behavior since we are unable to generalize. Therefore, using BuzzFeed quizzes to generalize each person is helpful.
A big idea of PCA is this: PCA creates n number of principal components equal to either the number of features or the number of observations depending on which one is smaller. Think of each principal component as a different BuzzFeed quiz. Each quiz can then be used to generalize the person. We will lose information about the person by doing this, but we gain simplicity. Finally, we need to decide how many quizzes we should use.
Unsurprisingly, some quizzes are more useful than others at describing a person. The best quizzes are insightful for large populations because they accurately describe most people who take them. In machine learning terms, this is equivalent to the variance in a dataset that a principal component explains. Therefore, we can decide how many principal components(quizzes) to use by determining how much of the variance in the dataset you want to explain.
For example, if the first four principal components account for 90% of the variance, but you want to account for 95% of the variance, you will need to utilize more principal components. Now that you have your principal components, now what?
The final step is to put these quiz results to work! As we just learned, PCA takes a bunch of features and explains them with fewer features. This means we don’t have to use all the features in our model. Instead, we use the new features we just made (quiz results). AND WALLAH! Faster model and less risk of overfitting. Consequently, you might finally have that free time you wanted to take that new BuzzFeed quiz.