Using BuzzFeed Quizzes to Explain Principal Component Analysis (PCA)

In this article I will discuss an analogy that I find very helpful when trying to comprehend what PCA is doing.

BACKGROUND

It is easy to comprehend how increasing the number of features would increase training times. More things for the computer to digest = more time. On the other hand, overfitting can be less straightforward. Overfitting means your model is poor at predictions with new observations. I think the way Aurélien Géron explains this idea in his book Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow, is insightful.

“[There’s] just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. This also means that a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.”

In summary, high dimensionality increases training time and makes finding a good solution harder. PCA counteracts this by reducing a dataset’s dimensionality.

PCA EXPLAINED

PCA operates similarly. It helps us generalize data. Think of it as a BuzzFeed quiz. When you take a BuzzFeed quiz, you answer questions and get a result. For example, the quiz could answer what Disney character you are most like.

Now, let’s take this idea and apply it to an arbitrary dataset. Think of every observation in that dataset as a person and the features as different attributes describing that person. If we have too many attributes describing each person, we can get lost in every individual’s uniqueness. Consequently, we will have trouble accurately predicting new people’s behavior since we are unable to generalize. Therefore, using BuzzFeed quizzes to generalize each person is helpful.

A big idea of PCA is this: PCA creates n number of principal components equal to either the number of features or the number of observations depending on which one is smaller. Think of each principal component as a different BuzzFeed quiz. Each quiz can then be used to generalize the person. We will lose information about the person by doing this, but we gain simplicity. Finally, we need to decide how many quizzes we should use.

Unsurprisingly, some quizzes are more useful than others at describing a person. The best quizzes are insightful for large populations because they accurately describe most people who take them. In machine learning terms, this is equivalent to the variance in a dataset that a principal component explains. Therefore, we can decide how many principal components(quizzes) to use by determining how much of the variance in the dataset you want to explain.

For example, if the first four principal components account for 90% of the variance, but you want to account for 95% of the variance, you will need to utilize more principal components. Now that you have your principal components, now what?

CONCLUSION

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store