Maximize Feature Selection: Harnessing Pairwise Correlation
Written on
Understanding the Correlation Coefficient
In the previous installment about Feature Selection, we examined a method for eliminating features based solely on their individual characteristics. In this article, we will delve into a more effective and robust approach that enables us to analyze the relationships between features, helping us determine their significance. This method revolves around the concept of Pairwise Correlation.
To start, let's discuss Pearson's correlation coefficient, commonly represented as r. This coefficient quantifies the linear relationship between two distributions (or features) into a single value, ranging from -1 to 1. A score of -1 indicates a perfect negative correlation, while +1 indicates a perfect positive correlation.
For example, body measurements typically exhibit strong positive correlations; a taller individual is likely to have longer limbs, and broader shoulders often correlate with higher weight. Conversely, ice cream sales tend to decrease as temperatures drop, illustrating a negative correlation.
For a comprehensive guide on interpreting and utilizing the correlation coefficient, refer to my previous article.
So, how does this correlation coefficient relate to Machine Learning or feature selection? If the correlation coefficient between two features is 0.85, it implies that 85% of the time, we can predict feature 2 based solely on feature 1's values. Hence, if feature 1 is present in our dataset, feature 2 offers minimal additional information. Thus, retaining feature 2 complicates model training without adding value.
The essence of using pairwise correlation for feature selection lies in identifying clusters of highly correlated features, retaining only one to maximize predictive power while minimizing the feature set.
Visualizing the Correlation Matrix
The most straightforward and effective way to identify highly correlated features is through a correlation matrix. This matrix illustrates the correlation between every numeric feature pair in the dataset.
Consider the Melbourne Housing dataset, which consists of 13 numeric features. We can easily calculate the correlation matrix by invoking the .corr() method on the DataFrame, followed by Seaborn's heatmap function to create an appealing visual representation.
However, the default heatmap can be improved. We can create a customized diverging color palette (blue -> white -> red), center the color bar around 0, enable annotations to display each correlation value, and format it to two decimal points.
As shown, the matrix displays each correlation twice, since the correlation between feature A and B is identical to that of feature B and A. Additionally, the diagonal consists of 1s, representing self-correlations. This excess information can be overwhelming; therefore, we can eliminate the upper half of the matrix.
In line 4 of the code snippet, a 2D boolean mask is generated. The np.ones_like function creates a 2D numpy array with True values, which is then transformed into a boolean mask using np.triu. We can then apply this mask to the heatmap's masking argument, producing a clean plot.
Here is a small function to create these ideal correlation matrices:
Even though we could programmatically identify all highly correlated features, visual exploration remains essential. A heatmap can reveal whether correlations are logical. For example, a high correlation between the number of newborns and the number of storks might not be meaningful. Thus, engaging in visual exploration helps ascertain whether feature groups are genuinely related.
Conclusion
In conclusion, employing pairwise correlation enables the identification of highly correlated features that contribute no new information to the dataset. Since these features merely complicate the model, increase the risk of overfitting, and demand greater computational resources, they should be eliminated.
However, it is crucial to gain a thorough understanding of your dataset through proper exploratory data analysis (EDA) before applying this technique. Always be vigilant for correlations that appear nonsensical, such as random correlations between unrelated features.
Enjoyed this article? Imagine accessing a plethora of similar content, crafted by a brilliant and engaging author (that's me!). For just $4.99, you can unlock not only my writings but a wealth of knowledge from the brightest minds on Medium. If you use my referral link, you will receive my heartfelt gratitude and a virtual high-five for supporting my work.
Explore how to drop features using Pearson correlation in this comprehensive tutorial.
Learn about constructing a correlation matrix for feature selection in Python.