Maximize Feature Selection: Harnessing Pairwise Correlation

Understanding the Correlation Coefficient

In the previous installment about Feature Selection, we examined a method for eliminating features based solely on their individual characteristics. In this article, we will delve into a more effective and robust approach that enables us to analyze the relationships between features, helping us determine their significance. This method revolves around the concept of Pairwise Correlation.

To start, let's discuss Pearson's correlation coefficient, commonly represented as r. This coefficient quantifies the linear relationship between two distributions (or features) into a single value, ranging from -1 to 1. A score of -1 indicates a perfect negative correlation, while +1 indicates a perfect positive correlation.

For example, body measurements typically exhibit strong positive correlations; a taller individual is likely to have longer limbs, and broader shoulders often correlate with higher weight. Conversely, ice cream sales tend to decrease as temperatures drop, illustrating a negative correlation.

For a comprehensive guide on interpreting and utilizing the correlation coefficient, refer to my previous article.

So, how does this correlation coefficient relate to Machine Learning or feature selection? If the correlation coefficient between two features is 0.85, it implies that 85% of the time, we can predict feature 2 based solely on feature 1's values. Hence, if feature 1 is present in our dataset, feature 2 offers minimal additional information. Thus, retaining feature 2 complicates model training without adding value.

The essence of using pairwise correlation for feature selection lies in identifying clusters of highly correlated features, retaining only one to maximize predictive power while minimizing the feature set.

Visualizing the Correlation Matrix

The most straightforward and effective way to identify highly correlated features is through a correlation matrix. This matrix illustrates the correlation between every numeric feature pair in the dataset.

Consider the Melbourne Housing dataset, which consists of 13 numeric features. We can easily calculate the correlation matrix by invoking the .corr() method on the DataFrame, followed by Seaborn's heatmap function to create an appealing visual representation.

However, the default heatmap can be improved. We can create a customized diverging color palette (blue -> white -> red), center the color bar around 0, enable annotations to display each correlation value, and format it to two decimal points.

As shown, the matrix displays each correlation twice, since the correlation between feature A and B is identical to that of feature B and A. Additionally, the diagonal consists of 1s, representing self-correlations. This excess information can be overwhelming; therefore, we can eliminate the upper half of the matrix.

In line 4 of the code snippet, a 2D boolean mask is generated. The np.ones_like function creates a 2D numpy array with True values, which is then transformed into a boolean mask using np.triu. We can then apply this mask to the heatmap's masking argument, producing a clean plot.

Here is a small function to create these ideal correlation matrices:

Even though we could programmatically identify all highly correlated features, visual exploration remains essential. A heatmap can reveal whether correlations are logical. For example, a high correlation between the number of newborns and the number of storks might not be meaningful. Thus, engaging in visual exploration helps ascertain whether feature groups are genuinely related.

Removing Highly Correlated Features

Next, we will explore how to eliminate highly correlated features. We will utilize the Ansur Dataset, which includes extensive body measurements of 6000 US Army personnel and contains 98 numeric features. This dataset is perfect for honing your feature selection skills, as it features numerous correlated attributes.

Focusing solely on numeric features, we will isolate them for analysis. Notably, attempting to visualize the correlation matrix for such a large dataset would be impractical.

Our goal is to eliminate strong correlations, both positive and negative. Therefore, we will construct the matrix using the absolute values of correlations by chaining the .abs() method with .corr().

Subsequently, we will create a boolean mask to subset the matrix. The mask function within Pandas DataFrames allows us to assign NaN values to the upper half and diagonal of the matrix.

Next, we need to determine a threshold to decide whether to remove a feature. This threshold must be chosen with care and a thorough understanding of the dataset. In this case, we will set a threshold of 0.9:

The list comprehension identifies all columns that meet the 0.9 threshold. Using "any" enables us to avoid a nested loop, resulting in a more elegant solution. If we hadn't applied NaNs to the upper half and diagonal, we would have inadvertently discarded all correlated features instead of retaining one.

Here is a function that performs these operations based on a specified threshold and returns a list of columns to be removed:

Now, let’s proceed to drop those columns:

To evaluate whether this decision was justified, we will compare the performance of two RandomForestRegressors: one using the complete dataset and another utilizing the selected features. The target variable will be weight in pounds (Weightlbs):

Impressively, we achieved excellent results even with a significant number of features. Now, let's run the same analysis on the feature-selected DataFrame:

Surprisingly, we observed only a 2% decrease in test score while achieving a twofold reduction in runtime.

Conclusion

In conclusion, employing pairwise correlation enables the identification of highly correlated features that contribute no new information to the dataset. Since these features merely complicate the model, increase the risk of overfitting, and demand greater computational resources, they should be eliminated.

However, it is crucial to gain a thorough understanding of your dataset through proper exploratory data analysis (EDA) before applying this technique. Always be vigilant for correlations that appear nonsensical, such as random correlations between unrelated features.

Enjoyed this article? Imagine accessing a plethora of similar content, crafted by a brilliant and engaging author (that's me!). For just $4.99, you can unlock not only my writings but a wealth of knowledge from the brightest minds on Medium. If you use my referral link, you will receive my heartfelt gratitude and a virtual high-five for supporting my work.

Explore how to drop features using Pearson correlation in this comprehensive tutorial.

Learn about constructing a correlation matrix for feature selection in Python.

robertbearclaw.com

Maximize Feature Selection: Harnessing Pairwise Correlation

Understanding the Correlation Coefficient

Visualizing the Correlation Matrix

Removing Highly Correlated Features

Conclusion

Share the page:

Recent Post:

# Unveiling 13 Enduring Insights from

Exploring Entropy: The Five-Minute Plank Challenge

Unraveling the Mysteries of Matter and Anti-Matter in the Universe

Transform Your Life with One Powerful Sentence

Unlocking Your Mind's Potential: A Guide to Achieving Success

Don't Worry: Essential Insights I Would Share with My Younger Self

The Fascinating Origins of the Thyroid Gland's Name

Exploring the Exciting Features of the iPhone 14 Pro