This post is going to delve into the mechanics of feature selection to help choose between the many variations of features created in the feature engineering stage. If you'd like to replicate and experiment with the below code, you can download the source notebook for this post by right-clicking on the below button and choosing "save link as"
By design, many of the features you've created will be very similar to each other (aka "collinear") because you've derived them from the same underlying dataset. If we were to keep many highly collinear features in our dataset used to train models, it would likely cause the model to "learn" some very funky and dangerous patterns. I will discuss this in greater depth in a future post.
The goal of feature selection is to reduce our possible features into the best set of features to learn from data. This will lead to models which generalize better (i.e., work well on data they haven't seen). They will also be much more interpretable.
In feature selection, we strive to meet two goals:
- Strength: Choose the features with the strongest, most persistent relationships to the target outcome variable. The reasons for this are obvious.
- Orthogonality: Minimize the amount of overlap or collinearity in your selected features. The importance of orthogonality (non-overlap) of features is much greater than you might guess.
I am biased towards making feature selection a relatively mechanical process. The "art" should mainly be encapsulated within the prior step (feature engineering) and the subsequent step (modeling).
Feature selection should, in my view, follow a heuristic and can be encoded into an algorithm if desired. For purposes of this tutorial, I'll keep things relatively manual.
## Replace this section of imports with your preferred ## data download/access interface. This calls a ## proprietary set of methods (ie they won't work for you) from IPython.core.display import Image import numpy as np import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like # remove once updated pandas-datareader issue is fixed # https://github.com/pydata/pandas-datareader/issues/534 import pandas_datareader.data as web %matplotlib inline def get_symbols(symbols,data_source, begin_date=None,end_date=None): out = pd.DataFrame() for symbol in symbols: df = web.DataReader(symbol, data_source,begin_date, end_date)\ [['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index() df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe df = df.set_index(['date','symbol']) out = pd.concat([out,df],axis=0) #stacks on top of previously collected data return out.sort_index() prices = get_symbols(['AAPL','CSCO','AMZN','YHOO','MSFT'],data_source='quandl',begin_date='2012-01-01',end_date=None) print(prices.sort_index().tail())
open high low close volume date symbol 2018-03-26 MSFT 90.61 94.000 90.40 93.78 55031149.0 2018-03-27 AAPL 173.68 175.150 166.92 168.34 38962839.0 AMZN 1572.40 1575.960 1482.32 1497.05 6793279.0 CSCO 44.49 44.520 42.24 42.68 30088447.0 MSFT 94.94 95.139 88.51 89.47 53704562.0
outcomes = pd.DataFrame(index=prices.index) # next day's opening change outcomes['close_1'] = prices.groupby(level='symbol').close.pct_change(-1) # next day's returns outcomes['close_5'] = prices.groupby(level='symbol').close.pct_change(-5) # next week's returns outcomes['close_10'] = prices.groupby(level='symbol').close.pct_change(-10) # next two weeks' returns outcomes['close_20'] = prices.groupby(level='symbol').close.pct_change(-20) # next month's (approx) returns print(outcomes.tail())
close_1 close_5 close_10 close_20 date symbol 2018-03-26 MSFT 0.048173 NaN NaN NaN 2018-03-27 AAPL NaN NaN NaN NaN AMZN NaN NaN NaN NaN CSCO NaN NaN NaN NaN MSFT NaN NaN NaN NaN
However, unlike the prior tutorials, we're going to engineer some features which are constructed to contain a relationship to the outcomes along with quite a bit of random noise. Clearly, this is not something we'd do in real usage but will help to demonstrate the concept more clearly.
Assume we have a target variable called
outcome which can be (partially) predicted with three factors,
factor_3. There's also an unpredictble noise component.
numpy.random to graft dummy values mapped onto the indices of real price data.
num_obs = prices.close.count() factor_1 = pd.Series(np.random.randn(num_obs),index=prices.index) factor_2 = pd.Series(np.random.randn(num_obs),index=prices.index) factor_3 = pd.Series(np.random.randn(num_obs),index=prices.index) outcome = 1.*factor_1 + 2.*factor_2 + 3.*factor_3 + 5.*np.random.randn(num_obs) outcome.name = 'outcome' outcome.tail()
date symbol 2018-03-26 MSFT 1.619244 2018-03-27 AAPL 0.669392 AMZN -7.372092 CSCO -0.209091 MSFT -8.924626 Name: outcome, dtype: float64
Now, we will engineer several variations on features which each contain some information about the three factors, plus a few which contain some interaction effects, and some which do not contain any useful data.
Note that we are, again, "cheating" here for illustration purposes.
features = pd.DataFrame(index=outcome.index) features['f11'] = 0.2*factor_1 + 0.8*np.random.randn(num_obs) features['f12'] = 0.4*factor_1 + 0.6*np.random.randn(num_obs) features['f13'] = 0.6*factor_1 + 0.4*np.random.randn(num_obs) features['f21'] = 0.2*factor_2 + 0.8*np.random.randn(num_obs) features['f22'] = 0.4*factor_2 + 0.8*np.random.randn(num_obs) features['f23'] = 0.6*factor_2 + 0.4*np.random.randn(num_obs) features['f31'] = 0.2*factor_3 + 0.8*np.random.randn(num_obs) features['f32'] = 0.4*factor_3 + 0.6*np.random.randn(num_obs) features['f33'] = 0.6*factor_3 + 0.4*np.random.randn(num_obs) features['f41'] = 0.2*factor_1+0.2*factor_2 + 0.6*np.random.randn(num_obs) features['f42'] = 0.2*factor_2+0.2*factor_3 + 0.6*np.random.randn(num_obs) features['f43'] = 0.2*factor_3+0.2*factor_1 + 0.6*np.random.randn(num_obs) features['f51'] = np.random.randn(num_obs) features['f52'] = np.random.randn(num_obs) features['f53'] = np.random.randn(num_obs) print(features.tail())
f11 f12 f13 f21 f22 f23 \ date symbol 2018-03-26 MSFT -0.041104 0.238372 0.413337 -0.478707 -0.248044 0.135140 2018-03-27 AAPL 0.569879 -0.333113 -1.205543 1.506631 0.641764 0.047848 AMZN -0.032226 -0.001385 0.411697 -0.213998 -1.078521 -0.745223 CSCO -0.954817 -0.455529 0.759212 0.108211 -0.255423 0.865484 MSFT -0.184980 0.028665 -0.363028 -0.701258 1.567036 0.677576 f31 f32 f33 f41 f42 f43 \ date symbol 2018-03-26 MSFT 0.308846 0.059778 0.269863 0.297112 0.211582 -0.375905 2018-03-27 AAPL -1.021095 -1.187441 1.198066 -0.227440 -0.051018 0.147372 AMZN -1.366262 -0.869260 -1.096526 0.200837 0.097775 -0.100074 CSCO -0.853482 -0.102179 -0.785770 0.165970 -0.259788 0.192491 MSFT 0.940413 -0.476567 -0.277618 0.218573 -0.052515 0.048862 f51 f52 f53 date symbol 2018-03-26 MSFT -0.160892 0.963626 0.039153 2018-03-27 AAPL 0.549437 -0.418202 0.124258 AMZN -0.120544 -0.996801 0.266643 CSCO 0.244743 -1.368344 -0.660723 MSFT 0.644907 0.176948 0.426277
Before evaluating the features for predictive strength and orthogonality, we'll do a quick data preparation stage. It is sometimes vital to "standardize" or "normalize" data so that we get fair comparisons between features of differing scale. Strictly speaking, since all of the doctored outcome and feature data is already drawn from normal distribution (using the numpy function
random.rnorm()) we don't really need this step, but good practice to include.
Here, I'll use the scikit-learn
StandardScaler() method and some pandas magic to transform the data.
import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from IPython.display import display from scipy.cluster import hierarchy from scipy.spatial import distance from sklearn.preprocessing import StandardScaler,Normalizer #f = features.dropna() #optional - to compare apples to apples # standardize or normalize data std_scaler = StandardScaler() features_scaled = std_scaler.fit_transform(features.dropna()) print (features_scaled.shape) df = pd.DataFrame(features_scaled,index=features.dropna().index) df.columns = features.dropna().columns df.tail() # standardize outcome as well outcome_df = outcome.to_frame() outcome_scaled = std_scaler.fit_transform(outcome_df.dropna()) outcome_scaled = pd.DataFrame(outcome_scaled,index=outcome_df.dropna().index) outcome_scaled.columns = outcome_df.columns print(outcome_scaled.tail())
(7639, 15) outcome date symbol 2018-03-26 MSFT 0.252853 2018-03-27 AAPL 0.101434 AMZN -1.180483 CSCO -0.038608 MSFT -1.427977
corr = df.corrwith(outcome) corr.sort_values().plot.barh(color = 'blue',title = 'Strength of Correlation')
<matplotlib.axes._subplots.AxesSubplot at 0xaa89908>
Pretend for a minute that we don't know which features are going to be stronger and weaker, and which are going to tend to cluster together. We've got an idea that there are some quite strong features, some weaker, and some useless.
While correlation is not the perfect metric, it gives us a reasonable sense of strength of each feature's historical relationship to the outcome variable.
However, it says nothing about orthogonality. To get an idea about this, we'll take advantage of the very handy seaborn
clustermap chart type which plots a heatmap representation of a covariance matrix and runs a hierarchical clustering algorithm to group together the most closely related features.
corr_matrix = df.corr() correlations_array = np.asarray(corr_matrix) linkage = hierarchy.linkage(distance.pdist(correlations_array), \ method='average') g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\ row_cluster=True,col_cluster=True,figsize=(10,10),cmap='Greens') plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) plt.show() label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
The algorithm has done a good job of finding the groupings of features. Of course, the diagonal of dark green represents each feature being perfectly correlated with itself, but we also see certain clusters of features which are similar to one another.
The cluster in the upper left captures
factor_1 (including some of the interaction effects).
factor_3 is fairly well isolated in the lower right corner, and in the middle we can see
factor_2 as well as some of the noise features.
Let's next focus in only on those features with correlations of greater than 0.1 to exclude the noise and weak features.
correlated_features = corr[corr>0.1].index.tolist() corr_matrix = df[correlated_features].corr() correlations_array = np.asarray(corr_matrix) linkage = hierarchy.linkage(distance.pdist(correlations_array), \ method='average') g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\ row_cluster=True,col_cluster=True,figsize=(6,6),cmap='Greens') plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) plt.show() label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns print("Correlation Strength:") print(corr[corr>0.1].sort_values(ascending=False))
Correlation Strength: f33 0.421067 f32 0.270004 f23 0.258847 f42 0.241708 f43 0.191224 f41 0.145155 f22 0.136999 f13 0.125092 f31 0.104860 dtype: float64
Ah, now the clusters look a bit sharper. We'll follow a simple heuristic to manually select the features.
Step 1: Take the most strongly correlated feature (f33) and add it to our list of selected features. Step 2: Take the second correlated feature (f23) and check to see if it's closely correlated (neighboring in the clustermap) to any features already chosen. If no, add to the list. If yes, discard. Step 3: Repeat this process until either (1) we've reached the target feature count, or (2) we've run out strongly correlated features.
Those interested could encode this heuristic into an algorithm without too much difficulty.
Following this heuristic, I get the below features:
selected_features = ['f33','f23','f42','f41','f31'] import seaborn as sns sns.pairplot(df[selected_features],size=1.5)
<seaborn.axisgrid.PairGrid at 0xc714470>
Note that this list of features is not simply the highest correlated features. Let's run the clustermap one more time to see if we've missed any major clusters.
corr_matrix = df[selected_features].corr() correlations_array = np.asarray(corr_matrix) linkage = hierarchy.linkage(distance.pdist(correlations_array), method='average') g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\ row_cluster=True,col_cluster=True,figsize=(6,6),cmap='Greens') plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) plt.show() label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
Looks generally pretty good. There is some residual covariance between features, so we haven't achieved orthogonal nirvana, but we're pretty close.
Thus far, we've only used a simple correlation statistic across the full time period. This is a good place to start but, in my opinion, is a dangerous place to stop. Financial time series data suffers from non-stationarity and regime change, so a relationship which on average has existed may have been wildly unstable over time.
To check, we'll plot the rolling correlation of these selected features.
tmp = df[selected_features].join(outcome_scaled).reset_index().set_index('date') tmp.dropna().resample('Q').apply(lambda x: x.corr()).iloc[:,-1].unstack()\ .iloc[:,:-1].plot(title='Correlation of Features to Outcome\n (by quarter)') # shows time stability
<matplotlib.axes._subplots.AxesSubplot at 0xe67c1d0>
As expected, since the data wasn't modeled with any non-stationarity, our features all appear to be robust over time. This gives increased confidence that the relationship we've found is likely to persist.
This installment of the tutorial series has walked through a systematic approach for selecting a subset of features from a universe of many overlapping (collinear) features. At this point, we're ready to model!
In the next post, I'll walk through an approach for training models in a "walk forward" basis - highly useful when working with ordered (e.g., time series) datasets.
One last thing...¶
If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.
Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!