### Introduction¶

This is the third post in my series on transforming data into alpha. If you haven't yet see the data management and guide to feature engineering, please take a minute to read those first...

This post is going to delve into the mechanics of *feature selection* to help choose between the many variations of features created in the feature engineering stage. If you'd like to replicate and experiment with the below code, *you can download the source notebook for this post by right-clicking on the below button and choosing "save link as"*

By design, many of the features you've created will be very similar to each other (aka "collinear") because you've derived them from the same underlying dataset. If we were to keep many highly collinear features in our dataset used to train models, it would likely cause the model to "learn" some very funky and dangerous patterns. I will discuss this in greater depth in a future post.

The goal of feature selection is to reduce our *possible* features into *the best* set of features to learn from data. This will lead to models which *generalize* better (i.e., work well on data they haven't seen). They will also be much more interpretable.

### Philosophy¶

In feature selection, we strive to meet two goals:

**Strength**: Choose the features with the strongest, most persistent relationships to the target outcome variable. The reasons for this are obvious.**Orthogonality**: Minimize the amount of overlap or collinearity in your selected features. The importance of orthogonality (non-overlap) of features is much greater than you might guess.

I am biased towards making feature selection a relatively mechanical process. The "art" should mainly be encapsulated within the prior step (feature engineering) and the subsequent step (modeling).

Feature selection should, in my view, follow a heuristic and can be encoded into an algorithm if desired. For purposes of this tutorial, I'll keep things relatively manual.

### Preparing the data¶

Let's dive in. I will begin by loading prices and creating *outcomes* `DataFrame`

as done in the post on data management.

```
## Replace this section of imports with your preferred
## data download/access interface. This calls a
## proprietary set of methods (ie they won't work for you)
from IPython.core.display import Image
import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like # remove once updated pandas-datareader issue is fixed
# https://github.com/pydata/pandas-datareader/issues/534
import pandas_datareader.data as web
%matplotlib inline
def get_symbols(symbols,data_source, begin_date=None,end_date=None):
out = pd.DataFrame()
for symbol in symbols:
df = web.DataReader(symbol, data_source,begin_date, end_date)\
[['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index()
df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase
df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe
df = df.set_index(['date','symbol'])
out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
return out.sort_index()
prices = get_symbols(['AAPL','CSCO','AMZN','YHOO','MSFT'],data_source='quandl',begin_date='2012-01-01',end_date=None)
print(prices.sort_index().tail())
```

```
outcomes = pd.DataFrame(index=prices.index)
# next day's opening change
outcomes['close_1'] = prices.groupby(level='symbol').close.pct_change(-1) # next day's returns
outcomes['close_5'] = prices.groupby(level='symbol').close.pct_change(-5) # next week's returns
outcomes['close_10'] = prices.groupby(level='symbol').close.pct_change(-10) # next two weeks' returns
outcomes['close_20'] = prices.groupby(level='symbol').close.pct_change(-20) # next month's (approx) returns
print(outcomes.tail())
```

However, unlike the prior tutorials, we're going to engineer some features which are constructed to contain a relationship to the outcomes along with quite a bit of random noise. Clearly, this is not something we'd do in real usage but will help to demonstrate the concept more clearly.

Assume we have a target variable called `outcome`

which can be (partially) predicted with three factors, `factor_1`

, `factor_2`

and `factor_3`

. There's also an unpredictble noise component.

We'll use `numpy.random`

to graft dummy values mapped onto the indices of real price data.

```
num_obs = prices.close.count()
factor_1 = pd.Series(np.random.randn(num_obs),index=prices.index)
factor_2 = pd.Series(np.random.randn(num_obs),index=prices.index)
factor_3 = pd.Series(np.random.randn(num_obs),index=prices.index)
outcome = 1.*factor_1 + 2.*factor_2 + 3.*factor_3 + 5.*np.random.randn(num_obs)
outcome.name = 'outcome'
print(outcome.tail())
```

Now, we will engineer several variations on features which each contain some information about the three factors, plus a few which contain some interaction effects, and some which do not contain any useful data.

Note that we are, again, "cheating" here for illustration purposes.

```
features = pd.DataFrame(index=outcome.index)
features['f11'] = 0.2*factor_1 + 0.8*np.random.randn(num_obs)
features['f12'] = 0.4*factor_1 + 0.6*np.random.randn(num_obs)
features['f13'] = 0.6*factor_1 + 0.4*np.random.randn(num_obs)
features['f21'] = 0.2*factor_2 + 0.8*np.random.randn(num_obs)
features['f22'] = 0.4*factor_2 + 0.8*np.random.randn(num_obs)
features['f23'] = 0.6*factor_2 + 0.4*np.random.randn(num_obs)
features['f31'] = 0.2*factor_3 + 0.8*np.random.randn(num_obs)
features['f32'] = 0.4*factor_3 + 0.6*np.random.randn(num_obs)
features['f33'] = 0.6*factor_3 + 0.4*np.random.randn(num_obs)
features['f41'] = 0.2*factor_1+0.2*factor_2 + 0.6*np.random.randn(num_obs)
features['f42'] = 0.2*factor_2+0.2*factor_3 + 0.6*np.random.randn(num_obs)
features['f43'] = 0.2*factor_3+0.2*factor_1 + 0.6*np.random.randn(num_obs)
features['f51'] = np.random.randn(num_obs)
features['f52'] = np.random.randn(num_obs)
features['f53'] = np.random.randn(num_obs)
print(features.tail())
```

Before evaluating the features for predictive strength and orthogonality, we'll do a quick data preparation stage. It is sometimes vital to "standardize" or "normalize" data so that we get fair comparisons between features of differing scale. Strictly speaking, since all of the doctored outcome and feature data is already drawn from normal distribution (using the numpy function `random.rnorm()`

) we don't really need this step, but good practice to include.

Here, I'll use the scikit-learn `StandardScaler()`

method and some pandas magic to transform the data.

```
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
from scipy.cluster import hierarchy
from scipy.spatial import distance
from sklearn.preprocessing import StandardScaler,Normalizer
#f = features.dropna() #optional - to compare apples to apples
# standardize or normalize data
std_scaler = StandardScaler()
features_scaled = std_scaler.fit_transform(features.dropna())
print (features_scaled.shape)
df = pd.DataFrame(features_scaled,index=features.dropna().index)
df.columns = features.dropna().columns
df.tail()
# standardize outcome as well
outcome_df = outcome.to_frame()
outcome_scaled = std_scaler.fit_transform(outcome_df.dropna())
outcome_scaled = pd.DataFrame(outcome_scaled,index=outcome_df.dropna().index)
outcome_scaled.columns = outcome_df.columns
print(outcome_scaled.tail())
```

```
corr = df.corrwith(outcome)
corr.sort_values().plot.barh(color = 'blue',title = 'Strength of Correlation')
```

Pretend for a minute that we don't know which features are going to be stronger and weaker, and which are going to tend to cluster together. We've got an idea that there are some quite strong features, some weaker, and some useless.

While correlation is not the perfect metric, it gives us a reasonable sense of **strength** of each feature's historical relationship to the outcome variable.

However, it says nothing about **orthogonality**. To get an idea about this, we'll take advantage of the very handy seaborn `clustermap`

chart type which plots a heatmap representation of a covariance matrix and runs a hierarchical clustering algorithm to group together the most closely related features.

```
corr_matrix = df.corr()
correlations_array = np.asarray(corr_matrix)
linkage = hierarchy.linkage(distance.pdist(correlations_array), \
method='average')
g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\
row_cluster=True,col_cluster=True,figsize=(10,10),cmap='Greens')
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
```

The algorithm has done a good job of finding the groupings of features. Of course, the diagonal of dark green represents each feature being perfectly correlated with itself, but we also see certain clusters of features which are similar to one another.

The cluster in the upper left captures `factor_1`

(including some of the interaction effects). `factor_3`

is fairly well isolated in the lower right corner, and in the middle we can see `factor_2`

as well as some of the noise features.

Let's next focus in only on those features with correlations of greater than 0.1 to exclude the noise and weak features.

```
corr
```

```
correlated_features = corr[corr>0.1].index.tolist()
corr_matrix = df[correlated_features].corr()
correlations_array = np.asarray(corr_matrix)
linkage = hierarchy.linkage(distance.pdist(correlations_array), \
method='average')
g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\
row_cluster=True,col_cluster=True,figsize=(6,6),cmap='Greens')
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
print("Correlation Strength:")
print(corr[corr>0.1].sort_values(ascending=False))
```

Ah, now the clusters look a bit sharper. We'll follow a simple heuristic to manually select the features.

**Step 1:** Take the most strongly correlated feature (f33) and add it to our list of selected features.
**Step 2:** Take the second correlated feature (f23) and check to see if it's closely correlated (neighboring in the clustermap) to any features already chosen. If no, add to the list. If yes, discard.
**Step 3:** Repeat this process until either (1) we've reached the target feature count, or (2) we've run out strongly correlated features.

*Those interested could encode this heuristic into an algorithm without too much difficulty.*

Following this heuristic, I get the below features:

```
selected_features = ['f33','f23','f42','f41','f31']
import seaborn as sns
sns.pairplot(df[selected_features],size=1.5)
```

Note that this list of features is not simply the highest correlated features. Let's run the clustermap one more time to see if we've missed any major clusters.

```
corr_matrix = df[selected_features].corr()
correlations_array = np.asarray(corr_matrix)
linkage = hierarchy.linkage(distance.pdist(correlations_array), method='average')
g = sns.clustermap(corr_matrix,row_linkage=linkage,col_linkage=linkage,\
row_cluster=True,col_cluster=True,figsize=(6,6),cmap='Greens')
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:,g.dendrogram_row.reordered_ind].columns
```

Looks generally pretty good. There is some residual covariance between features, so we haven't achieved orthogonal nirvana, but we're pretty close.

Thus far, we've only used a simple correlation statistic across the full time period. This is a good place to start but, in my opinion, is a dangerous place to stop. Financial time series data suffers from non-stationarity and regime change, so a relationship which *on average* has existed may have been wildly unstable over time.

To check, we'll plot the rolling correlation of these selected features.

```
tmp = df[selected_features].join(outcome_scaled).reset_index().set_index('date')
tmp.dropna().resample('Q').apply(lambda x: x.corr()).iloc[:,-1].unstack()\
.iloc[:,:-1].plot(title='Correlation of Features to Outcome\n (by quarter)')
# shows time stability
```

As expected, since the data wasn't modeled with any non-stationarity, our features all appear to be robust over time. This gives increased confidence that the relationship we've found is likely to persist.

### Summary¶

This installment of the tutorial series has walked through a systematic approach for selecting a subset of features from a universe of many overlapping (collinear) features. At this point, we're ready to model!

In the next post, I'll walk through an approach for training models in a "walk forward" basis - highly useful when working with ordered (e.g., time series) datasets.

### One last thing...¶

If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.

Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!