The Alpha Scientist

Generating stock market alpha using data science

ML for Stock Market Prediction (part 1): Data Management

In [2]:
from IPython.core.display import HTML,Image
import sys
import config

About this tutorial

This is the first in a six-part series on the mechanics of applying machine learning techniques to the unique domain of stock market price prediction.

Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.

However, usage of machine learning in stock market prediction requires much more than a good grasp of the concepts and techniques for supervised machine learning. As I describe further in [this post], stock prediction is a challenging domain which requires some special techniques to overcome issues like non-stationarity, collinearity, and low signal-to-noise.

In this and following posts, I'll present the design of an end-to-end system for developing, testing, and applying machine learning models in a way which addresses each of these problems in a very practical way.

These tutorials are not intended offer any "secret sauce" or strategies which I use in live trading, but will offer a more valuable and more generalized set of techniques which will allow you to create your own strategies in a robust manner.

Within other posts in this series, I plan to cover:

In future, I also plan to make tutorials on:

  • Using Pandas, scikit-learn, and pandas plus scikit-learn
  • Techniques for improving model predictive power
  • Techniques for improving model robustness out-of-sample ... and probably others (please feel free to suggest in below comments)

For this, I will assume readers have a good working knowledge of python and pandas as well as basic supervised machine learning concepts.

About this post

In this first post, I will present a framework for organizing and working with data. Perhaps not the most electrifying of topics, but it's a precondition for comprehending later modeling tutorials.

It's also, as any practitioner of the field will agree, of critical significance and importance. I've heard it said that 90% of time in real-world quant finance is spent on data rather than models. That may be a bit of an exagggeration, but not far off.

Following a coherent data management schema (mine or otherwise) will save countless hours of frustration and will allow you to scale your projects to teams of contributors

Types of Data Structures

My system for handling data makes heavy use of three main types of data collections:

  • features: This is a dataframe which contains all features or values which we will allow models to use in the course of learning relationships - and later making predictions. All features must be values which would have been known at the point in time when the model needed to make predictions.

    In other words, next_12_months_returns would be a bad feature since it would not become known at the time needed. The features dataframe has a multi-index of date/symbol and column names unique to each feature. More on this later.

  • outcomes: This is a dataframe of all possible future outcomes which we may be interested in predicting, magically shifted back in time to T-zero. For instance, we may want to predict the total_return for a symbol over the year following T=0 (the time of prediction). We would look ahead into the future, calculate what ultimately did happen to this metric, and log it onto time T=0. I'll explain why in a minute.

    Just like features, this dataframe has rows indexed by date/symbol and columns named with a convention which describes the feature.

  • master: The final data structure type is the master dataframe. This contains any static information about each symbol in the universe, such as the SIC code, the number of shares outstanding, beta factors, etc...

    In practice, things in the master may change over time (SIC codes and shares out can both change...) but I've found it sufficient for my purposes to take the current static values for the current point in time.

    This dataframe uses row index of symbol only. You could, of course, add a date/symbol index if you wanted to reflect changing values over time.

Why this data scheme?

It may seem odd to split the features and outcomes into distinct dataframes, and odd to create a dataframe of several different possible "outcomes". Most important, it may seem odd to record on t=0 what will happen in the next day, week, month, etc...

There are several reasons for this approach:

  1. This makes it trivial to extract the X's and y's when training models. Just slice some columns from features for the X and slice one column of outcomes in y. They're already aligned and ready for fitting.
  2. This makes it trivial to toggle between various time horizons - just change the column of outcomes used for y.
  3. This helps us guard against inadvertent "peeking" at the future. We only ever use features columns in X.
  4. This allows us to use the incredibly efficient pandas join, merge, and concat methods to quickly align data for purposes of training models.

Trust me. This will save you many, many hours of debugging and brute force coding down the road.

An example

Let's create simple toy examples of each dataframe using free data from quandl:

First, we'll make a utility function which downloads one or more symbols from quandl and returns the adjusted OHLC data (I generally find adjusted data to be best).

In [2]:
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like # may be necessary in some versions of pandas
import as web

def get_symbols(symbols,data_source, begin_date=None,end_date=None):
    out = pd.DataFrame()
    for symbol in symbols:
        df = web.DataReader(symbol, data_source,begin_date, end_date)\
        df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase
        df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe
        df = df.set_index(['date','symbol'])
        out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
    return out.sort_index()
prices = get_symbols(['AAPL','CSCO'],data_source='quandl',\

Now, we will create some toy features. If the syntax is unclear, I'll cover that in more depth in the next post. For now, just note that we've created five features for both symbols using only data that would be available as of the end of day T.

Also note that I've dropped any rows which contain any nulls for simplicity, since scikit-learn can't handle those out of the box.

In [3]:
features = pd.DataFrame(index=prices.index)
features['volume_change_ratio'] = prices.groupby(level='symbol').volume\
.diff(1) / prices.groupby(level='symbol').shift(1).volume
features['momentum_5_day'] = prices.groupby(level='symbol').close\

features['intraday_chg'] = (prices.groupby(level='symbol').close\
                            .shift(0) - prices.groupby(level='symbol').open\

features['day_of_week'] = features.index.get_level_values('date').weekday

features['day_of_month'] = features.index.get_level_values('date').day

volume_change_ratio momentum_5_day intraday_chg day_of_week day_of_month
date symbol
2016-12-23 AAPL -0.453747 0.004743 0.008046 4 23
CSCO -0.291298 -0.001961 -0.000327 4 23
2016-12-27 AAPL 0.284036 0.005316 0.006351 1 27
CSCO 0.546260 -0.002276 0.001305 1 27
2016-12-28 AAPL 0.142595 -0.001625 -0.006467 2 28
CSCO -0.151900 -0.004581 -0.009121 2 28
2016-12-29 AAPL -0.280609 -0.002819 0.002404 3 29
CSCO -0.085396 0.001315 0.002963 3 29
2016-12-30 AAPL 1.033726 -0.004042 -0.007115 4 30
CSCO 0.836194 -0.007879 -0.011126 4 30

Next, we'll create outcomes. Note that the seemingly unnecessary lambda function is needed because of this issue with pandas

In [4]:
outcomes = pd.DataFrame(index=prices.index)
# next day's opening change
outcomes['open_1'] = prices.groupby(level='symbol').open.shift(-1)\
# next day's closing change
func_one_day_ahead = lambda x: x.pct_change(-1)
outcomes['close_1'] = prices.groupby(level='symbol').close\
func_five_day_ahead = lambda x: x.pct_change(-5)
outcomes['close_5'] = prices.groupby(level='symbol').close\

open_1 close_1 close_5
date symbol
2016-12-20 CSCO 0.004254 0.004602 0.004602
2016-12-21 AAPL -0.006065 0.006621 0.002827
CSCO -0.000657 -0.001313 -0.001313
2016-12-22 AAPL -0.006019 -0.001974 0.004058
CSCO 0.002626 -0.002293 0.007942
2016-12-23 AAPL 0.000000 -0.006311 NaN
CSCO 0.003603 -0.004889 NaN
2016-12-27 AAPL 0.002217 0.004282 NaN
CSCO 0.000652 0.008547 NaN
2016-12-28 AAPL -0.002655 0.000257 NaN
CSCO -0.001644 -0.001313 NaN
2016-12-29 AAPL -0.000685 0.007857 NaN
CSCO 0.003283 0.007942 NaN
2016-12-30 AAPL NaN NaN NaN

Note that the shifted periods are negative, which in pandas convention looks ahead in time. This means that at the ending of our time period we will have nulls - and more nulls in the outcome colums that need to look further into the future. We don't dropna() here since we may want to use open_1 and there's no reason to throw away data from that column just because a different outcome didn't have data. But I digress.

Now, to put it together, we'll train a simple linear model in scikit-learn, using all features to predict close_1

In [5]:
# first, create y (a series) and X (a dataframe), with only rows where 
# a valid value exists for both y and X
y = outcomes.close_1
X = features
Xy = X.join(y).dropna()
y = Xy[]
X = Xy[X.columns]
(996, 5)

Note that all of these slightly tedious steps have left us with properly sized, identically indexed data objects. At this point, the modeling is dead simple:

In [6]:
from sklearn.linear_model import LinearRegression

model = LinearRegression(),y)
print("Model RSQ: "+ str(model.score(X,y)))

print("Coefficients: ")
Model RSQ: 0.01598347165537528
intraday_chg           0.150482
volume_change_ratio    0.000976
day_of_month           0.000036
day_of_week           -0.000427
momentum_5_day        -0.005543
dtype: float64

Clearly, this model isn't very useful but illustrates the workflow.

If we wanted to instead try a random forest to predict tomorrow's open, it'd be mostly copy-paste:

In [7]:
from sklearn.ensemble import RandomForestRegressor

y = outcomes.open_1
X = features
Xy = X.join(y).dropna()
y = Xy[]
X = Xy[X.columns]

model = RandomForestRegressor(max_features=3),y)
print("Model Score: "+ str(model.score(X,y)))

print("Feature Importance: ")
(996, 5)
Model Score: 0.7864834616064446
Feature Importance: 
intraday_chg           0.314371
momentum_5_day         0.274028
volume_change_ratio    0.252230
day_of_month           0.101123
day_of_week            0.058248
dtype: float64

This yields a vastly improved RSQ but note that it is almost certainly ridiculously overfitted, as random forests are prone to do.

We'll cover ways to systematically avoid allowing the model to overfit in future posts, but that requires going a bit further down the rabbit hole.

Side point: in this example (and often, in real life) we've mixed together all observations from AAPL and CSCO into one dataset. We could have alternatively trained two different models for the two symbols, which may have achieved better fit, but almost certainly at the cost of worse generalization out of sample. The bias-variance trade-off in action!


Once the model is trained, it becomes a one-liner to make predictions from a set of feature values. In this case, we'll simply feed the same X values used to train the model, but in live usage, of course, we'd want to apply the trained model to new X values.

In [8]:
date        symbol
2016-12-22  AAPL     -0.006029
            CSCO      0.001444
2016-12-23  AAPL      0.000805
            CSCO      0.002617
2016-12-27  AAPL      0.007293
            CSCO      0.001129
2016-12-28  AAPL      0.001042
            CSCO     -0.001178
2016-12-29  AAPL     -0.001244
            CSCO      0.002477
dtype: float64

Let me pause here to emphasize the most critical point to understand about this framework. Read this twice!

The date of a feature row represents the day when a value would be known after that day's trading, using the feature value date as T=0. The date of an outcome row represents what will happen in the n days following that date.

Predictions are indexed to the date of the evening when the model could have been run. In other words, the prediction indexed to 2016-12-23 represents what the model believes will happen in some time period after 12/23. In practical usage, we can't start using the trading signal until T+1 (since predictions are generated after markets are closed on T+0).


This post presented the concept of organizing data into a features dataframe and outcome dataframe, and then showed how simple it is to join these two dataframes together to train a model.

True, the convention may take a few examples to get used to. However, after trial and error, I've found this to be the most error-resistant, flexible, and high-performance way to go.

In the next post, I will share some methods of feature engineering and feature selection.