About this tutorial¶
This is the first in a six-part series on the mechanics of applying machine learning techniques to the unique domain of stock market price prediction.
Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.
However, usage of machine learning in stock market prediction requires much more than a good grasp of the concepts and techniques for supervised machine learning. As I describe further in [this post], stock prediction is a challenging domain which requires some special techniques to overcome issues like non-stationarity, collinearity, and low signal-to-noise.
In this and following posts, I'll present the design of an end-to-end system for developing, testing, and applying machine learning models in a way which addresses each of these problems in a very practical way.
These tutorials are not intended offer any "secret sauce" or strategies which I use in live trading, but will offer a more valuable and more generalized set of techniques which will allow you to create your own strategies in a robust manner.
Within other posts in this series, I plan to cover:
- Part 2: Feature engineering
- Part 3: Feature selection
- Part 4: Walk-forward modeling and out-of-sample testing
- Part 5: Evaluating model performance
- Part 6: Building ensemble models to combine many models and factors into an overall prediction
In future, I also plan to make tutorials on:
- Using Pandas, scikit-learn, and pandas plus scikit-learn
- Techniques for improving model predictive power
- Techniques for improving model robustness out-of-sample ... and probably others (please feel free to suggest in below comments)
For this, I will assume readers have a good working knowledge of python and
pandas as well as basic supervised machine learning concepts.
About this post¶
In this first post, I will present a framework for organizing and working with data. Perhaps not the most electrifying of topics, but it's a precondition for comprehending later modeling tutorials.
It's also, as any practitioner of the field will agree, of critical significance and importance. I've heard it said that 90% of time in real-world quant finance is spent on data rather than models. That may be a bit of an exagggeration, but not far off.
Following a coherent data management schema (mine or otherwise) will save countless hours of frustration and will allow you to scale your projects to teams of contributors
Types of Data Structures¶
My system for handling data makes heavy use of three main types of data collections:
features: This is a dataframe which contains all features or values which we will allow models to use in the course of learning relationships - and later making predictions. All features must be values which would have been known at the point in time when the model needed to make predictions.
In other words,
next_12_months_returnswould be a bad feature since it would not become known at the time needed. The
featuresdataframe has a multi-index of date/symbol and column names unique to each feature. More on this later.
outcomes: This is a dataframe of all possible future outcomes which we may be interested in predicting, magically shifted back in time to T-zero. For instance, we may want to predict the total_return for a symbol over the year following T=0 (the time of prediction). We would look ahead into the future, calculate what ultimately did happen to this metric, and log it onto time T=0. I'll explain why in a minute.
features, this dataframe has rows indexed by date/symbol and columns named with a convention which describes the feature.
master: The final data structure type is the
masterdataframe. This contains any static information about each symbol in the universe, such as the SIC code, the number of shares outstanding, beta factors, etc...
In practice, things in the master may change over time (SIC codes and shares out can both change...) but I've found it sufficient for my purposes to take the current static values for the current point in time.
This dataframe uses row index of symbol only. You could, of course, add a date/symbol index if you wanted to reflect changing values over time.
Why this data scheme?¶
It may seem odd to split the features and outcomes into distinct dataframes, and odd to create a dataframe of several different possible "outcomes". Most important, it may seem odd to record on t=0 what will happen in the next day, week, month, etc...
There are several reasons for this approach:
- This makes it trivial to extract the X's and y's when training models. Just slice some columns from
featuresfor the X and slice one column of
outcomesin y. They're already aligned and ready for fitting.
- This makes it trivial to toggle between various time horizons - just change the column of
outcomesused for y.
- This helps us guard against inadvertent "peeking" at the future. We only ever use
featurescolumns in X.
- This allows us to use the incredibly efficient pandas
concatmethods to quickly align data for purposes of training models.
Trust me. This will save you many, many hours of debugging and brute force coding down the road.
Let's create simple toy examples of each dataframe using free data from quandl:
First, we'll make a utility function which downloads one or more symbols from quandl and returns the adjusted OHLC data (I generally find adjusted data to be best).
import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like # may be necessary in some versions of pandas import pandas_datareader.data as web def get_symbols(symbols,data_source, begin_date=None,end_date=None): out = pd.DataFrame() for symbol in symbols: df = web.DataReader(symbol, data_source,begin_date, end_date)\ [['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index() df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe df = df.set_index(['date','symbol']) out = pd.concat([out,df],axis=0) #stacks on top of previously collected data return out.sort_index() prices = get_symbols(['AAPL','CSCO'],data_source='quandl',\ begin_date='2015-01-01',end_date='2017-01-01')
Now, we will create some toy features. If the syntax is unclear, I'll cover that in more depth in the next post. For now, just note that we've created five features for both symbols using only data that would be available as of the end of day T.
Also note that I've dropped any rows which contain any nulls for simplicity, since scikit-learn can't handle those out of the box.
features = pd.DataFrame(index=prices.index) features['volume_change_ratio'] = prices.groupby(level='symbol').volume\ .diff(1) / prices.groupby(level='symbol').shift(1).volume features['momentum_5_day'] = prices.groupby(level='symbol').close\ .pct_change(5) features['intraday_chg'] = (prices.groupby(level='symbol').close\ .shift(0) - prices.groupby(level='symbol').open\ .shift(0))/prices.groupby(level='symbol').open.shift(0) features['day_of_week'] = features.index.get_level_values('date').weekday features['day_of_month'] = features.index.get_level_values('date').day features.dropna(inplace=True) features.tail(10)
Next, we'll create outcomes. Note that the seemingly unnecessary lambda function is needed because of this issue with pandas
outcomes = pd.DataFrame(index=prices.index) # next day's opening change outcomes['open_1'] = prices.groupby(level='symbol').open.shift(-1)\ /prices.groupby(level='symbol').close.shift(0)-1 # next day's closing change func_one_day_ahead = lambda x: x.pct_change(-1) outcomes['close_1'] = prices.groupby(level='symbol').close\ .apply(func_one_day_ahead) func_five_day_ahead = lambda x: x.pct_change(-5) outcomes['close_5'] = prices.groupby(level='symbol').close\ .apply(func_five_day_ahead) (outcomes.tail(15))
Note that the shifted periods are negative, which in pandas convention looks ahead in time. This means that at the ending of our time period we will have nulls - and more nulls in the outcome colums that need to look further into the future. We don't dropna() here since we may want to use
open_1 and there's no reason to throw away data from that column just because a different outcome didn't have data. But I digress.
Now, to put it together, we'll train a simple linear model in
scikit-learn, using all features to predict
# first, create y (a series) and X (a dataframe), with only rows where # a valid value exists for both y and X y = outcomes.close_1 X = features Xy = X.join(y).dropna() y = Xy[y.name] X = Xy[X.columns] print(y.shape) print(X.shape)
(996,) (996, 5)
Note that all of these slightly tedious steps have left us with properly sized, identically indexed data objects. At this point, the modeling is dead simple:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X,y) print("Model RSQ: "+ str(model.score(X,y))) print("Coefficients: ") pd.Series(model.coef_,index=X.columns).sort_values(ascending=False)
Model RSQ: 0.01598347165537528 Coefficients:
intraday_chg 0.150482 volume_change_ratio 0.000976 day_of_month 0.000036 day_of_week -0.000427 momentum_5_day -0.005543 dtype: float64
Clearly, this model isn't very useful but illustrates the workflow.
If we wanted to instead try a random forest to predict tomorrow's open, it'd be mostly copy-paste:
from sklearn.ensemble import RandomForestRegressor y = outcomes.open_1 X = features Xy = X.join(y).dropna() y = Xy[y.name] X = Xy[X.columns] print(y.shape) print(X.shape) model = RandomForestRegressor(max_features=3) model.fit(X,y) print("Model Score: "+ str(model.score(X,y))) print("Feature Importance: ") pd.Series(model.feature_importances_,index=X.columns)\ .sort_values(ascending=False)
(996,) (996, 5) Model Score: 0.7963938902121673 Feature Importance:
intraday_chg 0.285079 momentum_5_day 0.274290 volume_change_ratio 0.265878 day_of_month 0.110968 day_of_week 0.063785 dtype: float64
This yields a vastly improved RSQ but note that it is almost certainly ridiculously overfitted, as random forests are prone to do.
We'll cover ways to systematically avoid allowing the model to overfit in future posts, but that requires going a bit further down the rabbit hole.
Side note: in this example (and often, in real life) we've mixed together all observations from AAPL and CSCO into one dataset. We could have alternatively trained two different models for the two symbols, which may have achieved better fit, but almost certainly at the cost of worse generalization out of sample. The bias-variance trade-off in action!
Once the model is trained, it becomes a one-liner to make predictions from a set of feature values. In this case, we'll simply feed the same X values used to train the model, but in live usage, of course, we'd want to apply the trained model to new X values.
date symbol 2016-12-22 AAPL -0.003580 CSCO 0.002147 2016-12-23 AAPL 0.001546 CSCO 0.003360 2016-12-27 AAPL -0.001172 CSCO -0.003555 2016-12-28 AAPL -0.002217 CSCO -0.001567 2016-12-29 AAPL 0.003620 CSCO 0.000772 dtype: float64
Let me pause here to emphasize the most critical point to understand about this framework. Read this twice!
The date of a feature row represents the day when a value would be known after that day's trading, using the feature value date as T=0. The date of an outcome row represents what will happen in the n days following that date.
Predictions are indexed to the date of the evening when the model could have been run. In other words, the prediction indexed to 2016-12-23 represents what the model believes will happen in some time period after 12/23. In practical usage, we can't start using the trading signal until T+1 (since predictions are generated after markets are closed on T+0).
This post presented the concept of organizing data into a
features dataframe and
outcome dataframe, and then showed how simple it is to join these two dataframes together to train a model.
True, the convention may take a few examples to get used to. However, after trial and error, I've found this to be the most error-resistant, flexible, and high-performance way to go.
In the next post, I will share some methods of feature engineering and feature selection.
One last thing...¶
If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.
Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!