About this tutorial¶
This is the first in a six-part series on the mechanics of applying machine learning techniques to the unique domain of stock market price prediction.
Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.
However, usage of machine learning in stock market prediction requires much more than a good grasp of the concepts and techniques for supervised machine learning. As I describe further in [this post], stock prediction is a challenging domain which requires some special techniques to overcome issues like non-stationarity, collinearity, and low signal-to-noise.
In this and following posts, I'll present the design of an end-to-end system for developing, testing, and applying machine learning models in a way which addresses each of these problems in a very practical way.
These tutorials are not intended offer any "secret sauce" or strategies which I use in live trading, but will offer a more valuable and more generalized set of techniques which will allow you to create your own strategies in a robust manner.
Within other posts in this series, I plan to cover:
- Part 2: Feature engineering
- Part 3: Feature selection
- Part 4: Walk-forward modeling
- Part 5: Evaluating model performance
- Part 6: Building ensemble models to combine many models and factors into an overall prediction
In future, I also plan to make tutorials on:
- Using Pandas, scikit-learn, and pandas plus scikit-learn
- Techniques for improving model predictive power
- Techniques for improving model robustness out-of-sample ... and probably others (please feel free to suggest in below comments)
For this, I will assume readers have a good working knowledge of python and
pandas as well as basic supervised machine learning concepts.
About this post¶
In this first post, I will present a framework for organizing and working with data. Perhaps not the most electrifying of topics, but it's a precondition for comprehending later modeling tutorials.
It's also, as any practitioner of the field will agree, of critical significance and importance. I've heard it said that 90% of time in real-world quant finance is spent on data rather than models. That may be a bit of an exagggeration, but not far off.
Following a coherent data management schema (mine or otherwise) will save countless hours of frustration and will allow you to scale your projects to teams of contributors
Types of Data Structures¶
My system for handling data makes heavy use of three main types of data collections:
features: This is a dataframe which contains all features or values which we will allow models to use in the course of learning relationships - and later making predictions. All features must be values which would have been known at the point in time when the model needed to make predictions.
In other words,
next_12_months_returnswould be a bad feature since it would not become known at the time needed. The
featuresdataframe has a multi-index of date/symbol and column names unique to each feature. More on this later.
outcomes: This is a dataframe of all possible future outcomes which we may be interested in predicting, magically shifted back in time to T-zero. For instance, we may want to predict the total_return for a symbol over the year following T=0 (the time of prediction). We would look ahead into the future, calculate what ultimately did happen to this metric, and log it onto time T=0. I'll explain why in a minute.
features, this dataframe has rows indexed by date/symbol and columns named with a convention which describes the feature.
master: The final data structure type is the
masterdataframe. This contains any static information about each symbol in the universe, such as the SIC code, the number of shares outstanding, beta factors, etc...
In practice, things in the master may change over time (SIC codes and shares out can both change...) but I've found it sufficient for my purposes to take the current static values for the current point in time.
This dataframe uses row index of symbol only. You could, of course, add a date/symbol index if you wanted to reflect changing values over time.
Why this data scheme?¶
It may seem odd to split the features and outcomes into distinct dataframes, and odd to create a dataframe of several different possible "outcomes". Most important, it may seem odd to record on t=0 what will happen in the next day, week, month, etc...
There are several reasons for this approach:
- This makes it trivial to extract the X's and y's when training models. Just slice some columns from
featuresfor the X and slice one column of
outcomesin y. They're already aligned and ready for fitting.
- This makes it trivial to toggle between various time horizons - just change the column of
outcomesused for y.
- This helps us guard against inadvertent "peeking" at the future. We only ever use
featurescolumns in X.
- This allows us to use the incredibly efficient pandas
concatmethods to quickly align data for purposes of training models.
Trust me. This will save you many, many hours of debugging and brute force coding down the road.
Let's create simple toy examples of each dataframe using free data from quandl:
First, we'll make a utility function which downloads one or more symbols from quandl and returns the adjusted OHLC data (I generally find adjusted data to be best).
import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like # may be necessary in some versions of pandas import pandas_datareader.data as web def get_symbols(symbols,data_source, begin_date=None,end_date=None): out = pd.DataFrame() for symbol in symbols: df = web.DataReader(symbol, data_source,begin_date, end_date)\ [['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index() df.columns = ['date','open','high','low','close','volume'] #my convention: always lowercase df['symbol'] = symbol # add a new column which contains the symbol so we can keep multiple symbols in the same dataframe df = df.set_index(['date','symbol']) out = pd.concat([out,df],axis=0) #stacks on top of previously collected data return out.sort_index() prices = get_symbols(['AAPL','CSCO'],data_source='quandl',\ begin_date='2015-01-01',end_date='2017-01-01')
Now, we will create some toy features. If the syntax is unclear, I'll cover that in more depth in the next post. For now, just note that we've created five features for both symbols using only data that would be available as of the end of day T.
Also note that I've dropped any rows which contain any nulls for simplicity, since scikit-learn can't handle those out of the box.
features = pd.DataFrame(index=prices.index) features['volume_change_ratio'] = prices.groupby(level='symbol').volume\ .diff(1) / prices.groupby(level='symbol').shift(1).volume features['momentum_5_day'] = prices.groupby(level='symbol').close\ .pct_change(5) features['intraday_chg'] = (prices.groupby(level='symbol').close\ .shift(0) - prices.groupby(level='symbol').open\ .shift(0))/prices.groupby(level='symbol').open.shift(0) features['day_of_week'] = features.index.get_level_values('date').weekday features['day_of_month'] = features.index.get_level_values('date').day features.dropna(inplace=True) print(features.tail(10))
volume_change_ratio momentum_5_day intraday_chg \ date symbol 2016-12-23 AAPL -0.453747 0.004743 0.008046 CSCO -0.291298 -0.001961 -0.000327 2016-12-27 AAPL 0.284036 0.005316 0.006351 CSCO 0.546260 -0.002276 0.001305 2016-12-28 AAPL 0.142595 -0.001625 -0.006467 CSCO -0.151900 -0.004581 -0.009121 2016-12-29 AAPL -0.280609 -0.002819 0.002404 CSCO -0.085396 0.001315 0.002963 2016-12-30 AAPL 1.033726 -0.004042 -0.007115 CSCO 0.836194 -0.007879 -0.011126 day_of_week day_of_month date symbol 2016-12-23 AAPL 4 23 CSCO 4 23 2016-12-27 AAPL 1 27 CSCO 1 27 2016-12-28 AAPL 2 28 CSCO 2 28 2016-12-29 AAPL 3 29 CSCO 3 29 2016-12-30 AAPL 4 30 CSCO 4 30
Next, we'll create outcomes. Note that the seemingly unnecessary lambda function is needed because of this issue with pandas
outcomes = pd.DataFrame(index=prices.index) # next day's opening change outcomes['open_1'] = prices.groupby(level='symbol').open.shift(-1)\ /prices.groupby(level='symbol').close.shift(0)-1 # next day's closing change func_one_day_ahead = lambda x: x.pct_change(-1) outcomes['close_1'] = prices.groupby(level='symbol').close\ .apply(func_one_day_ahead) func_five_day_ahead = lambda x: x.pct_change(-5) outcomes['close_5'] = prices.groupby(level='symbol').close\ .apply(func_five_day_ahead) print((outcomes.tail(15)))
open_1 close_1 close_5 date symbol 2016-12-20 CSCO 0.004254 0.004602 0.004602 2016-12-21 AAPL -0.006065 0.006621 0.002827 CSCO -0.000657 -0.001313 -0.001313 2016-12-22 AAPL -0.006019 -0.001974 0.004058 CSCO 0.002626 -0.002293 0.007942 2016-12-23 AAPL 0.000000 -0.006311 NaN CSCO 0.003603 -0.004889 NaN 2016-12-27 AAPL 0.002217 0.004282 NaN CSCO 0.000652 0.008547 NaN 2016-12-28 AAPL -0.002655 0.000257 NaN CSCO -0.001644 -0.001313 NaN 2016-12-29 AAPL -0.000685 0.007857 NaN CSCO 0.003283 0.007942 NaN 2016-12-30 AAPL NaN NaN NaN CSCO NaN NaN NaN
Note that the shifted periods are negative, which in pandas convention looks ahead in time. This means that at the ending of our time period we will have nulls - and more nulls in the outcome colums that need to look further into the future. We don't dropna() here since we may want to use
open_1 and there's no reason to throw away data from that column just because a different outcome didn't have data. But I digress.
Now, to put it together, we'll train a simple linear model in
scikit-learn, using all features to predict
# first, create y (a series) and X (a dataframe), with only rows where # a valid value exists for both y and X y = outcomes.close_1 X = features Xy = X.join(y).dropna() y = Xy[y.name] X = Xy[X.columns] print(y.shape) print(X.shape)
(996,) (996, 5)
Note that all of these slightly tedious steps have left us with properly sized, identically indexed data objects. At this point, the modeling is dead simple:
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X,y) print("Model RSQ: "+ str(model.score(X,y))) print("Coefficients: ") print(pd.Series(model.coef_,index=X.columns).sort_values(ascending=False))
Model RSQ: 0.01598347165537528 Coefficients: intraday_chg 0.150482 volume_change_ratio 0.000976 day_of_month 0.000036 day_of_week -0.000427 momentum_5_day -0.005543 dtype: float64
Clearly, this model isn't very useful but illustrates the workflow.
If we wanted to instead try a random forest to predict tomorrow's open, it'd be mostly copy-paste:
from sklearn.ensemble import RandomForestRegressor y = outcomes.open_1 X = features Xy = X.join(y).dropna() y = Xy[y.name] X = Xy[X.columns] print(y.shape) print(X.shape) model = RandomForestRegressor(max_features=3) model.fit(X,y) print("Model Score: "+ str(model.score(X,y))) print("Feature Importance: ") print(pd.Series(model.feature_importances_,index=X.columns)\ .sort_values(ascending=False))
(996,) (996, 5) Model Score: 0.7640516995438571 Feature Importance: intraday_chg 0.347469 volume_change_ratio 0.237988 momentum_5_day 0.211114 day_of_month 0.146615 day_of_week 0.056814 dtype: float64
C:\Anaconda2\envs\blog\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)
This yields a vastly improved RSQ but note that it is almost certainly ridiculously overfitted, as random forests are prone to do.
We'll cover ways to systematically avoid allowing the model to overfit in future posts, but that requires going a bit further down the rabbit hole.
Side note: in this example (and often, in real life) we've mixed together all observations from AAPL and CSCO into one dataset. We could have alternatively trained two different models for the two symbols, which may have achieved better fit, but almost certainly at the cost of worse generalization out of sample. The bias-variance trade-off in action!
Once the model is trained, it becomes a one-liner to make predictions from a set of feature values. In this case, we'll simply feed the same X values used to train the model, but in live usage, of course, we'd want to apply the trained model to new X values.
date symbol 2016-12-22 AAPL -0.005515 CSCO 0.001538 2016-12-23 AAPL -0.000649 CSCO 0.003587 2016-12-27 AAPL 0.003226 CSCO 0.005280 2016-12-28 AAPL -0.003042 CSCO -0.001504 2016-12-29 AAPL -0.000526 CSCO 0.002091 dtype: float64
Let me pause here to emphasize the most critical point to understand about this framework. Read this twice!
The date of a feature row represents the day when a value would be known after that day's trading, using the feature value date as T=0. The date of an outcome row represents what will happen in the n days following that date.
Predictions are indexed to the date of the evening when the model could have been run. In other words, the prediction indexed to 2016-12-23 represents what the model believes will happen in some time period after 12/23. In practical usage, we can't start using the trading signal until T+1 (since predictions are generated after markets are closed on T+0).
This post presented the concept of organizing data into a
features dataframe and
outcome dataframe, and then showed how simple it is to join these two dataframes together to train a model.
True, the convention may take a few examples to get used to. However, after trial and error, I've found this to be the most error-resistant, flexible, and high-performance way to go.
In the next post, I will share some methods of feature engineering and feature selection.
One last thing...¶
If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.
Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!