Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.
However, these financial ML explorers are learning that there are many ways in which using ML to predict financial time series differs greatly from labeling cat pictures or flagging spam. Among these differences is that traditional model performance metrics (RSQ, MSE, accuracy, F1, etc...) can be misleading and incomplete.
Over the past several years, I've developed a set of metrics which have proved useful for comparing and optimizing financial time series models. These metrics attempt to measure models' predictive power but also their trade-ability, critically important for those who actually intend to use their models in the real world.
In this post, I will present a general outline of my approach and will demonstrate a few of the most useful metrics I've added to my standard "scorecard". I look forward to hearing how others may think to extend the concept. If you'd like to replicate and experiment with the below code, you can download the source notebook for this post by right-clicking on the below button and choosing "save link as"
If you haven't already checked out the previous four installments in this tutorial, you may want review those first. Many of the coding patterns used below are discussed at length:
Preparing sample data¶
I will illustrate this metrics methodology using a simple example of synthetically generated data (see previous posts in this tutorial for explanations of the below method of creating data).
## Remove / Replace this code with a link to your quandl key, if you have one import sys sys.path.append('/anaconda/') import config quandl_key = config.quandl_key from IPython.core.display import Image import numpy as np import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like # May be necessary to fix below issue # https://github.com/pydata/pandas-datareader/issues/534 import pandas_datareader.data as web %matplotlib inline def get_symbols(symbols,data_source, quandl_key=None, begin_date=None,end_date=None): out = pd.DataFrame() for symbol in symbols: df = web.DataReader(symbol, data_source,begin_date, end_date)\ [['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index() df.columns = ['date','open','high','low','close','volume'] df['symbol'] = symbol # add symbol col so we can keep all in the same dataframe df = df.set_index(['date','symbol']) out = pd.concat([out,df],axis=0) #stacks on top of previously collected data return out.sort_index() prices = get_symbols(['AAPL','CSCO','AMZN','YHOO','MSFT'],\ data_source='quandl',quandl_key=quandl_key,begin_date='2012-01-01',end_date=None) # Note: we're only using real price data to generate an index set. # We will make synthetic features and outcomes below instead of deriving from price
The below code generates several features then synthetically generates an outcome series from them (along with noise). This guarantees that the features will be informative, since the outcome has been constructed to ensure a relationship.
num_obs = prices.close.count() def add_memory(s,n_days=50,mem_strength=0.1): ''' adds autoregressive behavior to series of data''' add_ewm = lambda x: (1-mem_strength)*x + mem_strength*x.ewm(n_days).mean() out = s.groupby(level='symbol').apply(add_ewm) return out # generate feature data f01 = pd.Series(np.random.randn(num_obs),index=prices.index) f01 = add_memory(f01,10,0.1) f02 = pd.Series(np.random.randn(num_obs),index=prices.index) f02 = add_memory(f02,10,0.1) f03 = pd.Series(np.random.randn(num_obs),index=prices.index) f03 = add_memory(f03,10,0.1) f04 = pd.Series(np.random.randn(num_obs),index=prices.index) f04 = f04 # no memory features = pd.concat([f01,f02,f03,f04],axis=1) ## now, create response variable such that it is related to features # f01 becomes increasingly important, f02 becomes decreasingly important, # f03 oscillates in importance, f04 is stationary, # and finally a noise component is added outcome = f01 * np.linspace(0.5,1.5,num_obs) + \ f02 * np.linspace(1.5,0.5,num_obs) + \ f03 * pd.Series(np.sin(2*np.pi*np.linspace(0,1,num_obs)*2)+1,index=f03.index) + \ f04 + \ np.random.randn(num_obs) * 3 outcome.name = 'outcome'
Generating models and predictions¶
Imagine that we created a simple linear model (such as below) and wanted to measure its effectiveness at prediction.
Note: we'll follow the walk-forward modeling process described in the previous post. If you don't understand the below code snippet (and want to...) please check out that post.
from sklearn.linear_model import LinearRegression ## fit models for each timestep on a walk-forward basis recalc_dates = features.resample('Q',level='date').mean().index.values[:-1] models = pd.Series(index=recalc_dates) for date in recalc_dates: X_train = features.xs(slice(None,date),level='date',drop_level=False) y_train = outcome.xs(slice(None,date),level='date',drop_level=False) model = LinearRegression() model.fit(X_train,y_train) models.loc[date] = model ## predict values walk-forward (all predictions out of sample) begin_dates = models.index end_dates = models.index[1:].append(pd.to_datetime(['2099-12-31'])) predictions = pd.Series(index=features.index) for i,model in enumerate(models): #loop thru each models object in collection X = features.xs(slice(begin_dates[i],end_dates[i]),level='date',drop_level=False) p = pd.Series(model.predict(X),index=X.index) predictions.loc[X.index] = p
Traditional model evaluation¶
So we've got a model, we've got a sizeable set of (out of sample) predictions. Is the model any good? Should we junk it, tune it, or trade it? Since this is a regression model, I'll throw our data into
scikit-learn's metrics package.
import sklearn.metrics as metrics # make sure we have 1-for-1 mapping between pred and true common_idx = outcome.dropna().index.intersection(predictions.dropna().index) y_true = outcome[common_idx] y_true.name = 'y_true' y_pred = predictions[common_idx] y_pred.name = 'y_pred' standard_metrics = pd.Series() standard_metrics.loc['explained variance'] = metrics.explained_variance_score(y_true, y_pred) standard_metrics.loc['MAE'] = metrics.mean_absolute_error(y_true, y_pred) standard_metrics.loc['MSE'] = metrics.mean_squared_error(y_true, y_pred) standard_metrics.loc['MedAE'] = metrics.median_absolute_error(y_true, y_pred) standard_metrics.loc['RSQ'] = metrics.r2_score(y_true, y_pred) print(standard_metrics)
explained variance 0.251057 MAE 2.491337 MSE 9.784733 MedAE 2.098055 RSQ 0.251051 dtype: float64
These stats don't really tell us much by themselves. You may have an intuition for r-squared so that may give you a level of confidence in the models. However, even this metric has problems not to mention does not tell us much about the practicality of this signal from a trading point of view.
True, we could construct some trading rules around this series of predictions and perform a formal backtest on that. However, that is quite time consuming and introduces a number of extraneous variables into the equation.
A better way... Creating custom metrics¶
Instead of relying on generic ML metrics, we will create several custom metrics that will hopefully give a more complete picture of strength, reliability, and practicality of these models.
I'll work through an example of creating an extensible scorecard with about a half dozen custom-defined metrics as a starting point. You can feel free to extend this into a longer scorecard which is suited to your needs and beliefs. In my own trading, I use about 25 metrics in a standard "scorecard" each time I evaluate a model. You may prefer to use more, fewer, or different metrics but the process should be applicable.
I'll focus only on regression-oriented metrics (i.e., those which use a continuous prediction rather than a binary or classification prediction). It's trivial to re-purpose the same framework to a classification-oriented environment.
Step 1: Preprocess data primitives¶
Before implementing specific metrics we need to do some data pre-processing. It'll become clear why doing this first will save considerable time later when calculating aggregate metrics.
To create these intermediate values, you'll need the following inputs:
- y_pred: the continuous variable prediction made by your model for each timestep, for each symbol
- y_true: the continuous variable actual outcome for each timestep, for each symbol.
- index: this is the unique identifier for each prediction or actual result. If working with a single instrument, then you can simply use date (or time or whatever). If you're using multiple instruments, a multi-index with (date/symbol) is necessary.
In other words, if your model is predicting one-day price changes, you'd want your y_pred to be the model's predictions made as of March 9th (for the coming day), indexed as
2017-03-09 and you'd want the actual future outcome which will play out in the next day also aligned to Mar 9th. This "peeking" convention is very useful for working with large sets of data across different time horizons. It is described ad nauseum in Part 1: Data Management.
The raw input data we need to provide might look something like this:
y_pred y_true date symbol 2018-03-26 MSFT 0.707500 1.693673 2018-03-27 AAPL 1.744680 -1.830242 AMZN 0.594976 -2.302375 CSCO -2.838380 -2.462017 MSFT -0.417073 -1.586291
We will feed this data into a simple function which will return a dataframe with the y_pred and y_true values, along with several other useful derivative values. These derivative values include:
- sign_pred: positive or negative sign of prediction
- sign_true: positive or negative sign of true outcome
- is_correct: 1 if sign_pred == sign_true, else 0
- is_incorrect: opposite
- is_predicted: 1 if the model has made a valid prediction, 0 if not. This is important if models only emit predictions when they have a certain level of confidence
- result: the profit (loss) resulting from betting one unit in the direction of the sign_pred. This is the continuous variable result of following the model
def make_df(y_pred,y_true): y_pred.name = 'y_pred' y_true.name = 'y_true' df = pd.concat([y_pred,y_true],axis=1) df['sign_pred'] = df.y_pred.apply(np.sign) df['sign_true'] = df.y_true.apply(np.sign) df['is_correct'] = 0 df.loc[df.sign_pred * df.sign_true > 0 ,'is_correct'] = 1 # only registers 1 when prediction was made AND it was correct df['is_incorrect'] = 0 df.loc[df.sign_pred * df.sign_true < 0,'is_incorrect'] = 1 # only registers 1 when prediction was made AND it was wrong df['is_predicted'] = df.is_correct + df.is_incorrect df['result'] = df.sign_pred * df.y_true return df df = make_df(y_pred,y_true) print(df.dropna().tail())
y_pred y_true sign_pred sign_true is_correct \ date symbol 2018-03-26 MSFT 0.707500 1.693673 1.0 1.0 1 2018-03-27 AAPL 1.744680 -1.830242 1.0 -1.0 0 AMZN 0.594976 -2.302375 1.0 -1.0 0 CSCO -2.838380 -2.462017 -1.0 -1.0 1 MSFT -0.417073 -1.586291 -1.0 -1.0 1 is_incorrect is_predicted result date symbol 2018-03-26 MSFT 0 1 1.693673 2018-03-27 AAPL 1 1 -1.830242 AMZN 1 1 -2.302375 CSCO 0 1 2.462017 MSFT 0 1 1.586291
Defining our metrics¶
With this set of intermediate variables pre-processed, we can more easily calculate metrics. The metrics we'll start with here include things like:
- Accuracy: Just as the name suggests, this measures the percent of predictions that were directionally correct vs. incorrect.
- Edge: perhaps the most useful of all metrics, this is the expected value of the prediction over a sufficiently large set of draws. Think of this like a blackjack card counter who knows the expected profit on each dollar bet when the odds are at a level of favorability
- Noise: critically important but often ignored, the noise metric estimates how dramatically the model's predictions vary from one day to the next. As you might imagine, a model which abruptly changes its mind every few days is much harder to follow (and much more expensive to trade) than one which is a bit more steady.
The below function takes in our pre-processed data primitives and returns a scorecard with
def calc_scorecard(df): scorecard = pd.Series() # building block metrics scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100 scorecard.loc['edge'] = df.result.mean() scorecard.loc['noise'] = df.y_pred.diff().abs().mean() return scorecard calc_scorecard(df)
accuracy 67.075999 edge 1.438081 noise 2.387606 dtype: float64
Much better. I now know that we've been directionally correct about two-thirds of the time, and that following this signal would create an edge of ~1.5 units per time period.
Let's keep going. We can now easily combine and transform things to derive new metrics. The below function shows several examples, including:
- y_true_chg and y_pred_chg: The average magnitude of change (per period) in y_true and y_pred.
- prediction_calibration: A simple ratio of the magnitude of our predictions vs. magnitude of truth. This gives some indication of whether our model is properly tuned to the size of movement in addition to the direction of it.
- capture_ratio: Ratio of the "edge" we gain by following our predictions vs. the actual daily change. 100 would indicate that we were perfectly capturing the true movement of the target variable.
def calc_scorecard(df): scorecard = pd.Series() # building block metrics scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100 scorecard.loc['edge'] = df.result.mean() scorecard.loc['noise'] = df.y_pred.diff().abs().mean() # derived metrics scorecard.loc['y_true_chg'] = df.y_true.abs().mean() scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean() scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg'] scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100 return scorecard calc_scorecard(df)
accuracy 67.075999 edge 1.438081 noise 2.387606 y_true_chg 2.888443 y_pred_chg 1.689327 prediction_calibration 0.584857 capture_ratio 49.787427 dtype: float64
Additionally, metrics can be easily calculated for only long or short predictions (for a two-sided model) or separately for positions which ended up being winners and losers.
- edge_long and edge_short: The "edge" for only long signals or for short signals.
- edge_win and edge_lose: The "edge" for only winners or for only losers.
def calc_scorecard(df): scorecard = pd.Series() # building block metrics scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100 scorecard.loc['edge'] = df.result.mean() scorecard.loc['noise'] = df.y_pred.diff().abs().mean() # derived metrics scorecard.loc['y_true_chg'] = df.y_true.abs().mean() scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean() scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg'] scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100 # metrics for a subset of predictions scorecard.loc['edge_long'] = df[df.sign_pred == 1].result.mean() - df.y_true.mean() scorecard.loc['edge_short'] = df[df.sign_pred == -1].result.mean() - df.y_true.mean() scorecard.loc['edge_win'] = df[df.is_correct == 1].result.mean() - df.y_true.mean() scorecard.loc['edge_lose'] = df[df.is_incorrect == 1].result.mean() - df.y_true.mean() return scorecard calc_scorecard(df)
accuracy 67.075999 edge 1.438081 noise 2.387606 y_true_chg 2.888443 y_pred_chg 1.689327 prediction_calibration 0.584857 capture_ratio 49.787427 edge_long 1.409962 edge_short 1.361946 edge_win 3.173426 edge_lose -2.254256 dtype: float64
From this slate of metrics, we've gained much more insight than we got from MSE, R-squared, etc...
- The model is predicting with a strong directional accuracy
- We are generating about 1.4 units of "edge" (expected profit) each prediction, which is about half of the total theoretical profit
- The model makes more on winners than it loses on losers
- The model is equally valid on both long and short predictions
If this were real data, I would be rushing to put this model into production!
Metrics over time¶
Critically important when considering using a model in live trading is to understand (a) how consistent the model's performance has been, and (b) whether its current performance has degraded from its past. Markets have a way of discovering and eliminating past sources of edge.
Here, a two line function will calculate each metric by year:
def scorecard_by_year(df): df['year'] = df.index.get_level_values('date').year return df.groupby('year').apply(calc_scorecard).T print(scorecard_by_year(df))
year 2012 2013 2014 2015 2016 \ accuracy 73.723404 67.539683 64.761905 69.761905 65.873016 edge 2.059936 1.461567 1.231334 1.543369 1.426360 noise 2.719877 2.821616 2.399404 2.120782 2.267517 y_true_chg 3.084964 2.936447 2.745874 2.979032 2.913586 y_pred_chg 1.936173 1.941289 1.703139 1.557748 1.587378 prediction_calibration 0.627616 0.661102 0.620254 0.522904 0.544820 capture_ratio 66.773413 49.773313 44.843048 51.807720 48.955474 edge_long 1.967572 1.430635 1.233274 1.538225 1.364924 edge_short 1.952401 1.763385 1.250856 1.383851 0.832610 edge_win 3.389715 3.387355 3.081384 3.159226 2.976300 edge_lose -2.050069 -2.140338 -2.138265 -2.456018 -2.496835 year 2017 2018 accuracy 61.814915 67.372881 edge 1.021169 1.406294 noise 2.087404 2.153034 y_true_chg 2.730181 2.739057 y_pred_chg 1.466882 1.583027 prediction_calibration 0.537284 0.577946 capture_ratio 37.402987 51.342271 edge_long 1.025613 1.266355 edge_short 1.089456 1.572209 edge_win 3.070849 3.072564 edge_lose -2.201292 -2.046276
It's just as simple to compare performance across symbols (or symbol groups, if you've defined those):
def scorecard_by_symbol(df): return df.groupby(level='symbol').apply(calc_scorecard).T print(scorecard_by_symbol(df))
symbol AAPL AMZN CSCO MSFT YHOO accuracy 66.688830 65.359043 66.312292 68.504983 68.726163 edge 1.416872 1.346539 1.324442 1.538295 1.582845 noise 2.345282 2.325754 2.264191 2.469043 2.395625 y_true_chg 2.898475 2.829595 2.807877 2.953369 2.962400 y_pred_chg 1.698427 1.671211 1.632399 1.741992 1.704562 prediction_calibration 0.585973 0.590618 0.581364 0.589832 0.575399 capture_ratio 48.883363 47.587700 47.168825 52.086091 53.431158 edge_long 1.498012 1.273866 1.292711 1.441065 1.574297 edge_short 1.291853 1.232207 1.210812 1.569954 1.538143 edge_win 3.208012 3.102369 3.044073 3.241324 3.280251 edge_lose -2.251305 -2.233005 -2.273472 -2.283530 -2.232134
The added insight we get from this methodology comes when wanting to make comparisons between models, periods, segments, etc...
To illustrate, let's say that we're comparing two models, a linear regression vs. a random forest, for performance on a training set and a testing set (pretend for a moment that we didn't adhere to Walk-forward model building practices...).
from sklearn.model_selection import train_test_split from sklearn.linear_model import ElasticNetCV,Lasso,Ridge from sklearn.ensemble import RandomForestRegressor X_train,X_test,y_train,y_test = train_test_split(features,outcome,test_size=0.20,shuffle=False) # linear regression model1 = LinearRegression().fit(X_train,y_train) model1_train = pd.Series(model1.predict(X_train),index=X_train.index) model1_test = pd.Series(model1.predict(X_test),index=X_test.index) model2 = RandomForestRegressor().fit(X_train,y_train) model2_train = pd.Series(model2.predict(X_train),index=X_train.index) model2_test = pd.Series(model2.predict(X_test),index=X_test.index) # create dataframes for each model1_train_df = make_df(model1_train,y_train) model1_test_df = make_df(model1_test,y_test) model2_train_df = make_df(model2_train,y_train) model2_test_df = make_df(model2_test,y_test) s1 = calc_scorecard(model1_train_df) s1.name = 'model1_train' s2 = calc_scorecard(model1_test_df) s2.name = 'model1_test' s3 = calc_scorecard(model2_train_df) s3.name = 'model2_train' s4 = calc_scorecard(model2_test_df) s4.name = 'model2_test' print(pd.concat([s1,s2,s3,s4],axis=1))
model1_train model1_test model2_train model2_test accuracy 68.417608 62.434555 89.461627 60.732984 edge 1.545181 1.071990 2.730365 0.913076 noise 2.200392 2.123827 3.259833 2.551081 y_true_chg 2.926529 2.739687 2.926529 2.739687 y_pred_chg 1.563175 1.509218 2.291515 1.796541 prediction_calibration 0.534139 0.550873 0.783015 0.655747 capture_ratio 52.799120 39.128174 93.297057 33.327736 edge_long 1.496996 1.063473 2.664405 0.910637 edge_short 1.436163 1.038296 2.640406 0.873405 edge_win 3.190281 3.031532 3.083960 2.986226 edge_lose -2.264566 -2.240729 -1.008382 -2.346892
This quick and dirty scorecard comparison gives us a great deal of useful information. We learn that:
- The relatively simple linear regression (model1) does a very good job of prediction, correct about 68% of the time, capturing >50% of available price movement (this is very good) during training
- Model1 holds up very well out of sample, performing nearly as well on test as train
- Model2, a more complex random forest ensemble model, appears far superior on the training data, capturing 90%+ of available price action, but appears quite overfit and does not perform nearly as well on the test set.
In this tutorial, we've covered a framework for evaluating models in a market prediction context and have demonstrated a few useful metrics. However, the approach can be extended much further to suit your needs. You can consider:
- Adding new metrics to the standard scorecard
- Comparing scorecard metrics for subsets of the universe. For instance, each symbol or grouping of symbols
- Calculating and plotting performance metrics across time to validate robustness or to identify trends
In the final post of this series, I'll present a unique framework for creating an ensemble model to blend together the results of your many different forecasting models.
Please feel free to add to the comment section with your good ideas for useful metrics, with questions/comments on this post, and topic ideas for future posts.
One last thing...¶
If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.
Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!