The Alpha Scientist

Discovering alpha in the stock market using data science

Stock Prediction with ML: Model Evaluation

Introduction

Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.

However, these financial ML explorers are learning that there are many ways in which using ML to predict financial time series differs greatly from labeling cat pictures or flagging spam. Among these differences is that traditional model performance metrics (RSQ, MSE, accuracy, F1, etc...) can be misleading and incomplete.

Over the past several years, I've developed a set of metrics which have proved useful for comparing and optimizing financial time series models. These metrics attempt to measure models' predictive power but also their trade-ability, critically important for those who actually intend to use their models in the real world.

In this post, I will present a general outline of my approach and will demonstrate a few of the most useful metrics I've added to my standard "scorecard". I look forward to hearing how others may think to extend the concept. If you'd like to replicate and experiment with the below code, you can download the source notebook for this post by right-clicking on the below button and choosing "save link as"

If you haven't already checked out the previous four installments in this tutorial, you may want review those first. Many of the coding patterns used below are discussed at length:

Preparing sample data

I will illustrate this metrics methodology using a simple example of synthetically generated data (see previous posts in this tutorial for explanations of the below method of creating data).

In [2]:
## Remove / Replace this code with a link to your quandl key, if you have one
import sys
sys.path.append('/anaconda/')
import config
quandl_key = config.quandl_key


from IPython.core.display import Image
import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like 
# May be necessary to fix below issue
# https://github.com/pydata/pandas-datareader/issues/534
import pandas_datareader.data as web
%matplotlib inline


def get_symbols(symbols,data_source, quandl_key=None, begin_date=None,end_date=None):
    out = pd.DataFrame()
    for symbol in symbols:
        df = web.DataReader(symbol, data_source,begin_date, end_date)\
        [['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index()
        df.columns = ['date','open','high','low','close','volume'] 
        df['symbol'] = symbol # add symbol col so we can keep all in the same dataframe
        df = df.set_index(['date','symbol'])
        out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
    return out.sort_index()
        
prices = get_symbols(['AAPL','CSCO','AMZN','YHOO','MSFT'],\
                     data_source='quandl',quandl_key=quandl_key,begin_date='2012-01-01',end_date=None)
# Note: we're only using real price data to generate an index set.  
# We will make synthetic features and outcomes below instead of deriving from price

The below code generates several features then synthetically generates an outcome series from them (along with noise). This guarantees that the features will be informative, since the outcome has been constructed to ensure a relationship.

In [3]:
num_obs = prices.close.count()

def add_memory(s,n_days=50,mem_strength=0.1):
    ''' adds autoregressive behavior to series of data'''
    add_ewm = lambda x: (1-mem_strength)*x + mem_strength*x.ewm(n_days).mean()
    out = s.groupby(level='symbol').apply(add_ewm)
    return out

# generate feature data
f01 = pd.Series(np.random.randn(num_obs),index=prices.index)
f01 = add_memory(f01,10,0.1)
f02 = pd.Series(np.random.randn(num_obs),index=prices.index)
f02 = add_memory(f02,10,0.1)
f03 = pd.Series(np.random.randn(num_obs),index=prices.index)
f03 = add_memory(f03,10,0.1)
f04 = pd.Series(np.random.randn(num_obs),index=prices.index)
f04 = f04 # no memory

features = pd.concat([f01,f02,f03,f04],axis=1)

## now, create response variable such that it is related to features
# f01 becomes increasingly important, f02 becomes decreasingly important,
# f03 oscillates in importance, f04 is stationary, 
# and finally a noise component is added

outcome =   f01 * np.linspace(0.5,1.5,num_obs) + \
            f02 * np.linspace(1.5,0.5,num_obs) + \
            f03 * pd.Series(np.sin(2*np.pi*np.linspace(0,1,num_obs)*2)+1,index=f03.index) + \
            f04 + \
            np.random.randn(num_obs) * 3 
outcome.name = 'outcome'

Generating models and predictions

Imagine that we created a simple linear model (such as below) and wanted to measure its effectiveness at prediction.

Note: we'll follow the walk-forward modeling process described in the previous post. If you don't understand the below code snippet (and want to...) please check out that post.

In [4]:
from sklearn.linear_model import LinearRegression

## fit models for each timestep on a walk-forward basis
recalc_dates = features.resample('Q',level='date').mean().index.values[:-1]
models = pd.Series(index=recalc_dates)
for date in recalc_dates:
    X_train = features.xs(slice(None,date),level='date',drop_level=False)
    y_train = outcome.xs(slice(None,date),level='date',drop_level=False)
    model = LinearRegression()
    model.fit(X_train,y_train)
    models.loc[date] = model

## predict values walk-forward (all predictions out of sample)
begin_dates = models.index
end_dates = models.index[1:].append(pd.to_datetime(['2099-12-31']))

predictions = pd.Series(index=features.index)

for i,model in enumerate(models): #loop thru each models object in collection
    X = features.xs(slice(begin_dates[i],end_dates[i]),level='date',drop_level=False)
    p = pd.Series(model.predict(X),index=X.index)
    predictions.loc[X.index] = p

Traditional model evaluation

So we've got a model, we've got a sizeable set of (out of sample) predictions. Is the model any good? Should we junk it, tune it, or trade it? Since this is a regression model, I'll throw our data into scikit-learn's metrics package.

In [5]:
import sklearn.metrics as metrics

# make sure we have 1-for-1 mapping between pred and true
common_idx = outcome.dropna().index.intersection(predictions.dropna().index)
y_true = outcome[common_idx]
y_true.name = 'y_true'
y_pred = predictions[common_idx]
y_pred.name = 'y_pred'

standard_metrics = pd.Series()

standard_metrics.loc['explained variance'] = metrics.explained_variance_score(y_true, y_pred)
standard_metrics.loc['MAE'] = metrics.mean_absolute_error(y_true, y_pred)
standard_metrics.loc['MSE'] = metrics.mean_squared_error(y_true, y_pred)
standard_metrics.loc['MedAE'] = metrics.median_absolute_error(y_true, y_pred)
standard_metrics.loc['RSQ'] = metrics.r2_score(y_true, y_pred)

print(standard_metrics)
explained variance    0.251057
MAE                   2.491337
MSE                   9.784733
MedAE                 2.098055
RSQ                   0.251051
dtype: float64

These stats don't really tell us much by themselves. You may have an intuition for r-squared so that may give you a level of confidence in the models. However, even this metric has problems not to mention does not tell us much about the practicality of this signal from a trading point of view.

True, we could construct some trading rules around this series of predictions and perform a formal backtest on that. However, that is quite time consuming and introduces a number of extraneous variables into the equation.

A better way... Creating custom metrics

Instead of relying on generic ML metrics, we will create several custom metrics that will hopefully give a more complete picture of strength, reliability, and practicality of these models.

I'll work through an example of creating an extensible scorecard with about a half dozen custom-defined metrics as a starting point. You can feel free to extend this into a longer scorecard which is suited to your needs and beliefs. In my own trading, I use about 25 metrics in a standard "scorecard" each time I evaluate a model. You may prefer to use more, fewer, or different metrics but the process should be applicable.

I'll focus only on regression-oriented metrics (i.e., those which use a continuous prediction rather than a binary or classification prediction). It's trivial to re-purpose the same framework to a classification-oriented environment.

Step 1: Preprocess data primitives

Before implementing specific metrics we need to do some data pre-processing. It'll become clear why doing this first will save considerable time later when calculating aggregate metrics.

To create these intermediate values, you'll need the following inputs:

  • y_pred: the continuous variable prediction made by your model for each timestep, for each symbol
  • y_true: the continuous variable actual outcome for each timestep, for each symbol.
  • index: this is the unique identifier for each prediction or actual result. If working with a single instrument, then you can simply use date (or time or whatever). If you're using multiple instruments, a multi-index with (date/symbol) is necessary.

In other words, if your model is predicting one-day price changes, you'd want your y_pred to be the model's predictions made as of March 9th (for the coming day), indexed as 2017-03-09 and you'd want the actual future outcome which will play out in the next day also aligned to Mar 9th. This "peeking" convention is very useful for working with large sets of data across different time horizons. It is described ad nauseum in Part 1: Data Management.

The raw input data we need to provide might look something like this:

In [6]:
print(pd.concat([y_pred,y_true],axis=1).tail())
                     y_pred    y_true
date       symbol                    
2018-03-26 MSFT    0.707500  1.693673
2018-03-27 AAPL    1.744680 -1.830242
           AMZN    0.594976 -2.302375
           CSCO   -2.838380 -2.462017
           MSFT   -0.417073 -1.586291

We will feed this data into a simple function which will return a dataframe with the y_pred and y_true values, along with several other useful derivative values. These derivative values include:

  • sign_pred: positive or negative sign of prediction
  • sign_true: positive or negative sign of true outcome
  • is_correct: 1 if sign_pred == sign_true, else 0
  • is_incorrect: opposite
  • is_predicted: 1 if the model has made a valid prediction, 0 if not. This is important if models only emit predictions when they have a certain level of confidence
  • result: the profit (loss) resulting from betting one unit in the direction of the sign_pred. This is the continuous variable result of following the model
In [7]:
def make_df(y_pred,y_true):
    y_pred.name = 'y_pred'
    y_true.name = 'y_true'
    
    df = pd.concat([y_pred,y_true],axis=1)

    df['sign_pred'] = df.y_pred.apply(np.sign)
    df['sign_true'] = df.y_true.apply(np.sign)
    df['is_correct'] = 0
    df.loc[df.sign_pred * df.sign_true > 0 ,'is_correct'] = 1 # only registers 1 when prediction was made AND it was correct
    df['is_incorrect'] = 0
    df.loc[df.sign_pred * df.sign_true < 0,'is_incorrect'] = 1 # only registers 1 when prediction was made AND it was wrong
    df['is_predicted'] = df.is_correct + df.is_incorrect
    df['result'] = df.sign_pred * df.y_true 
    return df

df = make_df(y_pred,y_true)
print(df.dropna().tail())
                     y_pred    y_true  sign_pred  sign_true  is_correct  \
date       symbol                                                         
2018-03-26 MSFT    0.707500  1.693673        1.0        1.0           1   
2018-03-27 AAPL    1.744680 -1.830242        1.0       -1.0           0   
           AMZN    0.594976 -2.302375        1.0       -1.0           0   
           CSCO   -2.838380 -2.462017       -1.0       -1.0           1   
           MSFT   -0.417073 -1.586291       -1.0       -1.0           1   

                   is_incorrect  is_predicted    result  
date       symbol                                        
2018-03-26 MSFT               0             1  1.693673  
2018-03-27 AAPL               1             1 -1.830242  
           AMZN               1             1 -2.302375  
           CSCO               0             1  2.462017  
           MSFT               0             1  1.586291  

Defining our metrics

With this set of intermediate variables pre-processed, we can more easily calculate metrics. The metrics we'll start with here include things like:

  • Accuracy: Just as the name suggests, this measures the percent of predictions that were directionally correct vs. incorrect.
  • Edge: perhaps the most useful of all metrics, this is the expected value of the prediction over a sufficiently large set of draws. Think of this like a blackjack card counter who knows the expected profit on each dollar bet when the odds are at a level of favorability
  • Noise: critically important but often ignored, the noise metric estimates how dramatically the model's predictions vary from one day to the next. As you might imagine, a model which abruptly changes its mind every few days is much harder to follow (and much more expensive to trade) than one which is a bit more steady.

The below function takes in our pre-processed data primitives and returns a scorecard with accuracy, edge, and noise.

In [8]:
def calc_scorecard(df):
    scorecard = pd.Series()
    # building block metrics
    scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
    scorecard.loc['edge'] = df.result.mean()
    scorecard.loc['noise'] = df.y_pred.diff().abs().mean()
    
    return scorecard    

calc_scorecard(df)
Out[8]:
accuracy    67.075999
edge         1.438081
noise        2.387606
dtype: float64

Much better. I now know that we've been directionally correct about two-thirds of the time, and that following this signal would create an edge of ~1.5 units per time period.

Let's keep going. We can now easily combine and transform things to derive new metrics. The below function shows several examples, including:

  • y_true_chg and y_pred_chg: The average magnitude of change (per period) in y_true and y_pred.
  • prediction_calibration: A simple ratio of the magnitude of our predictions vs. magnitude of truth. This gives some indication of whether our model is properly tuned to the size of movement in addition to the direction of it.
  • capture_ratio: Ratio of the "edge" we gain by following our predictions vs. the actual daily change. 100 would indicate that we were perfectly capturing the true movement of the target variable.
In [9]:
def calc_scorecard(df):
    scorecard = pd.Series()
    # building block metrics
    scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
    scorecard.loc['edge'] = df.result.mean()
    scorecard.loc['noise'] = df.y_pred.diff().abs().mean()

    # derived metrics
    scorecard.loc['y_true_chg'] = df.y_true.abs().mean()
    scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean()
    scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg']
    scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100

    return scorecard    

calc_scorecard(df)
Out[9]:
accuracy                  67.075999
edge                       1.438081
noise                      2.387606
y_true_chg                 2.888443
y_pred_chg                 1.689327
prediction_calibration     0.584857
capture_ratio             49.787427
dtype: float64

Additionally, metrics can be easily calculated for only long or short predictions (for a two-sided model) or separately for positions which ended up being winners and losers.

  • edge_long and edge_short: The "edge" for only long signals or for short signals.
  • edge_win and edge_lose: The "edge" for only winners or for only losers.
In [10]:
def calc_scorecard(df):
    scorecard = pd.Series()
    # building block metrics
    scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
    scorecard.loc['edge'] = df.result.mean()
    scorecard.loc['noise'] = df.y_pred.diff().abs().mean()

    # derived metrics
    scorecard.loc['y_true_chg'] = df.y_true.abs().mean()
    scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean()
    scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg']
    scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100

    # metrics for a subset of predictions
    scorecard.loc['edge_long'] = df[df.sign_pred == 1].result.mean()  - df.y_true.mean()
    scorecard.loc['edge_short'] = df[df.sign_pred == -1].result.mean()  - df.y_true.mean()

    scorecard.loc['edge_win'] = df[df.is_correct == 1].result.mean()  - df.y_true.mean()
    scorecard.loc['edge_lose'] = df[df.is_incorrect == 1].result.mean()  - df.y_true.mean()

    return scorecard    

calc_scorecard(df)
Out[10]:
accuracy                  67.075999
edge                       1.438081
noise                      2.387606
y_true_chg                 2.888443
y_pred_chg                 1.689327
prediction_calibration     0.584857
capture_ratio             49.787427
edge_long                  1.409962
edge_short                 1.361946
edge_win                   3.173426
edge_lose                 -2.254256
dtype: float64

From this slate of metrics, we've gained much more insight than we got from MSE, R-squared, etc...

  • The model is predicting with a strong directional accuracy
  • We are generating about 1.4 units of "edge" (expected profit) each prediction, which is about half of the total theoretical profit
  • The model makes more on winners than it loses on losers
  • The model is equally valid on both long and short predictions

If this were real data, I would be rushing to put this model into production!

Metrics over time

Critically important when considering using a model in live trading is to understand (a) how consistent the model's performance has been, and (b) whether its current performance has degraded from its past. Markets have a way of discovering and eliminating past sources of edge.

Here, a two line function will calculate each metric by year:

In [15]:
def scorecard_by_year(df):
    df['year'] = df.index.get_level_values('date').year
    return df.groupby('year').apply(calc_scorecard).T

print(scorecard_by_year(df))
year                         2012       2013       2014       2015       2016  \
accuracy                73.723404  67.539683  64.761905  69.761905  65.873016   
edge                     2.059936   1.461567   1.231334   1.543369   1.426360   
noise                    2.719877   2.821616   2.399404   2.120782   2.267517   
y_true_chg               3.084964   2.936447   2.745874   2.979032   2.913586   
y_pred_chg               1.936173   1.941289   1.703139   1.557748   1.587378   
prediction_calibration   0.627616   0.661102   0.620254   0.522904   0.544820   
capture_ratio           66.773413  49.773313  44.843048  51.807720  48.955474   
edge_long                1.967572   1.430635   1.233274   1.538225   1.364924   
edge_short               1.952401   1.763385   1.250856   1.383851   0.832610   
edge_win                 3.389715   3.387355   3.081384   3.159226   2.976300   
edge_lose               -2.050069  -2.140338  -2.138265  -2.456018  -2.496835   

year                         2017       2018  
accuracy                61.814915  67.372881  
edge                     1.021169   1.406294  
noise                    2.087404   2.153034  
y_true_chg               2.730181   2.739057  
y_pred_chg               1.466882   1.583027  
prediction_calibration   0.537284   0.577946  
capture_ratio           37.402987  51.342271  
edge_long                1.025613   1.266355  
edge_short               1.089456   1.572209  
edge_win                 3.070849   3.072564  
edge_lose               -2.201292  -2.046276  

It's just as simple to compare performance across symbols (or symbol groups, if you've defined those):

In [14]:
def scorecard_by_symbol(df):
    return df.groupby(level='symbol').apply(calc_scorecard).T

print(scorecard_by_symbol(df))
symbol                       AAPL       AMZN       CSCO       MSFT       YHOO
accuracy                66.688830  65.359043  66.312292  68.504983  68.726163
edge                     1.416872   1.346539   1.324442   1.538295   1.582845
noise                    2.345282   2.325754   2.264191   2.469043   2.395625
y_true_chg               2.898475   2.829595   2.807877   2.953369   2.962400
y_pred_chg               1.698427   1.671211   1.632399   1.741992   1.704562
prediction_calibration   0.585973   0.590618   0.581364   0.589832   0.575399
capture_ratio           48.883363  47.587700  47.168825  52.086091  53.431158
edge_long                1.498012   1.273866   1.292711   1.441065   1.574297
edge_short               1.291853   1.232207   1.210812   1.569954   1.538143
edge_win                 3.208012   3.102369   3.044073   3.241324   3.280251
edge_lose               -2.251305  -2.233005  -2.273472  -2.283530  -2.232134

Comparing models

The added insight we get from this methodology comes when wanting to make comparisons between models, periods, segments, etc...

To illustrate, let's say that we're comparing two models, a linear regression vs. a random forest, for performance on a training set and a testing set (pretend for a moment that we didn't adhere to Walk-forward model building practices...).

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV,Lasso,Ridge
from sklearn.ensemble import RandomForestRegressor

X_train,X_test,y_train,y_test = train_test_split(features,outcome,test_size=0.20,shuffle=False)

# linear regression
model1 = LinearRegression().fit(X_train,y_train)
model1_train = pd.Series(model1.predict(X_train),index=X_train.index)
model1_test = pd.Series(model1.predict(X_test),index=X_test.index)

model2 = RandomForestRegressor().fit(X_train,y_train)
model2_train = pd.Series(model2.predict(X_train),index=X_train.index)
model2_test = pd.Series(model2.predict(X_test),index=X_test.index)

# create dataframes for each 
model1_train_df = make_df(model1_train,y_train)
model1_test_df = make_df(model1_test,y_test)
model2_train_df = make_df(model2_train,y_train)
model2_test_df = make_df(model2_test,y_test)

s1 = calc_scorecard(model1_train_df)
s1.name = 'model1_train'
s2 = calc_scorecard(model1_test_df)
s2.name = 'model1_test'
s3 = calc_scorecard(model2_train_df)
s3.name = 'model2_train'
s4 = calc_scorecard(model2_test_df)
s4.name = 'model2_test'

print(pd.concat([s1,s2,s3,s4],axis=1))
                        model1_train  model1_test  model2_train  model2_test
accuracy                   68.417608    62.434555     89.461627    60.732984
edge                        1.545181     1.071990      2.730365     0.913076
noise                       2.200392     2.123827      3.259833     2.551081
y_true_chg                  2.926529     2.739687      2.926529     2.739687
y_pred_chg                  1.563175     1.509218      2.291515     1.796541
prediction_calibration      0.534139     0.550873      0.783015     0.655747
capture_ratio              52.799120    39.128174     93.297057    33.327736
edge_long                   1.496996     1.063473      2.664405     0.910637
edge_short                  1.436163     1.038296      2.640406     0.873405
edge_win                    3.190281     3.031532      3.083960     2.986226
edge_lose                  -2.264566    -2.240729     -1.008382    -2.346892

This quick and dirty scorecard comparison gives us a great deal of useful information. We learn that:

  • The relatively simple linear regression (model1) does a very good job of prediction, correct about 68% of the time, capturing >50% of available price movement (this is very good) during training
  • Model1 holds up very well out of sample, performing nearly as well on test as train
  • Model2, a more complex random forest ensemble model, appears far superior on the training data, capturing 90%+ of available price action, but appears quite overfit and does not perform nearly as well on the test set.

Summary

In this tutorial, we've covered a framework for evaluating models in a market prediction context and have demonstrated a few useful metrics. However, the approach can be extended much further to suit your needs. You can consider:

  • Adding new metrics to the standard scorecard
  • Comparing scorecard metrics for subsets of the universe. For instance, each symbol or grouping of symbols
  • Calculating and plotting performance metrics across time to validate robustness or to identify trends

In the final post of this series, I'll present a unique framework for creating an ensemble model to blend together the results of your many different forecasting models.

Please feel free to add to the comment section with your good ideas for useful metrics, with questions/comments on this post, and topic ideas for future posts.

One last thing...

If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.

Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!

Comments