### Introduction¶

Use of machine learning in the quantitative investment field is, by all indications, skyrocketing. The proliferation of easily accessible data - both traditional and alternative - along with some very approachable frameworks for machine learning models - is encouraging many to explore the arena.

However, these financial ML explorers are learning that there are many ways in which using ML to predict financial time series differs greatly from labeling cat pictures or flagging spam. Among these differences is that traditional model performance metrics (RSQ, MSE, accuracy, F1, etc...) can be misleading and incomplete.

Over the past several years, I've developed a set of metrics which have proved useful for comparing and optimizing financial time series models. These metrics attempt to measure models' *predictive power* but also their *trade-ability*, critically important for those who actually intend to *use* their models in the real world.

In this post, I will present a general outline of my approach and will demonstrate a few of the most useful metrics I've added to my standard "scorecard". I look forward to hearing how others may think to extend the concept. If you'd like to replicate and experiment with the below code, *you can download the source notebook for this post by right-clicking on the below button and choosing "save link as"*

If you haven't already checked out the previous four installments in this tutorial, you may want review those first. Many of the coding patterns used below are discussed at length:

### Preparing sample data¶

I will illustrate this metrics methodology using a simple example of synthetically generated data (see previous posts in this tutorial for explanations of the below method of creating data).

```
## Remove / Replace this code with a link to your quandl key, if you have one
import sys
sys.path.append('/anaconda/')
import config
quandl_key = config.quandl_key
from IPython.core.display import Image
import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
# May be necessary to fix below issue
# https://github.com/pydata/pandas-datareader/issues/534
import pandas_datareader.data as web
%matplotlib inline
def get_symbols(symbols,data_source, quandl_key=None, begin_date=None,end_date=None):
out = pd.DataFrame()
for symbol in symbols:
df = web.DataReader(symbol, data_source,begin_date, end_date)\
[['AdjOpen','AdjHigh','AdjLow','AdjClose','AdjVolume']].reset_index()
df.columns = ['date','open','high','low','close','volume']
df['symbol'] = symbol # add symbol col so we can keep all in the same dataframe
df = df.set_index(['date','symbol'])
out = pd.concat([out,df],axis=0) #stacks on top of previously collected data
return out.sort_index()
prices = get_symbols(['AAPL','CSCO','AMZN','YHOO','MSFT'],\
data_source='quandl',quandl_key=quandl_key,begin_date='2012-01-01',end_date=None)
# Note: we're only using real price data to generate an index set.
# We will make synthetic features and outcomes below instead of deriving from price
```

The below code generates several features then *synthetically generates* an outcome series from them (along with noise). This guarantees that the features will be informative, since the outcome has been constructed to ensure a relationship.

```
num_obs = prices.close.count()
def add_memory(s,n_days=50,mem_strength=0.1):
''' adds autoregressive behavior to series of data'''
add_ewm = lambda x: (1-mem_strength)*x + mem_strength*x.ewm(n_days).mean()
out = s.groupby(level='symbol').apply(add_ewm)
return out
# generate feature data
f01 = pd.Series(np.random.randn(num_obs),index=prices.index)
f01 = add_memory(f01,10,0.1)
f02 = pd.Series(np.random.randn(num_obs),index=prices.index)
f02 = add_memory(f02,10,0.1)
f03 = pd.Series(np.random.randn(num_obs),index=prices.index)
f03 = add_memory(f03,10,0.1)
f04 = pd.Series(np.random.randn(num_obs),index=prices.index)
f04 = f04 # no memory
features = pd.concat([f01,f02,f03,f04],axis=1)
## now, create response variable such that it is related to features
# f01 becomes increasingly important, f02 becomes decreasingly important,
# f03 oscillates in importance, f04 is stationary,
# and finally a noise component is added
outcome = f01 * np.linspace(0.5,1.5,num_obs) + \
f02 * np.linspace(1.5,0.5,num_obs) + \
f03 * pd.Series(np.sin(2*np.pi*np.linspace(0,1,num_obs)*2)+1,index=f03.index) + \
f04 + \
np.random.randn(num_obs) * 3
outcome.name = 'outcome'
```

### Generating models and predictions¶

Imagine that we created a simple linear model (such as below) and wanted to measure its effectiveness at prediction.

Note: we'll follow the walk-forward modeling process described in the previous post. If you don't understand the below code snippet (and want to...) please check out that post.

```
from sklearn.linear_model import LinearRegression
## fit models for each timestep on a walk-forward basis
recalc_dates = features.resample('Q',level='date').mean().index.values[:-1]
models = pd.Series(index=recalc_dates)
for date in recalc_dates:
X_train = features.xs(slice(None,date),level='date',drop_level=False)
y_train = outcome.xs(slice(None,date),level='date',drop_level=False)
model = LinearRegression()
model.fit(X_train,y_train)
models.loc[date] = model
## predict values walk-forward (all predictions out of sample)
begin_dates = models.index
end_dates = models.index[1:].append(pd.to_datetime(['2099-12-31']))
predictions = pd.Series(index=features.index)
for i,model in enumerate(models): #loop thru each models object in collection
X = features.xs(slice(begin_dates[i],end_dates[i]),level='date',drop_level=False)
p = pd.Series(model.predict(X),index=X.index)
predictions.loc[X.index] = p
```

### Traditional model evaluation¶

So we've got a model, we've got a sizeable set of (out of sample) predictions. Is the model any good? Should we junk it, tune it, or trade it? Since this is a regression model, I'll throw our data into `scikit-learn`

's metrics package.

```
import sklearn.metrics as metrics
# make sure we have 1-for-1 mapping between pred and true
common_idx = outcome.dropna().index.intersection(predictions.dropna().index)
y_true = outcome[common_idx]
y_true.name = 'y_true'
y_pred = predictions[common_idx]
y_pred.name = 'y_pred'
standard_metrics = pd.Series()
standard_metrics.loc['explained variance'] = metrics.explained_variance_score(y_true, y_pred)
standard_metrics.loc['MAE'] = metrics.mean_absolute_error(y_true, y_pred)
standard_metrics.loc['MSE'] = metrics.mean_squared_error(y_true, y_pred)
standard_metrics.loc['MedAE'] = metrics.median_absolute_error(y_true, y_pred)
standard_metrics.loc['RSQ'] = metrics.r2_score(y_true, y_pred)
print(standard_metrics)
```

These stats don't really tell us much by themselves. You may have an intuition for r-squared so that may give you a level of confidence in the models. However, even this metric has problems not to mention does not tell us much about the practicality of this signal from a trading point of view.

True, we could construct some trading rules around this series of predictions and perform a formal backtest on that. However, that is quite time consuming and introduces a number of extraneous variables into the equation.

### A better way... Creating custom metrics¶

Instead of relying on generic ML metrics, we will create several custom metrics that will hopefully give a more complete picture of strength, reliability, and practicality of these models.

I'll work through an example of creating an extensible *scorecard* with about a half dozen custom-defined *metrics* as a starting point. You can feel free to extend this into a longer scorecard which is suited to your needs and beliefs. In my own trading, I use about 25 metrics in a standard "scorecard" each time I evaluate a model. You may prefer to use more, fewer, or different metrics but the process should be applicable.

I'll focus only on regression-oriented metrics (i.e., those which use a continuous prediction rather than a binary or classification prediction). It's trivial to re-purpose the same framework to a classification-oriented environment.

### Step 1: Preprocess data primitives¶

Before implementing specific metrics we need to do some data pre-processing. It'll become clear why doing this first will save considerable time later when calculating aggregate metrics.

To create these intermediate values, you'll need the following inputs:

**y_pred:**the*continuous variable*prediction made by your model for each timestep, for each symbol**y_true:**the*continuous variable*actual outcome for each timestep, for each symbol.**index:**this is the unique identifier for each prediction or actual result. If working with a single instrument, then you can simply use date (or time or whatever). If you're using multiple instruments, a multi-index with (date/symbol) is necessary.

In other words, if your model is predicting one-day price changes, you'd want your y_pred to be the model's predictions made as of March 9th (for the coming day), indexed as `2017-03-09`

and you'd want the actual *future* outcome which will play out in the next day also aligned to Mar 9th. This "peeking" convention is very useful for working with large sets of data across different time horizons. It is described ad nauseum in Part 1: Data Management.

The raw input data we need to provide might look something like this:

```
print(pd.concat([y_pred,y_true],axis=1).tail())
```

We will feed this data into a simple function which will return a dataframe with the y_pred and y_true values, along with several other useful derivative values. These derivative values include:

**sign_pred:**positive or negative sign of prediction**sign_true:**positive or negative sign of true outcome**is_correct:**1 if sign_pred == sign_true, else 0**is_incorrect:**opposite**is_predicted:**1 if the model has made a valid prediction, 0 if not. This is important if models only emit predictions when they have a certain level of confidence**result:**the profit (loss) resulting from betting one unit in the direction of the sign_pred. This is the continuous variable result of following the model

```
def make_df(y_pred,y_true):
y_pred.name = 'y_pred'
y_true.name = 'y_true'
df = pd.concat([y_pred,y_true],axis=1)
df['sign_pred'] = df.y_pred.apply(np.sign)
df['sign_true'] = df.y_true.apply(np.sign)
df['is_correct'] = 0
df.loc[df.sign_pred * df.sign_true > 0 ,'is_correct'] = 1 # only registers 1 when prediction was made AND it was correct
df['is_incorrect'] = 0
df.loc[df.sign_pred * df.sign_true < 0,'is_incorrect'] = 1 # only registers 1 when prediction was made AND it was wrong
df['is_predicted'] = df.is_correct + df.is_incorrect
df['result'] = df.sign_pred * df.y_true
return df
df = make_df(y_pred,y_true)
print(df.dropna().tail())
```

### Defining our metrics¶

With this set of intermediate variables pre-processed, we can more easily calculate metrics. The metrics we'll start with here include things like:

**Accuracy:**Just as the name suggests, this measures the percent of predictions that were*directionally*correct vs. incorrect.**Edge:**perhaps the most useful of all metrics, this is the expected value of the prediction over a sufficiently large set of draws. Think of this like a blackjack card counter who knows the expected profit on each dollar bet when the odds are at a level of favorability**Noise:**critically important but often ignored, the noise metric estimates how dramatically the model's predictions vary from one day to the next. As you might imagine, a model which abruptly changes its mind every few days is much harder to follow (and much more expensive to trade) than one which is a bit more steady.

The below function takes in our pre-processed data primitives and returns a scorecard with `accuracy`

, `edge`

, and `noise`

.

```
def calc_scorecard(df):
scorecard = pd.Series()
# building block metrics
scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
scorecard.loc['edge'] = df.result.mean()
scorecard.loc['noise'] = df.y_pred.diff().abs().mean()
return scorecard
calc_scorecard(df)
```

Much better. I now know that we've been directionally correct about two-thirds of the time, and that following this signal would create an edge of ~1.5 units per time period.

Let's keep going. We can now easily combine and transform things to derive new metrics. The below function shows several examples, including:

**y_true_chg**and**y_pred_chg:**The average magnitude of change (per period) in y_true and y_pred.**prediction_calibration:**A simple ratio of the magnitude of our predictions vs. magnitude of truth. This gives some indication of whether our model is properly tuned to the size of movement in addition to the direction of it.**capture_ratio:**Ratio of the "edge" we gain by following our predictions vs. the actual daily change. 100 would indicate that we were*perfectly*capturing the true movement of the target variable.

```
def calc_scorecard(df):
scorecard = pd.Series()
# building block metrics
scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
scorecard.loc['edge'] = df.result.mean()
scorecard.loc['noise'] = df.y_pred.diff().abs().mean()
# derived metrics
scorecard.loc['y_true_chg'] = df.y_true.abs().mean()
scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean()
scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg']
scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100
return scorecard
calc_scorecard(df)
```

Additionally, metrics can be easily calculated for only long or short predictions (for a two-sided model) or separately for positions which ended up being winners and losers.

**edge_long**and**edge_short:**The "edge" for only long signals or for short signals.**edge_win**and**edge_lose:**The "edge" for only winners or for only losers.

```
def calc_scorecard(df):
scorecard = pd.Series()
# building block metrics
scorecard.loc['accuracy'] = df.is_correct.sum()*1. / (df.is_predicted.sum()*1.)*100
scorecard.loc['edge'] = df.result.mean()
scorecard.loc['noise'] = df.y_pred.diff().abs().mean()
# derived metrics
scorecard.loc['y_true_chg'] = df.y_true.abs().mean()
scorecard.loc['y_pred_chg'] = df.y_pred.abs().mean()
scorecard.loc['prediction_calibration'] = scorecard.loc['y_pred_chg']/scorecard.loc['y_true_chg']
scorecard.loc['capture_ratio'] = scorecard.loc['edge']/scorecard.loc['y_true_chg']*100
# metrics for a subset of predictions
scorecard.loc['edge_long'] = df[df.sign_pred == 1].result.mean() - df.y_true.mean()
scorecard.loc['edge_short'] = df[df.sign_pred == -1].result.mean() - df.y_true.mean()
scorecard.loc['edge_win'] = df[df.is_correct == 1].result.mean() - df.y_true.mean()
scorecard.loc['edge_lose'] = df[df.is_incorrect == 1].result.mean() - df.y_true.mean()
return scorecard
calc_scorecard(df)
```

From this slate of metrics, we've gained much more insight than we got from MSE, R-squared, etc...

- The model is predicting with a strong directional accuracy
- We are generating about 1.4 units of "edge" (expected profit) each prediction, which is about half of the total theoretical profit
- The model makes more on winners than it loses on losers
- The model is equally valid on both long and short predictions

If this were real data, I would be rushing to put this model into production!

### Metrics over time¶

Critically important when considering using a model in live trading is to understand (a) how consistent the model's performance has been, and (b) whether its current performance has degraded from its past. Markets have a way of discovering and eliminating past sources of edge.

Here, a two line function will calculate each metric by year:

```
def scorecard_by_year(df):
df['year'] = df.index.get_level_values('date').year
return df.groupby('year').apply(calc_scorecard).T
print(scorecard_by_year(df))
```

It's just as simple to compare performance across symbols (or symbol groups, if you've defined those):

```
def scorecard_by_symbol(df):
return df.groupby(level='symbol').apply(calc_scorecard).T
print(scorecard_by_symbol(df))
```

### Comparing models¶

The added insight we get from this methodology comes when wanting to make comparisons between models, periods, segments, etc...

To illustrate, let's say that we're comparing two models, a linear regression vs. a random forest, for performance on a training set and a testing set (pretend for a moment that we didn't adhere to Walk-forward model building practices...).

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV,Lasso,Ridge
from sklearn.ensemble import RandomForestRegressor
X_train,X_test,y_train,y_test = train_test_split(features,outcome,test_size=0.20,shuffle=False)
# linear regression
model1 = LinearRegression().fit(X_train,y_train)
model1_train = pd.Series(model1.predict(X_train),index=X_train.index)
model1_test = pd.Series(model1.predict(X_test),index=X_test.index)
model2 = RandomForestRegressor().fit(X_train,y_train)
model2_train = pd.Series(model2.predict(X_train),index=X_train.index)
model2_test = pd.Series(model2.predict(X_test),index=X_test.index)
# create dataframes for each
model1_train_df = make_df(model1_train,y_train)
model1_test_df = make_df(model1_test,y_test)
model2_train_df = make_df(model2_train,y_train)
model2_test_df = make_df(model2_test,y_test)
s1 = calc_scorecard(model1_train_df)
s1.name = 'model1_train'
s2 = calc_scorecard(model1_test_df)
s2.name = 'model1_test'
s3 = calc_scorecard(model2_train_df)
s3.name = 'model2_train'
s4 = calc_scorecard(model2_test_df)
s4.name = 'model2_test'
print(pd.concat([s1,s2,s3,s4],axis=1))
```

This quick and dirty scorecard comparison gives us a great deal of useful information. We learn that:

- The relatively simple linear regression (model1) does a very good job of prediction, correct about 68% of the time, capturing >50% of available price movement (this is very good) during training
- Model1 holds up very well out of sample, performing nearly as well on test as train
- Model2, a more complex random forest ensemble model, appears
*far*superior on the training data, capturing 90%+ of available price action, but appears quite overfit and does not perform nearly as well on the test set.

### Summary¶

In this tutorial, we've covered a framework for evaluating models in a market prediction context and have demonstrated a few useful metrics. However, the approach can be extended much further to suit your needs. You can consider:

- Adding new metrics to the standard scorecard
- Comparing scorecard metrics for subsets of the universe. For instance, each symbol or grouping of symbols
- Calculating and plotting performance metrics across time to validate robustness or to identify trends

In the final post of this series, I'll present a unique framework for creating an *ensemble model* to blend together the results of your many different forecasting models.

Please feel free to add to the comment section with your good ideas for useful metrics, with questions/comments on this post, and topic ideas for future posts.

### One last thing...¶

If you've found this post useful, please follow @data2alpha on twitter and forward to a friend or colleague who may also find this topic interesting.

Finally, take a minute to leave a comment below - either to discuss this post or to offer an idea for future posts. Thanks for reading!