The Alpha Scientist

Discovering alpha in the stock market using data science

COVID By The Numbers

Preface

This post is something of a departure from the focus of this blog - exploring the use of data science and machine learning to improve investment decision-making. However, every person reading this has been living through a (hopefully) once-in-a-lifetime pandemic so I think this departure is warranted.

Motivation

As a data geek, I've tried to make sense of this frightening, fascinating, frustrating era in our lives at least partially through data. Not many days have passed in the past 18 months without checking some sort of COVID statistics - case rates, positivity rates, transmission rates, vaccinations, and so forth. Early in the pandemic, many great data dashboards popped up - from the Washington Post, NY Times, and others.

However, innovation in this department seems to have stagnated and I'm often not able to find empirical answers to important questions. Like many a data geek, I've become an armchair epidemiologist (a subject I had barely given a thought to prior to Feb 2020) so I'd like to roll my own dashboard of state-by-state COVID data to explore some questions.

In this post, I will use familiar python tools to create a jupyter notebook based dashboard. Along the way, I'll cover some basic patterns used to interface with REST APIs, plotting functions, and simple linear regressions. All of these are very relevant to data-science-driven finance so it's actually a very small departure for the blog. If you find this interesting and worthwhile, please take a moment to post a comment below with what you found interesting and what would be of interest in the future.

I'm making a version of this notebook available on Google's great Colab platform so you can get your hands on the examples with an absolute minimum of setup. Scroll to the bottom for the access link.

COVID Data API

I'm going to make use of the excellent API maintained by COVID Act Now, a great non-profit that came into existence in March 2020 to gather and distribute COVID data from states, counties, and municipalities in a consistent and convenient manner. I suspect that many of the websites you've used to monitor COVID statistics make use of this API to do what they do. Please take a minute to donate to this group if you have a few coins to spare...

The API is free to use, but requires that you register for an API key and jot down a few words about what you're doing with the data. Out of respect for the organization, I won't share my API key here, so please take a second to register here for a key.

Below, we'll go through the data in much more detail, but at a high level this is the sort of data offered by the API. Current and historical data of these types are available for different geographies (United States overall, states, counties, and CBSAs).

Let's get to it.

Setup and Imports

Below are the imports which will be required. You will also need to get the aforementioned API key here and assign the API_KEY variable

In [5]:
import requests
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import matplotlib.pyplot as plt
import plotly.express as px
from pprint import pprint
import json

API_KEY = ''
In [6]:
import plotly.offline as pyo
pyo.init_notebook_mode() #may be necessary to avoid this problem: https://stackoverflow.com/questions/52771328/plotly-chart-not-showing-in-jupyter-notebook