The Alpha Scientist

Discovering alpha in the stock market using data science

Reddit for Fun and Profit [part 1]

Author: Chad Gray

The news story in 2021 that captured the complete attention of the financial press was the Gamestop / WallStreetBets / RoaringKitty episode of late January. A group of presumably small, retail traders banded together on Reddit's r/wallstreetbets forum to drive the price of $GME, $AMC and other "meme stocks" to unimaginable heights, wreaking havoc with the crowd of hedge funds who had shorted the stocks.

In the wake of that headline-grabbing incident, many a hedge fund has begun to consider social media buzz - especially on "meme stocks" as a risk factor to consider when taking large positions, especially short ones. The smartest funds are going beyond simply hand-wringing and are starting to monitor social media forums like r/wallstreetbets to identify potential risks in their portfolios.

Below, I'm going to walk through an example of collecting r/wallstreetbets activity on a handful of example stocks using Reddit's semi-unofficial PushShift API and related packages. In a following post, I'll walk through a simple example of sentiment analysis using VADER, and other assorted python packages.

If you'd like to experiment with the below code without tedious copy-pasting, I've made it available on the below Google Colab link.

Setup and Download

We will access the PushShift API through a python package named psaw (acronym for "PushShift API Wrapper") so first we'll need to pip install that. If you don't already have the fantastically useful jsonlines package installed, it'd be a good idea to install that too.

In [1]:
!pip install psaw
!pip install jsonlines
Requirement already satisfied: psaw in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (0.1.0)
Requirement already satisfied: Click in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from psaw) (8.0.1)
Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from psaw) (2.25.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->psaw) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->psaw) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->psaw) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->psaw) (2.10)
WARNING: You are using pip version 21.2.2; however, version 21.3.1 is available.
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.
Requirement already satisfied: jsonlines in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (2.0.0)
WARNING: You are using pip version 21.2.2; however, version 21.3.1 is available.
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.

The imports are dead-simple. I'll import pandas as well since that's my swiss army knife of choice. I'm also going to define a data root path to where on my system I want to store the

In [1]:
import os
DATA_ROOT = '../data/'

import pandas as pd
from datetime import datetime

from psaw import PushshiftAPI
api = PushshiftAPI()

PushShift and psaw Overview

I'll start with a quick example of how to use the psaw wrapper. You'll want to refer to the psaw and PushShift GitHub pages for more complete documentation.

First, we will use the search_submissions API method, which searches submissions (the initial post in a new thread) for the given ticker. We need to pass in unix-type integer timestamps rather than human-readable ones, so here we're using pandas to do that.

You'll also notice the filter parameter, which allows you to return only a subset of the many fields available. If you want to see the full list of available fields, read the docs or run the below code snippet.

gen = api.search_submissions(q='GME',limit=1) list(gen)[0].d_.keys()

In [2]:
start_epoch = int(pd.to_datetime('2021-01-01').timestamp())
end_epoch = int(pd.to_datetime('2021-01-02').timestamp())

gen = api.search_submissions(q='GME', # this is the keyword (ticker symbol) for which we're searching
                               after=start_epoch, before=end_epoch, # these are the unix-based timestamps to search between
                               subreddit=['wallstreetbets','stocks'], # one or more subreddits to include in the search
                               filter=['id','url','author', 'title', 'score',
                                       'subreddit','selftext','num_comments'], # list of fields to return
                               limit = 2 # limit on the number of records returned
                              ) 

You'll notice that this ran awfully quickly. In part, that's due to the fact that it has returned a lazy generator object which doesn't (yet) contain the data we want. One simple way to make the generator object actually pull the data is to wrap it in a list() call. Below is an example of what that returns.

Side note: if you don't catch the resulting list in the first time you run this, you'll notice that it won't work a second time. The generator has been "consumed" and emptied of objects. So we will catch the returned value in a variable called lst and view that...

In [3]:
lst = list(gen)
lst
Out[3]:
[submission(author='Alexbuildit', created_utc=1609541557, id='kol20h', num_comments=2, score=1, selftext="Brand new investor here. Saw all the hype surrounding GME, and bought in with this months paycheck. Had a good laugh when I saw the reddit award &lt; GME post. Already down over a hundred dollars on GME, but not gonna sell! Let's send GME to the moon! I'll keep picking up GME whenever I can. 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀\n\n[Yep. Checks out.](https://preview.redd.it/i6wj8ynnvs861.png?width=804&amp;format=png&amp;auto=webp&amp;s=69a0688bddce3cf409caea5adde58421e60296fc)", subreddit='wallstreetbets', title='WSB In A Nutshell. Send GME To The Moon! 🚀', url='https://www.reddit.com/r/wallstreetbets/comments/kol20h/wsb_in_a_nutshell_send_gme_to_the_moon/', created=1609566757.0, d_={'author': 'Alexbuildit', 'created_utc': 1609541557, 'id': 'kol20h', 'num_comments': 2, 'score': 1, 'selftext': "Brand new investor here. Saw all the hype surrounding GME, and bought in with this months paycheck. Had a good laugh when I saw the reddit award &lt; GME post. Already down over a hundred dollars on GME, but not gonna sell! Let's send GME to the moon! I'll keep picking up GME whenever I can. 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀\n\n[Yep. Checks out.](https://preview.redd.it/i6wj8ynnvs861.png?width=804&amp;format=png&amp;auto=webp&amp;s=69a0688bddce3cf409caea5adde58421e60296fc)", 'subreddit': 'wallstreetbets', 'title': 'WSB In A Nutshell. Send GME To The Moon! 🚀', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/kol20h/wsb_in_a_nutshell_send_gme_to_the_moon/', 'created': 1609566757.0}),
 submission(author='luncheonmeat79', created_utc=1609538748, id='kok7my', num_comments=57, score=1, selftext='', subreddit='wallstreetbets', title='Ryan Cohen-GME confirmation bias', url='https://i.redd.it/xv2ie5t8ns861.jpg', created=1609563948.0, d_={'author': 'luncheonmeat79', 'created_utc': 1609538748, 'id': 'kok7my', 'num_comments': 57, 'score': 1, 'selftext': '', 'subreddit': 'wallstreetbets', 'title': 'Ryan Cohen-GME confirmation bias', 'url': 'https://i.redd.it/xv2ie5t8ns861.jpg', 'created': 1609563948.0})]

Each element of the returned list is a submission object which, as far as I can tell, simply provides easier access to the fields.

In [4]:
print("id:",lst[0].id) # this is Reddit's unique ID for this post
print("url:",lst[0].url) 
print("author:",lst[0].author) 
print("title:",lst[0].title)
print("score:",lst[0].score) # upvote/downvote-based score, doesn't seem 100% reliable
print("subreddit:",lst[0].subreddit)
print("num_comments:",lst[0].num_comments) # number of comments in the thread (which we can get later if we choose)
print("selftext:",lst[0].selftext) # This is the body of the post
id: kol20h
url: https://www.reddit.com/r/wallstreetbets/comments/kol20h/wsb_in_a_nutshell_send_gme_to_the_moon/
author: Alexbuildit
title: WSB In A Nutshell. Send GME To The Moon! 🚀
score: 1
subreddit: wallstreetbets
num_comments: 2
selftext: Brand new investor here. Saw all the hype surrounding GME, and bought in with this months paycheck. Had a good laugh when I saw the reddit award &lt; GME post. Already down over a hundred dollars on GME, but not gonna sell! Let's send GME to the moon! I'll keep picking up GME whenever I can. 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀

[Yep. Checks out.](https://preview.redd.it/i6wj8ynnvs861.png?width=804&amp;format=png&amp;auto=webp&amp;s=69a0688bddce3cf409caea5adde58421e60296fc)

Perhaps a more familiar way to interact with each item of this list is as a dict. Luckily, the API includes an easy way to get all of the available info as a dict without any effort - like this:

In [5]:
lst[0].d_
Out[5]:
{'author': 'Alexbuildit',
 'created_utc': 1609541557,
 'id': 'kol20h',
 'num_comments': 2,
 'score': 1,
 'selftext': "Brand new investor here. Saw all the hype surrounding GME, and bought in with this months paycheck. Had a good laugh when I saw the reddit award &lt; GME post. Already down over a hundred dollars on GME, but not gonna sell! Let's send GME to the moon! I'll keep picking up GME whenever I can. 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀\n\n[Yep. Checks out.](https://preview.redd.it/i6wj8ynnvs861.png?width=804&amp;format=png&amp;auto=webp&amp;s=69a0688bddce3cf409caea5adde58421e60296fc)",
 'subreddit': 'wallstreetbets',
 'title': 'WSB In A Nutshell. Send GME To The Moon! 🚀',
 'url': 'https://www.reddit.com/r/wallstreetbets/comments/kol20h/wsb_in_a_nutshell_send_gme_to_the_moon/',
 'created': 1609566757.0}

That's much better!

However, you'll notice that the returned values for created and created_utc aren't particularly user-friendly. They're in the same UNIX-style epoch integer format we had to specify in the query. A quick way to add a human-readable version is a function like the below. You'll notice the human-readable timestamp added onto the end.

In [6]:
def convert_date(timestamp):
    return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%dT%H:%M:%S')
lst[0].d_['datetime_utc'] = convert_date( lst[0].d_['created_utc'] )
lst[0].d_
Out[6]:
{'author': 'Alexbuildit',
 'created_utc': 1609541557,
 'id': 'kol20h',
 'num_comments': 2,
 'score': 1,
 'selftext': "Brand new investor here. Saw all the hype surrounding GME, and bought in with this months paycheck. Had a good laugh when I saw the reddit award &lt; GME post. Already down over a hundred dollars on GME, but not gonna sell! Let's send GME to the moon! I'll keep picking up GME whenever I can. 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀\n\n[Yep. Checks out.](https://preview.redd.it/i6wj8ynnvs861.png?width=804&amp;format=png&amp;auto=webp&amp;s=69a0688bddce3cf409caea5adde58421e60296fc)",
 'subreddit': 'wallstreetbets',
 'title': 'WSB In A Nutshell. Send GME To The Moon! 🚀',
 'url': 'https://www.reddit.com/r/wallstreetbets/comments/kol20h/wsb_in_a_nutshell_send_gme_to_the_moon/',
 'created': 1609566757.0,
 'datetime_utc': '2021-01-01T14:52:37'}

Depending on the ticker, you may find A LOT of posts (if you don't assign a limit value, of course). One handy capability of the API is to filter based on fields so we can seach only for fields with at least N comments. Notice that we need to express the greater than as a string (">100"), which isn't totally obvious from the documentation.

In [7]:
gen = api.search_submissions(q='GME', after=start_epoch, before=end_epoch, # these are the unix-based timestamps to search between
                             subreddit=['wallstreetbets','stocks'], 
                             filter=['id','url','author', 'title', 'score','subreddit','selftext','num_comments'], # list of fields to return
                             num_comments=">100",
                             limit = 2 # limit on the number of records returned
                              ) 
lst = list(gen)
item = lst[0]
item.d_
Out[7]:
{'author': 'redcedar53',
 'created_utc': 1609535773,
 'id': 'kojagn',
 'num_comments': 157,
 'score': 1,
 'selftext': 'January 4/5: Cohen discloses his additional 7% purchase of GME last Friday, brining his total ownership of GME to 20%. \n\nJanuary 6/7: GME announces Cohen’s seat at the BoD as a special advisor to the modernization of GME. \n\nJanuary 9: GME releases December sales numbers.\n\nJanuary 11: The Conference where Papa Cohen himself presents his vision and roadmap for GME and gathers institutional buyers.\n\nTLDR: Next week pops to build up the momentum for the eventual rocket squeeze on the week of 11th.\n\n#NEW YEAR, NEW (G)ME 🚀🚀🚀🚀🚀🚀🚀',
 'subreddit': 'wallstreetbets',
 'title': 'GME’s Game Plan Next Week (Probably)',
 'url': 'https://www.reddit.com/r/wallstreetbets/comments/kojagn/gmes_game_plan_next_week_probably/',
 'created': 1609560973.0}

Sidebar: Getting Comments

For our purposes, just the submissions offer ample amounts of material to analyze so I'm generally ignoring the comments underneath them, other than tracking the num_comments value. However, if you wanted to pull the comments for a given submission, you could do it like below.

Note a few things:

  1. pass in the id property of the submission item as link_id. Also not totally clearly documented IMO
  2. The filter values are a little different because the fields available on a comment are not exactly the same as on a submission. The main changes to note is that url -> permalink and selftext -> body. Otherwise, they seem similar.
In [8]:
comments_lst = list(api.search_comments(link_id=item.id,
                                        filter=['id','parent_id','permalink','author', 'title', 
                                                'subreddit','body','num_comments','score'],
                                        limit=5))
pd.DataFrame(comments_lst)
Out[8]:
author body created_utc id parent_id permalink score subreddit created d_
0 [deleted] [removed] 1617413524 gt7atx5 t1_ghtky8c /r/wallstreetbets/comments/kojagn/gmes_game_pl... 1 wallstreetbets 1.617439e+09 {'author': '[deleted]', 'body': '[removed]', '...
1 [deleted] [removed] 1611883171 gl6b32t t1_ghriokl /r/wallstreetbets/comments/kojagn/gmes_game_pl... 1 wallstreetbets 1.611908e+09 {'author': '[deleted]', 'body': '[removed]', '...
2 LemniscateSideEight No. This is wrong. He has to disclose options:... 1609724561 gi0rulb t1_ghsugzo /r/wallstreetbets/comments/kojagn/gmes_game_pl... 1 wallstreetbets 1.609750e+09 {'author': 'LemniscateSideEight', 'body': 'No....
3 LemniscateSideEight He does not. He does not care about peasants. 1609697542 ghz689z t1_ghrhevd /r/wallstreetbets/comments/kojagn/gmes_game_pl... 1 wallstreetbets 1.609723e+09 {'author': 'LemniscateSideEight', 'body': 'He ...
4 possibly6 I remember seeing massive $1m plus orders for ... 1609654197 ghwq0cf t1_ghrfiim /r/wallstreetbets/comments/kojagn/gmes_game_pl... 1 wallstreetbets 1.609679e+09 {'author': 'possibly6', 'body': 'I remember se...

Building a Downloader

With a basic understanding of the API and psaw wrapper, we can construct a simple downloader which downloads all submissions (with greater than n comments) for a one week time window on any stock ticker. Then, since we will probably want to avoid needing to call the API repeatedly for the same data, we will save it as a jsonlines file.

If you're not familiar with jsonlines, it's well worth checking out. Note that, by default, jsonlines will append to the end of an existing file if one exists, or will create a file if one doesn't. Keep this in mind if running the same code on the same date/ticker repeatedly. It's probably easiest to assume the jl files have duplicates in them and to simply dedupe when reading back from disk.

In [9]:
import jsonlines
from tqdm.notebook import tqdm
import time
import random

def get_submissions(symbol, end_date):
    end_date = pd.to_datetime(end_date) #ensure it's a datetime object not string
    end_epoch = int(end_date.timestamp())
    start_epoch = int((end_date-pd.offsets.Week(1)).timestamp())
    gen = api.search_submissions(q=f'${symbol}', after=start_epoch, before=end_epoch,
                                subreddit=['wallstreetbets','stocks'], num_comments = ">10",
                                filter=['id','url','author', 'title', 'subreddit',
                                        'num_comments','score','selftext'] ) 

    path = os.path.join(DATA_ROOT,f'{symbol}.jl')
    with jsonlines.open(path, mode='a') as writer:
        for item in gen:
            item.d_['date_utc'] = convert_date(item.d_['created_utc'])
            writer.write(item.d_)
    return


get_submissions('GME','2021-07-19')

If we had a list of tickers that we wanted to get across a longer daterange, we could use some nested for loops like below to iterate through symbols and weeks. Running the below should take 15-20 minutes to complete so feel free to narrow the scope of tickers or dates if needed.

In [10]:
import traceback

symbols = ['GME']#,'AMC','SPCE','TSLA']

for symbol in tqdm(symbols):
    print(symbol)
    for date in tqdm(pd.date_range('2021-01-01','2021-10-31', freq='W')):
        try:
            get_submissions(symbol,date)
        except:
            traceback.print_exc()
            time.sleep(5)
GME
/Users/Chad/opt/anaconda3/envs/reddit/lib/python3.6/site-packages/psaw/PushshiftAPI.py:192: UserWarning: Got non 200 code 429
  warnings.warn("Got non 200 code %s" % response.status_code)
/Users/Chad/opt/anaconda3/envs/reddit/lib/python3.6/site-packages/psaw/PushshiftAPI.py:180: UserWarning: Unable to connect to pushshift.io. Retrying after backoff.
  warnings.warn("Unable to connect to pushshift.io. Retrying after backoff.")

Try it Out!

Enough reading, already! The above code is available on colab at the link below. Feel free to try it out yourself.

You can modify the notebook however you'd like without risk of breaking it. I really hope that those interested will "fork" from my notebook (all you'll need is a Google Drive to save a copy of the file...) and extend it to answer your own questions through data.

Summary

In this first post, we've made it through the heavy lifting of downloading data from the API and storing it in a usable format on disk. In the next segment, we will do some basic analysis on how spikes in Reddit traffic may signal risk of increased volatility in a given stock.

One last thing...

If you've found this post useful or enlightening, please consider subscribing to the email list to be notified of future posts (email addresses will only be used for this purpose...). To subscribe, scroll to the top of this page and look at the right sidebar.

You can also follow me on twitter (@data2alpha) and forward to a friend or colleague who may find this topic interesting.

Comments