The news story in 2021 that captured the complete attention of the financial press was the Gamestop / WallStreetBets / RoaringKitty episode of late January. A group of presumably small, retail traders banded together on Reddit's r/wallstreetbets forum to drive the price of $GME
, $AMC
and other "meme stocks" to unimaginable heights, wreaking havoc with the crowd of hedge funds who had shorted the stocks.
In the wake of that headline-grabbing incident, many a hedge fund has begun to consider social media buzz - especially on "meme stocks" as a risk factor to consider when taking large positions, especially short ones. The smartest funds are going beyond simply hand-wringing and are starting to monitor social media forums like r/wallstreetbets to identify potential risks in their portfolios.
Below, I'm going to walk through an example of collecting r/wallstreetbets
activity on a handful of example stocks using Reddit's semi-unofficial PushShift API and related packages. In a following post, I'll walk through a simple example of sentiment analysis using VADER, and other assorted python packages.
If you'd like to experiment with the below code without tedious copy-pasting, I've made it available on the below Google Colab link.
!pip install psaw
!pip install jsonlines
The imports are dead-simple. I'll import pandas as well since that's my swiss army knife of choice. I'm also going to define a data root path to where on my system I want to store the
import os
DATA_ROOT = '../data/'
import pandas as pd
from datetime import datetime
from psaw import PushshiftAPI
api = PushshiftAPI()
PushShift and psaw
Overview¶
I'll start with a quick example of how to use the psaw wrapper. You'll want to refer to the psaw and PushShift GitHub pages for more complete documentation.
First, we will use the search_submissions
API method, which searches submissions (the initial post in a new thread) for the given ticker. We need to pass in unix-type integer timestamps rather than human-readable ones, so here we're using pandas to do that.
You'll also notice the filter
parameter, which allows you to return only a subset of the many fields available. If you want to see the full list of available fields, read the docs or run the below code snippet.
gen = api.search_submissions(q='GME',limit=1)
list(gen)[0].d_.keys()
start_epoch = int(pd.to_datetime('2021-01-01').timestamp())
end_epoch = int(pd.to_datetime('2021-01-02').timestamp())
gen = api.search_submissions(q='GME', # this is the keyword (ticker symbol) for which we're searching
after=start_epoch, before=end_epoch, # these are the unix-based timestamps to search between
subreddit=['wallstreetbets','stocks'], # one or more subreddits to include in the search
filter=['id','url','author', 'title', 'score',
'subreddit','selftext','num_comments'], # list of fields to return
limit = 2 # limit on the number of records returned
)
You'll notice that this ran awfully quickly. In part, that's due to the fact that it has returned a lazy generator object which doesn't (yet) contain the data we want. One simple way to make the generator object actually pull the data is to wrap it in a list()
call. Below is an example of what that returns.
Side note: if you don't catch the resulting list in the first time you run this, you'll notice that it won't work a second time. The generator has been "consumed" and emptied of objects. So we will catch the returned value in a variable called lst
and view that...
lst = list(gen)
lst
Each element of the returned list is a submission
object which, as far as I can tell, simply provides easier access to the fields.
print("id:",lst[0].id) # this is Reddit's unique ID for this post
print("url:",lst[0].url)
print("author:",lst[0].author)
print("title:",lst[0].title)
print("score:",lst[0].score) # upvote/downvote-based score, doesn't seem 100% reliable
print("subreddit:",lst[0].subreddit)
print("num_comments:",lst[0].num_comments) # number of comments in the thread (which we can get later if we choose)
print("selftext:",lst[0].selftext) # This is the body of the post
Perhaps a more familiar way to interact with each item of this list is as a dict
. Luckily, the API includes an easy way to get all of the available info as a dict without any effort - like this:
lst[0].d_
That's much better!
However, you'll notice that the returned values for created
and created_utc
aren't particularly user-friendly. They're in the same UNIX-style epoch integer format we had to specify in the query. A quick way to add a human-readable version is a function like the below. You'll notice the human-readable timestamp added onto the end.
def convert_date(timestamp):
return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%dT%H:%M:%S')
lst[0].d_['datetime_utc'] = convert_date( lst[0].d_['created_utc'] )
lst[0].d_
Depending on the ticker, you may find A LOT of posts (if you don't assign a limit
value, of course). One handy capability of the API is to filter based on fields so we can seach only for fields with at least N comments. Notice that we need to express the greater than as a string (">100"
), which isn't totally obvious from the documentation.
gen = api.search_submissions(q='GME', after=start_epoch, before=end_epoch, # these are the unix-based timestamps to search between
subreddit=['wallstreetbets','stocks'],
filter=['id','url','author', 'title', 'score','subreddit','selftext','num_comments'], # list of fields to return
num_comments=">100",
limit = 2 # limit on the number of records returned
)
lst = list(gen)
item = lst[0]
item.d_
Sidebar: Getting Comments¶
For our purposes, just the submissions
offer ample amounts of material to analyze so I'm generally ignoring the comments
underneath them, other than tracking the num_comments
value. However, if you wanted to pull the comments for a given submission, you could do it like below.
Note a few things:
- pass in the
id
property of the submission item aslink_id
. Also not totally clearly documented IMO - The filter values are a little different because the fields available on a comment are not exactly the same as on a submission. The main changes to note is that url -> permalink and selftext -> body. Otherwise, they seem similar.
comments_lst = list(api.search_comments(link_id=item.id,
filter=['id','parent_id','permalink','author', 'title',
'subreddit','body','num_comments','score'],
limit=5))
pd.DataFrame(comments_lst)
Building a Downloader¶
With a basic understanding of the API and psaw
wrapper, we can construct a simple downloader which downloads all submissions (with greater than n comments) for a one week time window on any stock ticker. Then, since we will probably want to avoid needing to call the API repeatedly for the same data, we will save it as a jsonlines file.
If you're not familiar with jsonlines
, it's well worth checking out. Note that, by default, jsonlines will append to the end of an existing file if one exists, or will create a file if one doesn't. Keep this in mind if running the same code on the same date/ticker repeatedly. It's probably easiest to assume the jl
files have duplicates in them and to simply dedupe when reading back from disk.
import jsonlines
from tqdm.notebook import tqdm
import time
import random
def get_submissions(symbol, end_date):
end_date = pd.to_datetime(end_date) #ensure it's a datetime object not string
end_epoch = int(end_date.timestamp())
start_epoch = int((end_date-pd.offsets.Week(1)).timestamp())
gen = api.search_submissions(q=f'${symbol}', after=start_epoch, before=end_epoch,
subreddit=['wallstreetbets','stocks'], num_comments = ">10",
filter=['id','url','author', 'title', 'subreddit',
'num_comments','score','selftext'] )
path = os.path.join(DATA_ROOT,f'{symbol}.jl')
with jsonlines.open(path, mode='a') as writer:
for item in gen:
item.d_['date_utc'] = convert_date(item.d_['created_utc'])
writer.write(item.d_)
return
get_submissions('GME','2021-07-19')
If we had a list of tickers that we wanted to get across a longer daterange, we could use some nested for loops like below to iterate through symbols and weeks. Running the below should take 15-20 minutes to complete so feel free to narrow the scope of tickers or dates if needed.
import traceback
symbols = ['GME']#,'AMC','SPCE','TSLA']
for symbol in tqdm(symbols):
print(symbol)
for date in tqdm(pd.date_range('2021-01-01','2021-10-31', freq='W')):
try:
get_submissions(symbol,date)
except:
traceback.print_exc()
time.sleep(5)
Try it Out!¶
Enough reading, already! The above code is available on colab at the link below. Feel free to try it out yourself.
You can modify the notebook however you'd like without risk of breaking it. I really hope that those interested will "fork" from my notebook (all you'll need is a Google Drive to save a copy of the file...) and extend it to answer your own questions through data.
Summary¶
In this first post, we've made it through the heavy lifting of downloading data from the API and storing it in a usable format on disk. In the next segment, we will do some basic analysis on how spikes in Reddit traffic may signal risk of increased volatility in a given stock.
One last thing...¶
If you've found this post useful or enlightening, please consider subscribing to the email list to be notified of future posts (email addresses will only be used for this purpose...). To subscribe, scroll to the top of this page and look at the right sidebar.
You can also follow me on twitter (@data2alpha) and forward to a friend or colleague who may find this topic interesting.