This is an introduction to working with nflscrapR data in Python. This is inspired by this guide by Ben Baldwin.
Using Jupyter Notebooks which come pre-installed with Anaconda is typically the best way to work with data in Python. This guide assumes you are using the Ananconda distribution and therefore already have the required packages installed. If you are not using the Anaconda distribution, install numpy, pandas, and matplotlib.
Once Anaconda has been downloaded and installed, open the Anaconda Navigator. Click launch on the Jupyter Notebook section which will open in your browser.
There are a couple ways to get nflscrapR data. While you don't necessarily need R for historical data, it is necessary for getting data that has not been uploaded to github. My preferred process is to get data using R, clean it, then export it to a CSV for use in Python. I'll introduce that process here and talk about another process afterward.
First, download R and R Studio. Open up R Studio and enter the following commands into the "Console" box.
install.packages("devtools")
devtools::install_github(repo = "maksimhorowitz/nflscrapR")
This installs nflscrapR so you can get data directly from the package instead of on github. I've posted two R scripts here for collecting and cleaning data. One is for collecting data for the current season by getting the most recent data on github and adding the most recent week(s) to it. The other is for getting a whole season's worth of data and not adding to it. Read the comments of the script for what lines to change.
If you use this method and run the whole script the data will already be cleaned, changes include:
- Limit play types to dropbacks, runs, or no_play. This removes punts, kicks, kneels, and spikes.
- Add success field. If EPA of a play is greater than 0, the play is a success.
- Add 'rush' and 'pass' field as binary (0 or 1) variables.
- Add missing player names for who passed/rushed/received the ball.
- Change play types to match what the playcall was, even if there was a penalty. QB scrambles counted as passes.
- Update team abbreviations for uniformity over multiple years, changes JAC to JAX, STL to LA, and SD to LAC.
These changes are made clear in the comments of the script, feel free to remove or comment out any part of the cleaning you do not want to happen. Simply add a # symbol at the start of a line to comment it out and it won't execute.
If you only want to use Python you can use the Jupyter Notebook included here. Note that you cannot get data directly from nflscrapR using this method so the data is only as update as the data on Ron's github is.
The notebook uses the same cleaning process as mentioned in the list above (in the R section). I have not found any discrepancies (yet) between the two methods but they may be there. Please feel free to comment below if you find a discrepancy.
Now that you have play-by-play data, open up a new Jupyter Notebook and make these imports.
import pandas as pd
import numpy as np
import os
To execute code in a cell, press Shift+Enter, this will execute the code and provide a new cell for you.
Pandas is now known as pd so we don't have to type out Pandas everytime, the same is true for numpy.
Pandas can be finicky about making changes to a dataframe and still using the same variable name. If you don't want to see warnings about this enter the following or make sure to create a new variable/dataframe when slicing the data and making changes. The warnings will not affect changes but can be somewhat annonying. Additionally, I like to increase the numbers of rows and columns Pandas will show so things don't get collapsed.
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 300)
Pandas has a read_csv() function to turn csv files into a dataframe. Find the csv file path and paste that in quotes where it says PATH below, don't forget the file extension (.csv).
data = pd.read_csv(PATH)
Our play by play data is now in the dataframe named data (use whatever name you like). You may see a DtypeWarning due to different data types being used in the same column, don't worry, the data is all there. Depending on the size of the dataframe you may also specify low_memory=False
to ignore a warning.
Reading in multiple years is fairly straightforward. To add another year, you will use the .append()
method.
data = data.append(pd.read_csv(NEW_DATA_PATH,low_memory=False),sort=True)
Breaking that statement down, the dataframe named data
is being appended with another CSV with play by play data. Notice the low_memory=False
to avoid a warning, also sort=True
sorts the columns alphabetically. One thing to be aware of is that indexes are no longer unique. There will an index 0 for each year of data, an index 1 for each year, etc. This has the potential to cause problems later on so it's a good idea to fix that now.
data.reset_index(drop=True,inplace=True)
This will reset the index asigning a unique number to each row. drop=True
will drop (remove) the current index and replace it with the new one. inplace=True
makes it so the change stays.
To understand what the dataframe looks like, enter the name of your dataframe ('data' in this case) into a cell and execute. The dataframe will be displayed for you. All 250+ columns can be scrolled through, also notice on the left an unnamed column with a number. This is the index (or key if you are familiar with databases) and a play or range of plays can be specified using the .loc[]
function.
In Python, counters start at 0 so the first row is actually index 0, not 1. To get just the first row enter
data.loc[0]
To get a range of index use a colon (:) to specify the range. When using .loc[]
Pandas follows the rules of [inclusive:inclusive] meaning both the first and last number index is included in the range. For example data.loc[0:5]
would return the first six rows (index 0,1,2,3,4,5).
This is not the case for other Python lists or arrays. Even the Pandas function .iloc[]
uses the more traditional [inclusive:exclusive] syntax. Meaning data.iloc[0:5]
would return only five rows (index 0,1,2,3,4). Since the play by play data's index (key) is already numerical, using .loc
makes sense. If the key were a word or not in a numerical order, you can use .iloc[]
to return a specific element.
To clarify, if our index was a random number and the first rows' index was 245 we would use data.loc[245]
or data.iloc[0]
to return the first row. Using just data.loc[0]
would search for the row where the index is 0. Hope this makes sense.
Now that you know how to use .loc[]
for getting specific rows, we can also use it to filter data.
There are many filters you can use and I'll list the most common ones below:
- Greater than
>
or greater than equal to>=
- Less than
<
or less than equal to<=
- Equal to
==
- String contains `.str.contains('word')
Multiple filters can also be chained together using paranthesis () around each filter and an &
as an AND to make all conditions necessary or a |
to indicate OR.
Each filter needs to state the name of the dataframe and the field/column to condition on. Fields can be specified using data.field_name
or data['field_name']
they work the same way. Just note that is a field shares the same name with a special word in Python (such as pass) then you must use the data['field_name']
method. Here's an example to filter to just first and second down, with run plays only.
data.loc[(data.down<3) & (data.rush==1)]
That will return a filtered dataframe with plays matching those criteria. You can save that smaller dataframe to a new variable like this early_down_runs = data.loc[(data.down<3) & (data.rush==1)]
Data can be grouped any category such as a player, team, game, down, etc. and can also have multiple groupings. Use the .groupby()
function to do this. The columns that should be grouped are then specified in double brackets and a function applied such as .mean()
,.sum()
,.count()
Here's an example of finding the average expected points added (EPA) per play by offense (posteam).
data.groupby('posteam')[['epa']].mean()
This will return the following dataframe (this is cropped to save space)
Note that the index is now the team names and no longer a number. If you would like to keep a numerical index, simply put ,as_index=False
next to 'posteam'
Groupby can also have multiple values to groupby, for example, receiver targets. Naming convention in nflscrapR is FirstInitial.LastName, this becomes an issue when two players in the league have the same first initial and last name. For this reason it is best to use a secondary groupby to eliminate any conflicts.
data.groupby(['receiver_player_name','posteam'])[['play_id']].count()
When using multiple groupby conditions they need to be put in brackets and separated by commas. Again, the index is now players and teams, not a numerical index. This can make further filtering an issue so it can be fixed in two ways. First, you can change the groupby to this:
data.groupby(['receiver_player_name','posteam'], as_index=False)[['play_id']].count()
The as_index=False
means do not make the name and team the index.
The other way is to reset the index after the data is grouped by simply calling reset_index(inplace=True)
To see a full list of columns included in nflscrapR head to the documentation. Below I'll list a few of the most helpful:
- posteam - the offensive team (possesion team)
- defteam - the defensive team
- game_id - a unique id given to each NFL game
- epa - expected points added
- wp - current win probability of the posteam
- def_wp - current win probability of the defteam
- yardline_100 - number of yards from the opponent's endzone
There are a couple ways to find specific games. The most straight forward is to use find a game using the home and away team. Note that if working with multiple seasons of data this can become an issue. If working with multiple seasons of data, skip to the next method. When working with a single season you can filter for that game with data.loc[(data.home_team=='ABR') & (data.away_team=='ABR')]
with 'ABR' being the team abbreviations.
Another way to find a specific game is to find it on NFL's website. For example, here is the link for the 2019 game between the Bears and Chargers: https://www.nfl.com/gamecenter/2019102702/2019/REG8/chargers@bears
The number after gamecenter/ is the game_id. To get that game use data.loc[data.game_id==2019102702]
Note: This section uses 2018 data. If you are using newer data, your results will be different.
The following is similar to part 1 in Ben's guide, but in Python. We'll look at how the Rams' running backs performed in 2018. We'll look at the EPA per rush for each Rams player:
rams_rbs = data.loc[(data.posteam=='LA') & (data.play_type=='run') & (data.down<=4)].groupby(by='rusher_player_name')[['epa', 'success','yards_gained']].mean()
First the team is filtered, then the play type, and lastly the down to eliminate 2-pt conversion attempts. Then the plays are grouped by the player that ran the ball and include the average EPA, success rate, and yards per attempt.
Now add in a new column for attempts. When adding a new column there specify the dataframe and the new column name in brackets.
rams_rbs['attempts'] = data.loc[(data['posteam']=='LA') & (data['play_type']=='run') & (data['down']<=4)].groupby(by='rusher_player_name')['play_id'].count()
Note that there's only one set of brackets around 'play_id' instead of two. When adding a new column to an existing dataframe, only one set of brackets is needed unless there is multiple groupby conditions, then two sets of brackets would be needed.
Now we'll condition on a minimum number of attempts and sort by the highest EPA per rush attempt.
rams_rbs = rams_rbs.loc[rams_rbs.attempts >= 40]
rams_rbs.sort_values('epa', ascending=False, inplace=True)
By default sorting happens in ascending order (smallest to largest) so ascending=False
makes it descending order. Also by default, calling sort_values()
returns the sorted dataframe for viewing but does not actually change the dataframe. To just view the dataframe, you can leave out inplace=True
. If you want to sort the dataframe and keep it that way, include inplace=True
.
To round numbers to a specific number of decimals
rams_rbs = rams_rbs.round({'epa':3, 'success':2, 'yards_gained':1})
Python has many libraries to create graphs, including one built into Pandas.
data['epa'].plot.hist(bins=50)
This gives us a histogram of EPA for all plays in 2018. The bins=50
specifies how many buckets there are in the histogram, feel free to change that number.
The built in library is a little barebones so this guide is going to use Matplotlib.
Import Matplotlib
import matplotlib.pyplot as plt
Like with Pandas and Numpy we'll shorten the full name to save time.
There are several ways to construct charts using Matplotlib and the simplest way will be shown first.
This figure will show separate histograms for the EPA on running plays and passing plays.
#Create figure and enter in a figsize
plt.figure(figsize=(10,6))
#Place a histogram on the figure with the EPA of all pass plays, assign a label, choose a color
plt.hist(data.epa.loc[data.play_type=='pass'], bins=50, label='Pass', color='slategrey')
#Place a second histogram this time for rush plays, the alpha < 1 will make this somewhat transparent
plt.hist(data.epa.loc[data.play_type=='run'], bins=50, label='Run', alpha=.7, color='dodgerblue')
#Add labels
plt.xlabel('Expected Points Added',fontsize=12)
plt.ylabel('Number of Plays',fontsize=12)
plt.title('EPA Distribution Based on Play Type',fontsize=14)
plt.figtext(6,50,'Data from nflscrapR', fontsize=10)
#Add a legend
plt.legend()
#Save the figure as a png
plt.savefig('epa_dist.png', dpi=400)
Figures can be saved in several formats including PDF, PNG, and JPG. Also note that the data was selected on the same line as inserting the histogram, this works with simple selections but more complex selections should be assigned to new variables and then used in making the graph.
Before going any further, a few more imports are required for gathering and using team logos:
import os
import urllib.request
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
OS
comes with Python, but you may not have urllib
installed, simply go to your terminal/command prompt and enter:
pip install urllib
or conda install urllib
if your're using Anaconda.
With that out of the way we can now download the logos. These few lines will get each team's logo downloaded for you to use on charts.
Note where it says FOLDER
in the last line, make a new folder in your current working directory with whatever name you choose and replace FOLDER
with the name of your new folder.
urls = pd.read_csv('https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv')
for i in range(0,len(urls)):
urllib.request.urlretrieve(urls['url'].iloc[i], os.getcwd() + '\\FOLDER\\' + urls['team'].iloc[i] + '.png')
IMPORTANT
The logo names are the full names of teams, not abbreviations like in nflscrapR data. When alphabatized, the logos and nflscrapR abbreviations do not match! I recommend changing the logo names to the abbreviations to avoid this error in charting.
Python doesn't easily allow for images to be used on charts as you'll see below, but luckily jezlax has us covered.
Create this function to be able to put images onto a chart. Feel free to change the zoom level as you see fit.
def getImage(path):
return OffsetImage(plt.imread(path), zoom=.5)
Now we'll store the required information to use the logos, replace FOLDER
with the name of your folder with the logos in it.
logos = os.listdir(os.getcwd() + '\\FOLDER')
logo_paths = []
for i in logos:
logo_paths.append(os.getcwd() + '\\FOLDER\\' + str(i))
Now we can finally do so more plotting. This will show you the second way to build charts in Matplotlib, it takes some getting used to but ultimately allows for more flexibility.
Start by getting the data to be plotted
Notice the use of data['pass']
and data.rush
, data['pass']
needs to in brackets because pass
is a Python word that has functionality, so we're telling it to use the pass column in the dataframe. You could also use the play_type
column, but either way works.
#Make a new dataframe that contains average pass play EPA, grouped by team
pass_epa = data.loc[data['pass']==1].groupby(by='posteam')[['epa']].mean()
#Do the same for average rush EPA, grouped by team
rush_epa = data.loc[data.rush==1].groupby(by='posteam')[['epa']].mean()
Those are now two dataframes with a row for each team in both. Now lets plot the data. For ease of use, put the data into x and y variables that will contain the rush EPA and pass EPA data respectively.
x = rush_epa.epa
y = pass_epa.epa
#Create a figure with size 12x12
fig, ax = plt.subplots(figsize=(12,12))
#Make a scatter plot with success rate data
ax.scatter(x, y, s=.001)
#Adding logos to the chart
for x0, y0, path in zip(x, y, logo_paths):
ab = AnnotationBbox(getImage(path), (x0, y0), frameon=False, fontsize=4)
ax.add_artist(ab)
#Add a grid
ax.grid(zorder=0,alpha=.4)
ax.set_axisbelow(True)
#Adding labels and text
ax.set_xlabel('EPA per Rush', fontsize=14)
ax.set_ylabel('EPA per Dropback', fontsize=14)
ax.set_title('Avg. EPA by Team & Play Type - 2018', fontsize=18)
plt.figtext(.78, .06, 'Data from nflscrapR', fontsize=10)
#Save the figure as a png
plt.savefig('team_epas.png', dpi=400)
I've created a dictionary object containing NFL team colors. Feel free to copy and paste this into your Jupyter Notebook.
COLORS = {'ARI':'#97233F','ATL':'#A71930','BAL':'#241773','BUF':'#00338D','CAR':'#0085CA','CHI':'#00143F',
'CIN':'#FB4F14','CLE':'#FB4F14','DAL':'#B0B7BC','DEN':'#002244','DET':'#046EB4','GB':'#24423C',
'HOU':'#C9243F','IND':'#003D79','JAX':'#136677','KC':'#CA2430','LA':'#002147','LAC':'#2072BA',
'MIA':'#0091A0','MIN':'#4F2E84','NE':'#0A2342','NO':'#A08A58','NYG':'#192E6C','NYJ':'#203731',
'OAK':'#C4C9CC','PHI':'#014A53','PIT':'#FFC20E','SEA':'#7AC142','SF':'#C9243F','TB':'#D40909',
'TEN':'#4095D1','WAS':'#FFC20F'}
To get a color you specify a team abbreviation and the color is returned. Typing in COLORS['ARI'] and executing the cell would return '#97233F
which can be used as a color in a plot.
Some other useful matplotlib functions or additions:
- ax.axhline(y=) - specify a y-axis value to create a horizontal line
- ax.axvline(x=) - specify a x-axis value to create a vertical line
- Adding annotations to scatter plot dots can be done using AdjustText
- A list of named matplotlib colors can be found here.
Thanks for reading, please leave feedback, and follow me on twitter @Deryck_SG.
I downloaded R and RStudio. I then input the command you listed above but I'm getting an error that reads:
Error in install.packages : Unrecognized response “devtools::install_github(repo = "maksimhorowitz/nflscrapR")”