Skip to content

Instantly share code, notes, and snippets.

@guga31bb
Last active August 18, 2023 07:45
Show Gist options
  • Save guga31bb/5634562c5a2a7b1e9961ac9b6c568701 to your computer and use it in GitHub Desktop.
Save guga31bb/5634562c5a2a7b1e9961ac9b6c568701 to your computer and use it in GitHub Desktop.
Simple guide for using nflscrapR

THIS IS OUTDATED. PLEASE FOLLOW THE FOLLOWING LINK

--> A beginner's guide to nflfastR <--

Basic nflscrapR tutorial

I get a lot of questions about how to get nflscrapR up and running. This guide is intended to help new users build interesting tables or charts from the ground up, taking the raw nflscrapR data.

Quick word if you're new to programming: all of this is happening in R. Obviously, you need to install R on your computer to do any of this. Make sure you save what you're doing in a script (in R, File --> New script) so you can save your work and run multiple lines of code at once. To run code from a script, highlight what you want, right click, and select Run line. As you go through your R journey, you might get stuck and have to google a bunch of things, but that's totally okay and normal. That's how I wrote this thing!

A huge thank you to Josh Hornsby (@Josh_ADHD) and Zach Feldman (@ZachFeldman3) for sharing code snippets for me to work with, and of course to the nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) for providing this resource. An additional huge thanks to Keegan Abdoo (@KeeganAbdoo) for making the string search parts of this look much better.

Final disclaimer: this will work but I'm not great at R so there might be better ways to do certain things.

-- Ben Baldwin, @benbbaldwin

Install packages

First, you need to install the magic packages. You only need to run this step once on a given computer. We aren't going to bother with the actual nflscrapR package since it's a million times faster to just download the .csvs directly from Ron's github (thanks Ron!).

install.packages("tidyverse")
install.packages("dplyr")
install.packages("na.tools")
install.packages("ggimage")

Load packages

You'll need to do this every time you open an instance of R.

library(tidyverse)
library(dplyr)
library(na.tools)
library(ggimage)

Load the data

pbp <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2018.csv"))

This reads in the 2018 play-by-play data from github and saves it as "pbp". Different seasons could be read in by changing the year at the end in the above.

Do some cleanup

To start, let's just look at the first few rows (the "head") of the data.

pbp %>% select(posteam, defteam, desc, play_type) %>% head

  posteam defteam                                                                                     desc play_type
1     ATL     PHI                             J.Elliott kicks 65 yards from PHI 35 to end zone, Touchback.   kickoff
2     ATL     PHI    (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.   no_play
3     ATL     PHI (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins).      pass
4     ATL     PHI                   (14:22) J.Jones left end pushed ob at ATL 41 for 11 yards (D.Barnett).       run
5     ATL     PHI                          (13:46) D.Freeman right end to PHI 39 for 20 yards (M.Jenkins).       run
6     ATL     PHI               (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks).      pass

A couple things. The %>% thing lets you pipe together a bunch of different commands. So we're taking pbp, our data, "select"ing a few variables we want to look at ("desc" is the important variable that lists the description of what happened on the play), and then saying to show the first few rows. Since this is already sorted by time, these are the first 6 rows from the season opener, ATL @ PHI.

There are a few important things we want to handle. First, the play-by-play includes a bunch of events that aren't actually plays (timeouts, quarters ending, and stuff like that). Second, we're going to fix some plays that are mis-classified (QB scrambles should count as pass plays; spikes and kneeldowns shouldn't count as real plays).

For this illustration, I'm going to focus on run plays and pass plays, so I'll throw out punts, kickoffs, field goals, and dead ball penalties (e.g. false starts) where we don't know what the attempted play was.

I'm also going to focus on 2018 only to keep things simple to start with. Adding other seasons is not super complicated and is covered later in this guide.

pbp_rp <- pbp %>% 
	filter(!is_na(epa), play_type=="no_play" | play_type=="pass" | play_type=="run")

We've saved a new dataset (the fancy name in R is data frame) that is "pbp" but removing plays that don't have values for epa, and that are either pass plays, run plays, or penalties ("no play"). Let's look at what the first few rows of this new data frame look like:

pbp_rp %>% select(posteam, desc, play_type) %>% head

  posteam                                                                                     desc play_type
1     ATL    (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.   no_play
2     ATL (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins).      pass
3     ATL                   (14:22) J.Jones left end pushed ob at ATL 41 for 11 yards (D.Barnett).       run
4     ATL                          (13:46) D.Freeman right end to PHI 39 for 20 yards (M.Jenkins).       run
5     ATL               (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks).      pass
6     ATL                        (13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman.      pass

Compared to the first time we did this, the kickoff is now gone.

Now let's take a look at some of the "no play" plays.

pbp_rp %>% filter(play_type=="no_play") %>% select(desc, rush_attempt, pass_attempt)  %>% head

rush_attempt pass_attempt desc
0            0 (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.
0            0 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play.
0            0 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play.
0            0 (1:43) (Shotgun) PENALTY on ATL-T.McKinley, Neutral Zone Infraction, 5 yards, enforced at PHI 36 - No Play.
0            0 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play.  PENALTY on ATL-D.Trufant, Defensive Pass Interference, 8 yards, enforced at ATL 17 - No Play.
0            0 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play.

The false start to open the game is still there, but the rest are attempted rush or pass plays that (in my opinion) should count when we're computing EPA or something like that. But importantly, rush_attempt and pass_attempt are 0 for all these plays. We need to fix that.

There might be a better way to do this, but I search the "desc" variable to look for indicators of passing or rushing:

pbp_rp <- pbp_rp %>%
	mutate(
	pass = if_else(str_detect(desc, "( pass)|(sacked)|(scramble)"), 1, 0),
	rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
	success = ifelse(epa>0, 1 , 0)
	) 

Mutate is R's weird word for creating new variables. I've created "pass", which searches the "desc" variable for plays with "pass", "sacked", or "scramble", along with a variable for rush and for a successful play (using the simple definition for success of positive EPA). Let's take a look:

pbp_rp %>% filter(play_type=="no_play") %>% select(pass, rush, desc)  %>% head

  pass  rush desc
1     0     0 (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.
2     1     0 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play.                      
3     0     1 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play.          
4     0     0 (1:43) (Shotgun) PENALTY on ATL-T.McKinley, Neutral Zone Infraction, 5 yards, enforced at PHI 36 - No Play.                                                             
5     1     0 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play.  PENALTY on ATL-D.Trufant, Defensive Pass Interf~
6     0     1 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play.              

Now we can finally have variables for "pass" and "rush" that include plays with penalties!

Hooray! Let's now save, keeping only the run or pass plays.

pbp_rp <- pbp_rp %>% filter(pass==1 | rush==1)

Some basic stuff: Part 1

Okay, we have a big dataset where we call dropbacks pass plays and non-dropbacks rush plays. Now we actually want to, like, do stuff.

Let's take a look at how various Rams' running backs fared on run plays in 2018:

pbp_rp %>%
	filter(posteam == "LA", rush == 1, down<=4) %>%
	group_by(rusher_player_name) %>%
	summarize(mean_epa = mean(epa), success_rate = mean(success), ypc=mean(yards_gained), plays=n()) %>%
	arrange(desc(mean_epa)) %>%
	filter(plays>40)
 
  rusher_player_name mean_epa success_rate   ypc plays
1 C.Anderson           0.333         0.628  6.95    43
2 T.Gurley             0.155         0.473  4.89   256
3 M.Brown             -0.0486        0.465  4.93    43

There's a lot going on here. The "group_by" function, well, groups by what you tell it -- in this case player name. Summarize is useful for getting summaries of what you're looking at, and here, while grouping by player, we're summarizing the mean of EPA, success, yardage (a bad rushing stat, but since we're here), and getting the number of plays.

Looking at the results, C.J. Anderson is truly a generational talent.

Some notes on this. The "plays" and "ypc" columns exactly match the official stats you'd see on PFR. There are two reasons for this. First, by specifying down<=4, I've excluded carries on two-point conversions (those have down=="NA"), which are present in the data but don't count in the official stats. And second, all of the work we did above to define rush plays and pass plays doesn't fix that rusher names are missing in the plays with penalties. So why did we bother? That'll be useful for the rest of this when we're doing some team-level stuff.

Some basic stuff: Part 2

Let's look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:

schotty <- pbp_rp %>%
	filter(wp>.20 & wp<.80 & down<=2 & qtr<=2 & half_seconds_remaining>120) %>%
	group_by(posteam) %>%
	summarize(mean_pass=mean(pass), plays=n()) %>%
	arrange(mean_pass)
schotty

> schotty
   posteam mean_pass plays
   <chr>       <dbl> <int>
 1 SEA         0.369   320
 2 JAX         0.435   276
 3 TEN         0.441   263
 4 BUF         0.452   219
 5 BAL         0.458   299
 6 ARI         0.466   236
 7 NYJ         0.473   256
 8 DET         0.482   299
 9 WAS         0.485   239
10 CAR         0.491   281
# ... with 22 more rows

The Seahawks were playing a different sport in 2018. Fun! Let's see what that looks like:

ggplot(schotty, aes(x=reorder(posteam,-mean_pass), y=mean_pass)) +
	    geom_text(aes(label=posteam))

ggsave('FILENAME.png', dpi=1000)

Schotty

This image is kind of a mess -- we still need a title, axis labels, etc -- but gets the point across. We'll get to that other stuff later. But more importantly, we made something interesting using nflscrapR data! And we used the "pass" variable that we created above to do it, so we're counting plays with penalties. In the above, be sure to change FILENAME to where you want to save.

The "reorder" sorts the teams according to pass rate, with the "-" saying to do it in descending order. "aes" is short for "aesthetic", which is R's weird way of asking which variables should go on the x and y axes.

Some basic stuff: Part 3

Let's make one of those dropback success rate vs EPA plots that have been floating around.

chart_data <- pbp_rp %>%
	filter(pass==1) %>%
	group_by(posteam) %>%
	summarise(
	num_db = n(),
	epa_per_db = sum(epa) / num_db,
	success_rate = sum(epa > 0) / num_db
	)

nfl_logos_df <- read_csv("https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv")
chart <- chart_data %>% left_join(nfl_logos_df, by = c("posteam" = "team_code"))

First, we're taking all plays with pass = 1, and grouping by team to get the mean of EPA/play and success rate. To get fancy, we're going to take the team logos provided by President Lopez and join that to the chart data we created. (what does left_join mean? I don't know, I copied this from where else and it works)

chart %>%
ggplot(aes(x = success_rate, y = epa_per_db)) +
	geom_image(aes(image = url), size = 0.05) +
	labs(x = "Success rate",
	y = "EPA per play",
	caption = "Data from nflscrapR",
	title = "Dropback success rate & EPA/play",
	subtitle = "2018") +
	theme_bw() +
	theme(axis.title = element_text(size = 12),
	axis.text = element_text(size = 10),
	plot.title = element_text(size = 16),
	plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12))

ggsave('FILENAME.png', dpi=1000)

Passing EPA/play & success rate

Boom! We created a fancy graph. We can see here a bunch of storylines that dominated 2018, including the Chiefs' and Saints' aerial dominance, Aaron Rodgers being about as efficient as Andy Dalton and Matt Stafford, and the struggles of the rookie Joshes.

Some basic stuff: Part 4

Let's look at rushing and passing EPA/play at the team level on 1st and 2nd down. Are there any teams that were better at rushing than passing?

chart_data <- pbp_rp %>%
	group_by(posteam) %>%
	filter(down<=2) %>%
	summarise(
	n_dropbacks = sum(pass),
	n_rush = sum(rush),
	epa_per_db = sum(epa * pass) / n_dropbacks,
	epa_per_rush = sum(epa * rush) / n_rush,
	success_per_db = sum(success * pass) / n_dropbacks,
	success_per_rush = sum(success * rush) / n_rush
	)

chart <- chart_data %>% left_join(nfl_logos_df, by = c("posteam" = "team_code"))

chart %>%
ggplot(aes(x = epa_per_rush, y = epa_per_db)) +
	geom_image(aes(image = url), size = 0.05) +
	labs(x = "Rush EPA/play",
	y = "Pass EPA/play",
	caption = "Data from nflscrapR",
	title = "Early-down rush and pass EPA/play",
	subtitle = "2018") +
	theme_bw() +
	geom_abline(slope=1, intercept=0, alpha=.2) +
	theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12))

ggsave('FILENAME.png', dpi=1000)

Figure

I've added a line with slope 1. Teams above that diagonal line are more efficient passing than rushing on early downs, as measured by EPA per play. In other words, every single team in the NFL in 2018, with the possible exception of the Panthers.

Is this driven by explosive plays? Let's look at the same graph but with success rate instead.

chart %>%
	ggplot(aes(x = success_per_rush, y = success_per_db)) +
	geom_image(aes(image = url), size = 0.05) +
	labs(x = "Rush success rate",
        y = "Pass success rate",
        caption = "Data from nflscrapR",
        title = "Early-down rush and pass success rate",
	subtitle = "2018") +
	theme_bw() +
	geom_abline(slope=1, intercept=0, alpha=.2) +
	theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12))

ggsave('FILENAME.png', dpi=1000)

Figure

Wow! As measured by having positive-value plays, early-down pass plays are much better at helping teams stay on schedule for every single team in the league.

Advanced: Fix player names on plays with penalties

Let's check player names on plays with penalties:

pbp_rp %>% filter(play_type=="no_play") 
	%>% select(desc, pass, passer_player_name, rusher_player_name, receiver_player_name) %>% head()

  desc                                                                                                                                                           passer_player_name rusher_player_na~ receiver_player_n~
  <chr>                                                                                                                                                          <chr>              <chr>             <chr>             
1 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play.             <NA>               <NA>              <NA>              
2 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play. <NA>               <NA>              <NA>              
3 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play.  PENALTY on ATL-D.Trufant, Defensive P~ <NA>               <NA>              <NA>              
4 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play.       <NA>               <NA>              <NA>              
5 (1:50) (Shotgun) N.Foles pass short right to N.Agholor ran ob at PHI 31 for 7 yards. PENALTY on ATL-D.Trufant, Defensive Holding, 5 yards, enforced at PHI 24~ <NA>               <NA>              <NA>              
6 (:33) N.Foles FUMBLES (Aborted) at PHI 7, and recovers at PHI 7. N.Foles pass incomplete short right to N.Agholor. PENALTY on ATL-B.Reed, Defensive Offside, ~ <NA>               <NA>              <NA>

All those NAs are missing where the player names for passer, runner, and receiver should be. Let's clean those up. A gigantic thanks to Keegan Abdoo (@KeeganAbdoo) for providing the code to do this.

pbp_players <- pbp_rp %>% 
    mutate(
	passer_player_name = ifelse(play_type == "no_play" & pass == 1, 
                str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
                passer_player_name),
        receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"), 
        	str_extract(desc, 
                "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"),
                receiver_player_name),
        rusher_player_name = ifelse(play_type == "no_play" & rush == 1, 
        	str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)|		(up the middle)|(right guard)|(right tackle)|(right end)))"),
        	rusher_player_name)
	)

By now you know how to check the head() of this to verify that it works. This should not be considered part of a basic R tutorial. If you just skim this part and copy & paste the code, that's totally fine. You might not ever need to perform a string search this complicated for the rest of your life.

With that being said, the notes on what happened above, again with thanks to Keegan:

* [A-Z] means starting with a capital letter
* [a-z]* means zero or more letters (accounts for players like Damien Williams - "Dam. Williams")
* \\. represents the period after a players first inital (or abbreviated first name)
* \\s? accounts for a potential space in between the abbreviated first name and last name (i.e "Dam. Williams")
* [A-Z][A-z]+ means a capital letter followed by one or more letters (indiscrimate case to account for names like McCown)
* (\\s(I{2,3})|(IV))? accounts for any possible suffix (? = zero or one)
* (?=\\s((pass)|(sack)|(scramble))) means preceding a space and then pass, sack or scramble (to account for plays that start out with an eligible OL or other irregularities)

Now let's do something!

qbs <- pbp_players %>% 
	mutate(
		name = ifelse(!is_na(passer_player_name), passer_player_name, rusher_player_name),
		rusher = rusher_player_name,
		receiver = receiver_player_name,
		play = 1
	) %>%
	group_by(name, posteam) %>%
	summarize (
		n_dropbacks = sum(pass),
		n_rush = sum(rush),
		n_plays = sum(play),
		epa_per_play = sum(epa)/n_plays,
		success_per_play =sum(success)/n_plays
		) %>%
	filter(n_dropbacks>=100)

We're going to make a plot for QBs, so to simplify things, I've created a "name" variable that's the passer name if it's a pass, and the rusher name if it's a rush. To make sure we're only getting QBs, I filtered to have at least 100 dropbacks. But EPA and success rate are calculated using all plays: both passes and rushes.

Now let's plot the results. Make sure you have ggrepel installed, which is used below to make it so the player name labels don't overlap.

library(ggrepel)

qbs %>%
  ggplot(aes(x = success_per_play, y = epa_per_play)) +
  geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
  geom_vline(xintercept =  mean(qbs$success_per_play), color = "red", linetype = "dashed") +
  geom_point(color = ifelse(qbs$posteam == "SF", "red", "black"), cex=qbs$n_plays/60, alpha=1/4) +
  geom_text_repel(aes(label=name),
      force=1, point.padding=0,
      segment.size=0.1) +
  labs(x = "Success rate",
       y = "EPA per play",
       caption = "Data from nflscrapR",
       title = "QB success rate and EPA/play",
       subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays") +
  theme_bw() +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 16, hjust = 0.5),
        plot.subtitle = element_text(size = 14, hjust = 0.5),
        plot.caption = element_text(size = 12))

ggsave('FILENAME.png', dpi=1000)

Picture

I've highlighted the 49ers' quarterbacks to show how similar the efficiency of Jimmy G and Mullens was in 2018, as well as the failed Beathard experiment. The "cex" part makes dot size proportional to plays, and alpha makes the dots semi-transparent.

Pulling together multiple seasons

To this point we've focused on 2018, but NFL history did not begin in 2018 (it began in 2012 when Russell Wilson was drafted). The following grabs the years of available data and also adds on the results from each game (e.g., home team, away team, week number of the season, and how many points were scored by each team), which are stored in a different file.

first <- 2009 #first season to grab. min available=2009
last <- 2018 # most recent season

datalist = list()
for (yr in first:last) {
    pbp <- read_csv(url(paste0("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_", yr, ".csv")))
    games <- read_csv(url(paste0("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/games_data/regular_season/reg_games_", yr, ".csv")))
    pbp <- pbp %>% inner_join(games %>% distinct(game_id, week, season)) %>% select(-fumble_recovery_2_yards)
    datalist[[yr]] <- pbp # add it to your list
}

pbp_all <- dplyr::bind_rows(datalist)

Note that above I'm dropping the "fumble_recovery_2_yards" variable since it was causing problems with stitching the seasons together and it was easier to remove it than figure out the problem. Since we're working with multiple seasons now, we want to make sure team names are coded consistently over time. Let's look for the problem teams:

pbp_all %>% group_by(home_team) %>%summarize(n=n(), seasons=n_distinct(season), minyr=min(season), maxyr=max(season)) %>% 
	arrange(seasons)

   home_team     n seasons minyr maxyr
   <chr>     <int>   <int> <dbl> <dbl>
 1 LAC        2695       2  2017  2018
 2 JAX        4084       3  2016  2018
 3 LA         4241       3  2016  2018
 4 STL        9842       7  2009  2015
 5 JAC       10035       8  2009  2016
 6 SD        11007       8  2009  2016
 7 ARI       14251      10  2009  2018
 8 ATL       13871      10  2009  2018
 9 BAL       14342      10  2009  2018

So we can see that we need to deal with the Chargers, Jags, and Rams, since these are the cases with fewer than the 10 full seasons (for example, the Rams changed from "STL" to "LA"). One way to do that:

pbp_all <- pbp_all %>% 
	mutate_at(vars(home_team, away_team, posteam, defteam), funs(case_when(
            . %in% "JAX" ~ "JAC",
            . %in% "STL" ~ "LA",
            . %in% "SD" ~ "LAC",
            TRUE ~ .
        ))) 

Re-running the command above verifies that this worked: every team now has 10 seasons. What we did: fix all the home team, away team, possession team, and defensive team names so they're consistent over time.

At this point, we have a lot of data sitting in memory and we don't want to have to re-download it every time we open R. So let's save the data and then load it again, so we know how.

saveRDS(pbp_all, file="FILENAME.rds")
pbp_all <- readRDS("FILENAME.rds")

Since we're back to working with the raw data from the seasons with downloaded, we need to repeat all our cleaning above on this new big data set. Here's how to do it all in one command:

pbp_all_rp <- pbp_all %>%
      filter(!is_na(epa), !is_na(posteam), play_type=="no_play" | play_type=="pass" | play_type=="run") %>%
	mutate(
	pass = if_else(str_detect(desc, "( pass)|(sacked)|(scramble)"), 1, 0),
	rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
	success = ifelse(epa>0, 1 , 0),
	passer_player_name = ifelse(play_type == "no_play" & pass == 1, 
              str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
              passer_player_name),
        receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"), 
              str_extract(desc, "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"),
              receiver_player_name),
        rusher_player_name = ifelse(play_type == "no_play" & rush == 1, 
              str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)|		      (up the middle)|(right guard)|(right tackle)|(right end)))"),
              rusher_player_name),
	name = ifelse(!is_na(passer_player_name), passer_player_name, rusher_player_name),
	yards_gained=ifelse(play_type=="no_play",NA,yards_gained),
	play=1
	) %>%
	filter(pass==1 | rush==1)

I've added a line replacing the yards_gained variable with NA on plays with penalties because the value is a misleading 0 yards on all of these plays and making it NA makes it easier to exclude them. Now we can do stuff with our multiple years of data.

Let's see how Aaron Rodgers' efficiency has changed from 2009-2014 to 2015-2018:

pbp_all_rp %>% filter(pass==1 & !is.na(passer_player_name))%>% mutate(
	arod = if_else(posteam=="GB"&passer_player_name=="A.Rodgers",1,0),
	early = if_else(season<=2014,1,0)
	) %>%
	group_by(arod,early) %>%
	summarize(mean_epa=mean(epa), ypp=mean(yards_gained, na.rm = TRUE)) %>% arrange(-early)

   arod early mean_epa   ypp
  <dbl> <dbl>    <dbl> <dbl>
1     0     1   0.0280  6.25
2     1     1   0.255   7.41
3     0     0   0.0605  6.37
4     1     0   0.123   6.16

In the above, we divided the sample into two periods (2009-2014 and 2015-2018), and into Rodgers and non-Rodgers dropbacks. We see that Rodgers had a 0.26 to 0.03 EPA/play advantage and 7.41 to 6.25 yards/dropback advantage over the other QBs in the league from 2009-2014, but only a 0.12 to 0.06 EPA/play advantage and a 6.4 to 6.2 yards/dropback disadvantage in the most recent four seasons.

Next steps

By now, you have the tools to do a bunch of interesting things with nflscrapR data. If you're wondering what to do next, either pick a question that's interesting to you and try to answer it (how well does passing and rushing EPA/play in the first half of a season predict a team's passing and rushing EPA/play in the second half?), or try to replicate things you come across on twitter.

For other ideas, you can view the list of variables using

names(pbp_all_rp)

  [1] "play_id"                             
  [2] "game_id"                             
  [3] "home_team"                           
  [4] "away_team"                           
  [5] "posteam"                             
  [6] "posteam_type"                        
  [7] "defteam"                             
  [8] "side_of_field"                       
  [9] "yardline_100"                        
 [10] "game_date"                           
 ...

I've truncated the output since there's more than 200 variables, but this gives you an idea of the things included.

Happy coding!

Ben's other R code

More data sources

Other code examples: R

Other code examples: Python

@awgymer
Copy link

awgymer commented Jul 29, 2019

I can't get the same output you have for schotty at the start of Part 2. It seems that it is using the dataframe of all plays rather than the one from this step: pbp_rp <- pbp_rp %>% filter(pass==1 | rush==1)

@guga31bb
Copy link
Author

I can't get the same output you have for schotty at the start of Part 2. It seems that it is using the dataframe of all plays rather than the one from this step: pbp_rp <- pbp_rp %>% filter(pass==1 | rush==1)

Does it work if you explicitly filter in the schotty command?

schotty <- pbp_rp %>%
	filter((pass==1 | rush==1) & wp>.20 & wp<.80 & down<=2 & qtr<=2 & quarter_seconds_remaining>120) %>%
	group_by(posteam) %>%
	summarize(mean_pass=mean(pass), plays=n()) %>%
	arrange(mean_pass)
schotty

@awgymer
Copy link

awgymer commented Jul 29, 2019

I meant it the other way around sorry. So your gist has the output as:

posteam mean_pass plays
   <fct>       <dbl> <int>
 1 SEA         0.356   309
 2 TEN         0.427   253
 3 JAX         0.428   269
 4 NYJ         0.442   251
 5 BUF         0.445   209

But if you use the table filtered by (pass==1 | rush==1) then you get:

 posteam mean_pass plays
   <chr>       <dbl> <int>
 1 SEA         0.367   300
 2 TEN         0.437   247
 3 JAX         0.439   262
 4 BUF         0.451   206
 5 NYJ         0.464   239

I also couldn't work out why you were dropping fumble_recovery_2_yards in the multiple season data section?

@guga31bb
Copy link
Author

Oops, I probably ran it on the wrong dataframe, I'll check when I get home, thanks! I dropped fumble_recovery_2_yards because it was throwing an error (I think because in most seasons it was NA for every line, and the seasons that had a value were messing things up) and it's not something I'd ever use. Appreciate the comments!

@awgymer
Copy link

awgymer commented Jul 29, 2019

Interesting, I've been getting an error with blocked_player_id so I wonder if it's the same thing.
It's a great guide, I've been meaning to actually try and do stuff with nflscrapR and this finally got me going.

@guga31bb
Copy link
Author

You were right on the schotty thing and I have updated the results. Thanks again. And yes, I'd guess it's the same error with blocked_player_id, and that variable is only used for very specific things so safe to drop.

I actually thought about adding a huge select() line that kept only the meaningful variables but in the end didn't bother.

@awgymer
Copy link

awgymer commented Jul 30, 2019

So I found out what the issue was. For some reason blocked_player_id and fumble_recover_2_player_id had only NA and FALSE values in some years, which meant the column type was logical and bind_rows could not convert them to character, like the rest of the years had for those columns. I suspect you might be able to avoid this by supplying default column types to read_csv, but since it only affects a few obscure columns and there are over 200 to supply defaults for, it probably isn't worth the stress. Interested as to why you had a problem with columns I didn't and vice versa though.

@gigantesdr
Copy link

When attempting to pull multiple seasons together after pbp_all <- dplyr::bind_rows(datalist) I keep getting the Error: Column blocked_player_id can't be converted from character to logical

Anyone know how to fix?

@awgymer
Copy link

awgymer commented Aug 14, 2019

You need to exclude it using the select clause that already excludes fumble_recovery_2_yards: select(-fumble_recovery_2_yards)

@mrcaseb
Copy link

mrcaseb commented Dec 26, 2019

I've been struggling for some time with the problem that the team logos are distorted. The line geom_image(aes(image = url), size = 0.05) in basic stuff part 4 causes the logos be compressed or stretched in width (see screenshot below). I use the logos introduced in part 3 with nfl_logos_df <- read_csv("https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv")

Does anyone else have such problems? I use ggimage version 0.2.5, ggplot2 (3.2.1), R version 3.6.1 (2019-07-05)
Logos

@guga31bb
Copy link
Author

I've been struggling for some time with the problem that the team logos are distorted. The line geom_image(aes(image = url), size = 0.05) in basic stuff part 4 causes the logos be compressed or stretched in width (see screenshot below). I use the logos introduced in part 3 with nfl_logos_df <- read_csv("https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv")

Try adding asp to the ggimage line:

geom_image(aes(image = url), size = 0.05, asp = 16/9) +

And saving in a height + width consistent with that

ggsave('PATH/2_offense_rp.png', dpi=800, height=9, width=16)

@TimmyG43
Copy link

TimmyG43 commented Jan 8, 2020

This worked by simply removing the (-fumble_recovery_2_yards) and keeping select

first <- 2009 #first season to grab. min available=2009
last <- 2018 # most recent season

datalist = list()
for (yr in first:last) {
pbp <- read_csv(url(paste0("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_", yr, ".csv")))
games <- read_csv(url(paste0("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/games_data/regular_season/reg_games_", yr, ".csv")))
pbp <- pbp %>% inner_join(games %>% distinct(game_id, week, season)) %>% select
datalist[[yr]] <- pbp # add it to your list
}

pbp_all <- dplyr::bind_rows(datalist)

@mrcaseb
Copy link

mrcaseb commented Jan 9, 2020

I've been struggling for some time with the problem that the team logos are distorted. The line geom_image(aes(image = url), size = 0.05) in basic stuff part 4 causes the logos be compressed or stretched in width (see screenshot below). I use the logos introduced in part 3 with nfl_logos_df <- read_csv("https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv")

Try adding asp to the ggimage line:

geom_image(aes(image = url), size = 0.05, asp = 16/9) +

And saving in a height + width consistent with that

ggsave('PATH/2_offense_rp.png', dpi=800, height=9, width=16)

Thanks for your help Mr. Baldwin. I know now that this is a ggimage problem introduced in Version 0.2.4 which makes your solution doesn't work. I have contacted the developer and hope for a fix.

@keeron-rahman
Copy link

keeron-rahman commented Feb 20, 2020

I used your code as a basis to get 2019 passing stats, but I'm getting different data than other sources.
Does using this quote block include plays with penalties or something?

pbp_rp <- pbp_rp %>%' ' mutate(' ' pass = if_else(str_detect(desc, "( pass)|(sacked)|(scramble)"), 1, 0),' ' rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),' ' success = ifelse(epa>0, 1 , 0)' ' )

...

summarize(mean_epa = mean(epa), success_rate = mean(success), ypa=mean(yards_gained), plays=n(), yards=sum(yards_gained)) %>%
arrange(desc(mean_epa)) %>%

@NerdMikeV
Copy link

Love the guide. I'd been wanting to do this for a while and finally committed the time to it - you made it easy.

I'm getting a slightly differe mean epa value for M. Brown in 2018. Mine's .015 or something. I'm not worried about it though.

I also tweeted about this but for anyone else checking, the logo file isn't (I think) working properly because it's grabbing the logos off of a wikipedia page and the Raiders are now on wikipedia as Las Vegas and not Oakland.

@rsasso
Copy link

rsasso commented Jun 8, 2020

This worked by simply removing the (-fumble_recovery_2_yards) and keeping select

first <- 2009 #first season to grab. min available=2009
last <- 2018 # most recent season

datalist = list()
for (yr in first:last) {
pbp <- read_csv(url(paste0("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_", yr, ".csv")))
games <- read_csv(url(paste0("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/games_data/regular_season/reg_games_", yr, ".csv")))
pbp <- pbp %>% inner_join(games %>% distinct(game_id, week, season)) %>% select
datalist[[yr]] <- pbp # add it to your list
}

pbp_all <- dplyr::bind_rows(datalist)

I have recently started learning R so forgive my lack of knowledge, but I am having issues getting the data in the datalist to appear. After I run this line of code the datalist appears completely full of NULL lines. Is there a step I am missing?

@awgymer
Copy link

awgymer commented Jun 9, 2020

@rsasso Are you sure that your datalist is in fact empty? If you index a list using integer values (as you do using years) then it actually creates empty list entries at all the integers before/between your first/last values.

datalist = list()
datalist[[10]] <- 'mystr'

Then look at that datalist. You'll see that datalist[[1]] - datalist[[9]] are all NULL but that datalist[[10]] does in fact contain mystr.

So try running the code and then doing datalist[[2009]] (or any year you've scraped) and check that those values are in fact NULL

@williambjames
Copy link

I'm only about a week removed from my introduction to R, but i've been trying to learn the ropes of this scraper. I've come across an issue when trying to load all seasons into Rstudtio using the code from the "Pulling together all seasons" bit of the guide, except i changed the last value to 2019 instead of 2018. Since the URL is valid and connects to the .csv file on github, i thought it would work. The code runs well and compiles into one dataset fine, however any time i try to do anything including 2019, no values appear. Has anyone else had this problem?

@casmacdo
Copy link

I think "install.packages("dplyr")" and "library(dplyr)" might be redundant because tidyverse includes dplyr

@mrcaseb
Copy link

mrcaseb commented Sep 16, 2020

This guid is quite outdated. Ben redid it on the nflfastR website and I recommend to everyone to work with the new guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment