--> A beginner's guide to nflfastR <--
I get a lot of questions about how to get nflscrapR up and running. This guide is intended to help new users build interesting tables or charts from the ground up, taking the raw nflscrapR data.
Quick word if you're new to programming: all of this is happening in R. Obviously, you need to install R on your computer to do any of this. Make sure you save what you're doing in a script (in R, File --> New script) so you can save your work and run multiple lines of code at once. To run code from a script, highlight what you want, right click, and select Run line. As you go through your R journey, you might get stuck and have to google a bunch of things, but that's totally okay and normal. That's how I wrote this thing!
A huge thank you to Josh Hornsby (@Josh_ADHD) and Zach Feldman (@ZachFeldman3) for sharing code snippets for me to work with, and of course to the nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) for providing this resource. An additional huge thanks to Keegan Abdoo (@KeeganAbdoo) for making the string search parts of this look much better.
Final disclaimer: this will work but I'm not great at R so there might be better ways to do certain things.
-- Ben Baldwin, @benbbaldwin
First, you need to install the magic packages. You only need to run this step once on a given computer. We aren't going to bother with the actual nflscrapR package since it's a million times faster to just download the .csvs directly from Ron's github (thanks Ron!).
install.packages("tidyverse")
install.packages("dplyr")
install.packages("na.tools")
install.packages("ggimage")
You'll need to do this every time you open an instance of R.
library(tidyverse)
library(dplyr)
library(na.tools)
library(ggimage)
pbp <- read_csv(url("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_2018.csv"))
This reads in the 2018 play-by-play data from github and saves it as "pbp". Different seasons could be read in by changing the year at the end in the above.
To start, let's just look at the first few rows (the "head") of the data.
pbp %>% select(posteam, defteam, desc, play_type) %>% head
posteam defteam desc play_type
1 ATL PHI J.Elliott kicks 65 yards from PHI 35 to end zone, Touchback. kickoff
2 ATL PHI (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play. no_play
3 ATL PHI (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins). pass
4 ATL PHI (14:22) J.Jones left end pushed ob at ATL 41 for 11 yards (D.Barnett). run
5 ATL PHI (13:46) D.Freeman right end to PHI 39 for 20 yards (M.Jenkins). run
6 ATL PHI (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks). pass
A couple things. The %>% thing lets you pipe together a bunch of different commands. So we're taking pbp, our data, "select"ing a few variables we want to look at ("desc" is the important variable that lists the description of what happened on the play), and then saying to show the first few rows. Since this is already sorted by time, these are the first 6 rows from the season opener, ATL @ PHI.
There are a few important things we want to handle. First, the play-by-play includes a bunch of events that aren't actually plays (timeouts, quarters ending, and stuff like that). Second, we're going to fix some plays that are mis-classified (QB scrambles should count as pass plays; spikes and kneeldowns shouldn't count as real plays).
For this illustration, I'm going to focus on run plays and pass plays, so I'll throw out punts, kickoffs, field goals, and dead ball penalties (e.g. false starts) where we don't know what the attempted play was.
I'm also going to focus on 2018 only to keep things simple to start with. Adding other seasons is not super complicated and is covered later in this guide.
pbp_rp <- pbp %>%
filter(!is_na(epa), play_type=="no_play" | play_type=="pass" | play_type=="run")
We've saved a new dataset (the fancy name in R is data frame) that is "pbp" but removing plays that don't have values for epa, and that are either pass plays, run plays, or penalties ("no play"). Let's look at what the first few rows of this new data frame look like:
pbp_rp %>% select(posteam, desc, play_type) %>% head
posteam desc play_type
1 ATL (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play. no_play
2 ATL (15:00) M.Ryan pass short right to J.Jones pushed ob at ATL 30 for 10 yards (M.Jenkins). pass
3 ATL (14:22) J.Jones left end pushed ob at ATL 41 for 11 yards (D.Barnett). run
4 ATL (13:46) D.Freeman right end to PHI 39 for 20 yards (M.Jenkins). run
5 ATL (13:10) M.Ryan pass incomplete short right to C.Ridley (J.Mills, J.Hicks). pass
6 ATL (13:05) (Shotgun) M.Ryan pass incomplete short left to D.Freeman. pass
Compared to the first time we did this, the kickoff is now gone.
Now let's take a look at some of the "no play" plays.
pbp_rp %>% filter(play_type=="no_play") %>% select(desc, rush_attempt, pass_attempt) %>% head
rush_attempt pass_attempt desc
0 0 (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.
0 0 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play.
0 0 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play.
0 0 (1:43) (Shotgun) PENALTY on ATL-T.McKinley, Neutral Zone Infraction, 5 yards, enforced at PHI 36 - No Play.
0 0 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play. PENALTY on ATL-D.Trufant, Defensive Pass Interference, 8 yards, enforced at ATL 17 - No Play.
0 0 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play.
The false start to open the game is still there, but the rest are attempted rush or pass plays that (in my opinion) should count when we're computing EPA or something like that. But importantly, rush_attempt and pass_attempt are 0 for all these plays. We need to fix that.
There might be a better way to do this, but I search the "desc" variable to look for indicators of passing or rushing:
pbp_rp <- pbp_rp %>%
mutate(
pass = if_else(str_detect(desc, "( pass)|(sacked)|(scramble)"), 1, 0),
rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
success = ifelse(epa>0, 1 , 0)
)
Mutate is R's weird word for creating new variables. I've created "pass", which searches the "desc" variable for plays with "pass", "sacked", or "scramble", along with a variable for rush and for a successful play (using the simple definition for success of positive EPA). Let's take a look:
pbp_rp %>% filter(play_type=="no_play") %>% select(pass, rush, desc) %>% head
pass rush desc
1 0 0 (15:00) PENALTY on ATL-L.Paulsen, False Start, 5 yards, enforced at ATL 25 - No Play.
2 1 0 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play.
3 0 1 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play.
4 0 0 (1:43) (Shotgun) PENALTY on ATL-T.McKinley, Neutral Zone Infraction, 5 yards, enforced at PHI 36 - No Play.
5 1 0 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play. PENALTY on ATL-D.Trufant, Defensive Pass Interf~
6 0 1 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play.
Now we can finally have variables for "pass" and "rush" that include plays with penalties!
Hooray! Let's now save, keeping only the run or pass plays.
pbp_rp <- pbp_rp %>% filter(pass==1 | rush==1)
Okay, we have a big dataset where we call dropbacks pass plays and non-dropbacks rush plays. Now we actually want to, like, do stuff.
Let's take a look at how various Rams' running backs fared on run plays in 2018:
pbp_rp %>%
filter(posteam == "LA", rush == 1, down<=4) %>%
group_by(rusher_player_name) %>%
summarize(mean_epa = mean(epa), success_rate = mean(success), ypc=mean(yards_gained), plays=n()) %>%
arrange(desc(mean_epa)) %>%
filter(plays>40)
rusher_player_name mean_epa success_rate ypc plays
1 C.Anderson 0.333 0.628 6.95 43
2 T.Gurley 0.155 0.473 4.89 256
3 M.Brown -0.0486 0.465 4.93 43
There's a lot going on here. The "group_by" function, well, groups by what you tell it -- in this case player name. Summarize is useful for getting summaries of what you're looking at, and here, while grouping by player, we're summarizing the mean of EPA, success, yardage (a bad rushing stat, but since we're here), and getting the number of plays.
Looking at the results, C.J. Anderson is truly a generational talent.
Some notes on this. The "plays" and "ypc" columns exactly match the official stats you'd see on PFR. There are two reasons for this. First, by specifying down<=4, I've excluded carries on two-point conversions (those have down=="NA"), which are present in the data but don't count in the official stats. And second, all of the work we did above to define rush plays and pass plays doesn't fix that rusher names are missing in the plays with penalties. So why did we bother? That'll be useful for the rest of this when we're doing some team-level stuff.
Let's look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:
schotty <- pbp_rp %>%
filter(wp>.20 & wp<.80 & down<=2 & qtr<=2 & half_seconds_remaining>120) %>%
group_by(posteam) %>%
summarize(mean_pass=mean(pass), plays=n()) %>%
arrange(mean_pass)
schotty
> schotty
posteam mean_pass plays
<chr> <dbl> <int>
1 SEA 0.369 320
2 JAX 0.435 276
3 TEN 0.441 263
4 BUF 0.452 219
5 BAL 0.458 299
6 ARI 0.466 236
7 NYJ 0.473 256
8 DET 0.482 299
9 WAS 0.485 239
10 CAR 0.491 281
# ... with 22 more rows
The Seahawks were playing a different sport in 2018. Fun! Let's see what that looks like:
ggplot(schotty, aes(x=reorder(posteam,-mean_pass), y=mean_pass)) +
geom_text(aes(label=posteam))
ggsave('FILENAME.png', dpi=1000)
This image is kind of a mess -- we still need a title, axis labels, etc -- but gets the point across. We'll get to that other stuff later. But more importantly, we made something interesting using nflscrapR data! And we used the "pass" variable that we created above to do it, so we're counting plays with penalties. In the above, be sure to change FILENAME to where you want to save.
The "reorder" sorts the teams according to pass rate, with the "-" saying to do it in descending order. "aes" is short for "aesthetic", which is R's weird way of asking which variables should go on the x and y axes.
Let's make one of those dropback success rate vs EPA plots that have been floating around.
chart_data <- pbp_rp %>%
filter(pass==1) %>%
group_by(posteam) %>%
summarise(
num_db = n(),
epa_per_db = sum(epa) / num_db,
success_rate = sum(epa > 0) / num_db
)
nfl_logos_df <- read_csv("https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv")
chart <- chart_data %>% left_join(nfl_logos_df, by = c("posteam" = "team_code"))
First, we're taking all plays with pass = 1, and grouping by team to get the mean of EPA/play and success rate. To get fancy, we're going to take the team logos provided by President Lopez and join that to the chart data we created. (what does left_join mean? I don't know, I copied this from where else and it works)
chart %>%
ggplot(aes(x = success_rate, y = epa_per_db)) +
geom_image(aes(image = url), size = 0.05) +
labs(x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "Dropback success rate & EPA/play",
subtitle = "2018") +
theme_bw() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 14),
plot.caption = element_text(size = 12))
ggsave('FILENAME.png', dpi=1000)
Boom! We created a fancy graph. We can see here a bunch of storylines that dominated 2018, including the Chiefs' and Saints' aerial dominance, Aaron Rodgers being about as efficient as Andy Dalton and Matt Stafford, and the struggles of the rookie Joshes.
Let's look at rushing and passing EPA/play at the team level on 1st and 2nd down. Are there any teams that were better at rushing than passing?
chart_data <- pbp_rp %>%
group_by(posteam) %>%
filter(down<=2) %>%
summarise(
n_dropbacks = sum(pass),
n_rush = sum(rush),
epa_per_db = sum(epa * pass) / n_dropbacks,
epa_per_rush = sum(epa * rush) / n_rush,
success_per_db = sum(success * pass) / n_dropbacks,
success_per_rush = sum(success * rush) / n_rush
)
chart <- chart_data %>% left_join(nfl_logos_df, by = c("posteam" = "team_code"))
chart %>%
ggplot(aes(x = epa_per_rush, y = epa_per_db)) +
geom_image(aes(image = url), size = 0.05) +
labs(x = "Rush EPA/play",
y = "Pass EPA/play",
caption = "Data from nflscrapR",
title = "Early-down rush and pass EPA/play",
subtitle = "2018") +
theme_bw() +
geom_abline(slope=1, intercept=0, alpha=.2) +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 14),
plot.caption = element_text(size = 12))
ggsave('FILENAME.png', dpi=1000)
I've added a line with slope 1. Teams above that diagonal line are more efficient passing than rushing on early downs, as measured by EPA per play. In other words, every single team in the NFL in 2018, with the possible exception of the Panthers.
Is this driven by explosive plays? Let's look at the same graph but with success rate instead.
chart %>%
ggplot(aes(x = success_per_rush, y = success_per_db)) +
geom_image(aes(image = url), size = 0.05) +
labs(x = "Rush success rate",
y = "Pass success rate",
caption = "Data from nflscrapR",
title = "Early-down rush and pass success rate",
subtitle = "2018") +
theme_bw() +
geom_abline(slope=1, intercept=0, alpha=.2) +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 14),
plot.caption = element_text(size = 12))
ggsave('FILENAME.png', dpi=1000)
Wow! As measured by having positive-value plays, early-down pass plays are much better at helping teams stay on schedule for every single team in the league.
Let's check player names on plays with penalties:
pbp_rp %>% filter(play_type=="no_play")
%>% select(desc, pass, passer_player_name, rusher_player_name, receiver_player_name) %>% head()
desc passer_player_name rusher_player_na~ receiver_player_n~
<chr> <chr> <chr> <chr>
1 (5:32) (Shotgun) M.Ryan sacked at PHI 11 for -4 yards (C.Long). PENALTY on PHI-D.Barnett, Defensive Offside, 4 yards, enforced at PHI 7 - No Play. <NA> <NA> <NA>
2 (4:13) (Shotgun) C.Clement up the middle to PHI 32 for no gain (D.Jones). PENALTY on ATL-T.McKinley, Defensive Offside, 5 yards, enforced at PHI 32 - No Play. <NA> <NA> <NA>
3 (8:23) (Shotgun) N.Foles pass incomplete short right to M.Wallace (D.Trufant). ATL-K.Neal was injured during the play. PENALTY on ATL-D.Trufant, Defensive P~ <NA> <NA> <NA>
4 (8:17) (Shotgun) D.Sproles left tackle to ATL 4 for 5 yards (D.Kazee). PENALTY on PHI-J.Kelce, Offensive Holding, 10 yards, enforced at ATL 9 - No Play. <NA> <NA> <NA>
5 (1:50) (Shotgun) N.Foles pass short right to N.Agholor ran ob at PHI 31 for 7 yards. PENALTY on ATL-D.Trufant, Defensive Holding, 5 yards, enforced at PHI 24~ <NA> <NA> <NA>
6 (:33) N.Foles FUMBLES (Aborted) at PHI 7, and recovers at PHI 7. N.Foles pass incomplete short right to N.Agholor. PENALTY on ATL-B.Reed, Defensive Offside, ~ <NA> <NA> <NA>
All those NAs are missing where the player names for passer, runner, and receiver should be. Let's clean those up. A gigantic thanks to Keegan Abdoo (@KeeganAbdoo) for providing the code to do this.
pbp_players <- pbp_rp %>%
mutate(
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(desc,
"(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"),
receiver_player_name),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name)
)
By now you know how to check the head() of this to verify that it works. This should not be considered part of a basic R tutorial. If you just skim this part and copy & paste the code, that's totally fine. You might not ever need to perform a string search this complicated for the rest of your life.
With that being said, the notes on what happened above, again with thanks to Keegan:
* [A-Z] means starting with a capital letter
* [a-z]* means zero or more letters (accounts for players like Damien Williams - "Dam. Williams")
* \\. represents the period after a players first inital (or abbreviated first name)
* \\s? accounts for a potential space in between the abbreviated first name and last name (i.e "Dam. Williams")
* [A-Z][A-z]+ means a capital letter followed by one or more letters (indiscrimate case to account for names like McCown)
* (\\s(I{2,3})|(IV))? accounts for any possible suffix (? = zero or one)
* (?=\\s((pass)|(sack)|(scramble))) means preceding a space and then pass, sack or scramble (to account for plays that start out with an eligible OL or other irregularities)
Now let's do something!
qbs <- pbp_players %>%
mutate(
name = ifelse(!is_na(passer_player_name), passer_player_name, rusher_player_name),
rusher = rusher_player_name,
receiver = receiver_player_name,
play = 1
) %>%
group_by(name, posteam) %>%
summarize (
n_dropbacks = sum(pass),
n_rush = sum(rush),
n_plays = sum(play),
epa_per_play = sum(epa)/n_plays,
success_per_play =sum(success)/n_plays
) %>%
filter(n_dropbacks>=100)
We're going to make a plot for QBs, so to simplify things, I've created a "name" variable that's the passer name if it's a pass, and the rusher name if it's a rush. To make sure we're only getting QBs, I filtered to have at least 100 dropbacks. But EPA and success rate are calculated using all plays: both passes and rushes.
Now let's plot the results. Make sure you have ggrepel installed, which is used below to make it so the player name labels don't overlap.
library(ggrepel)
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
geom_vline(xintercept = mean(qbs$success_per_play), color = "red", linetype = "dashed") +
geom_point(color = ifelse(qbs$posteam == "SF", "red", "black"), cex=qbs$n_plays/60, alpha=1/4) +
geom_text_repel(aes(label=name),
force=1, point.padding=0,
segment.size=0.1) +
labs(x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays") +
theme_bw() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12))
ggsave('FILENAME.png', dpi=1000)
I've highlighted the 49ers' quarterbacks to show how similar the efficiency of Jimmy G and Mullens was in 2018, as well as the failed Beathard experiment. The "cex" part makes dot size proportional to plays, and alpha makes the dots semi-transparent.
To this point we've focused on 2018, but NFL history did not begin in 2018 (it began in 2012 when Russell Wilson was drafted). The following grabs the years of available data and also adds on the results from each game (e.g., home team, away team, week number of the season, and how many points were scored by each team), which are stored in a different file.
first <- 2009 #first season to grab. min available=2009
last <- 2018 # most recent season
datalist = list()
for (yr in first:last) {
pbp <- read_csv(url(paste0("https://github.com/ryurko/nflscrapR-data/raw/master/play_by_play_data/regular_season/reg_pbp_", yr, ".csv")))
games <- read_csv(url(paste0("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/games_data/regular_season/reg_games_", yr, ".csv")))
pbp <- pbp %>% inner_join(games %>% distinct(game_id, week, season)) %>% select(-fumble_recovery_2_yards)
datalist[[yr]] <- pbp # add it to your list
}
pbp_all <- dplyr::bind_rows(datalist)
Note that above I'm dropping the "fumble_recovery_2_yards" variable since it was causing problems with stitching the seasons together and it was easier to remove it than figure out the problem. Since we're working with multiple seasons now, we want to make sure team names are coded consistently over time. Let's look for the problem teams:
pbp_all %>% group_by(home_team) %>%summarize(n=n(), seasons=n_distinct(season), minyr=min(season), maxyr=max(season)) %>%
arrange(seasons)
home_team n seasons minyr maxyr
<chr> <int> <int> <dbl> <dbl>
1 LAC 2695 2 2017 2018
2 JAX 4084 3 2016 2018
3 LA 4241 3 2016 2018
4 STL 9842 7 2009 2015
5 JAC 10035 8 2009 2016
6 SD 11007 8 2009 2016
7 ARI 14251 10 2009 2018
8 ATL 13871 10 2009 2018
9 BAL 14342 10 2009 2018
So we can see that we need to deal with the Chargers, Jags, and Rams, since these are the cases with fewer than the 10 full seasons (for example, the Rams changed from "STL" to "LA"). One way to do that:
pbp_all <- pbp_all %>%
mutate_at(vars(home_team, away_team, posteam, defteam), funs(case_when(
. %in% "JAX" ~ "JAC",
. %in% "STL" ~ "LA",
. %in% "SD" ~ "LAC",
TRUE ~ .
)))
Re-running the command above verifies that this worked: every team now has 10 seasons. What we did: fix all the home team, away team, possession team, and defensive team names so they're consistent over time.
At this point, we have a lot of data sitting in memory and we don't want to have to re-download it every time we open R. So let's save the data and then load it again, so we know how.
saveRDS(pbp_all, file="FILENAME.rds")
pbp_all <- readRDS("FILENAME.rds")
Since we're back to working with the raw data from the seasons with downloaded, we need to repeat all our cleaning above on this new big data set. Here's how to do it all in one command:
pbp_all_rp <- pbp_all %>%
filter(!is_na(epa), !is_na(posteam), play_type=="no_play" | play_type=="pass" | play_type=="run") %>%
mutate(
pass = if_else(str_detect(desc, "( pass)|(sacked)|(scramble)"), 1, 0),
rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
success = ifelse(epa>0, 1 , 0),
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(desc, "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"),
receiver_player_name),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name),
name = ifelse(!is_na(passer_player_name), passer_player_name, rusher_player_name),
yards_gained=ifelse(play_type=="no_play",NA,yards_gained),
play=1
) %>%
filter(pass==1 | rush==1)
I've added a line replacing the yards_gained
variable with NA
on plays with penalties because the value is a misleading 0 yards on all of these plays and making it NA
makes it easier to exclude them. Now we can do stuff with our multiple years of data.
Let's see how Aaron Rodgers' efficiency has changed from 2009-2014 to 2015-2018:
pbp_all_rp %>% filter(pass==1 & !is.na(passer_player_name))%>% mutate(
arod = if_else(posteam=="GB"&passer_player_name=="A.Rodgers",1,0),
early = if_else(season<=2014,1,0)
) %>%
group_by(arod,early) %>%
summarize(mean_epa=mean(epa), ypp=mean(yards_gained, na.rm = TRUE)) %>% arrange(-early)
arod early mean_epa ypp
<dbl> <dbl> <dbl> <dbl>
1 0 1 0.0280 6.25
2 1 1 0.255 7.41
3 0 0 0.0605 6.37
4 1 0 0.123 6.16
In the above, we divided the sample into two periods (2009-2014 and 2015-2018), and into Rodgers and non-Rodgers dropbacks. We see that Rodgers had a 0.26 to 0.03 EPA/play advantage and 7.41 to 6.25 yards/dropback advantage over the other QBs in the league from 2009-2014, but only a 0.12 to 0.06 EPA/play advantage and a 6.4 to 6.2 yards/dropback disadvantage in the most recent four seasons.
By now, you have the tools to do a bunch of interesting things with nflscrapR data. If you're wondering what to do next, either pick a question that's interesting to you and try to answer it (how well does passing and rushing EPA/play in the first half of a season predict a team's passing and rushing EPA/play in the second half?), or try to replicate things you come across on twitter.
For other ideas, you can view the list of variables using
names(pbp_all_rp)
[1] "play_id"
[2] "game_id"
[3] "home_team"
[4] "away_team"
[5] "posteam"
[6] "posteam_type"
[7] "defteam"
[8] "side_of_field"
[9] "yardline_100"
[10] "game_date"
...
I've truncated the output since there's more than 200 variables, but this gives you an idea of the things included.
Happy coding!
- To get the cleaned data: All the cleaning here in one place
- Investigation of QB hits
- Making team-specific EPA/play timelines
- Lee Sharpe: Draft Picks, Draft Values, Games, Logos, Rosters, Standings
- greerre: how to get .csv file of weather & stadium data from PFR in python
- Parker Fleming: Introduction to College Football Data with R and cfbscrapR
- Introduction to R (recommended)
- Lee Sharpe: basic intro to R and RStudio
- Lee Sharpe: lots of useful NFL / nflscrapR code
- Lee Sharpe: how to update current season games
- Thomas Mock: pretty #viz with nflscrapR
- Thomas Mock: biggest comebacks
- Josh Hermsmeyer: Getting Started with R for NFL Analysis
- Slavin: visualizing positional tiers in SFB9
- Ron Yurko: assorted examples
- CowboysStats: defensive playmaking EPA
- Michael Lopez: function to sample plays
- Michael Lopez: R for NFL analysis (presentation to club staffers)
- Mitchell Wesson: QB hits investigation
- Mitchell Wesson: Investigation of the nflscrapR EP model
- WHoffman: graphs for receivers (aDoT, success rate, and more)
- ChiBearsStats: investigation of 3rd downs vs offensive efficiency
- ChiBearsStats: the insignificance of field goal kicking
- Deryck97: nflscrapR Python Guide
- Cory Jez: animated plot
- 903124S: Sampling EP
- 903124S: estimating EPA using nfldb
- 903124S: estimate EPA for college football
- Blake Atkinson: explosiveness blog post and python code
- Blake Atkinson: player type visualizations blog post and python code
I think "install.packages("dplyr")" and "library(dplyr)" might be redundant because tidyverse includes dplyr