10 MLB Pitchers 2015-2019
Motivation
Baseball is an extremely data-rich sport. Because of the less-continuous nature of the game, data is able to be collected at close to every point during the game, making it an extremely useful sport for analysis.
Baseball Background
In baseball, the game is split into innings. There are usually nine innings, each of which is split into a top and bottom. In the top of the inning, the home team is pitching (throwing the baseball) and the away team is batting (attempting to hit the thrown baseball with their bat); they switch for the bottom of the inning. Teams can score points during their turn batting. The goal is to get players around the field (around 3 bases) and back to home base (where they bat from). At the most basic level, players are separated into two categories: pitchers and batters. This data set only deals with pitchers. These players are responsible for throwing the pitches that the batter must hit to advance around the field.
Pitchers
The pitchers included in this data set are 10 standout pitchers from the last decade: Clayton Kershaw, Justin Verlander, Max Scherzer, Chris Sale, Madison Bumgarner, Cole Hamels, Zack Greinke, David Price, Jacob deGrom, and Jon Lester. To avoid problems resulting from the shortened 2020 season (due to Covid) as well as any seasons that certain pitchers missed due to injury (for example, Justin Verlander in 2021), this data set is from the 2015-2019 seasons. For these years, the pitchers each threw over 2000 pitches/season, providing ample data.
Data
The pitchers
data set contains 33,289 rows and 7 columns.
Each of row in the pitchers
data set represents a pitch thrown by one of the pitchers in their 2015-2019 seasons. The pitch type in this data set is distinguished by abbreviation. The most common types are four-seam fastballs (FF), sinkers (SI), changeups (CH), sliders (SL), and curveballs (CU). Most of the rest of the categorizations can be found here. The result column is also in shorthand, with B meaning ball, S meaning strike, and X meaning in-play.
Variable Descriptions
Variable | Description |
---|---|
player_name | Pitcher name |
pitch_type | Type of pitch in shorthand |
result | Shorthand for result of pitch |
release_speed | Out-of-hand pitch velocity |
description | Short description of pitch outcome |
count | Strike-ball count at time of pitch |
outs_when_up | The out count at time of pitch |
Download Pitching Data: pitchers.csv
Questions
- Are some pitches more likely to throw certain types of pitches?
- Which player throws the fastest, on average?
- Does the out count have an impact on the type of pitch thrown?
- Can you predict how fast a pitch will be based on the pitcher, if you account for pitch type?
References
Petti B, Gilani S (2025). baseballr: Acquiring and Analyzing Baseball Data. R package version 1.6.0, commit af84f6deaf5115490791936abcbf11f3586b4597, https://github.com/mlascaleia/baseballr.
Code
# Install baseballr if needed
# if (!requireNamespace('devtools', quietly = TRUE)){
# install.packages('devtools')
# }
# devtools::install_github(repo = "BillPetti/baseballr")
library(tidyverse)
library(baseballr)
# download each pitcher's data
# See https://billpetti.github.io/baseballr/articles/using_statcast_pitch_data.html#find-corbin-burnes-mlbam-id
# for an example on how to get a specific player's id
CK <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 477132)
JV <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 434378)
MS <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 453286)
CS <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 519242)
MB <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 518516)
CH <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 430935)
ZG <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 425844)
DP <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 456034)
JD <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 594798)
JL <- statcast_search_pitchers(start_date = "2015-04-05", end_date = "2019-10-30", pitcherid = 452657)
pitchers <-
bind_rows(CK, JV, MS, CS, MB, CH, ZG, DP, JD, JL) |>
select(player_name, pitch_type, type,
release_speed, events, description,
balls, strikes, outs_when_up) |>
unite(count, balls, strikes, sep = "-") |>
select(!events) |>
rename(result = type)