IPND - Data Analysis Project - Baseball Statistics

An analysis of team ERA, and its relation (if any) to regular season success

Hypothesis:

As Team ERA goes down, an average team should expect to win more games. And the opposite is also true, as Team ERA goes up, an average team should expect to lose more games.

I am specifically only looking at regular season win percentage because of its unique properties compared to the playoffs. In the playoffs a team can rely on their best pitchers to pitch the majority of innings, during the regular season this is not feasible because of the extremely long, grueling schedule MLB has.

The data will only be analyzed from 1962 to the present. In 1962, MLB expanded the regular season length to the (still used) 162 games. This approach will simplify the analysis, and reduce errors.

Team ERA will be the primary variable examined because it is a summary statistic for the performance of a team's entire pitching staff. This one number succinctly describes how a team's pitchers did during any given season. I based the above hypothesis on the idea that if a team's pitchers do well, then the team should expect to win more games.

I will divide my notes below into two sections Process and Analysis.

Process

I'll start with importing data analysis packages, importing the actual data set csv to a pandas dataframe, and calling .info to get a quick glance at the default data formatting and layout.

In [16]:
#import packages, and the csv data file. 'inline' means the notebook will render plots in this window, not open a new one
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

teams_df = pd.read_csv(r"C:\Users\Mike\Desktop\udacity\baseball_stats\data\Teams.csv")
#I'll use .info to take a quick look at what the data contains
teams_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2805 entries, 0 to 2804
Data columns (total 48 columns):
yearID            2805 non-null int64
lgID              2755 non-null object
teamID            2805 non-null object
franchID          2805 non-null object
divID             1288 non-null object
Rank              2805 non-null int64
G                 2805 non-null int64
Ghome             2406 non-null float64
W                 2805 non-null int64
L                 2805 non-null int64
DivWin            1260 non-null object
WCWin             624 non-null object
LgWin             2777 non-null object
WSWin             2448 non-null object
R                 2805 non-null int64
AB                2805 non-null int64
H                 2805 non-null int64
2B                2805 non-null int64
3B                2805 non-null int64
HR                2805 non-null int64
BB                2805 non-null int64
SO                2685 non-null float64
SB                2661 non-null float64
CS                1946 non-null float64
HBP               480 non-null float64
SF                480 non-null float64
RA                2805 non-null int64
ER                2805 non-null int64
ERA               2805 non-null float64
CG                2805 non-null int64
SHO               2805 non-null int64
SV                2805 non-null int64
IPouts            2805 non-null int64
HA                2805 non-null int64
HRA               2805 non-null int64
BBA               2805 non-null int64
SOA               2805 non-null int64
E                 2805 non-null int64
DP                2488 non-null float64
FP                2805 non-null float64
name              2805 non-null object
park              2771 non-null object
attendance        2526 non-null float64
BPF               2805 non-null int64
PPF               2805 non-null int64
teamIDBR          2805 non-null object
teamIDlahman45    2805 non-null object
teamIDretro       2805 non-null object
dtypes: float64(10), int64(25), object(13)
memory usage: 1.0+ MB

Process

I will create a quick view of the table to be familiar with the layout and to make sure it has the data I want.

In [17]:
#First step is to reduce the full table down to only the columns I am concerned with
sorted_teams_df = teams_df[['name', 'yearID', 'W', 'ERA']].copy()
sorted_teams_df.head()
Out[17]:
name yearID W ERA
0 Boston Red Stockings 1871 20 3.55
1 Chicago White Stockings 1871 19 2.76
2 Cleveland Forest Citys 1871 10 4.11
3 Fort Wayne Kekiongas 1871 7 5.17
4 New York Mutuals 1871 16 3.72

Process

I want to check the data types to make sure they are the ones I am expecting.

In [18]:
#Check the datatypes
sorted_teams_df.dtypes
Out[18]:
name       object
yearID      int64
W           int64
ERA       float64
dtype: object

Process

Columns look good and the values are the correct types.

Next, I need to filter the 'yearID' from 1962 up to the present (or at least the more recent year present in the data). I will exclude 1981, 1994, and 1995 as these were strike-shortened seasons and will unnecessarily complicate the data.

In [19]:
#Filter the data to 1962 - present, with 1981, 1994, 1995 excluded due to being strike seasons
filtered_teams_df = sorted_teams_df[(sorted_teams_df['yearID'] >= 1962) & (sorted_teams_df['yearID'] != 1981) & (sorted_teams_df['yearID'] != 1994) & (sorted_teams_df['yearID'] != 1995)]
filtered_teams_df.head()
Out[19]:
name yearID W ERA
1377 Baltimore Orioles 1962 77 3.69
1378 Boston Red Sox 1962 76 4.22
1379 Chicago White Sox 1962 85 3.73
1380 Chicago Cubs 1962 59 4.54
1381 Cincinnati Reds 1962 98 3.75

Process

There were some teams that changed their mascot name, or moved cities during the survey period. In order to do a team-by-team comparison I need to consolidate the team names to one representative franchise name. I arbitrarily chose to use the current name.

In [20]:
#Replace the old team name with the currently used team name so that I have one data set for each team
filtered_teams_df.replace({'name' : { "Milwaukee Braves" : "Atlanta Braves", "Florida Marlins" : "Miami Marlins", "Seattle Pilots" : "Seattle Mariners", "Washington Senators" : "Washington Nationals", "Tampa Bay Devil Rays" : "Tampa Bay Rays", "Houston Colt .45's" : "Houston Astros", "California Angels" : "Los Angeles Angels", "Los Angeles Angels of Anaheim" : "Los Angeles Angels", "Kansas City Athletics" : "Oakland Athletics"}}, inplace=True)
filtered_teams_df.head(10)
Out[20]:
name yearID W ERA
1377 Baltimore Orioles 1962 77 3.69
1378 Boston Red Sox 1962 76 4.22
1379 Chicago White Sox 1962 85 3.73
1380 Chicago Cubs 1962 59 4.54
1381 Cincinnati Reds 1962 98 3.75
1382 Cleveland Indians 1962 80 4.14
1383 Detroit Tigers 1962 85 3.81
1384 Houston Astros 1962 64 3.83
1385 Oakland Athletics 1962 72 4.79
1386 Los Angeles Angels 1962 86 3.70

Analysis

I can see 'Oakland Athletics' and 'Houston Astros' in the above slice so I know the replace worked (in 1962, the year present in the slice, both teams used different names).

Process

I don't expect any missing values in this dataset, but I'll check. A return of 'False' means that no null values were found in the data.

In [21]:
#return a false means no null values, true means a return of at least 1 null value
filtered_teams_df.isnull().any().any()
Out[21]:
False

Process

I want to take a look at some summary statistics for the data so I will call .describe to see some common descriptive statistics.

In [22]:
#next 3 lines are for later use in a different test below
era_data = filtered_teams_df.copy()
del era_data['W']
del era_data['name']

#Check out some summary statistics for the data as a whole
filtered_teams_df.describe()
Out[22]:
yearID W ERA
count 1346.000000 1346.000000 1346.000000
mean 1990.329866 80.850669 3.985386
std 15.692717 11.409153 0.585281
min 1962.000000 40.000000 2.450000
25% 1977.000000 73.000000 3.570000
50% 1991.000000 81.000000 3.940000
75% 2004.000000 89.000000 4.360000
max 2015.000000 116.000000 6.380000

Analysis

I see that mean Wins is about 81, which is the expected value. Based on my prior knowledge of Major League Baseball, the min and max values for Wins and ERA are certainly within expected boundaries.

Process

I would like to get an idea of how the ERA values are distributed across the data. I will use a histogram to model the frequency of Team ERA values in the data set.

In [23]:
#Create a pandas series to hold the era data, then plot it in a histogram
era_series = filtered_teams_df['ERA']
plt.hist(era_series)
plt.suptitle('ERA Frequencies from Sample Data', fontsize = 14)
plt.xlabel('Team ERA', fontsize = 12)
plt.ylabel('Frequency in Data Distribution', fontsize = 12)
Out[23]:
<matplotlib.text.Text at 0xc627b70>

Analysis

The above histogram shows the frequency of the Team ERA value (tabulated at the end of each season, for each team during the survey period of 1962-2015). I can see from this histogram the most common ERA's in the data set fall between 3.5 and 4.0 earned runs allowed per game and the distribution is roughly symmetrical with ERA values falling both above and below the calculated mean of 3.99 ERA.

Process

Next, I want to standardize the Wins and ERA columns so that the variables can be compared along the same scale. 'Wins' are not generally comparable to 'ERA', they are measures of different 'units' (earned run average, and wins). Additionally, the standard deviation for 'Wins' is 11.41, the standard deviation for 'ERA' is .59. By standardizing these values, both variables will have the same weight in a comparison.

In [24]:
#Only interested in standardizing 'W' and 'ERA', 2nd line performs the actual standardizing
columns_to_norm = ['W', 'ERA']
filtered_teams_df[columns_to_norm] = filtered_teams_df[columns_to_norm].apply(lambda x: (x - x.mean()) / (x.std()))
filtered_teams_df.head()
Out[24]:
name yearID W ERA
1377 Baltimore Orioles 1962 -0.337507 -0.504692
1378 Boston Red Sox 1962 -0.425156 0.400857
1379 Chicago White Sox 1962 0.363684 -0.436348
1380 Chicago Cubs 1962 -1.915188 0.947603
1381 Cincinnati Reds 1962 1.503121 -0.402177

Process

Below I will complete a preliminary scatter plot using the standardized values to illustrate any general trends in the data set. The scatter plot will clearly demonstrate the correlation between the variables (positive, negative, or none).

In [25]:
#code to product a scatterplot, scatter(x values, y values) is the formatting. the other lines add axis titles
plt.scatter(filtered_teams_df['W'], filtered_teams_df['ERA'])
plt.suptitle('Prelim Scatter Plot', fontsize = 14)
plt.xlabel('Number of wins', fontsize = 12)
plt.ylabel('Team ERA', fontsize = 12)
Out[25]:
<matplotlib.text.Text at 0xa288908>

Analysis

The above scatterplot shows the distribution of wins compared to Team ERA throughout the sample data. Each plot point represents a snapshot of one team during one season, comparing ERA value along the Y-axis to number of wins along the X-axis.

The general distribution (clustering around 0,0) suggests that with an average Team ERA (~3.99), a sample team can expect an average number of wins (~81 wins). The plot can further be intrepreted to suggest that there is a negative correlation between Team ERA and Number of Wins. As Team ERA goes up, number of wins goes down. This makes sense and agrees with my original hypothesis because the more runs your team is allowing the other team to score the harder it will be for your team to win the game.

The opposite can also be inferred - that number of wins should go up as Team ERA goes down. The above scatterplot essentially means any number of wins above 0 correlates to a season where the team finished above .500 (I know this because the mean number of wins is ~81, which is half the games in a season), and had a decent chance of making the playoffs (ie. what most teams would consider a successful season). Number of wins below 0 means the team finished under .500 and most likely did not make the playoffs.

Process

Now I want to plot the top 5 teams by Team ERA, and the bottom 5 by Team ERA, and see how their win totals compare. This analysis should further demonstrate the correlation between the two tested variables, Team ERA, and number of wins/season. I expect the top 5 teams will average more wins/year, than the bottom 5.

First, I need to make a new dataframe with the relevant data.

In [26]:
#pivot the data to organize it by team, the mean values for era and W are computed. Sort the data, and take top and bottom 5 teams to produce plotting data
pivot_teams = pd.pivot_table(filtered_teams_df, index=['name'], aggfunc=np.mean)
pivot_teams_sorted = pivot_teams.sort_values(by=('ERA'))
top_five_era = pivot_teams_sorted.head()
bottom_five_era = pivot_teams_sorted.tail()
plotting_data = pd.concat([top_five_era, bottom_five_era])
del plotting_data['yearID']
plotting_data = plotting_data.reset_index()
plotting_data
Out[26]:
name ERA W
0 Los Angeles Dodgers -0.979409 0.523515
1 St. Louis Cardinals -0.523788 0.459926
2 New York Mets -0.394471 -0.275637
3 San Francisco Giants -0.391456 0.289784
4 Atlanta Braves -0.348909 0.203854
5 Seattle Mariners 0.568944 -0.477271
6 Texas Rangers 0.593385 -0.119454
7 Anaheim Angels 0.719080 0.188387
8 Tampa Bay Rays 0.771049 -0.503066
9 Colorado Rockies 1.731923 -0.496110

Process

Then I can plot it.

In [27]:
#plot the data using a scatter plot. The for loop adds the team name annotation based on the index location of the plot point
plotting_x = plotting_data['W']
plotting_y = plotting_data['ERA']
plotting_data_label = ['Dodgers', 'Cardinals', 'Mets', 'Giants', 'Braves', 'Mariners', 'Rangers', 'Angels', 'Rays', 'Rockies']

fix, ax = plt.subplots()
ax.scatter(plotting_x, plotting_y)
plt.suptitle('Team ERA vs. Number of Wins', fontsize = 14)
plt.xlabel('Standardized Number of wins', fontsize = 12)
plt.ylabel('Standardized Team ERA', fontsize = 12)

for i, txt in enumerate(plotting_data_label):
    ax.annotate(txt, (plotting_x[i], plotting_y[i]))

Analysis

The above plot shows Standardized Team ERA along the Y-axis, and standardized number of wins along the X-axis for the top 5 and bottom 5 Team ERA averages over the sample period of 1965-2015.

The trend here is pretty clear: generally speaking, lowering your team's ERA will result in winning more regular season games, which strengthens my hypothesis and leads me to belive further, more concrete analysis is warranted to investigate and confirm the possible correlation.

However, the Mets and Angels are clear outliers amongst this grouping. The Mets' team ERA was pretty decent (third best average actually out of any team), but their average win rate was below .500. This illustrates one of the key points to keep in mind when reviewing baseball statistics. I believe that no stat can be taken in isolation. There are so many factors (front-office personnel, weather at the game, hot and cold streaks for players, and on and on and on) that affect which team wins. Making general recommendations, or conclusions based on most baseball stats is fine, but I don't think any definite conclusions should be drawn.

The Angels are also an outlier, but in the opposite direction as the Mets. For the Angels, their team ERA was the third worst of any team over the survey period but their number of wins/year was right in line with the teams with the best average ERA's. Again, outliers are to be expected in baseball, especially when the statistics are being compared only to each other and not factoring in any outside elements. One potential explanation for the Angels plot, is that their batting/offensive statistics were so good that they were able to overcome their deficiencies in pitching. Overall batting averages per team, or on-base percetage totals per team would be a good area to expand the scope of this analysis to find further correlations to expect win totals per season.

Process

If a team decides that they want to focus on lowering their Team ERA, how will these efforts fit into the overall Team ERA trends in MLB over the years? To get a sense of this metric I will plot the average team ERA/year since 1962 and see what the trend is. My hypothesis is that the average ERA will be trending downward over the years due to better practice/training/exercise/and medical processes developed over the last 50+ years.

In [28]:
#creates a bar graph of the mean ERA value per year, format for creating a bar graph is .bar(xvalue, yvalue)
pivot_era = pd.pivot_table(era_data, index=['yearID'], aggfunc=np.mean)
pivotera = pivot_era.reset_index()

plt.bar(pivotera['yearID'], pivotera['ERA'])
plt.suptitle('Graph of Average Team ERA by Year', fontsize = 14)
plt.xlabel('Year', fontsize = 12)

plt.ylabel('Team ERA', fontsize = 12)
Out[28]:
<matplotlib.text.Text at 0xc5526a0>

Analysis

The above bar graph illustrates average team ERA calculated from every team per year. Essentially, all teams had their Team ERA averaged into one overall average for all teams. Each bar on the graph represents the average Team ERA during a different year, with gaps where the data was removed to account for seasons where a strike disrupted play.

The trend illustrated by the graph was not what I expected at all! From near the beginning of the survey period (1962) until the early 2000's the overall trend is upward, until it peaks near ~2000. This trend very clearly mirrors the incresing use of performance-enhancing drugs, or steriods, in MLB through the 70's, and especially during the 80's and 90's.

After the peak the overall trend is slightly downwards which corresponds to an increase in steroid testing in MLB and more severe penalties for steroid users. This presumably had the desired effect of reducing usage of steriods in MLB, which would curb some of the frankly outrageous offensive batting statistics put forth by certain players during the 80's and 90's.

The observed trend in the data does not necessarily agree with my original hypothesis, but I did not consider the balooning effect of steriod usage on Team ERA would be inferred here.

Conclusion

If an MLB team is looking to improve their win percentage, would lowering their Team ERA help them win more games?

I think the data has clearly shown that, yes, teams that win more games have lower team ERA's, on average. This is not to say that lowering ERA CAUSES a team to win more games. There are many factors that go into which team wins a MLB game, factors such as fielding percentage, batting stats, and countless others. Team ERA is just one measure of a team's performance.

This analysis was purposely limited to only team ERA/number of wins and only to data going back to 1962. I did this to get a sense of the correlation during the modern baseball period. This, of course, does not take into account the almost 100 years of games played before I began my analysis. For a more complete picture, I could re-run the study for the entire dataset streching back to the 1800's.

The advantage of my approach is that the length of each season during my sample period was the same (162 games), this simplified the dataset significantly, reducing the liklihood of introducing errors. I also chose to ignore data during strike years. I was not interested in the data for these years because they are not representative of a 'normal' baseball season. A baseball General Manager would never make special 'plans' for a strike season because they are never predictable. The hypothetical General Manager would instead, plan the way his/her team would run based on 'normal', predictable seasons, which last 162 games in the modern era.

As mentioned throughout the analysis there are many other variables that may affect who wins or loses a baseball game, including factors outside the game, like steroids! There are many other variables not included in this dataset that could have been useful in drawing stronger conclusions. Some examples of these other variables are average runs scored per game, team batting average, team fielding percentage, and/or team on-base percentage.

A good follow-up to this analysis might be seeing how some or all of those other variables relate to win percentage over the regular season and then finding out which one of the examined variables correlates the strongest with winning more games.

That one stat would be the area of the team I would spend the most time/energy/and money to improve in the future.

List of resources used to compile this analysis:

https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook#gs.fSMrzAY

Python for Data Analysis - Wes McKinney

https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

http://stackoverflow.com/questions/17661136/csv-does-not-exist-pandas-dataframe

http://stackoverflow.com/questions/34682828/pandas-extracting-specific-selected-columns-from-a-dataframe-to-new-dataframe

http://pbpython.com/excel-pandas-comp-2.html

http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe

http://pandas.pydata.org/pandas-docs/stable/basics.html

http://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib

https://en.wikipedia.org/wiki/MLB_lockout

http://stackoverflow.com/questions/12525722/normalize-data-in-pandas

http://pbpython.com/pandas-pivot-table-explained.html

http://stackoverflow.com/questions/23307301/pandas-replacing-column-values-in-dataframe

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values

http://chrisalbon.com/python/pandas_join_merge_dataframe.html

http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe

http://stackoverflow.com/questions/16167829/in-pandas-how-can-i-reset-index-without-adding-a-new-column

http://stackoverflow.com/questions/14432557/matplotlib-scatter-plot-with-different-text-at-each-data-point

https://www.biomedware.com/files/documentation/Preparing_data/Why_standardize_variables.htm