I am specifically only looking at regular season win percentage because of its unique properties compared to the playoffs. In the playoffs a team can rely on their best pitchers to pitch the majority of innings, during the regular season this is not feasible because of the extremely long, grueling schedule MLB has.
The data will only be analyzed from 1962 to the present. In 1962, MLB expanded the regular season length to the (still used) 162 games. This approach will simplify the analysis, and reduce errors.
Team ERA will be the primary variable examined because it is a summary statistic for the performance of a team's entire pitching staff. This one number succinctly describes how a team's pitchers did during any given season. I based the above hypothesis on the idea that if a team's pitchers do well, then the team should expect to win more games.
I will divide my notes below into two sections Process and Analysis.
Process
I'll start with importing data analysis packages, importing the actual data set csv to a pandas dataframe, and calling .info to get a quick glance at the default data formatting and layout.
#import packages, and the csv data file. 'inline' means the notebook will render plots in this window, not open a new one
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
teams_df = pd.read_csv(r"C:\Users\Mike\Desktop\udacity\baseball_stats\data\Teams.csv")
#I'll use .info to take a quick look at what the data contains
teams_df.info()
Process
I will create a quick view of the table to be familiar with the layout and to make sure it has the data I want.
#First step is to reduce the full table down to only the columns I am concerned with
sorted_teams_df = teams_df[['name', 'yearID', 'W', 'ERA']].copy()
sorted_teams_df.head()
Process
I want to check the data types to make sure they are the ones I am expecting.
#Check the datatypes
sorted_teams_df.dtypes
Process
Columns look good and the values are the correct types.
Next, I need to filter the 'yearID' from 1962 up to the present (or at least the more recent year present in the data). I will exclude 1981, 1994, and 1995 as these were strike-shortened seasons and will unnecessarily complicate the data.
#Filter the data to 1962 - present, with 1981, 1994, 1995 excluded due to being strike seasons
filtered_teams_df = sorted_teams_df[(sorted_teams_df['yearID'] >= 1962) & (sorted_teams_df['yearID'] != 1981) & (sorted_teams_df['yearID'] != 1994) & (sorted_teams_df['yearID'] != 1995)]
filtered_teams_df.head()
Process
There were some teams that changed their mascot name, or moved cities during the survey period. In order to do a team-by-team comparison I need to consolidate the team names to one representative franchise name. I arbitrarily chose to use the current name.
#Replace the old team name with the currently used team name so that I have one data set for each team
filtered_teams_df.replace({'name' : { "Milwaukee Braves" : "Atlanta Braves", "Florida Marlins" : "Miami Marlins", "Seattle Pilots" : "Seattle Mariners", "Washington Senators" : "Washington Nationals", "Tampa Bay Devil Rays" : "Tampa Bay Rays", "Houston Colt .45's" : "Houston Astros", "California Angels" : "Los Angeles Angels", "Los Angeles Angels of Anaheim" : "Los Angeles Angels", "Kansas City Athletics" : "Oakland Athletics"}}, inplace=True)
filtered_teams_df.head(10)
Analysis
I can see 'Oakland Athletics' and 'Houston Astros' in the above slice so I know the replace worked (in 1962, the year present in the slice, both teams used different names).
Process
I don't expect any missing values in this dataset, but I'll check. A return of 'False' means that no null values were found in the data.
#return a false means no null values, true means a return of at least 1 null value
filtered_teams_df.isnull().any().any()
Process
I want to take a look at some summary statistics for the data so I will call .describe to see some common descriptive statistics.
#next 3 lines are for later use in a different test below
era_data = filtered_teams_df.copy()
del era_data['W']
del era_data['name']
#Check out some summary statistics for the data as a whole
filtered_teams_df.describe()
Analysis
I see that mean Wins is about 81, which is the expected value. Based on my prior knowledge of Major League Baseball, the min and max values for Wins and ERA are certainly within expected boundaries.
Process
I would like to get an idea of how the ERA values are distributed across the data. I will use a histogram to model the frequency of Team ERA values in the data set.
#Create a pandas series to hold the era data, then plot it in a histogram
era_series = filtered_teams_df['ERA']
plt.hist(era_series)
plt.suptitle('ERA Frequencies from Sample Data', fontsize = 14)
plt.xlabel('Team ERA', fontsize = 12)
plt.ylabel('Frequency in Data Distribution', fontsize = 12)
Analysis
The above histogram shows the frequency of the Team ERA value (tabulated at the end of each season, for each team during the survey period of 1962-2015). I can see from this histogram the most common ERA's in the data set fall between 3.5 and 4.0 earned runs allowed per game and the distribution is roughly symmetrical with ERA values falling both above and below the calculated mean of 3.99 ERA.
Process
Next, I want to standardize the Wins and ERA columns so that the variables can be compared along the same scale. 'Wins' are not generally comparable to 'ERA', they are measures of different 'units' (earned run average, and wins). Additionally, the standard deviation for 'Wins' is 11.41, the standard deviation for 'ERA' is .59. By standardizing these values, both variables will have the same weight in a comparison.
#Only interested in standardizing 'W' and 'ERA', 2nd line performs the actual standardizing
columns_to_norm = ['W', 'ERA']
filtered_teams_df[columns_to_norm] = filtered_teams_df[columns_to_norm].apply(lambda x: (x - x.mean()) / (x.std()))
filtered_teams_df.head()
Process
Below I will complete a preliminary scatter plot using the standardized values to illustrate any general trends in the data set. The scatter plot will clearly demonstrate the correlation between the variables (positive, negative, or none).
#code to product a scatterplot, scatter(x values, y values) is the formatting. the other lines add axis titles
plt.scatter(filtered_teams_df['W'], filtered_teams_df['ERA'])
plt.suptitle('Prelim Scatter Plot', fontsize = 14)
plt.xlabel('Number of wins', fontsize = 12)
plt.ylabel('Team ERA', fontsize = 12)
Analysis
The above scatterplot shows the distribution of wins compared to Team ERA throughout the sample data. Each plot point represents a snapshot of one team during one season, comparing ERA value along the Y-axis to number of wins along the X-axis.
The general distribution (clustering around 0,0) suggests that with an average Team ERA (~3.99), a sample team can expect an average number of wins (~81 wins). The plot can further be intrepreted to suggest that there is a negative correlation between Team ERA and Number of Wins. As Team ERA goes up, number of wins goes down. This makes sense and agrees with my original hypothesis because the more runs your team is allowing the other team to score the harder it will be for your team to win the game.
The opposite can also be inferred - that number of wins should go up as Team ERA goes down. The above scatterplot essentially means any number of wins above 0 correlates to a season where the team finished above .500 (I know this because the mean number of wins is ~81, which is half the games in a season), and had a decent chance of making the playoffs (ie. what most teams would consider a successful season). Number of wins below 0 means the team finished under .500 and most likely did not make the playoffs.
Process
Now I want to plot the top 5 teams by Team ERA, and the bottom 5 by Team ERA, and see how their win totals compare. This analysis should further demonstrate the correlation between the two tested variables, Team ERA, and number of wins/season. I expect the top 5 teams will average more wins/year, than the bottom 5.
First, I need to make a new dataframe with the relevant data.
#pivot the data to organize it by team, the mean values for era and W are computed. Sort the data, and take top and bottom 5 teams to produce plotting data
pivot_teams = pd.pivot_table(filtered_teams_df, index=['name'], aggfunc=np.mean)
pivot_teams_sorted = pivot_teams.sort_values(by=('ERA'))
top_five_era = pivot_teams_sorted.head()
bottom_five_era = pivot_teams_sorted.tail()
plotting_data = pd.concat([top_five_era, bottom_five_era])
del plotting_data['yearID']
plotting_data = plotting_data.reset_index()
plotting_data
Process
Then I can plot it.
#plot the data using a scatter plot. The for loop adds the team name annotation based on the index location of the plot point
plotting_x = plotting_data['W']
plotting_y = plotting_data['ERA']
plotting_data_label = ['Dodgers', 'Cardinals', 'Mets', 'Giants', 'Braves', 'Mariners', 'Rangers', 'Angels', 'Rays', 'Rockies']
fix, ax = plt.subplots()
ax.scatter(plotting_x, plotting_y)
plt.suptitle('Team ERA vs. Number of Wins', fontsize = 14)
plt.xlabel('Standardized Number of wins', fontsize = 12)
plt.ylabel('Standardized Team ERA', fontsize = 12)
for i, txt in enumerate(plotting_data_label):
ax.annotate(txt, (plotting_x[i], plotting_y[i]))
Analysis
The above plot shows Standardized Team ERA along the Y-axis, and standardized number of wins along the X-axis for the top 5 and bottom 5 Team ERA averages over the sample period of 1965-2015.
The trend here is pretty clear: generally speaking, lowering your team's ERA will result in winning more regular season games, which strengthens my hypothesis and leads me to belive further, more concrete analysis is warranted to investigate and confirm the possible correlation.
However, the Mets and Angels are clear outliers amongst this grouping. The Mets' team ERA was pretty decent (third best average actually out of any team), but their average win rate was below .500. This illustrates one of the key points to keep in mind when reviewing baseball statistics. I believe that no stat can be taken in isolation. There are so many factors (front-office personnel, weather at the game, hot and cold streaks for players, and on and on and on) that affect which team wins. Making general recommendations, or conclusions based on most baseball stats is fine, but I don't think any definite conclusions should be drawn.
The Angels are also an outlier, but in the opposite direction as the Mets. For the Angels, their team ERA was the third worst of any team over the survey period but their number of wins/year was right in line with the teams with the best average ERA's. Again, outliers are to be expected in baseball, especially when the statistics are being compared only to each other and not factoring in any outside elements. One potential explanation for the Angels plot, is that their batting/offensive statistics were so good that they were able to overcome their deficiencies in pitching. Overall batting averages per team, or on-base percetage totals per team would be a good area to expand the scope of this analysis to find further correlations to expect win totals per season.
Process
If a team decides that they want to focus on lowering their Team ERA, how will these efforts fit into the overall Team ERA trends in MLB over the years? To get a sense of this metric I will plot the average team ERA/year since 1962 and see what the trend is. My hypothesis is that the average ERA will be trending downward over the years due to better practice/training/exercise/and medical processes developed over the last 50+ years.
#creates a bar graph of the mean ERA value per year, format for creating a bar graph is .bar(xvalue, yvalue)
pivot_era = pd.pivot_table(era_data, index=['yearID'], aggfunc=np.mean)
pivotera = pivot_era.reset_index()
plt.bar(pivotera['yearID'], pivotera['ERA'])
plt.suptitle('Graph of Average Team ERA by Year', fontsize = 14)
plt.xlabel('Year', fontsize = 12)
plt.ylabel('Team ERA', fontsize = 12)
Analysis
The above bar graph illustrates average team ERA calculated from every team per year. Essentially, all teams had their Team ERA averaged into one overall average for all teams. Each bar on the graph represents the average Team ERA during a different year, with gaps where the data was removed to account for seasons where a strike disrupted play.
The trend illustrated by the graph was not what I expected at all! From near the beginning of the survey period (1962) until the early 2000's the overall trend is upward, until it peaks near ~2000. This trend very clearly mirrors the incresing use of performance-enhancing drugs, or steriods, in MLB through the 70's, and especially during the 80's and 90's.
After the peak the overall trend is slightly downwards which corresponds to an increase in steroid testing in MLB and more severe penalties for steroid users. This presumably had the desired effect of reducing usage of steriods in MLB, which would curb some of the frankly outrageous offensive batting statistics put forth by certain players during the 80's and 90's.
The observed trend in the data does not necessarily agree with my original hypothesis, but I did not consider the balooning effect of steriod usage on Team ERA would be inferred here.
I think the data has clearly shown that, yes, teams that win more games have lower team ERA's, on average. This is not to say that lowering ERA CAUSES a team to win more games. There are many factors that go into which team wins a MLB game, factors such as fielding percentage, batting stats, and countless others. Team ERA is just one measure of a team's performance.
This analysis was purposely limited to only team ERA/number of wins and only to data going back to 1962. I did this to get a sense of the correlation during the modern baseball period. This, of course, does not take into account the almost 100 years of games played before I began my analysis. For a more complete picture, I could re-run the study for the entire dataset streching back to the 1800's.
The advantage of my approach is that the length of each season during my sample period was the same (162 games), this simplified the dataset significantly, reducing the liklihood of introducing errors. I also chose to ignore data during strike years. I was not interested in the data for these years because they are not representative of a 'normal' baseball season. A baseball General Manager would never make special 'plans' for a strike season because they are never predictable. The hypothetical General Manager would instead, plan the way his/her team would run based on 'normal', predictable seasons, which last 162 games in the modern era.
As mentioned throughout the analysis there are many other variables that may affect who wins or loses a baseball game, including factors outside the game, like steroids! There are many other variables not included in this dataset that could have been useful in drawing stronger conclusions. Some examples of these other variables are average runs scored per game, team batting average, team fielding percentage, and/or team on-base percentage.
A good follow-up to this analysis might be seeing how some or all of those other variables relate to win percentage over the regular season and then finding out which one of the examined variables correlates the strongest with winning more games.
That one stat would be the area of the team I would spend the most time/energy/and money to improve in the future.
https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook#gs.fSMrzAY
Python for Data Analysis - Wes McKinney
https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/
http://stackoverflow.com/questions/17661136/csv-does-not-exist-pandas-dataframe
http://pbpython.com/excel-pandas-comp-2.html
http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe
http://pandas.pydata.org/pandas-docs/stable/basics.html
https://en.wikipedia.org/wiki/MLB_lockout
http://stackoverflow.com/questions/12525722/normalize-data-in-pandas
http://pbpython.com/pandas-pivot-table-explained.html
http://stackoverflow.com/questions/23307301/pandas-replacing-column-values-in-dataframe
http://chrisalbon.com/python/pandas_join_merge_dataframe.html
http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe
https://www.biomedware.com/files/documentation/Preparing_data/Why_standardize_variables.htm