Initial Loading¶
The wrangled data previously created and required modules is loaded.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scikits.bootstrap as bootstrap
import scipy.stats
from typing import List
regular_season_totals = pd.read_csv("regular_season_totals.csv")
play_off_totals = pd.read_csv("play_off_totals.csv")
regular_season_team_summary_stats = pd.read_csv("regular_season_team_summary_stats.csv")
Visualization Analysis (Team-Level)¶
The purpose of this component of the project is to analyze basketball performance variables and determine if they may have associations with winning and team performance in basketball. The results from this component will be applied to the statistical modelling component. Specifically, the top 6 variables interpreted to have the strongest association with wins will be selected as the model predictors in the next element in this project.
Boxplot, Histogram, Bar Chart, and Confidence Interval Analysis¶
Boxplots are used to compare the total values of all variables among all teams during the regular season. Since there are 30 teams, to make visualizations more readable, 3 side-by-side boxplots will be used. The y-scales will also be monitored closely, since there will be 3 boxplots with different scales. For each variable explored in the boxplots, a team that performs relatively better in the variable and one that performs relatively worse is selected via observation from the boxplots to observe meaningful differences. There is no specific criterion for the selection process, though teams with more extreme means and lower ranges are more favoured to be chosen. For simplicity, play-offs data will not be used.
Then, a histogram showing the distribution of the variable from all games played by the team during the regular season is plotted, along with a bar chart summarizing their wins and losses. Play-offs data is not used to create the histograms and bar charts as there are a lot less observations for each team compared to regular season data. It is then compared whether the higher-performing team in the variable has more wins. In the context of higher-performing in this project, it means a greater score, not necessarily whether or not the score is favourable. The purpose of this part is to have a broader glimpse of the relationship between the variable and winning for each of the teams, and whether or not the variable may be essential to winning. If the lower-performing team has more wins than the higher-performing team, then the variable may not have as strong a relationship with winning as other variables.
Focusing only on games where the higher-performing team beat the lower-performing team, a 95% confidence interval for the true mean difference in the variable between the teams is computed. This is done via bootstrapping 1000 resamples from all games in the sample where that specific winning team beat the specific losing team. Play-offs data is once again not used here due to very limited data. If the interval contains values between -1 and 1, then it is interpreted that winning is not likely to be associated with having higher values in the variable, as it is suggested that both teams have similar overall performance. This is because differences in the basketball performance variables analyzed with magnitudes less than 1 are minimal and do not typically portray the variable as having a meaningful association with winning. However, if the interval contains values greater than 1, it is interpreted that there may be a positive relationship between the variable and winning, as one team does substantially better than the other. Afterwards, a 95% confidence interval for the true mean difference in the variable between the winning and losing teams of all games in each of the regular season and play-offs sample is constructed to potentially solidify the findings from just using the two teams, or to observe pattern deviations. The results from the interval using the regular season sample should be more reliable as the sample size is larger than the play-offs sample, being more representative of the population. Since both sample sizes are much larger than the sample with the two specific teams, their results for the confidence interval are more reliable as there is more precision. The purpose of comparing two teams while also having the ability to use the entire samples of regular season and play-offs games is to further explore the data and identify any outliers or deviations from the overall trends.
#Team names to create boxplots of variables for
teamNames1 = ["Atlanta Hawks","Boston Celtics","Charlotte Hornets","Chicago Bulls","Cleveland Cavaliers",
"Dallas Mavericks","Denver Nuggets","Detroit Pistons","Golden State Warriors",
"Houston Rockets"]
teamNames2 = ["Indiana Pacers","Los Angeles Clippers","Los Angeles Lakers","Memphis Grizzlies","Miami Heat",
"Milwaukee Bucks","Minnesota Timberwolves","Brooklyn Nets","New Orleans Pelicans",
"New York Knicks"]
teamNames3 = ["Oklahoma City Thunder","Orlando Magic","Phoenix Suns","Portland Trail Blazers","Sacramento Kings",
"San Antonio Spurs","Toronto Raptors","Washington Wizards","Philadelphia 76ers","Utah Jazz"]
regular_season_team_summary_stats_1 = regular_season_team_summary_stats[regular_season_team_summary_stats["teamName"].isin(teamNames1)]
regular_season_team_summary_stats_2 = regular_season_team_summary_stats[regular_season_team_summary_stats["teamName"].isin(teamNames2)]
regular_season_team_summary_stats_3 = regular_season_team_summary_stats[regular_season_team_summary_stats["teamName"].isin(teamNames3)]
#Regular season totals data split between winning games and losing games
winning_games_regular_season = regular_season_totals[regular_season_totals["WL"] == "W"]
losing_games_regular_season = regular_season_totals[regular_season_totals["WL"] == "L"]
#Play-offs totals data split between winning games and losing games
winning_games_play_offs = play_off_totals[play_off_totals["WL"] == "W"]
losing_games_play_offs = play_off_totals[play_off_totals["WL"] == "L"]
#Function to create boxplots
def plot_boxplot(variable:str,df:pd):
"""
Plots boxplot of given variable for given data frame df for group of teams,
assuming the variable exists within df
"""
plt.figure(figsize=(20,4.8))
sns.boxplot(x="teamName",y=variable,data=df,color='grey')
plt.show()
return
#Function to create histogram
def plot_histogram(variable:str,df:pd):
"""
Plots histogram of given variable for given data frame df for a team,
assuming the variable exists within df
"""
plt.figure(figsize=(20,4.8))
sns.histplot(x=variable,data=df,bins='auto',color='grey')
plt.show()
return
#Function to create bar chart
def plot_barchart(variable:str,df:pd):
"""
Plots bar chart of given variable for given data frame df for a team,
assuming the variable exists within df
"""
plt.figure(figsize=(20,4.8))
sns.countplot(x=variable,data=df,color='grey')
plt.show()
return
def bootstrap_distribution(variable:str,original_sample:pd):
"""
Performs bootstrap resampling for 1000 samples to explore true mean for given variable of the given original sample,
assuming the variable exists in given data
"""
bootstrap_means = []
for _ in range(1000):
bootstrap_resample = original_sample[variable].sample(n=len(original_sample),replace=True)
mean = bootstrap_resample.mean()
bootstrap_means.append(mean)
return bootstrap_means
def confidence_interval(data:List[int]):
"""
Computes lower and upper bounds of 95% confidence interval for given numeric data
"""
data = pd.Series(data)
lower_bound = float(data.quantile(0.025))
upper_bound = float(data.quantile(0.975))
return [lower_bound,upper_bound]
def visualize_confidence_interval(distribution: None,bounds:List[float]):
"""
Shades in given 95% confidence interval for given distribution
"""
sns.histplot(x=distribution,bins=30,color='gray')
plt.axvspan(bounds[0],bounds[1],alpha=0.3,facecolor="turquoise",edgecolor="red",linewidth=4)
plt.show()
return
def score(ci1: List[float],ci2:List[float]):
"""
Computes score for variable given the method described under "Final Selections"
"""
return 0.93 * ((ci1[0] + ci1[1])/2)/(ci1[1] - ci1[0]) + 0.07 * ((ci2[0] + ci2[1])/2)/(ci2[1] - ci2[0])
¶
Field Goals Made
plot_boxplot("totalFieldGoalsMade",regular_season_team_summary_stats_1)
plot_boxplot("totalFieldGoalsMade",regular_season_team_summary_stats_2)
plot_boxplot("totalFieldGoalsMade",regular_season_team_summary_stats_3)
Based on the boxplots above, San Antonio Spurs appear to have a higher number of field goals made than most teams and Utah Jazz tends to have a lower number of field goals made. To explore the association of field goals made on game outcomes, a histogram of the field goals made and a bar chart of wins and losses are plotted for each team.
sas_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="San Antonio Spurs"]
uta_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Utah Jazz"]
plot_histogram("fieldGoalsMade",sas_regular_season_totals)
plot_histogram("fieldGoalsMade",uta_regular_season_totals)
sas_wl_bar_chart = plot_barchart("WL",sas_regular_season_totals)
uta_wl_bar_chart = plot_barchart("WL",uta_regular_season_totals)
The histograms and bar charts above show that the San Antonio Spurs have a higher avergae number of field goals made while also more wins than Utah Jazz. To further investigate the association between field goals and winning, a 95% confidence interval for the true mean difference in field goals made is computed by bootstrap resampling with 1000 iterations from all NBA games during 2010-2024 when San Antonio Spurs beat Utah Jazz.
np.random.seed(100)
combined_games_regular_season = pd.merge(winning_games_regular_season,losing_games_regular_season,on=["gameID"])
combined_games_regular_season["diffFieldGoals"] = combined_games_regular_season["fieldGoalsMade_x"] - combined_games_regular_season["fieldGoalsMade_y"]
combined_games_play_offs = pd.merge(winning_games_play_offs,losing_games_play_offs,on=["gameID"])
combined_games_play_offs["diffFieldGoals"] = combined_games_play_offs["fieldGoalsMade_x"] - combined_games_play_offs["fieldGoalsMade_y"]
sas_uta_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="San Antonio Spurs") & (combined_games_regular_season["teamName_y"]=="Utah Jazz")]
sas_uta_field_goals_bootstrap_distribution = bootstrap_distribution("diffFieldGoals",sas_uta_games)
sas_uta_percentile_confidence_interval = confidence_interval(sas_uta_field_goals_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goals")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Field Goals for \nSan Antonio Spurs Winning Utah Jazz (n=1000)")
sas_uta_field_goals_confidence_interval = visualize_confidence_interval(sas_uta_field_goals_bootstrap_distribution,sas_uta_percentile_confidence_interval)
sas_uta_percentile_confidence_interval
[1.9301724137931036, 5.414655172413792]
Based on the interval, 95% of samples with 49 games would produce a 95% confidence interval that contains the true mean difference in field goals made between the San Antonio Spurs and Utah Jazz. In other words, there is 95% confidence that the true mean difference in field goals made between the San Antonio Spurs and Utah Jazz lies between approximately 1.73 and 5.35 field goals. Since the entire interval lies above a 1.5 difference in field goals made, this provides further evidence that number of field goals made is strongly associated with winning basketball games. To further investigate this relationship, a 95% confidence interval is constructed for the full sample of NBA games.
np.random.seed(100)
field_goals_regular_season_bootstrap_distribution = bootstrap_distribution("diffFieldGoals",combined_games_regular_season)
field_goals_regular_season_percentile_confidence_interval = confidence_interval(field_goals_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goals")
plt.title("Bootstrap Sampling Distribution for Difference in Mean Field Goals \n(Regular Season, n=1000)")
field_goals_regular_season_confidence_interval = visualize_confidence_interval(field_goals_regular_season_bootstrap_distribution,field_goals_regular_season_percentile_confidence_interval)
print(field_goals_regular_season_percentile_confidence_interval)
field_goals_play_offs_bootstrap_distribution = bootstrap_distribution("diffFieldGoals",combined_games_play_offs)
field_goals_play_offs_percentile_confidence_interval = confidence_interval(field_goals_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goals")
plt.title("Bootstrap Sampling Distribution for Difference in Mean Field Goals \n(Play-offs, n=1000)")
field_goals_play_offs_confidence_interval = visualize_confidence_interval(field_goals_play_offs_bootstrap_distribution,field_goals_play_offs_percentile_confidence_interval)
print(field_goals_play_offs_percentile_confidence_interval)
[3.883704526353704, 4.0328415776203625]
[3.925124688279302, 4.479093931837074]
The results from constructing a 95% confidence interval for the true mean difference in field goals made using the both the regular season and play-offs samples further support the suggestion that field goals made are essential to success in basketball games, as the confidence interval for regular season games ranges from approximately 3.88 to 4.03 field goals. The confidence interval constructed from the play-offs games ranges similar values, but is a little wider. This is expected as the sample size is smaller than the regular season sample. The lower bounds reflect a larger difference in field goals compared to the interval for the San Antonio Spurs and Utah Jazz; both intervals are narrower, indicating more precision. The increased precision is expected, given the larger sample sizes.
Three Pointers¶
plot_boxplot("totalThreePointersMade",regular_season_team_summary_stats_1)
plot_boxplot("totalThreePointersMade",regular_season_team_summary_stats_2)
plot_boxplot("totalThreePointersMade",regular_season_team_summary_stats_3)
Though San Antonio Spurs performs strongly in field goals, based on the boxplots, it seems like they do not perform as well in three pointers. On the contrary, Houston Rockets seems to perform better than the other teams in three pointers.
hou_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Houston Rockets"]
plot_histogram("threePointersMade",hou_regular_season_totals)
plot_histogram("threePointersMade",sas_regular_season_totals)
plot_barchart("WL",hou_regular_season_totals)
plot_barchart("WL",sas_regular_season_totals)
It is observed above that though Houston Rockets tend to make more three pointers than San Antonio Spurs, they do not have more wins.
np.random.seed(100)
combined_games_regular_season["diffThreePointers"] = combined_games_regular_season["threePointersMade_x"] - combined_games_regular_season["threePointersMade_y"]
combined_games_play_offs["diffThreePointers"] = combined_games_play_offs["threePointersMade_x"] - combined_games_play_offs["threePointersMade_y"]
sas_hou_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"] == "Houston Rockets") & (combined_games_regular_season["teamName_y"] == "San Antonio Spurs")]
sas_hou_field_goals_bootstrap_distribution = bootstrap_distribution("diffThreePointers",sas_hou_games)
sas_hou_percentile_confidence_interval = confidence_interval(sas_hou_field_goals_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointers for \nSan Antonio Spurs Winning Houston Rockets (n=1000)")
sas_hou_field_goals_confidence_interval = visualize_confidence_interval(sas_hou_field_goals_bootstrap_distribution,sas_hou_percentile_confidence_interval)
plt.show()
sas_hou_percentile_confidence_interval
[1.32, 4.88]
Despite Houston Rockets winning less games than San Antonio Spurs, when they beat San Antonio Spurs, they still tend to have more three pointers as the lower bound of the confidence interval is approximately 1.48 three pointers.sas_hou_games
np.random.seed(100)
three_pointers_regular_season_bootstrap_distribution = bootstrap_distribution("diffThreePointers",combined_games_regular_season)
three_pointers_regular_season_percentile_confidence_interval = confidence_interval(three_pointers_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointers \n(Regular season, n=1000)")
three_pointers_regular_season_confidence_interval = visualize_confidence_interval(three_pointers_regular_season_bootstrap_distribution,three_pointers_regular_season_percentile_confidence_interval)
print(three_pointers_regular_season_percentile_confidence_interval)
three_pointers_play_offs_bootstrap_distribution = bootstrap_distribution("diffThreePointers",combined_games_play_offs)
three_pointers_play_offs_percentile_confidence_interval = confidence_interval(three_pointers_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointers \n(Play-offs, n=1000)")
three_pointers_play_offs_confidence_interval = visualize_confidence_interval(three_pointers_play_offs_bootstrap_distribution,three_pointers_play_offs_percentile_confidence_interval)
print(three_pointers_play_offs_percentile_confidence_interval)
[1.7015848241085365, 1.8355249729859529]
[1.693246051537822, 2.2103075644222776]
Using the regular season sample, the 95% confidence interval ranges from approximately 1.70 to 1.84 three-pointers. The play-offs sample also constructs an interval ranging similar values, except it is once again a little wider. Comparing these observations to field goals, it is clear that though there is indication that three-pointers are positively associated with winning, it is not as strong as field goals. This finding supports the idea that field goals are more critical than three-pointers in determining game outcomes.
Free Throws¶
plot_boxplot("totalFreeThrowsMade",regular_season_team_summary_stats_1)
plot_boxplot("totalFreeThrowsMade",regular_season_team_summary_stats_2)
plot_boxplot("totalFreeThrowsMade",regular_season_team_summary_stats_3)
For free throws made, it seems that Houston Rockets once again performs relatively better than the other teams. Orlando Magic seems to perform relatively worse.
orl_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Orlando Magic"]
plot_histogram("freeThrowsMade",hou_regular_season_totals)
plot_histogram("freeThrowsMade",orl_regular_season_totals)
plot_barchart("WL",hou_regular_season_totals)
plot_barchart("WL",orl_regular_season_totals)
Not only does Orlando Magic make roughly less free throws, but it also loses more games than it wins.
np.random.seed(100)
combined_games_regular_season["diffFreeThrows"] = combined_games_regular_season["freeThrowsMade_x"] - combined_games_regular_season["freeThrowsMade_y"]
combined_games_play_offs["diffFreeThrows"] = combined_games_play_offs["freeThrowsMade_x"] - combined_games_play_offs["freeThrowsMade_y"]
hou_orl_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Houston Rockets") & (combined_games_regular_season["teamName_y"]=="Orlando Magic")]
hou_orl_free_throws_bootstrap_distribution = bootstrap_distribution("diffFreeThrows",hou_orl_games)
hou_orl_free_throws_percentile_confidence_interval = confidence_interval(hou_orl_free_throws_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throws")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throws for \nHouston Rockets Winning Orlando Magic (n=1000)")
hou_orl_free_throws_confidence_interval = visualize_confidence_interval(hou_orl_free_throws_bootstrap_distribution,hou_orl_free_throws_percentile_confidence_interval)
hou_orl_free_throws_percentile_confidence_interval
[-2.0588235294117645, 4.529411764705882]
Despite the Houston Rockets making more free throws and wins than the Orlando Magic, the 95% confidence interval not only contains values below both 1 and 0. This implies that in a substantial number of games, the Orlando Magic may have made more free throws than the Houston Rockets, and additionally, there is limited confidence that the true mean difference in free throws made is above 0. This suggests that there is a weak relationship between making free throws and winning.
np.random.seed(100)
free_throws_regular_season_bootstrap_distribution = bootstrap_distribution("diffFreeThrows",combined_games_regular_season)
free_throws_regular_season_percentile_confidence_interval = confidence_interval(free_throws_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throws")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throws \n(Regular season, n=1000)")
free_throws_regular_season_confidence_interval = visualize_confidence_interval(free_throws_regular_season_bootstrap_distribution,free_throws_regular_season_percentile_confidence_interval)
print(free_throws_regular_season_percentile_confidence_interval)
free_throws_play_offs_bootstrap_distribution = bootstrap_distribution("diffFreeThrows",combined_games_play_offs)
free_throws_play_offs_percentile_confidence_interval = confidence_interval(free_throws_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throws")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throws \n(Play-offs, n=1000)")
free_throws_play_offs_confidence_interval = visualize_confidence_interval(free_throws_play_offs_bootstrap_distribution,free_throws_play_offs_percentile_confidence_interval)
print(free_throws_play_offs_percentile_confidence_interval)
[1.6124894945371593, 1.8389317445071438]
[1.1953241895261846, 2.0034081463009143]
The 95% confidence interval from bootstrapping the regular season and play-offs samples are narrower and contain only values above 1, in contrast to bootstrapping only games where the Houston Rockets beat the Orlando Magic. The sample with the Houston Rockets beating the Orlando Magic may be an outlier relative to the entire sample, but more importantly, the sample size was much smaller, with only 17 games, a lot less than the usual recommended sample size of at least 30. With a smaller sample size, the confidence interval is wider as variability increases. The 95% confidence interval for the Houston Rockets vs. Orlando Magic had a range of approximately 6.58, almost twice as wide as other team-to-team comparisons, indicating less precision. Given that the interval contained mostly positive values, it provides stronger evidence in favour of a positive mean difference in free throws made by the Houston Rockets compared to the Orlando Magic.
Rebounds¶
plot_boxplot("totalRebounds",regular_season_team_summary_stats_1)
plot_boxplot("totalRebounds",regular_season_team_summary_stats_2)
plot_boxplot("totalRebounds",regular_season_team_summary_stats_3)
For rebounds, Denver Nuggets was selected as the higher-performing team and Miami Heat as the lower-performing team.
den_regular_season_totals = regular_season_totals[regular_season_totals["teamName"] == "Denver Nuggets"]
mia_regular_season_totals = regular_season_totals[regular_season_totals["teamName"] == "Miami Heat"]
plot_histogram("reboundsTotal",den_regular_season_totals)
plot_histogram("reboundsTotal",mia_regular_season_totals)
plot_barchart("WL",den_regular_season_totals)
plot_barchart("WL",mia_regular_season_totals)
As observed when analysing three pointers made, Denver Nuggets do not have more wins than Miami Heat despite having more rebounds.
np.random.seed(100)
combined_games_regular_season["diffRebounds"] = combined_games_regular_season["reboundsTotal_x"] - combined_games_regular_season["reboundsTotal_y"]
combined_games_play_offs["diffRebounds"] = combined_games_play_offs["reboundsTotal_x"] - combined_games_play_offs["reboundsTotal_y"]
den_mia_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"] == "Denver Nuggets") & (combined_games_regular_season["teamName_y"] == "Miami Heat")]
den_mia_rebounds_bootstrap_distribution = bootstrap_distribution("diffRebounds",den_mia_games)
den_mia_rebounds_percentile_confidence_interval = confidence_interval(den_mia_rebounds_bootstrap_distribution)
plt.xlabel("Mean Difference in Rebounds")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Rebounds for \nDenver Nuggets Winning Miami Heat (n=1000)")
den_mia_rebounds_confidence_interval = visualize_confidence_interval(den_mia_rebounds_bootstrap_distribution,den_mia_rebounds_percentile_confidence_interval)
den_mia_rebounds_percentile_confidence_interval
[0.9970588235294119, 6.354411764705881]
The confidence interval is once again relatively wide, likely to be due to the limited sample size of only 17 games in which the Denver Nuggets beat the Miami Heat.
np.random.seed(100)
rebounds_regular_season_bootstrap_distribution = bootstrap_distribution("diffRebounds",combined_games_regular_season)
rebounds_regular_season_percentile_confidence_interval = confidence_interval(rebounds_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Rebounds")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Rebounds \n(Regular season, n=1000)")
rebounds_regular_season_confidence_interval = visualize_confidence_interval(rebounds_regular_season_bootstrap_distribution,rebounds_regular_season_percentile_confidence_interval)
print(rebounds_regular_season_percentile_confidence_interval)
rebounds_play_offs_bootstrap_distribution = bootstrap_distribution("diffRebounds",combined_games_play_offs)
rebounds_play_offs_percentile_confidence_interval = confidence_interval(rebounds_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Rebounds")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Rebounds \n(Play-offs, n=1000)")
rebounds_play_offs_confidence_interval = visualize_confidence_interval(rebounds_play_offs_bootstrap_distribution,rebounds_play_offs_percentile_confidence_interval)
print(rebounds_play_offs_percentile_confidence_interval)
[3.480279745467643, 3.7330246728298713]
[3.329093931837074, 4.289318370739817]
As previously noted, the confidence interval for both samples is narrower due to the large sample sizes. Given that the lower bounds are approximately above a mean difference of 3 rebounds, the data support the conclusion that securing more rebounds is positively associated with winning. As usual, using the play-offs sample results in a wider interval than using the regular season sample as the sample size is smaller, which results in less precision.
Assists¶
plot_boxplot("totalAssists",regular_season_team_summary_stats_1)
plot_boxplot("totalAssists",regular_season_team_summary_stats_2)
plot_boxplot("totalAssists",regular_season_team_summary_stats_3)
The Golden State Warriors was selected as the higher-peforming team for assists, while the New York Knicks was selected as the lower-performing team.
gsw_regular_season_totals = regular_season_totals[regular_season_totals["teamName"] == "Golden State Warriors"]
nyk_regular_season_totals = regular_season_totals[regular_season_totals["teamName"] == "New York Knicks"]
plot_histogram("assists",gsw_regular_season_totals)
plot_histogram("assists",nyk_regular_season_totals)
plot_barchart("WL",gsw_regular_season_totals)
plot_barchart("WL",nyk_regular_season_totals)
There is a clear sharp contrast between the assists each team makes and also the number of games they won. The Golden State Warriors have more assists while having more wins, while the New York Knicks have less assists and less wins.
np.random.seed(100)
combined_games_regular_season["diffAssists"] = combined_games_regular_season["assists_x"] - combined_games_regular_season["assists_y"]
combined_games_play_offs["diffAssists"] = combined_games_play_offs["assists_x"] - combined_games_play_offs["assists_y"]
gsw_nyk_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Golden State Warriors") & (combined_games_regular_season["teamName_y"]=="New York Knicks")]
gsw_nyk_assists_bootstrap_distribution = bootstrap_distribution("diffAssists",gsw_nyk_games)
gsw_nyk_assists_percentile_confidence_interval = confidence_interval(gsw_nyk_assists_bootstrap_distribution)
plt.xlabel("Difference in Assists")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Assists for \n Golden State Warriors Winning New York Knicks (n=1000)")
gsw_nyk_assists_confidence_interval = visualize_confidence_interval(gsw_nyk_assists_bootstrap_distribution,gsw_nyk_assists_percentile_confidence_interval)
print(gsw_nyk_assists_percentile_confidence_interval)
[5.888888888888889, 12.001388888888888]
The confidence interval exhibits the largest difference observed thus far in this project, with a lower bound of approximately 5.89 assists.
np.random.seed(100)
assists_regular_season_bootstrap_distribution = bootstrap_distribution("diffAssists",combined_games_regular_season)
assists_regular_season_percentile_confidence_interval = confidence_interval(assists_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Assists")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Assists \n(Regular season, n=1000)")
assists_regular_season_confidence_interval = visualize_confidence_interval(assists_regular_season_bootstrap_distribution,assists_regular_season_percentile_confidence_interval)
print(assists_regular_season_percentile_confidence_interval)
assists_play_offs_bootstrap_distribution = bootstrap_distribution("diffAssists",combined_games_play_offs)
assists_play_offs_percentile_confidence_interval = confidence_interval(assists_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Assists")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Assists \n(Play-offs, n=1000)")
assists_play_offs_confidence_interval = visualize_confidence_interval(assists_play_offs_bootstrap_distribution,assists_play_offs_percentile_confidence_interval)
print(assists_play_offs_percentile_confidence_interval)
[3.120283047184536, 3.2888867210949693]
[2.599334995843724, 3.2319825436408975]
The confidence intervals align with the results observed for the mean differences in assists when Golden State Warriors beat New York Knicks, with lower bounds of above approximately 2.5 assists. This again suggests that recording more assists has a positive relationship with winning games.
Steals¶
plot_boxplot("totalSteals",regular_season_team_summary_stats_1)
plot_boxplot("totalSteals",regular_season_team_summary_stats_2)
plot_boxplot("totalSteals",regular_season_team_summary_stats_3)
Memphis Grizzlies perform better in steals than Portland Trail Blazers while having more wins.
np.random.seed(100)
combined_games_regular_season["diffSteals"] = combined_games_regular_season["steals_x"] - combined_games_regular_season["steals_y"]
combined_games_play_offs["diffSteals"] = combined_games_play_offs["steals_x"] - combined_games_play_offs["steals_y"]
mem_por_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Memphis Grizzlies") & (combined_games_regular_season["teamName_y"]=="Portland Trail Blazers")]
mem_por_steals_bootstrap_distribution = bootstrap_distribution("diffSteals",mem_por_games)
mem_por_steals_percentile_confidence_interval = confidence_interval(mem_por_steals_bootstrap_distribution)
plt.xlabel("Mean Difference in Steals")
plt.title("Bootstrap Sampling Distrbution for Mean Difference in Steals for \nMemphis Grizzlies Winning Portland Trail Blazers (n=1000)")
mem_por_steals_confidence_interval = visualize_confidence_interval(mem_por_steals_bootstrap_distribution,mem_por_steals_percentile_confidence_interval)
print(mem_por_steals_percentile_confidence_interval)
[0.2790000000000001, 2.720999999999999]
The confidence interval ranges from relatively lower mean difference values, with a lower bound of 0.32 and an upper bound of 2.76. The sample size was also 25, yielding an interval a little more precise than previous intervals where the sample size was smaller.
np.random.seed(100)
steals_regular_season_bootstrap_distribution = bootstrap_distribution("diffSteals",combined_games_regular_season)
steals_regular_season_percentile_confidence_interval = confidence_interval(steals_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Steals")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Steals \n(Regular season, n=1000)")
steals_regular_season_confidence_interval = visualize_confidence_interval(steals_regular_season_bootstrap_distribution,steals_regular_season_percentile_confidence_interval)
print(steals_regular_season_percentile_confidence_interval)
steals_play_offs_bootstrap_distribution = bootstrap_distribution("diffSteals",combined_games_play_offs)
steals_play_offs_percentile_confidence_interval = confidence_interval(steals_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Steals")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Steals \n(Play-offs, n=1000)")
steals_play_offs_confidence_interval = visualize_confidence_interval(steals_play_offs_bootstrap_distribution,steals_play_offs_percentile_confidence_interval)
print(steals_play_offs_percentile_confidence_interval)
[0.7036844158962661, 0.8254397286589026]
[0.7024106400665004, 1.1355153782211138]
Because both 95% confidence intervals tends to range values below 1, it is 95% confident that the true mean difference in steals between winning teams and losing teams is minimal. This aligns well with the results when observing instances where the Memphis Grizzlies beat the Portland Trail Blazers. Although the play-offs sample resulted in a 95% confidence interval with an upper bound of approximately 1.13 steals, most of the values in the interval are still below 1. Additionally, it is less precise than the confidnece interval constructed from using the regular season sample.
Blocks¶
plot_boxplot("totalBlocks",regular_season_team_summary_stats_1)
plot_boxplot("totalBlocks",regular_season_team_summary_stats_2)
plot_boxplot("totalBlocks",regular_season_team_summary_stats_3)
For blocks, Oklahoma City Thunder was selected as the higher-performing team and Cleveland Cavaliers was selected as the lower performing team.
okc_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Oklahoma City Thunder"]
cle_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Cleveland Cavaliers"]
plot_histogram("blocks",okc_regular_season_totals)
plot_histogram("blocks",cle_regular_season_totals)
plot_barchart("WL",okc_regular_season_totals)
plot_barchart("WL",cle_regular_season_totals)
The results produce a staunch difference between the two teams once again. Oklahoma City Thunder tends to have more blocks while winning more than Cleveland Cavaliers.
np.random.seed(100)
combined_games_regular_season["diffBlocks"] = combined_games_regular_season["blocks_x"] - combined_games_regular_season["blocks_y"]
combined_games_play_offs["diffBlocks"] = combined_games_play_offs["blocks_x"] - combined_games_play_offs["blocks_y"]
okc_cle_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"] == "Oklahoma City Thunder") & (combined_games_regular_season["teamName_y"] == "Cleveland Cavaliers")]
okc_cle_blocks_bootstrap_distribution = bootstrap_distribution("diffBlocks",okc_cle_games)
okc_cle_blocks_percentile_confidence_interval = confidence_interval(okc_cle_blocks_bootstrap_distribution)
plt.xlabel("Mean Difference in Blocks")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Blocks for \nOklahoma City Thunder Winning Cleveland Cavaliers (n=1000)")
okc_cle_blocks_confidence_interval = visualize_confidence_interval(okc_cle_blocks_bootstrap_distribution,okc_cle_blocks_percentile_confidence_interval)
print(okc_cle_blocks_percentile_confidence_interval)
[1.6666666666666667, 5.668333333333332]
The confidence interval has a lower bound of approximately 1.73 blocks, which may suggest a positive influence on winning. However, as expected, just like assists and steals, it does not seem to have as much of influence as field goals.
np.random.seed(100)
blocks_regular_season_bootstrap_distribution = bootstrap_distribution("diffBlocks",combined_games_regular_season)
blocks_regular_season_percentile_confidence_interval = confidence_interval(blocks_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Blocks")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Blocks \n(Regular season, n=1000)")
blocks_regular_season_confidence_interval = visualize_confidence_interval(blocks_regular_season_bootstrap_distribution,blocks_regular_season_percentile_confidence_interval)
print(blocks_regular_season_percentile_confidence_interval)
blocks_play_offs_bootstrap_distribution = bootstrap_distribution("diffBlocks",combined_games_play_offs)
blocks_play_offs_percentile_confidence_interval = confidence_interval(blocks_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Blocks")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Blocks \n(Play-offs, n=1000)")
blocks_play_offs_confidence_interval = visualize_confidence_interval(blocks_play_offs_bootstrap_distribution,blocks_play_offs_percentile_confidence_interval)
print(blocks_play_offs_percentile_confidence_interval)
[0.7778079601392724, 0.8788299915956297]
[0.6707605985037406, 1.0623649210307564]
Although the 95% confidence interval for games in which Oklahoma City Thunder Winning Cleveland Cavaliers ranges above a 1.73 mean difference in blocks, the confidence intervals based on the regular season and play-offs samples do not indicate any meaningful difference. Since both samples are more representative of the population of interest and produce a more precise interval, it is more accurate to conclude that blocks are weakly associated with winning.
Fouls¶
plot_boxplot("totalFouls",regular_season_team_summary_stats_1)
plot_boxplot("totalFouls",regular_season_team_summary_stats_2)
plot_boxplot("totalFouls",regular_season_team_summary_stats_3)
For fouls, Phoenix Suns was selected as the higher-performing team and San Antonio Spurs was selected as the lower-performing team
phx_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Phoenix Suns"]
plot_histogram("foulsPersonal",phx_regular_season_totals)
plot_histogram("foulsPersonal",sas_regular_season_totals)
plot_barchart("WL",phx_regular_season_totals)
plot_barchart("WL",sas_regular_season_totals)
The Phoenix Suns tend to have more fouls while having fewer wins, and the San Antonio Spurs tend to have fewer fouls with more wins. Unlike the other variables examined, it is important to note that in basketball, fouls are negative metrics of a team or player, so having fewer fouls is considered favourable. Therefore, for the next comparison, games where the San Antonio Spurs beat the Phoenix Suns will be focused on.
np.random.seed(100)
combined_games_regular_season["diffFouls"] = combined_games_regular_season["foulsPersonal_x"] - combined_games_regular_season["foulsPersonal_y"]
combined_games_play_offs["diffFouls"] = combined_games_play_offs["foulsPersonal_x"] - combined_games_play_offs["foulsPersonal_y"]
phx_sas_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"] == "San Antonio Spurs") & (combined_games_regular_season["teamName_y"] == "Phoenix Suns")]
phx_sas_blocks_bootstrap_distribution = bootstrap_distribution("diffFouls",phx_sas_games)
phx_sas_blocks_percentile_confidence_interval = confidence_interval(phx_sas_blocks_bootstrap_distribution)
plt.xlabel("Mean Difference in Fouls")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Fouls for \nPhoenix Suns Winning San Antonio Spurs (n=1000)")
phx_sas_blocks_confidence_interval = visualize_confidence_interval(phx_sas_blocks_bootstrap_distribution,phx_sas_blocks_percentile_confidence_interval)
print(phx_sas_blocks_percentile_confidence_interval)
[-3.6470588235294117, -0.8235294117647058]
It is observed that the confidence interval includes negative values, indicating that, on average, the San Antonio Spurs commit fewer fouls than the Phoenix Suns. As previously noted, committing fewer fouls is a positive metric in basketball, suggesting that having fewer fouls is positively associated with winning.
np.random.seed(100)
fouls_regular_season_bootstrap_distribution = bootstrap_distribution("diffFouls",combined_games_regular_season)
fouls_regular_season_percentile_confidence_interval = confidence_interval(fouls_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Fouls")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Fouls \n(Regular season, n=1000)")
fouls_regular_season_confidence_interval = visualize_confidence_interval(fouls_regular_season_bootstrap_distribution,fouls_regular_season_percentile_confidence_interval)
print(fouls_regular_season_percentile_confidence_interval)
fouls_play_offs_bootstrap_distribution = bootstrap_distribution("diffFouls",combined_games_play_offs)
fouls_play_offs_percentile_confidence_interval = confidence_interval(fouls_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Fouls")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Fouls \n(Play-offs, n=1000)")
fouls_play_offs_confidence_interval = visualize_confidence_interval(fouls_play_offs_bootstrap_distribution,fouls_play_offs_percentile_confidence_interval)
print(fouls_play_offs_percentile_confidence_interval)
[-0.9539620602713411, -0.7976197622763838]
[-1.1264546965918538, -0.5860141313383209]
Although games in which the San Antonio Spurs beat the Phoenix Suns portray a meaningful relationship with wins and fewer fouls, the 95% confidence intervals constructed above contains mostly values with magnitude less than 1. This phenomenon suggests that overall, even though fewer fouls are considered a positive attribute in basketball, it has a less meaningful relationship with winning.
Turnovers¶
plot_boxplot("totalTurnovers",regular_season_team_summary_stats_1)
plot_boxplot("totalTurnovers",regular_season_team_summary_stats_2)
plot_boxplot("totalTurnovers",regular_season_team_summary_stats_3)
For turnovers, Golden State Warriors was selected as the higher-performing team while Dallas Mavericks was selected as the lower-performing team. As with fouls, turnovers is also considered a negative metric in basketball. Therefore, having fewer turnovers is considered a positive attribute in basketball.
dal_regular_season_totals = regular_season_totals[regular_season_totals["teamName"]=="Dallas Mavericks"]
plot_histogram("turnovers",gsw_regular_season_totals)
plot_histogram("turnovers",dal_regular_season_totals)
plot_barchart("WL",gsw_regular_season_totals)
plot_barchart("WL",dal_regular_season_totals)
Despite having more turnovers, the Golden State Warriors won more games than the Dallas Mavericks. Same with fouls, the confidence interval analysis will be focused on the mean difference in turnovers when the Dallas Mavericks beat the Golden State Warriors, as the relationship of having favourable values of turnovers with winning is being analyzed.
np.random.seed(100)
combined_games_regular_season["diffTurnovers"] = combined_games_regular_season["turnovers_x"] - combined_games_regular_season["turnovers_y"]
combined_games_play_offs["diffTurnovers"] = combined_games_play_offs["turnovers_x"] - combined_games_play_offs["turnovers_y"]
gsw_dal_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Dallas Mavericks") & (combined_games_regular_season["teamName_y"]=="Golden State Warriors")]
gsw_dal_blocks_bootstrap_distribution = bootstrap_distribution("diffTurnovers",gsw_dal_games)
gsw_dal_blocks_percentile_confidence_interval = confidence_interval(gsw_dal_blocks_bootstrap_distribution)
plt.xlabel("Mean Difference in Turnovers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Turnovers for \nDallas Mavericks Winning Golden State Warriors (n=1000)")
gsw_dal_blocks_confidence_interval = visualize_confidence_interval(gsw_dal_blocks_bootstrap_distribution,gsw_dal_blocks_percentile_confidence_interval)
print(gsw_dal_blocks_percentile_confidence_interval)
[-3.5920454545454543, 0.2727272727272727]
The 95% confidence interval includes mostly negative values and has an upper bound close to 0. As such, the data may suggest that having fewer turnovers is associated with winning.
np.random.seed(100)
turnovers_regular_season_bootstrap_distribution = bootstrap_distribution("diffTurnovers",combined_games_regular_season)
turnovers_regular_season_percentile_confidence_interval = confidence_interval(turnovers_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Turnovers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Turnovers \n(Regular Season, n=1000)")
turnovers_regular_season_confidence_interval = visualize_confidence_interval(turnovers_regular_season_bootstrap_distribution,turnovers_regular_season_percentile_confidence_interval)
print(turnovers_regular_season_percentile_confidence_interval)
turnovers_play_offs_bootstrap_distribution = bootstrap_distribution("diffTurnovers",combined_games_play_offs)
turnovers_play_offs_percentile_confidence_interval = confidence_interval(turnovers_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Turnovers")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Turnovers \n(Play-offs, n=1000)")
turnovers_play_offs_confidence_interval = visualize_confidence_interval(turnovers_play_offs_bootstrap_distribution,turnovers_play_offs_percentile_confidence_interval)
print(turnovers_play_offs_percentile_confidence_interval)
[-0.8297679793492616, -0.6694636210829632]
[-1.2718412302576891, -0.7115544472152951]
Though the 95% confidence intervals range negative values, the magnitudes are mostly below 1, thus showing minimal association between committing fewer turnovers and winning. As the entire sample is larger and produces a more precise interval, the results here are more reliable.
Field Goal Attempts¶
plot_boxplot("totalFieldGoalsAttempted",regular_season_team_summary_stats_1)
plot_boxplot("totalFieldGoalsAttempted",regular_season_team_summary_stats_2)
plot_boxplot("totalFieldGoalsAttempted",regular_season_team_summary_stats_3)
Despite the Golden State Warriors typically having more field goal attempts than the Miami Heat, they have a similar number of wins to Miami Heat.
np.random.seed(100)
combined_games_regular_season["diffFieldGoalAttempts"] = combined_games_regular_season["fieldGoalsAttempted_x"] - combined_games_regular_season["fieldGoalsAttempted_y"]
combined_games_play_offs["diffFieldGoalAttempts"] = combined_games_play_offs["fieldGoalsAttempted_x"] - combined_games_play_offs["fieldGoalsAttempted_y"]
gsw_mia_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Golden State Warriors") & (combined_games_regular_season["teamName_y"]=="Miami Heat")]
gsw_mia_field_goal_attempts_bootstrap_distribution = bootstrap_distribution("diffFieldGoalAttempts",gsw_mia_games)
gsw_mia_field_goal_attempts_percentile_confidence_interval = confidence_interval(gsw_mia_field_goal_attempts_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goal Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Field Goal Attempts for \nGolden State Warriors Winning Miami Heat (n=1000)")
gsw_mia_field_goal_attempts_confidence_interval = visualize_confidence_interval(gsw_mia_field_goal_attempts_bootstrap_distribution,gsw_mia_field_goal_attempts_percentile_confidence_interval)
print(gsw_mia_field_goal_attempts_percentile_confidence_interval)
[-5.3140625, 2.8125]
The 95% confidence interval has a wide range, indicating less precision. As it ranges between negative and positive values while also containing all values with magnitude less than 1, it may be the case that attempting field goals has minimal association with winning.
np.random.seed(100)
field_goal_attempts_regular_season_bootstrap_distribution = bootstrap_distribution("diffFieldGoalAttempts",combined_games_regular_season)
field_goal_attempts_regular_season_percentile_confidence_interval = confidence_interval(field_goal_attempts_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goal Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Field Goal Attempts \n(Regular Season, n=1000)")
field_goal_attempts_regular_season_confidence_interval = visualize_confidence_interval(field_goal_attempts_regular_season_bootstrap_distribution,field_goal_attempts_regular_season_percentile_confidence_interval)
print(field_goal_attempts_regular_season_percentile_confidence_interval)
field_goal_attempts_play_offs_bootstrap_distribution = bootstrap_distribution("diffFieldGoalAttempts",combined_games_play_offs)
field_goal_attempts_play_offs_percentile_confidence_interval = confidence_interval(field_goal_attempts_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Field Goal Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Field Goal Attempts \n(Play-offs, n=1000)")
field_goal_attempts_play_offs_confidence_interval = visualize_confidence_interval(field_goal_attempts_play_offs_bootstrap_distribution,field_goal_attempts_play_offs_percentile_confidence_interval)
print(field_goal_attempts_play_offs_percentile_confidence_interval)
[-0.7874414695641734, -0.5167156921599232]
[-0.37578969243557775, 0.6169576059850372]
As previously predicted, using the regular season and play-offs samples suggest that there may be minimal association between attempting more field goals and winning games.
Three Pointer Attempts¶
plot_boxplot("totalThreePointersAttempted",regular_season_team_summary_stats_1)
plot_boxplot("totalThreePointersAttempted",regular_season_team_summary_stats_2)
plot_boxplot("totalThreePointersAttempted",regular_season_team_summary_stats_3)
For three pointer attempts, Houston Rockets was selected as the higher-performing team while San Antonio Spurs was selected as the lower-performing team.
plot_histogram("threePointersAttempted",hou_regular_season_totals)
plot_histogram("threePointersAttempted",sas_regular_season_totals)
plot_barchart("WL",hou_regular_season_totals)
plot_barchart("WL",sas_regular_season_totals)
San Antonio Spurs have more wins than Houston Rockets even though they tend to have less three pointer attempts.
np.random.seed(100)
combined_games_regular_season["diffThreePointerAttempts"] = combined_games_regular_season["threePointersAttempted_x"] - combined_games_regular_season["threePointersAttempted_y"]
combined_games_play_offs["diffThreePointerAttempts"] = combined_games_play_offs["threePointersAttempted_x"] - combined_games_play_offs["threePointersAttempted_y"]
hou_sas_games = combined_games_regular_season[(combined_games_regular_season["teamName_x"]=="Houston Rockets") & (combined_games_regular_season["teamName_y"]=="San Antonio Spurs")]
hou_sas_three_pointer_attempts_bootstrap_distribution = bootstrap_distribution("diffThreePointerAttempts",hou_sas_games)
hou_sas_three_pointer_attempts_percentile_confidence_interval = confidence_interval(hou_sas_three_pointer_attempts_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointer Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointer Attempts for \nHouston Rockets Winning San Antonio Spurs (n=1000)")
hou_sas_three_pointer_attempts_confidence_interval = visualize_confidence_interval(hou_sas_three_pointer_attempts_bootstrap_distribution,hou_sas_three_pointer_attempts_percentile_confidence_interval)
print(hou_sas_three_pointer_attempts_percentile_confidence_interval)
[1.197, 9.682999999999996]
Unlike attempting field goals, the interval suggests a potential association between attempting more three-pointers and winning. However, as usual, the results by comparing only games with two specific teams could deviate from the entire sample of games. Additionally, the interval is relatively wide, thus having less precision.
np.random.seed(100)
three_pointer_attempts_regular_season_bootstrap_distribution = bootstrap_distribution("diffThreePointerAttempts",combined_games_regular_season)
three_pointer_attempts_regular_season_percentile_confidence_interval = confidence_interval(three_pointer_attempts_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointer Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointer Attempts \n(Regular Season, n=1000)")
three_pointer_attempts_regular_season_confidence_interval = visualize_confidence_interval(three_pointer_attempts_regular_season_bootstrap_distribution,three_pointer_attempts_regular_season_percentile_confidence_interval)
print(three_pointer_attempts_regular_season_percentile_confidence_interval)
three_pointer_attempts_play_offs_bootstrap_distribution = bootstrap_distribution("diffThreePointerAttempts",combined_games_play_offs)
three_pointer_attempts_play_offs_percentile_confidence_interval = confidence_interval(three_pointer_attempts_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Three Pointer Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Three Pointer Attempts \n(Play-offs, n=1000)")
three_pointer_attempts_play_offs_confidence_interval = visualize_confidence_interval(three_pointer_attempts_play_offs_bootstrap_distribution,three_pointer_attempts_play_offs_percentile_confidence_interval)
print(three_pointer_attempts_play_offs_percentile_confidence_interval)
[0.017872793852803468, 0.3045053427782447]
[-0.09981296758104738, 0.8305070656691603]
As suggested, the previous analysis with Houston Rockets Winning Miami Heat deviates from the results using the regular season and play-offs samples of games. The results here suggest that there is weak relationship between winning games and attempting more three-pointers.
Free Throw Attempts¶
plot_boxplot("totalFreeThrowsAttempted",regular_season_team_summary_stats_1)
plot_boxplot("totalFreeThrowsAttempted",regular_season_team_summary_stats_2)
plot_boxplot("totalFreeThrowsAttempted",regular_season_team_summary_stats_3)
For free throw attempts, Houston Rockets was selected as the higher-performing team while Orlando Magic was selected as the lower-performing team.
plot_histogram("freeThrowsAttempted",hou_regular_season_totals)
plot_histogram("freeThrowsAttempted",orl_regular_season_totals)
plot_barchart("WL",hou_regular_season_totals)
plot_barchart("WL",orl_regular_season_totals)
Orlando Magic has less wins while also tending to have less free throw attempts.
np.random.seed(100)
combined_games_regular_season["diffFreeThrowAttempts"] = combined_games_regular_season["freeThrowsAttempted_x"] - combined_games_regular_season["freeThrowsAttempted_y"]
combined_games_play_offs["diffFreeThrowAttempts"] = combined_games_play_offs["freeThrowsAttempted_x"] - combined_games_play_offs["freeThrowsAttempted_y"]
hou_orl_games["diffFreeThrowAttempts"] = hou_orl_games["freeThrowsAttempted_x"] - hou_orl_games["freeThrowsAttempted_y"]
hou_orl_free_throw_attempts_bootstrap_distribution = bootstrap_distribution("diffFreeThrowAttempts",hou_orl_games)
hou_orl_free_throw_attempts_percentile_confidence_interval = confidence_interval(hou_orl_free_throw_attempts_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throw Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throw Attempts for \nHouston Rockets Winning Orlando Magic (n=1000)")
hou_orl_free_throw_attempts_confidence_interval = visualize_confidence_interval(hou_orl_free_throw_attempts_bootstrap_distribution,hou_orl_free_throw_attempts_percentile_confidence_interval)
print(hou_orl_free_throw_attempts_percentile_confidence_interval)
/var/folders/5y/f2gxs3rd1px9742dhtwbjbc40000gn/T/ipykernel_72339/2085374337.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy hou_orl_games["diffFreeThrowAttempts"] = hou_orl_games["freeThrowsAttempted_x"] - hou_orl_games["freeThrowsAttempted_y"]
[0.2338235294117648, 8.294117647058824]
Just as the previous team comparison, the interval is wide thus leading to less precision.
np.random.seed(100)
free_throw_attempts_regular_season_bootstrap_distribution = bootstrap_distribution("diffFreeThrowAttempts",combined_games_regular_season)
free_throw_attempts_regular_season_percentile_confidence_interval = confidence_interval(free_throw_attempts_regular_season_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throw Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throw Attempts \n(Regular Season, n=1000)")
free_throw_attempts_regular_season_confidence_interval = visualize_confidence_interval(free_throw_attempts_regular_season_bootstrap_distribution,free_throw_attempts_regular_season_percentile_confidence_interval)
print(free_throw_attempts_regular_season_percentile_confidence_interval)
free_throw_attempts_play_offs_bootstrap_distribution = bootstrap_distribution("diffFreeThrowAttempts",combined_games_play_offs)
free_throw_attempts_play_offs_percentile_confidence_interval = confidence_interval(free_throw_attempts_play_offs_bootstrap_distribution)
plt.xlabel("Mean Difference in Free Throw Attempts")
plt.title("Bootstrap Sampling Distribution for Mean Difference in Free Throw Attempts \n(Play-offs, n=1000)")
free_throw_attempts_play_offs_confidence_interval = visualize_confidence_interval(free_throw_attempts_play_offs_bootstrap_distribution,free_throw_attempts_play_offs_percentile_confidence_interval)
print(free_throw_attempts_play_offs_percentile_confidence_interval)
[1.5190884259815105, 1.792971845359587]
[0.857024106400665, 1.8146716541978387]
Surprisingly, the intervasl suggests there may be a stronger relationship between having more wins and attempting more free throws when compared to field goal and three pointer attempts. What is also interesting is that when compared to making actual free throws, the interval constructed from using the regular season data spans similar values.
Final Selections¶
The regular season data contains approximately 93% of all games in all the data explored while the play-offs data contains approximately 7%. For each of the 95% confidence intervals constructed from each sets of data, the absolute values of the centres of the 95% confidence intervals constructed from using each of the regular season and play-offs data will be divided by their ranges and also multiplied by their respective proportions. This method will focus on variables with mean differences with higher values while also penalizing ones with wider (less precise) intervals. The 6 variables with the greatest computed values will be the top 6 variables selected for statistical modelling.
field_goals_score = score(field_goals_regular_season_percentile_confidence_interval,field_goals_play_offs_percentile_confidence_interval)
print(field_goals_score)
three_pointers_score = score(three_pointers_regular_season_percentile_confidence_interval,three_pointers_play_offs_percentile_confidence_interval)
print(three_pointers_score)
free_throws_score = score(free_throws_regular_season_percentile_confidence_interval,free_throws_play_offs_percentile_confidence_interval)
print(free_throws_score)
rebounds_score = score(rebounds_regular_season_percentile_confidence_interval,rebounds_play_offs_percentile_confidence_interval)
print(rebounds_score)
assists_score = score(assists_regular_season_percentile_confidence_interval,assists_play_offs_percentile_confidence_interval)
print(assists_score)
steals_score = score(steals_regular_season_percentile_confidence_interval,steals_play_offs_percentile_confidence_interval)
print(steals_score)
blocks_score = score(blocks_regular_season_percentile_confidence_interval,blocks_play_offs_percentile_confidence_interval)
print(blocks_score)
fouls_score = score(fouls_regular_season_percentile_confidence_interval,fouls_play_offs_percentile_confidence_interval)
print(fouls_score)
turnovers_score= score(turnovers_regular_season_percentile_confidence_interval,turnovers_play_offs_percentile_confidence_interval)
print(turnovers_score)
field_goal_attempts_score = score(field_goal_attempts_regular_season_percentile_confidence_interval,field_goal_attempts_play_offs_percentile_confidence_interval)
print(field_goal_attempts_score)
three_pointer_attempts_score = score(three_pointer_attempts_regular_season_percentile_confidence_interval,three_pointer_attempts_play_offs_percentile_confidence_interval)
print(three_pointer_attempts_score)
free_throw_attempts_score = score(free_throw_attempts_regular_season_percentile_confidence_interval,free_throw_attempts_play_offs_percentile_confidence_interval)
print(free_throw_attempts_score)
25.21427763257304 12.544016036174868 7.226050867293867 13.548723813295808 17.99875700853817 5.988458058461015 7.780331863029979 -5.320533006607075 -4.472768029068193 -2.231524564548751 0.5504793504985643 5.720869614037792
Based on the computations, field goals, three-pointers, free throws, rebounds, assists, and blocks will be selected as the top half variables to predict wins and losses in statistical modelling.