Statistical Modelling¶
Initial Loading¶
The data from the main file of wrangled data and required libraries is loaded.
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.metrics import accuracy_score,confusion_matrix,ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
regular_season_totals = pd.read_csv("regular_season_totals.csv")
play_off_totals = pd.read_csv("play_off_totals.csv")
Statistical Modelling (Team-Level)¶
In this component of the project, the top 6 variables representative of team performance selected from the Visualization Analysis component will be applied in a logistic regression model using both the regular season and play-offs totls data to predict whether or not a team wins or loses a basketball game. The data will be split into training and testing sets. 80% of the data will be used as training data to build the model and 20% of the data will be used to assess the performance of the model (testing data). A confusion matrix is constructed to get the accuracy of the model as well.
Main Classification Models¶
The main classification models built from the top 6 variables from visualization analysis.
np.random.seed(100)
regular_season_main_predictors_training,regular_season_main_predictors_testing,regular_season_main_WL_training,regular_season_main_WL_testing = train_test_split(regular_season_totals[["fieldGoalsMade","threePointersMade","freeThrowsMade","reboundsTotal","assists","blocks"]],
regular_season_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
regular_season_main_predictors_training = standard_scaler.fit_transform(regular_season_main_predictors_training)
regular_season_main_predictors_testing = standard_scaler.transform(regular_season_main_predictors_testing)
regular_season_main_classification_model = LogisticRegression()
regular_season_main_classification_model.fit(regular_season_main_predictors_training,regular_season_main_WL_training)
regular_season_main_classification_model_predictions = regular_season_main_classification_model.predict(regular_season_main_predictors_testing)
regular_season_main_classification_model_accuracy = accuracy_score(regular_season_main_WL_testing,regular_season_main_classification_model_predictions)
regular_season_main_classification_model_confusion_matrix = confusion_matrix(regular_season_main_WL_testing,regular_season_main_classification_model_predictions)
regular_season_main_classification_model_accuracy
0.7225390156062425
regular_season_main_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=regular_season_main_classification_model_confusion_matrix,display_labels=regular_season_main_classification_model.classes_)
regular_season_main_classification_model_confusion_matrix_display.plot()
plt.show()
The model using regular season totals provides approximately 72.25% accuracy using the top 6 variables selected.
np.random.seed(100)
play_offs_main_predictors_training,play_offs_main_predictors_testing,play_offs_main_WL_training,play_offs_main_WL_testing = train_test_split(play_off_totals[["fieldGoalsMade","threePointersMade","freeThrowsMade","reboundsTotal","assists","blocks"]],
play_off_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
play_offs_main_predictors_training = standard_scaler.fit_transform(play_offs_main_predictors_training)
play_offs_main_predictors_testing = standard_scaler.transform(play_offs_main_predictors_testing)
play_offs_main_classification_model = LogisticRegression()
play_offs_main_classification_model.fit(play_offs_main_predictors_training,play_offs_main_WL_training)
play_offs_main_classification_model_predictions = play_offs_main_classification_model.predict(play_offs_main_predictors_testing)
play_offs_main_classification_model_accuracy = accuracy_score(play_offs_main_WL_testing,play_offs_main_classification_model_predictions)
play_offs_main_classification_model_confusion_matrix = confusion_matrix(play_offs_main_WL_testing,play_offs_main_classification_model_predictions)
play_offs_main_classification_model_accuracy
0.718816067653277
play_offs_main_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=play_offs_main_classification_model_confusion_matrix,display_labels=play_offs_main_classification_model.classes_)
play_offs_main_classification_model_confusion_matrix_display.plot()
plt.show()
The model using play-off totals provides approximately 71.88% accuracy using the top 6 variables selected, similar to the model using regular season totals.
Comparison with Other Models¶
To see how well the model using the top 6 variables is performing, it is compared to a model using the bottom 6 variables and a model using all variables.
np.random.seed(100)
regular_season_bottom_predictors_training,regular_season_bottom_predictors_testing,regular_season_bottom_WL_training,regular_season_bottom_WL_testing = train_test_split(regular_season_totals[["steals","foulsPersonal","turnovers","fieldGoalsAttempted","threePointersAttempted","freeThrowsAttempted"]],
regular_season_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
regular_season_bottom_predictors_training = standard_scaler.fit_transform(regular_season_bottom_predictors_training)
regular_season_bottom_predictors_testing = standard_scaler.transform(regular_season_bottom_predictors_testing)
regular_season_bottom_classification_model = LogisticRegression()
regular_season_bottom_classification_model.fit(regular_season_bottom_predictors_training,regular_season_bottom_WL_training)
regular_season_bottom_classification_model_predictions = regular_season_bottom_classification_model.predict(regular_season_bottom_predictors_testing)
regular_season_bottom_classification_model_accuracy = accuracy_score(regular_season_bottom_WL_testing,regular_season_bottom_classification_model_predictions)
regular_season_bottom_classification_model_confusion_matrix = confusion_matrix(regular_season_bottom_WL_testing,regular_season_bottom_classification_model_predictions)
regular_season_bottom_classification_model_accuracy
0.5975390156062425
regular_season_bottom_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=regular_season_bottom_classification_model_confusion_matrix,display_labels=regular_season_bottom_classification_model.classes_)
regular_season_bottom_classification_model_confusion_matrix_display.plot()
plt.show()
Using the bottom 6 variables, an accuracy of only approximately 59.75% is achieved using regular season totals.
np.random.seed(100)
play_offs_bottom_predictors_training,play_offs_bottom_predictors_testing,play_offs_bottom_WL_training,play_offs_bottom_WL_testing = train_test_split(play_off_totals[["steals","foulsPersonal","turnovers","fieldGoalsAttempted","threePointersAttempted","freeThrowsAttempted"]],
play_off_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
play_offs_bottom_predictors_training = standard_scaler.fit_transform(play_offs_bottom_predictors_training)
play_offs_bottom_predictors_testing = standard_scaler.transform(play_offs_bottom_predictors_testing)
play_offs_bottom_classification_model = LogisticRegression()
play_offs_bottom_classification_model.fit(play_offs_bottom_predictors_training,play_offs_bottom_WL_training)
play_offs_bottom_classification_model_predictions = play_offs_bottom_classification_model.predict(play_offs_bottom_predictors_testing)
play_offs_bottom_classification_model_accuracy = accuracy_score(play_offs_bottom_WL_testing,play_offs_bottom_classification_model_predictions)
play_offs_bottom_classification_model_confusion_matrix = confusion_matrix(play_offs_bottom_WL_testing,play_offs_bottom_classification_model_predictions)
play_offs_bottom_classification_model_accuracy
0.5813953488372093
play_offs_bottom_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=play_offs_bottom_classification_model_confusion_matrix,display_labels=play_offs_bottom_classification_model.classes_)
play_offs_bottom_classification_model_confusion_matrix_display.plot()
plt.show()
Using the bottom 6 variables, an accuracy of only approximately 58.14% is achieved using play-off totals.
np.random.seed(100)
regular_season_all_predictors_training,regular_season_all_predictors_testing,regular_season_all_WL_training,regular_season_all_WL_testing = train_test_split(regular_season_totals[["fieldGoalsMade","threePointersMade","freeThrowsMade","reboundsTotal","assists","blocks","steals","foulsPersonal","turnovers","fieldGoalsAttempted","threePointersAttempted","freeThrowsAttempted"]],
regular_season_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
regular_season_all_predictors_training = standard_scaler.fit_transform(regular_season_all_predictors_training)
regular_season_all_predictors_testing = standard_scaler.transform(regular_season_all_predictors_testing)
regular_season_all_classification_model = LogisticRegression()
regular_season_all_classification_model.fit(regular_season_all_predictors_training,regular_season_all_WL_training)
regular_season_all_classification_model_predictions = regular_season_all_classification_model.predict(regular_season_all_predictors_testing)
regular_season_all_classification_model_accuracy = accuracy_score(regular_season_all_WL_testing,regular_season_all_classification_model_predictions)
regular_season_all_classification_model_confusion_matrix = confusion_matrix(regular_season_all_WL_testing,regular_season_all_classification_model_predictions)
regular_season_all_classification_model_accuracy
0.8463385354141657
regular_season_all_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=regular_season_all_classification_model_confusion_matrix,display_labels=regular_season_all_classification_model.classes_)
regular_season_all_classification_model_confusion_matrix_display.plot()
plt.show()
Using all variables analyzed, an accuracy of approximately 84.63% is achieved using regular season totals.
np.random.seed(100)
play_offs_all_predictors_training,play_offs_all_predictors_testing,play_offs_all_WL_training,play_offs_all_WL_testing = train_test_split(play_off_totals[["fieldGoalsMade","threePointersMade","freeThrowsMade","reboundsTotal","assists","blocks","steals","foulsPersonal","turnovers","fieldGoalsAttempted","threePointersAttempted","freeThrowsAttempted","steals","foulsPersonal","turnovers","fieldGoalsAttempted","threePointersAttempted","freeThrowsAttempted"]],
play_off_totals["WL"],test_size=0.2)
standard_scaler = StandardScaler()
play_offs_all_predictors_training = standard_scaler.fit_transform(play_offs_all_predictors_training)
play_offs_all_predictors_testing = standard_scaler.transform(play_offs_all_predictors_testing)
play_offs_all_classification_model = LogisticRegression()
play_offs_all_classification_model.fit(play_offs_all_predictors_training,play_offs_all_WL_training)
play_offs_all_classification_model_predictions = play_offs_all_classification_model.predict(play_offs_all_predictors_testing)
play_offs_all_classification_model_accuracy = accuracy_score(play_offs_all_WL_testing,play_offs_all_classification_model_predictions)
play_offs_all_classification_model_confusion_matrix = confusion_matrix(play_offs_all_WL_testing,play_offs_all_classification_model_predictions)
play_offs_all_classification_model_accuracy
0.8414376321353065
play_offs_all_classification_model_confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=play_offs_all_classification_model_confusion_matrix,display_labels=play_offs_all_classification_model.classes_)
play_offs_all_classification_model_confusion_matrix_display.plot()
plt.show()
Using all variables analyzed, an accuracy of approximately 84.63% is achieved using play-off totals.
Conclusion¶
In conclusion, the accuracy from using the top 6 variables as predictors for game outcomes produced relatively good accuracy. When using all variables as predictors, the high accuracy is expected because the models have access to more information about basketball performance when making predictions, leading to more accurate predictions. Compared to the full-variable models, the models with the top 6 variables is approximately 12.49% less accurate, whereas the model using the bottom 6 variables is approximately 25.39% less accurate, nearly double the loss in accuracy relative to the top 6 variable model. As such, the classification model constructed from using the top 6 variables performs well, especially when comparing to the model using the bottom 6 variables. These results may also suggest that the visualization analysis previously done was effective.