We are running a discriminant analysis to try and predict whether or not a Major League Baseball team will make the playoffs. We are running the analysis for the 2005-2007 MLB seasons. Also we are trying to see based on offense statistics and defense statistics how well the discriminant analysis function can predict the teams that will make the playoffs. The offense statistics that will be our independent variables are: Runs Scored, Batting Average, On Base Percentage, Average Batters Age, and Homeruns. And for defense our independent variables will include; Hits allowed, Runs allowed, Total Team Fielding Percentage, Saves and Average Pitchers Age.
Our dependent variable, what we are trying to predict, is making the playoff or not, Playoffs. Which indicates whether or not a team made the playoffs or didn’t make the playoffs for that year. The discriminant analysis will try and predict which teams should have made the playoffs based on the statistics we indicate, and compare them to the actual results to see how accurate the model is at predicting the teams that made the playoffs. Also the anlaysis did not always choose 8 teams to make the playoff (4 from the American League and 4 from the National League) but due to the data provided it is impossible to make the model consistently provide 8 teams being predicted, so the while in those cases the accuracy may be a little off but the data still provides interesting and important results to our analysis.
2007 Descriptive Statistics For Offense and Defense independent variables there exists no problems with skewness and Kurtosis. So the data for 2007 has no normality problems and the data is sufficient to use. Offense Level of measurement and sample size issues The variables being used in a discriminant analysis should be non-metric for the dependenant variable and metric for the independent variables, which in this analysis and the following analysis’s is true so the measurement level requirement is satisfied.
The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 30 valid cases and 5 independent variables. The ratio here is 6 to 1 so the ratio exceeds the minimum. So the sample size requirement for discriminant analysis is satisfied. The assumption of homogeneity of variance is satisfied in this analysis. Since the significant value .202 > .05 we fail to reject the null hypothesis that tests the null hypotheses that the group variance-covariance matrices are equal. Since we fail to reject we will use the group variance-covariance matrices and can conclude that homogeneity is satisfied.
Overall Relationship The Wilks’ lambda statistic for the test of the function (Wilks’ lambda=.548) had a probability of p=0.009 which was not less than or equal to the level of significance of 0.05 but close enough for this anlysis. Which indicates that there is an overall relationship. Role of independent variables in predicting group membership In the discriminant function it seperates between the two supgroups, making the playoffs and not making the playoffs, here the variables with negative values will relate to teams who made the playoffs and positive values will correlate with teams who did not make the playoffs.
In the discriminant model 3 of the 5 statistics relate to teams making the playoffs. The independent variable that most significantly influences whether a team makes the playoffs or not is Saves (-.474), followed by Average Pitchers Age (-.400), and lastly Total Team Fielding %(-.227) And for 2006 Run Allowed(.870) and Hits Allowed (.440) correlates with teams not making the playoffs.
Classification using the discriminant model The discriminant model was able to correctly classify 90% of the original group cases which makes the model extremely significant. In this model they did not predict STL and OAK to make the playoffs when they actually made the playoffs and predicted HOU to make the playoffs when they didn’t actually make the playoffs. So you could say based on this model, the independent variables used can be used to predict whether or not a team will make the playoffs.