Research‎ > ‎

Sports Analytics

IPL Auctions

YouTube Video


About the data: Ball-by ball data is available in YAML format. I have used Python 3 to read the data and then form a single csv file to create training data and test data sets including only those variables that are needed for the analysis.

Here is the summary of data from 178 IPL Matches played during 2008-2010. The aim of the model is to predict the outcome of the game at early stages of the second innings in the match and later propose a measure of the value of the player based on his contribution to the winning of a match. We will present some preliminary statistics of the data here:


Mean(victory) of 0.5147 means that in roughly 50 % of the data points the team chasing (or batting second) has won. The median target score in the data that was chased by teams batting second is around 162.

Variable lost_wickets denotes the number of wickets the chasing team has lost by that time at which the ball is delivered, scored is the variable that is equal to the score of the chasing team after the ball is delivered and overs denote the number of balls bowled so far divided by 6 at the ball.

To summarize, lost_wickets, runs scored and overs bowled together determines the available resources with the team chasing a target.
The results of running a logistic regression of the variable victory on the available resources and the target yields the following results:


All the variables turned out be strong determinants. Model accurately predicts the outcome close to 80% of the time at various stages of the second innings. The performance is the same on the test data set. What's even more interesting is that the accuracy of the model is close to 75% on the subset of test data that only consists of early stages of the second innings. This shows that the model has a pretty good performance in predicting outcomes in early stages of the match.

Confusion Matrix for Train Data

Confusion Matrix for Test Data


Early Stages Confusion Matrix on test Data
Comments