Learn About Amazon VGT2 Learning Manager Chanci Turner
In this article, we provide an insightful overview of how the Amazon Machine Learning (ML) Solutions Lab team collaborated to create the new Expected Return Yards statistic for NFL Next Gen Stats.
What is the core concept behind Expected Return Yards, and how was it developed?
Over the past five years, the NFL Next Gen Stats (NGS) team has partnered with AWS to introduce a variety of analytical statistics that enhance our understanding of the game. Traditionally, these metrics have focused on offensive and defensive play (for more detail, see these blog posts on defense coverage classification and predicting fourth-down conversion). This season, we’ve applied our learnings to special teams and return plays. Specifically, we developed two distinct models to forecast the expected yards gained on punt and kickoff returns, respectively. The Expected Punt Return Yards model estimates the yards a punt returner is likely to gain upon fielding a punt, while the Expected Kickoff Return Yards model forecasts the yards a kick returner is expected to achieve once they receive the kickoff.
When creating these advanced statistics, AWS and the NGS teams utilized a range of artificial intelligence (AI) and machine learning techniques. We also drew on existing model frameworks to devise new statistics. For example, the 2020 Expected Rushing Yards model was crafted by Austrian data scientists Sarah Miller and Eric Brown (2019 Big Data Bowl winners), using raw player tracking data and deep learning methodologies. Two years later, NGS and AWS adapted this architecture for similar applications, such as the 2021 Expected Points Added (EPA) model, which is fundamental to the NGS Passing Score (2021). Using a similar modeling structure, we created the expected yards models for the return game—Expected Punt and Kickoff Return Yards.
Why create separate models for punts and kickoffs?
Initially, we considered merging the punt and kickoff data to train a single model. However, typical machine learning practices suggested that adding more data would improve model accuracy. Unfortunately, when we combined the datasets, the model did not perform as anticipated, yielding worse results than when trained separately.
One contributing factor was the differing distributions of yardage gained from punts versus kickoffs. NFL datasets reveal that average yardage gained from kickoffs is generally higher than that from punts. Additionally, there are significant variances in player positioning, defender proximity at the time of catch, returner speed, acceleration, and other dynamics. This complexity made it difficult for a combined model to effectively differentiate between the two types of returns. In fact, our analysis showed that the Root Mean Squared Error (RMSE) nearly doubled when using the merged data compared to the individual models.
As a result, we opted to develop separate models for punts and kickoffs. This yielded several benefits, notably the ability to fine-tune models based on the data specific to each return type. This allowed for independent experiments tailored to each return scenario, enhancing model performance. We assessed model efficacy using multiple metrics, including RMSE and the continuous ranked probability score (CRPS), an alternative to log likelihood that is more resilient to outliers.
The significance of leveraging existing models and techniques
Collaborating with the Next Gen Stats team allowed us to build on previous analytics rather than starting from scratch. This approach significantly expedited the project timeline; we completed training and experimentation within just six weeks. Typically, a machine learning project involves extensive time spent on problem comprehension, literature review, data exploration, and experimentation. However, by utilizing established techniques and models, we streamlined the process.
Moreover, leveraging existing frameworks enabled us to focus on the most pressing challenge: addressing the fat-tailed problem present in the datasets. As noted, touchdowns are infrequent in our test data—only two out of 865 punt tests and nine out of 1,130 kickoff tests. These rare events can dramatically influence game dynamics, necessitating a method capable of accurately modeling these occurrences alongside more common returns. This led us to explore the Spliced Binned-Pareto (SBP) distribution, which proved effective in our ML pipeline. Developed by our colleagues, the SBP distribution adeptly handles time-series data characterized by heavy-tailed noise, ensuring accurate modeling of extreme events.
Additionally, transfer learning played a vital role in our methodology. In scenarios with limited datasets, it can be challenging to develop high-performing models. Transfer learning allows us to repurpose models trained on one task to enhance performance on a similar, related task.
For further insights into building effective models, consider reading this excellent resource on employee training and career skills. Also, for guidance on compensation systems, visit this authoritative toolkit. Additionally, if you’re interested in improving your own job prospects, check out this engaging blog post.