Statcast: A Curveball to Baseball

By Brian Salerno, @briansalerno (SMU)

Baseball has had a lifelong fascination with statistics. Before some of the earliest recognized baseball leagues in the United States, Henry Chadwick created the baseball box score and produced the first annual baseball guide. This allowed the game to be consumed by those, not at the ballpark. People could now create a narrative around the performance of teams and players objectively through numbers.

Baseball researchers have used this wealth of information to answer questions on a number of topics such as comparing players in different eras, how rule changes affected baseball in its infancy, and how old stadiums affected certain statistics. Most influential to the modern game, however, was Bill James’ research in the late 1970s and 1980s. James is credited with sparking the sabermetric movement, a movement whose roots are present in most of today’s analytic efforts. All of James’ work was made possible through publicly accessible data or data gathered by the public since Major League Baseball refused to release play-by-play accounts of games.

In 2007, Major League Baseball unveiled pitchf/x, a pitch tracking system that could quantify the velocity, spin rate, and location of all pitches thrown over the course of a season. “Moneyball” had created an insatiable appetite in baseball clubs and the public for new data. MLB made the conscious decision to make this new data public by providing it in an accessible format, XML. The same league that had refused to give data to the public three decades prior now made an effort to release it as a result of a movement that was started by one of the men they originally refused.

If MLB thought that pitchf/x would satisfy baseball fans’ desires for data, they were sorely mistaken. Prior to the 2016 season, all thirty ballparks were outfitted with Statcast, a system that gives even more detailed information through player tracking technology. While some data has been released to the public, the vast majority of the information generated by the system has not. Part of this decision could be logistical since Statcast data for one game is 80 gigabytes compressed and seven terabytes uncompressed. Another reason for keeping the data under wraps may be the clubs wanting to fully realize its potential before turning it loose.

Given how detailed and large the dataset seems to be, this seems a likely reason. Since pitchf/x was immediately released to the public, teams did not have the ability to gain much of a competitive advantage.

However, there is massive public interest in the data. As seen before, analysis by the public can influence the game on a massive scale. Public access to pitchf/x spawned a number of useful, informative, and free-to-use resources such as Baseball Savant and Brooks Baseball. While a Statcast resource such as these would take some time (and a lot of server space) to develop, it would allow the public to conduct the analyses that would be game-changing.

With the limited data released publicly already, baseball has learned about the correlation of batted-ball exit velocity and how often those balls fall for hits. Imagine what could be discovered if the public was given access to the treasure trove of data that is Statcast. After all, baseball does have the most statistically-inclined fans and giving them more data wouldn’t be anything but beneficial to the game.  

About the Author: Brian Salerno is a Junior at Southern Methodist University double majoring in Sport Management and Statistics. He is interested in working in football and baseball data and analytics. You can connect with Brian on Linkedin here.

