Sunday, April 15, 2007

Numbers Numbers Numbers (Part 2)

Okay, let’s just get this out of the way right off: I’m at a loss right now. Anyway...

In case you didn’t read Part 1 of what will hopefully become a pretty lengthy series of Football Numbers posts (assuming I figure out where to go with this), here’s a quick summary of Part 1: last week I got things started by entering the 2006 box scores for Big XII teams and running some simple correlations to see what statistical categories had the biggest impact on wins and losses for each team. Each team had a completely different list, and I was curious if these correlations tended to change drastically from year to year or if each team had something of a blueprint.

(Also, I changed from running the basic Excel correlation, which is a Pearson’s correlation, to running a Spearman correlation, which is “bivariate normal.” Like the term ‘1080i’ when it comes to new big screens, you don’t have to know what “bivariate normal” is, you should just know that it’s cool.)

So what’s happened since then? Well, this week I entered the box scores from 2005 and 2004 with three themes in mind: 1) Would each team have similar key variables every year...in other words, is there something of a blueprint for each team? 2) Was a high/low correlation in any one variable a sign of a good/bad team (i.e. if I listed out all the correlations and tied them to win %, would there be a high correlation in any one area?)? 3) Was a high/low correlation in any one variable a predictor of success/failure the next season?

In the methods I used, I really didn’t find much. So I’ll be asking for some help. But first, I’ll summarize what I found...in all its non-glory...

Does each team have a blueprint?

In a word...no.

Well, that’s not quite right. In all, if you look at the average correlations for all the categories I entered, you’ll see that each team still has something of a blueprint—each team has different correlations and categories which most directly lead to their success/failure. However, correlations for any most variables vary significantly from season to season. For instance, Baylor’s #1 most important overall category is Opponents’ Yards per Carry (with an average correlation of -0.73). The correlation for this variable was -0.81 in 2004, -0.88 in 2005, and -0.49 in 2006. For Missouri, Team Yards Per Carry was the #3 most important variable overall (an average of 0.66). Its value was 0.62 in 2004, 0.92 in 2005, and 0.43 in 2006. Even the most important overall variables vary wildly from season to season.

This makes sense when you think about it, though. There’s always a change in personnel, and football, like other games, is a game of inches. So even if a team is perfectly consistent in its play-calling and relatively consistent in its execution, there are still all sorts of variables that factor into their success. This isn’t surprising.

It was disappointing, though. I’d love to stumble across a magic bullet, after all.

Anyway...with that in mind, here are the strongest correlations for each team (again, I ranked them according to absolute value—some high correlations were positive, some were negative).

Baylor

1. Opponents’ Yards Per Carry (-0.73)
2. Opponents’ Rushing Yards (-0.66)
3. First Down Ratio (0.51)
4. Opponents’ Turnovers (0.51)
5. 3rd Down Ratio (0.47)

Colorado

1. Team Rushing Attempts (0.69)
2. Opponents’ Rushing Attempts (-0.65)
3. First Down Ratio (0.65)
4. Team Rushing Yards (0.64)
5. Opponents’ Rushing Yards (-0.58)

Iowa State

1. 3rd Down Ratio (0.65)
2. Opponents’ Yards Per Passing Attempt (-0.59)
3. Team Rushing Yards (0.54)
4. Team Rushing Attempts (0.54)
5. Yards Per Carry (0.53)

Kansas

1. 3rd Down Ratio (0.62)
2. Rushing Yards (0.52)
3. 3rd Down Conversion % (0.52)
4. Opponents’ Rushing Yards (-0.49)
5. Pass Completion % (0.49)

Kansas State

1. 3rd Down Ratio (0.73)
2. First Down Ratio (0.68)
3. Rushing Attempts (0.62)
4. Time of Possession (0.59)
5. Opponents’ Completion % (-0.58)

Missouri

1. Rushing Yards (0.72)
2. Opponents’ Rushing Yards (-0.68)
3. Yards Per Carry (0.66)
4. First Down Ratio (0.56)
5. First Downs (0.53)

Nebraska

1. Rushing Yards (0.60)
2. Turnover Margin (-0.60)
3. Opponents’ Yards Per Carry (-0.55)
4. Opponents’ Yards Per Passing Attempt (-0.55)
5. Team Rushes (0.55)

Oklahoma

1. First Down Ratio (0.64)
2. Opponents’ First Downs (-0.60)
3. Opponents’ Yards Per Passing Attempts (-0.56)
4. Opponents’ Yards Per Carry (-0.54)
5. Opponents’ Turnovers (0.51)

Oklahoma State

1. Rushing Yards (0.64)
2. Opponents’ Yards Per Passing Attempt (-0.60)
3. Passing Attempts (-0.59)
4. 3rd Down Ratio (0.59)
5. Opponents’ Rushing Yards (-0.55)

Texas

1. First Down Ratio (0.73)
2. Opponents’ First Downs (-0.72)
3. 3rd Down Ratio (0.70)
4. Opponents’ Total Plays (-0.63)
5. Opponents’ 3rd Down % (-0.61)

Texas A&M

1. Opponents’ First Downs (-0.65)
2. First Down Ratio (0.64)
3. Opponents’ Yards Per Carry (-0.59)
4. Opponents’ Rushing Yards (-0.58)
5. Opponents’ Turnovers (0.57)

Texas Tech

1. 3rd Down Ratio (0.71)
2. Yards Per Carry (0.66)
3. Opponents’ Yards Per Passing Attempts (-0.59)
4. Opponents’ 3rd Down % (-0.58)
5. Rushing Yards (0.55)
Predictably, here are the Top 5 Variables:

1. 3rd Down Ratio (0.62)
2. First Down Ratio (0.61)
3. Team Rushing Yards (0.53)
4. Opponents’ Yards Per Passing Attempt (-0.53)
5. Opponents’ Rushing Yards (-0.51)
So after all of this, we can come to one specific conclusion...yards matter.

I know, what an amazing thought. I could have come up with that without looking at a single box score. Just as on-base percentage is the single most key variable in baseball (if you’re not making outs, you’re more likely to score points), yards are the single most key variable in football (if you’re advancing the ball, you’re more likely to score points). Brilliant. Moving on...

Was a high/low correlation in any one variable a sign of a good/bad team?

Okay, so I correlated statistics from individual games to the results of those games and found that the team that controls the ball, converts on 3rd downs, and ends up with more yards, probably ends up with more points. What happens if we take a step back and look at a season’s success instead of individual games? What happens if I take those individual game correlations and tie them to a team’s winning percentage for any given year?

In other words, instead of looking at what’s most important for winning a game, what’s most important for a winning season?

(I’m having trouble wording this right. I’m basically looking at the correlation between correlations and win %, but “correlations between correlations” doesn’t really sound all that clear, does it?)

Anyway, I didn’t find much. First of all, none of the resulting correlations were all that strong (the highest was 0.41), but here was the main conclusion I could draw: the more important Opponents’ Yards Per Pass Completion, Opponents’ Yards Per Pass Attempt, and overall Opponents’ Passing Yards are to you (i.e. if it has a higher-than-normal correlation), the worse your record is.

What does that mean? I’m honestly not sure. Does it have to do with big plays? In other words, if you have a high correlation in these categories, I guess that means you win or lose games depending on how many big passing plays you give up. That suggests you give up quite a few big plays, doesn’t it? That’s all I could come up with.

Of course, I don’t know how much thought this is worth, since a 0.41 correlation with a relatively small sample size isn’t all that telling.

Was a high/low correlation in any one variable a predictor of success/failure the next season?

I looked at this one the same way I looked at the last one, only instead of tying individual game correlations to a team’s win %, I tied them to the next season’s win %. I’m not working with a huge sample size here (2005’s win % with 2004 correlations and 2006’s win % with 2005 correlations...2 years for 12 teams), but here’s what I’ve come up with so far.

(And it should be noted the best predictor of next year’s win % is...this year’s win %. That doesn’t fill me with confidence in these weak correlations. Still no magic bullet.)

But for the sake of sharing, I did find something interesting. You know the ‘big play’ rule from above? “The more important Opponents’ Yards Per Pass Completion, Opponents’ Yards Per Pass Attempt, and overall Opponents’ Passing Yards are to you (i.e. if it has a higher-than-normal correlation), the worse your record is”? Well, looking at the next year’s numbers, I can say that if your own yards per completion are more important (i.e. have a higher correlation) in any given season, the worse your record is going to be the next season. This makes a little bit of sense, too.

And really, going back to the baseball stats analogy earlier, this is a lot like a team’s batting average with runners in scoring position (RISP). The ’03 Royals had a strangely high RISP (and won 83 games), and it gave fans (and the front office) an artificially inflated view of where the organization was as a whole. Well, in ’04 they were quite below average in that category, and they ended up losing 104 games. RISP balanced out in the end. They weren't especially good at getting hits with runners in scoring position; they just got on an extended hot streak. Which was followed by an extended cold streak.

Well, it appears that big pass plays are somewhat the same. If you’re giving up a lot of bombs one year, your record will probably suffer, but it will likely even out the next season, and those low-percentage passes won’t find the hands of opposing WR’s (and therefore significantly raise your opponents’ yards per catch and affecting the outcome of a game) quite as often.

Now, again...these correlations aren’t very high—there are other factors involved, like some teams being better than others at the fundamentals of tackling and/or pass coverage, and some teams just having more talent—but I don’t find it a total coincidence that the same categories emerged, inversely, when looking at a team’s record from year to year.

And just in case you’re wondering, here were the teams with the highest 2006 correlations in the category of Team Yards per Completion: 1) Texas, 0.70, 2) Texas Tech, 0.69, 3) Oklahoma State, 0.30, 4) Kansas State, 0.29. For the record, Missouri was at just 0.03, which hopefully suggests that those bombs that fell just out of the reach of WR’s last year (until the bowl game, anyway) will find a little more success. As for the team at the top of the list, I guess don’t be surprised if Colt McCoy’s reputation for throwing a great deep ball loses as smidge of its luster this year, even though most of his major receiving targets return for 2007.

Summary

So...what have I learned in the process of entering all this data? Not nearly as much as I would have hoped. But I do have all of this data, and I plan on compiling more. My question to you is, what should I do with it? And what data would be the best to compile? After I get back to about 2001, I figure I’ll move on to play-by-play data, but I don’t want to dig in that deep without some idea of what I’m looking for. So I’m asking for help from any burdgeoning data or sabermetrics nerd reading this...let’s make this a community project. Let me know where you think I should go with this.

Feel free to share any thoughts in the comments section.