Introducing SIERA

by **TenuredVulture** » Sun Feb 14, 2010 14:39:51

tangotiger wrote:
TenuredVulture wrote:Maybe I'm misunderstanding something here. The point of SIERA and FIP is to figure out how good a pitcher is. And the argument here, if I'm following it correctly, is these metrics are evaluated in terms of how well they predict the next season's ERA.

Insofar as FIP is concerned, you are misunderstanding.

The way I always describe it is to liken it to OBP: is there anyone that objects to weighting the HR and walk as "1" in the numerator of OBP? No. Because OBP is what it is. Do we care if HR show more persistence year-to-year than doubles (or not)? No. OBP is what it is.

And that's what FIP is: describing a PART of a pitcher's performance (the 25% of the time that a ball is not put in play), and cast it on an ERA scale.

Presenting (reframing?) the argument on that basis, do you have objections?

Not really. I do wonder though if you changed the dependent variable, would the weights change considerably? I mean, it's close to self evident that a pitcher controls SO, BB, and HR, and SIERA adds GB and FB. And as I noted, SO were clearly important in SIERA (its beta is large).

But if we want to argue whether FIP is a better stat than SIERA, we would have reason to discuss which dependent variable we want.

On some level, if you simply go with "it is what it is" as a defense of a stat, then why not just go back to using batting average, and getting after hitters who "strike out too much?"

What I'm getting at here is measurement validity.

by **tangotiger** » Sun Feb 14, 2010 16:19:01

TenuredVulture wrote:But if we want to argue whether FIP is a better stat than SIERA, we would have reason to discuss which dependent variable we want.

On some level, if you simply go with "it is what it is" as a defense of a stat, then why not just go back to using batting average, and getting after hitters who "strike out too much?"

What I'm getting at here is measurement validity.

Since FIP expressly cares about HR and not at all about batted balls, while SIERA takes the opposite point of view, then by definition, FIP will correlate (Validate) better with same-year runs allowed. As for next-year runs allowed, SIERA would have a (slight, if any) advantage.

But, they each do what they purport to do and therefore, choosing between the two is somewhat like choosing between OBP and SLG.

by **TheBrig** » Sun Feb 14, 2010 22:36:13

I'm going to try and take a shot at explaining why you might want to use ERA as a dependent variable in this regression, for the benefit of Vulture and anybody else who isn't quite getting it. Let me offer up a simplified example:

Say you're the researcher building the regression model, and for the sake of argument let's say you've managed to isolate a pitcher's strikeout rate as the most important run-stopping factor a pitcher can have within his own control. That is to say, let's pretend walk rates and ground ball rates and so forth have negligible measurable impact on run prevention and you think you can come up with a reasonable model for a pitcher's unique run stopping ability using only strikeout rates.

You take the data, which has every pitcher from the last decade, and you want to find a relationship between run-preventing ability and strikeout rate. Say for the sake of argument you choose ERA as your dependent variable (after all, ERA is literally the observed rate of earned runs allowed). You regress ERA on strikeout rate using the classic least squares regression model and you should get an estimated linear relationship

estimated ERA = a + b * (K/9).

Now when you look at your data plot with the regression line, you should see that the fitted regression line indicates a general trend in the data (which should be downwards as K rate increases). But the data points themselves might still deviate from your fitted line by a large amount. Mark Buehrle (low ERA, low K rate) is going to show up in the bottom left well below your fitted line and, say, Brad Lidge '09 (high K rate, high ERA) will show up in the top right well above the line. But even though there are outliers you're still convinced that K rate is the only factor within a pitcher's control that has any non-zero correlation with his ERA. Then for a given pitcher's K rate, the corresponding point on the fitted line is his "luck independent ERA" and you conclude the difference between that fitted value and the pitcher's actual ERA is the effect of factors beyond the pitcher's control (ie, "luck").

The SIERA model is the same idea, only with more independent variables and with second order interaction terms thrown in. The ERA the SIERA model predicts is the ERA you would expect based on a pitcher's K rate, BB rate, and groundball rate, all other things held equal. The difference between that estimated ERA and actual ERA is again assumed to be luck (or perhaps other factors the researchers haven't been able to pin down yet).

The main benefit of using ERA as the dependent variable then is that you can come up with a luck-adjusted statistic which is on the same scale as a pitcher's actual ERA so that it is easy to make an apples-to-apples comparison. You could argue it might be better to use runs allowed instead of earned runs allowed, but using ERA makes it more tractable.

by **tangotiger** » Sun Feb 14, 2010 23:27:40

TheBrig wrote: You could argue it might be better to use runs allowed instead of earned runs allowed, but using ERA makes it more tractable.

Excellent description preceding this, and absolutely correct that runs allowed, not ER, is what is needed.

by **Phan In Phlorida** » Mon Feb 15, 2010 04:01:07

tangotiger wrote:
Phan In Phlorida wrote:TheAAGuy's method is more in line with computer programming, where you have to be that explicit (with the extra parens et al) to insure the compiler handles it the way you want it to.

Seeing that I am a computer programmer, I can assure you I have never added extraneous parens, and I have never had a problem.

A compiler follows rules, explicitly coded rules. And one of those rules is that multiplication precedes addition.

Now, maybe in assembly language, it's different. And (some) calculators simply follow the rule left-to-right.

Otherwise, as I said, it's fine to have the objection, but the objection runs counter to our Grade 7 education.

Well, let's just say I learned the lesson the hard way over 20 years ago. Perhaps such explicity is more of an old-school habit (be it CYA, source readability, paranoia, whatever).

by **TheAAGuy** » Mon Feb 15, 2010 09:32:32

Phan In Phlorida wrote:
tangotiger wrote:
Phan In Phlorida wrote:TheAAGuy's method is more in line with computer programming, where you have to be that explicit (with the extra parens et al) to insure the compiler handles it the way you want it to.

Seeing that I am a computer programmer, I can assure you I have never added extraneous parens, and I have never had a problem.

A compiler follows rules, explicitly coded rules. And one of those rules is that multiplication precedes addition.

Now, maybe in assembly language, it's different. And (some) calculators simply follow the rule left-to-right.

Otherwise, as I said, it's fine to have the objection, but the objection runs counter to our Grade 7 education.

Well, let's just say I learned the lesson the hard way over 20 years ago. Perhaps such explicity is more of an old-school habit (be it CYA, source readability, paranoia, whatever).

I don't consider it to be 'CYA programming'. I consider it to be robust programming. Where I used to work there was a big push to get our company through the CMM/CMMI certification, and this required the programmers to use a more robust type of programming style. It became ingrained into my programming style, to the point where I would do things instinctively. All I know is, If I wrote code the way tangotiger wrote that formula, It would never pass code review. It would get kicked back to me with the note: ambiguous calculation.

Archived

Introducing SIERA