Resolving confusions over that study of “Teacher Effects on Student Achievement and Height”

Someone pointed me to this article by Marianne Bitler, Sean Corcoran, Thurston Domina, and Emily Penner, “Teacher Effects on Student Achievement and Height: A Cautionary Tale,” which begins:

Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted value-added measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height. We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice.

This was, unsurprisingly, reported as evidence that value-added assessment doesn’t work. For example, a Washington Post article says “The paper is critiquing the idea that teacher quality can be measured by looking at their students’ test scores” and quotes an economist who says that “the effect of teachers on achievement may also be spurious.”

I asked some colleagues who work in education research, and they argued that the above interpretation of the empirical findings was mistaken:

1. The correlation of class mean test-score residuals across classes taught by the same teacher is large and positive, while the correlation of class mean height residuals across classes taught by the same teacher is zero. In their case, the method indicates sizeable teacher effects on test scores and zero teacher effects on height. Thus, this is a non-cautionary tale. The authors sort of say this in their abstract, noting that the “effects” on height are spurious (noise).

2. They find large variance of class-mean residuals (classroom “effects”) on height when they randomly assign students to teachers. This doesn’t seem right. If we randomly assign observations to groups and still find significant variation in “group effects,” there must be something else going on.

My colleagues’ key point was that one key piece of evidence for teacher effects on test scores is that these effects persist among teachers from year to year. If there appear to be large effects of teachers on height, but these effects do not persist among teachers from year to year, then we should not think of these as effects of teachers on height but rather as unexplained classroom-level residuals in our model.

Bittler et el. were informed of this discussion and they contacted me with further information. Here’s Bittler:

We have noticed that our work has gotten some attention that mischaracterizes what we find. We have submitted the attached blog post to the Brookings education Brown Center blog to make the point that first, more sophisticated models which account for class-year shocks no longer produce value added of teachers on height and second, another, more policy relevant point, that a measure of teacher quality that adjusts slowly (because it extracts the persistent component) may not be the best for motivating teachers.

From their post:

We [Bittler et al.] have seen a number of claims that overgeneralize our findings, suggesting that all uses of all types of value-added models (VAMs) are invalid. This conclusion is not supported by our work. We believe that our study has some good news about some value-added models, particularly those used in research. At the same time, however, we see important cautions in our findings for the use of value-added models in policy and practice.

They continue with some good news about the use of VAMs in research:

That we find teacher effects on height does not . . . invalidate the consistent finding emerging from VAM research that teachers matter for students’ achievement and other life outcomes. The most sophisticated value-added models currently used in research concentrate on persistent contributions to student outcomes. . . . [and] appear to demonstrate that teachers vary substantially in their contribution to achievement growth and that exposure to high value-added teachers has measurable positive effects on students’ educational attainment, employment and other long-term outcomes. Importantly, we find no teacher effects on height using the more sophisticated models these researchers use.

And some cautions:

VAMs are hard to use well in practice. . . . the models behind many of these real world applications are much less sophisticated than the ones that researchers use, and our results suggest that they are more likely to result in misleading conclusions.

In many cases, practical applications use single-year models like the models that yield implausible teacher effects on height in our analyses. Our findings reinforce previous work identifying problems with these models, demonstrating that random error can lead observers to draw mistaken conclusions about teacher quality with striking regularity.

Again:

The more sophisticated models used in research draw on several years of a teacher’s students to estimate their persistent effect. When we use these multi-year models, we do not find effects on height, suggesting that the effects we see in simpler VAMs reflect year-to-year variation that should be seen as random errors rather than systematic factors.

When it comes to policy:

Multi-year models adjust for these random year-to-year errors, but because they identify teachers’ persistent contributions to student learning using multiple years of data, they may not pick up on short term changes in teacher effectiveness. As such, these models are of limited usefulness for motivating annual performance pay goals. However, multi-year models may work well for identifying persistently very good or very bad teachers, but only after several years of teaching.

This subtlety – that our results support the validity of multi-year VAMs while indicating that single-year VAMs are not valid – has been overlooked in much of the discussion of our study.

And, in summary:

VAMs have underscored the importance of teachers, and we believe that they have a role to play in future educational research and policy. We just need to be more modest in our expectations for these models and make sure that the empirical tool fits the job at hand.

I sent the above to my colleagues, who had two comments:

1. The only quibble we have with their blog post is when they say “our results support the validity of multi-year VAMs while indicating that single-year VAMs are not valid.” This is very poor wording. The persistent component of teacher value assessment is contained in the single year estimate, it just has a bunch of noise. We can reduce the noise (but not eliminate it completely in finite samples) with more classrooms taught by the same teacher. We think it’s wrong to refer to an unbiased noisy estimate as “not valid” when we know that the signal variance is meaningfully large. If no noise were the bar then nothing would be valid, and we have known about the magnitude of signal/noise in VA for well over a decade.

2. It still is quite unclear what is driving their finding of significant class level residuals on height. Assuming no true peer effects on height, one possible explanation is correlated measurement error. But they find classroom height “effects” even when they randomly assign students to classrooms, which should not be possible if they are properly estimating the variance of random effects. The authors offer very little explanation to help us understand what might be going on, so it’s hard to know whether they have identified a pervasive problem in the estimation of group effects, whether there an error in their code, or something else entirely.