“Does Not Compute”: Teach For America Mathematica Study is Deceptive?

I’ll admit it. I am a sci-fi junkie. I am that lone person in the theater at a midnight showing on the release night of a new sci-fi film. Sometimes I am one of only 3 or 4 people in the theater. I should blame this probably on my parents who raised me on Star Trek, Star Wars, Twilight Zone, B-movies with large radioactive ants attacking Los Angeles etc. One of my favorite shows as a kid was Lost In Space. Who doesn’t remember the robot declaring:

DangerWillRobinson — Danger Will Robinson! Danger!

(Millennials— Wikipedia it)

Is it possible that Mathematica, a “reputable” multi-million dollar research for hire evaluation shop, could deceptively report their statistical findings to favor for Teach For America? Could they have also made fundamental errors in their assumptions about combining/comparing middle and high school test-score differences (albeit miniscule) attributed to Teach For America teachers essentially invalidating their (albeit miniscule) results?

lifesize-lost-in-space-b-9-robot-2 — “Affirmative!!” (Robot Voice)

Soon after the release of the study, I posted my first impressions of the study in New Mathematica TFA Study is Irrational Exuberance. I have also asked scholars across the nation to take a peek at the Mathematica study. Perhaps because there may be some that believe I have some bias against TFA or Mathematica. Which I don’t. (I actually have several friends that work in high places for TFA, including one of my best friends. Shout out to TLG and LET. Don’t do it EP. For the record, when they ask, I do let them know about the peer-reviewed research based on TFA. See Teach For America: A review of the evidence)

Recently, Professor Barbara Veltri discussed here on Cloaking Inequity the huggy, cuddly, snuggly interactions between Mathematica and the TFA network over the past decade. I also asked Dr. Francesca A. Lopez, a tenured Associate Professor of Educational Psychology at the University of Arizona, to discuss her thoughts on the two most problematic statistical issues (average impact and outcomes) in the Mathematica TFA study. Now let me say this, her important critiques may not be for the faint of heart relative to statistics but are necessary considering the cacophony towards any methodological critiques of the Mathematica study and the mileage that TFA is trying to get from it… Without further ado, Dr. Francesca A. Lopez’s critique:

Average Impact

On average, TFA teachers in the study were more effective than comparison teachers.

The point that TFA teachers are “more effective” is made close to fifty times throughout the report. Statistically speaking, TFA teachers were significantly more effective than comparison teachers, but what does this mean? All one really needs for statistical significance is a large sample size. I could have a miniscule difference between groups, but if my sample is large enough, it will be “significant.” And I’ll get to use * or ** next to my coefficient. But, just how much more effective are TFA teachers?

Students assigned to TFA teachers scored 0.07 standard deviations higher on end-of-year math assessments than students assigned to comparison teachers.

The “.07” can be referred to as the effect size—it tells us how large the difference is (here, .07 standard deviations). An effect size can also tell us how much of the variability in scores is attributable to the kind of teachers students’ had. Mathematica decided to approach the explanation this way:

First, the effect size can be expressed as a change in percentiles of achievement within the reference populations used in the study.

Although one could explain it this way, it is problematic because the percentiles are not on an interval scale (the distance between the 27^th and 30^th percentile is not the same as the distance between the 20^th and 23^rd). The reason for this somewhat unconventional use of percentiles can be uncovered in their explanation:

If assigned to a comparison teacher, the average student in the study would have had a z-score of -0.60, equivalent to the 27th percentile of achievement in his or her reference population based on a normal distribution for test scores. If assigned to a TFA teacher, this student would, instead, have had a z-score of -0.52—equivalent to the 30th percentile. Thus, the average student in the study would gain three percentile points from being assigned to a TFA teacher rather than a comparison teacher.

In this explanation, Mathematica attempts to aggrandize the effects of a TFA teacher by focusing on percentiles toward the middle of the distribution. If Mathematica had explained it another way, the miniscule advantage pretty much disappears for students most in need (those in the lower percentiles). If we examine the effect of having a TFA teacher another way, it explains students’ scores by less than .001%.

Comparison Group

Still another problem is in the very group Mathematica designed to be the comparison group—the group corresponding to the -.60 referenced by Mathematica. This group was created by using an aggregate of traditionally-certified teachers and “less selective” alternatively certified teachers. Although Mathematica explains at length the lack of comparability between TFA and Teaching Fellows

…For all of these reasons, the study findings cannot be used to compare the effectiveness of TFA and Teaching…pp. xxii-xxiii

They saw no issue in using an aggregate comprising distinctly different programs as a comparison group. They not only aggregated two disparate programs, but they also made sure to exclude alternatively certified candidates from rigorous programs.

If we want to figure out what the effect of TFA was when compared to a traditionally-certified group of teachers was, then we need to use a z-score of -.58, not -.60. This means that the corresponding percentile is 28, and that the difference would be two percentile points, an effect size of .06. This still explains students’ scores by less than .001%.

There are more problems with the way the statistics were presented in the report. Here’s another one:

Although TFA teachers had a positive average impact on student math achievement relative to comparison teachers, impacts from individual classroom matches varied in both sign and magnitude (Figure V.1). Notably, not all TFA teachers were more effective than their counterparts; without regard to statistical significance, the estimated difference in effectiveness between TFA and comparison teachers was positive in 60 percent of classroom matches (67 out of 111) and negative in the remaining 40 percent. Because each match-specific estimate was based on a small number of students, random statistical error contributed to the variation in impact estimates across classroom matches. Nevertheless, on the basis of an F-test, we found that the observed variation in impact estimates across classroom matches exceeded the variation that would be expected from pure statistical chance.

This time, Mathematica did not attempt to report effect sizes. Using Cohen’s calculations for percentages, the difference between effective TFA teachers and their comparison teachers was also small, with an h (interpreted like the standard deviation effect sizes) of .17.

Outcome measures

Students’ math assessment scores constituted the outcome measure for the analysis. The scales of the test scores differed between the state assessments and NWEA assessments and, among state assessments, differed across states and grade levels. To express test scores in a common unit, we converted each score into a z-score by subtracting the mean score of a reference population and dividing the difference by the standard deviation of scores in that reference population. For a student’s score on a state assessment, the reference population was the full population of students in the same state, year, and grade who took the same assessment; for a student’s score on an NWEA assessment in a given course, the reference population was the NWEA’s nationwide norming sample for that assessment. Thus, impacts on z-scores in this analysis represented increments to math achievement expressed in standard deviations within a statewide or national student population.

Mathematica committed a fundamental error in combining different tests (the various state tests and NWEA). Each state is very likely to have had a distinct test framework, which was aligned to each respective state’s standards. Others have provided details about the lack of cohesion among state standards,[1] making act of combining state-level scores (by using state level norms) with national norms is particularly egregious. No amount of transformation can change the fact that different tests were converted to a similar metric, which would remain interpretable only against the metric from which they were obtained.

FRANCESCA A. LOPEZ is an Associate Professor of Education Psychology at the University of Arizona. Her current research interests include examining the ways teacher-student dynamics inform the development of identity and achievement among Latino English language learners.

If this review is too technical go here instead: New Mathematica TFA Study is Irrational Exuberance

Please Facebook Like, Tweet, etc below and/or reblog to share this discussion with others.

1cc2ce10 — “Does Not Compute” (Robot Voice)

[1] Schmidt, W. H., Cogan, L. S., Houang, R. T., & McKnight, C. C. (2011). Content coverage differences across districts/states: A persistent challenge for U.S. education policy. American Journal of Education, 117, 399-427.

Want to know about Cloaking Inequity’s freshly pressed conversations about educational policy? Click the “Follow blog by email” button in the upper left hand corner of this page.

Please blame Siri for any typos.

p.s. In my initial post on the TFA Mathematica study, I discussed the fact that in Houston the majority of TFA teachers are elementary as my reference point. However, I am hearing from my sources that nationwide that secondary is the majority of placements for TFA. No one knows that except TFA because they took that data off of their national website recently. If that is the case, on that particular point (whether corps members are majority elementary or secondary across the US), I will stand corrected. This of course won’t change the fact that secondary math teachers are still a comparatively small minority of placements.

Twitter: @ProfessorJVH

Click here for Vitae.

See my piece on TFA in the New York Times.

See all of Cloaking Inequity’s posts about Teach For America.