Saturday, December 14, 2013

VAM: Size matters

What if I tell you that two different teachers can get the exact same growth scores on the exact same test and have completely different Value Added scores?  Possible?  As it turns out, yes. Fair?  I'll let you decide.

As a Value Added Leader (VAL) and educational consultant in Ohio, I have the opportunity to work with many teachers in several districts to look at their value added teacher level reports.  In case you have missed the news, Ohio determines educator effectiveness by measuring how much students in teachers' classrooms "grow" on mandated standardized tests.  Simply put, student scores are placed on a bell curve and then compared with where they place on the bell curve the following year.  Those changes in placement on the bell curve (Normal Curve Equivalent scores,or NCE scores) are averaged across a classroom to get a mean NCE gain.  In order to be "most effective," a teacher's students must have a mean NCE change of at least 2 standard errors above the mean growth score.

That's a lot of math talk, I know, but let me explain a little about "standard error" to those not familiar with statistical math concepts.  Standard error is basically the confidence I have in the data.  If I have a LOT of data, my standard error is small, since I have more confidence in the data.  When I have fewer data points, I'm not so confident and so the standard error is larger.  There are a couple of factors that have direct impact on the size of the standard error - size of the population and range of the scores.

To put this in terms of a classroom teacher's rating, teachers in middle schools typically have 120 or so students and elementary teachers have maybe 25.  Special education teachers or gifted intervention specialists have even fewer, and if two or more teachers work with the same students, their numbers are decreased even further since the students are "linked" to all of the teachers who contribute to instruction.  The teachers with more students will have a small standard error and the ones with fewer students have a large standard error. So what?  The problem is that this becomes a big deal when determining a teacher's effectiveness rating.  Let's look at an example that I encountered at a school just yesterday.

Two middle school math teachers, one general ed and one special ed, co-teach a class of sixth grade math.  We will call them Mrs. A and Mrs. B.  They did an outstanding job, and their students, all low-performing students in the past, did quite well on standardized tests.  Their mean NCE change was about 5.  Another teacher in the same building, Mr. C,  teaches three classes of the same subject each day, and had similar results  - mean NCE gain of 5.  In other words, their students grew the same amount on their standardized tests - the teachers all produced equal "growth" in terms of how our legislature defines growth.  The standard error of Mrs. A and Mrs. B was 4.9.  Mr. C with similar results has around 70 students and a standard error of 1.9.  Remember, more students, more confidence in the data.  Same growth, different standard errors because of a difference in the size of the teachers' classes.  Mrs. A and Mrs. C have the smaller class and they both "link" to all of their students and so they each get credit for only 50% of their students' results.

In Ohio, the "most effective" teachers are those with a gain index of 2.0, that is their mean NCE change is 2 or more standard errors above the mean.  Teachers with a gain index of 1-2 are "above average". These are the teachers whose mean student gain is between 1 and 2 standard errors above the mean.  "Average" teachers are plus or minus 1 standard error from the mean.  Teachers between 1-2 standard errors below the mean are "approaching average" and those with mean student change in NCE scores of more than 2 standard errors are "least effective."  It's all about the standard error, but, as I've explained, standard error depends on the size of a teachers' classroom.

The exact same growth in student achievement resulted in a gain of over two standard errors for Mr. C in our example above,  and so he is lauded as one of the "most effective" teachers in the state.  Mrs. A and Mrs. B, however, with the exact same gain, but a standard error of 4.9 are average.  Same results - same gain in student achievement but the teachers are evaluated very differently because of the size of their classes.

If this sounds unfair to you, you would be correct.  There are MANY problems with using standardized test results and norm-referenced testing for accountability that I addressed before here  , here, and here. But for teachers looking at "average" ratings, this problem is significant.  My effectiveness should not be determined by the size of my class.

1 comment:

  1. Note: While this explanation is overly simplistic, it includes both the mean gain (NCE change) and the confidence interval (standard error). The Gain index is simply the "estimate," (basically mean NCE change) divided by the standard error, which tells us the number of standard errors from the growth standard (0, ie staying at the same place on the bell curve) the mean NCE change is. Hence, any gain index of 2 or greater is "Most Effective" since the mean NCE change (Estimate) is 2 or more standard errors from the growth standard. It's all about the number of standard errors from 0 change the mean gain is.