A primer on standardized tests

According to his sign-off, Peter Green “spent 39 years as a high school English teacher, looking at how hot new reform policies affect the classroom.” Recently he wrote about standardized tests and education accountability for Forbes.com.

https://www.forbes.com/sites/petergreene/2018/09/20/is-the-big-standardized-test-a-big-standardized-flop/?fbclid=IwAR0rZSrLUCmkvCryOpYsXT2Iwbp8EZejJ2eOQm5Kp7V5lyHpk1px0X9oP8I#6839ec004937

He argued that standardized testing and the accountability movement have not produced a better educated citizenry. This paragraph captures his main point:

But there is one critical lesson that ed reform testing apostates should keep in mind. The idea that the Big Standardized Test does not measure what it claims to measure, the idea that it actually does damage to schools, the idea that it simply isn’t what it claims to be–while these ideas are presented as new notions for ed reformers, classroom teachers have been raising these concerns for about 20 years.

Of course Green is right. The purpose of this blog is to explain why, which space limitations at Forbes.com did not allow him to do.

Standardized tests are of two types. As their names suggest, criterion-referenced tests match student responses against a criterion, and norm-referenced tests match student responses against a reference group or a norm.

The written driver’s license test is a criterion-referenced test. The first step in creating this kind of test is to define the body of knowledge to be assessed. The driver’s license written test is the state’s attempt to assesses applicants’ knowledge of traffic regulations. The knowledge needed to pass the test is described in the driver’s manual, so the manual defines the knowledge required to take the road test.

The second step for administering a criterion-referenced, standardized test is to set the cut-off point for passing. Driver’s license applicants in North Carolina must correctly answer at least 80% of the questions to qualify for the road test.

When I was an assistant principal at Stoughton High School (Wisconsin), I was on a district committee assigned to develop grade level, criterion-referenced tests for every academic subject. We thought, how hard can that be? The driver’s license people do it. Why can’t we?

Our first stumbling block was that students of the Stoughton Area School District, needed to have both knowledge and skills assessed. (The driver’s test assesses only knowledge because skills are assessed in the road test.) Third grade math students, for example, need to understand mathematical principles and use them to solve mathematical problems. Our committee’s task was to define all the desired knowledge and skills, as the first step toward creating tests that would assess students against those criteria. Still–how hard can that be?

Our work came to an abrupt halt after we hired a statistician to assist us. During his first meeting with us, we described what we wanted to accomplish. He responded by explaining that each specific knowledge or skill proficiency would require students to answer more than one multiple-choice question. In many cases, we needed at least five questions to test for a single skill. That meant the number of tests and questions would have to be many times greater than we assumed. We concluded that it was a good idea to develop criterion-referenced tests, but creating and administering them would take too much time.

Standardized, norm-referenced tests have limitations, too. Greene mentioned some of them, but the main one is that they are not designed as improvement tools. Instead, their main purpose is to tell students how they scored in relation to other test takers. Results are reported as percentiles, not as percentages.

This type of test is like a machine that is balanced for proper operation. Easy-to-answer items are balanced with those that are slightly more difficulty and others that are very difficult. Balancing items this way enables the results to discriminate across the knowledge and skill levels of test takers. Some students will be high scorers, some will score in the middle, and some will score toward the bottom.

A perceptive reader sees that a test designed for these results means that students scoring at the lower percentiles make it possible for others to score at higher percentiles. They also see that percentiles tell us little about what a specific student knows or does not know. It’s as if teachers prepare students for a test that requires their lowest ability students to score low, and then those students are scolded for getting low scores.

Educators do this because they are ignorant of the way standardized tests operate. For example, some schools develop annual school improvement goals that focus on improving students’ scores on end-of-grade exams. They must not realize that their annual goal–the idea that focuses their efforts for 5 hours per day, 180 days per year—will be achieved if students average 1 more correct answer on an end-of-grade exam. A shallower, more meaningless goal might be possible, but I can’t think of one.

– – – – – – – – – – – – – – – – – – –  –

Those are the explanations Greene could not give in his limited space. And those are the reasons why standardized tests fail as school improvement tools. They do not tell us what students know, and they distract schools from their most important goal, which is modeling and teaching the six virtues of the educated person. But educators don’t know that, and space limitations prevent me from explaining the reasons. You have to read the book. Order it at

https://rowman.com/ISBN/978-1-60709-274-2

 

 

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment