Benchmark Item Difficulty – It Means Something


I want to thank one of our Florida educators whose students take our Benchmark exam for asking a number of wonderful questions this spring. We’ll take some time this summer to answer them on the Assessment blog. We hope that the questions and answers can provide clarification for all on things that you may have been wondering about, or maybe hadn’t even considered. So here goes:


1. What level of questions saw the greatest success and posed the greatest difficulty?


Answer from Chris Meador – Our Manager of Quality Assurance

We take a number of factors into consideration when selecting the difficulty level of items for our benchmark tests.

  • Our primary concern is always accurately assessing the benchmark standard.  For example, Florida’s DOE is very explicit when it comes to their published FCAT Item Specifications, which provide clarifications of the benchmarks as well as content limits for assessment.  Since there is no FCAT in grade 2, we don’t have quite as much assessment information from FL DOE on the grade 2 standards, but our approach in assessing them is essentially the same as we have in grades 3-5.
  • So the benchmark standards themselves determine (to a large degree) the difficulty level of the items.  Item difficulty is something we measure and monitor carefully, because in building the test itself we want to make sure that each reporting category is measuring student performance with a balance of easy, moderate, and hard items.  This is essential for (1) having a test that is short enough to administer in a single class but long enough to predict future state test performance, and (2) providing the classroom teacher with diagnostic information about where the students are.  (To the second point – a test that is either too easy or too hard fails to provide meaningful information about the student’s knowledge and skills.)
  • As a consequence of measuring these end-of-year learning goals and maintaining a balance of difficulty levels, our tests tend to be quite challenging compared to a unit test from a textbook.  We design the tests so that the average test score is between 50-60% correct, and so that each reporting category has approximately 25% easy, 50% moderate, and 25% hard items.  Some of the items are measuring standards which haven’t been taught yet, which is completely normal.  This is why it is so important for the teachers to study the reports carefully to understand what their students answered correctly and incorrectly, and why.

Related posts