Understanding Assessment Quality: Validity, Reliability and More

I find it hard to imagine a system that is completely perfect and void of human influence or error.

However, there are ways to increase the confidence we can have in the system.

1) Start with valid assessments.  What is a valid assessment?  It is an assessment that accurately measures what it is intended to measure.  That means the questions are written so students get the correct/incorrect because of their knowledge of that content and skill NOT because of cultural bias, NOT because of excessive wordiness, etc.   For example, a question with a very difficult vocabulary word might become invalid because students are getting it wrong because they don’t know that word, not because they don’t understand the concept.  The data is measuring who knows the word: not what you intended to measure.   You can’t prove an assessment is valid until you have historical data, so it is important that we give the assessment to the students and examine the scores.  “…Validity is concerned with the confidence with which we may draw inferences about student learning from an assessment.  Furthermore, validity is not an either/or proposition, instead it is a matter of degree,” (Gareis & Grant, 2008 p.35)

Thus, increasing validity will be an ongoing district process.

2) Start with Reliable Assessments.  What is a reliable assessment?  It is an assessment that will yield repeatable results. If you take an assessment and give it to a group of kids one year, then give it to a similar group of students next year you get similar results.  Reliability in selected response (multiple choice) tests is easier to achieve.  However, if a teacher gives an assessment to period 2, realizes some trouble the students had and therefore gives different directions to period 4 than they are interfering with the reliability.  Thus, reliability requires consistency.

Rubrics are a viable and great tool for measuring growth over multiple data points.   It is important to recognize, however, that ensuring reliability (repeatability of results) can be a little more difficult with a rubric-graded open ended task than a multiple choice test.  Rubrics must be created and implemented so that the grader(s) have very specific understandings of what each level of the rubric means.  Inter-rater reliability, or consistency between multiple graders, is important so the whole team would agree on the same score for the same work.  Even more essential is that the individuals grading are consistent within themselves.  That means when you grade a batch of student work, and then another batch of student work several months or even a year in the future, your scoring methods are the same (Student work earning 3 looks the same as another sample earning a 3, as another sample earning a 3.  Every time).  Ensuring repeatable results with the same rubric is ESSENTIAL to getting reliable data as well as useful data for talking about students and how they are growing.

3) Assessments Should be High Quality : Teachers should be able to explain why the assessment set accurately measures student growth in the key areas of their curriculum and administrators should be able to understand common elements in quality growth assessment design.  Using a district rubric or checklist that looks at alignment, distractor and wrong answer use, growth design, cognitive demand, validity and reliability will help indicate the quality of the assessments.  “Gaming the system” with unaligned, easy, or schemed assessments should thus be highlighted and prevented.

4)  Make Data Collection Simple: Asking teachers to fill out complex spreadsheets opens us up for unintentional errors in data entry.  Systems of scantron tools and ways to consistently and automatically collect data will minimize errors both intentional and unintentional.

Multiple Teachers Grading:

The validity should not be affected by the fact that there is more than one grader.  In fact, the multiple minds at the table during the creation process should help increase the validity of the assessment and help increase the fact that the assessment truly measures what it intended to measure.

Reliability is another issue altogether.  When multiple teachers are grading assessments, we need to make sure their results are repeatable no matter who grades the test.  Start by increasing inter-rater reliability by having a “trade and grade” professional development event.  Teachers can learn about how the others would have graded the same questions.  Some districts have use two graders on the same assessment and use averaged the score.  Other districts don’t allow teachers to grade their own student’s work.  Ultimately, if multiple teachers using the same assessment have worked to ensure a high degree of comparability in the way they give scores, the data produced will be reliable.


Anne Weerda
Follow Anne.

Anne Weerda

This article was written by Kids at the Core founder, Anne Weerda.

Anne is an assessment and curriculum specialist best known for her work in assessment design, data analysis and instructional effectiveness. Anne is a sought after speaker in the area of assessment design, curriculum and instruction.
Anne Weerda
Follow Anne.

Latest posts by Anne Weerda (see all)