Date Published: November 2019

Stefanie A. Wind and Eli Jones. Educational Researcher, 48(8), 521–533.


Purpose of the Study

When classroom observations are used to evaluate teachers, how do we know that the ratings teachers are given are trustworthy? The purpose of this study was to illustrate a way to evaluate the quality of classroom observation ratings.  The authors use a method based on Rasch measurement theory.


Classroom observations are the key component of most teacher evaluation systems. Typically, principals observe a teacher’s classroom and provide a rating, or “score,” based on what they observe during that snippet of time. The reliability of principals’ ratings are therefore of particular interest. But what is the best way to determine reliability in a teacher evaluation system? This study makes a case for the use of a particular statistical approach – Many-Facet Rasch (MFR) model.

About MFR Models

The MFR approach is particularly useful because it identifies sources of variance within the data, such as differences among individual raters (e.g., whether some principals are more severe or lenient) or observation occasions. MFR models reveal whether raters have used the rating scale appropriately. They allow statistical adjustment for rater effects (e.g., severity or leniency), and identify areas for improvement of ratings.

While MFR models have a long history of use in other areas, such as writing assessments in which raters score students’ compositions, it is innovative to use MFR models for classroom observations.


The data in this study are from the Network for Educator Effectiveness (NEE). The sample consisted of 114 principals in the NEE system during the 2016-17 school year. Ratings of three teaching practices were analyzed for this study:

  • Cognitive engagement (NEE Indicator 1.2)
  • Critical thinking (NEE Indicator 4.1)
  • Formative assessment (NEE Indicator 7.4)


There are three key findings from use of the MFR model on NEE data:

  1. There are substantial severity differences among the principals (i.e., some principals consistently give higher or lower ratings). However, these differences are not considered evidence of measurement error. The researchers note that differences in rater severity can exist even with high reliability.
  2. Principals were internally consistent in their ratings. Only nine of the 114 principals were flagged as “misfitting” by the analysis. That is, they gave ratings that were statistically unexpected (e.g., a lenient principal gave a high-scoring teacher an unexpectedly low rating). The researchers showed examples from four principals. Two stayed within a 95% confidence interval of expected ratings, and two gave unexpected ratings.
  3. The principals used the rating scale appropriately (i.e., the scores describe distinct levels of teaching effectiveness and higher scores correspond to more effective teaching).


This study illustrates that analyzing the quality of principals’ ratings involves more than describing interrater reliability (i.e. the extent to which different principals agree when rating teachers).

The Many-Facet Rasch (MFR) model is useful for more fully exploring the quality of ratings in classroom observations, such as severity, rater fit, and raters’ use of the rating scale categories.

Findings of this study underscore the importance of rater training, which happens annually in the NEE system. Training and implementation of any observation system, including NEE, should examine the factors that affect rater quality and make adjustments in order to increase the reliability of the system.