Date Published: May 26, 2017

Christi Bergin, Stefanie A. Wind, Sara B. Grajeda, Chia-Lin Tsai Published in Studies in Educational Evaluation, Volume 55 (2017), pages 19-26


Purpose of the study

The Network for Educator Effectiveness informs educator evaluation practices through our work with schools and through our research. . Every summer, NEE trains more than 2,000 administrators on effective evaluation practices, with the goal of becoming more consistent, reliable, and accurate evaluators of teachers in the classroom. This article examines principal accuracy after one of the summer training sessions that NEE requires of all evaluators. The study sought to answer three questions:

  1. How accurate are principals’ ratings when conducting a classroom observation?
  2. Does rating accuracy vary by which classroom is being observed?
  3. Does rating accuracy vary by which teaching practice is being observed?

Background of the study

Research on the accuracy of evaluators after a training session is important for many reasons. As the authors state, “inaccurate ratings are unfair to teachers and provide misinformation on teachers’ effectiveness.” There is also consensus in the research community that ratings that are not accurate are unacceptable ethically when it comes to high-stakes personnel decisions and evaluation.

Unfortunately, with most evaluation systems currently available to school districts, there are few resources in place to investigate the accuracy of observations after training or in actual practice within schools. The NEE Research Team has that rare ability and investigated rating accuracy in training because of the foundational importance of evaluators’ accuracy after training and before going into their own classrooms.

The research was conducted during the summer of 2015. All evaluators that completed the qualifying exam during that training season (1,324 total evaluators) were included in the data. The exam was based on four ten-minute videos of authentic classrooms, and each evaluator was asked to score each video on six teaching practices: (1) use of academic language, (2) cognitive engagement, (3) critical thinking, (4) motivation practices, (5) student-teacher relationship, and (6) formative assessment.

Findings of the study

The study was conducted using a Many-Facet Rasch model. The model examines three facets: evaluators (the 1,324 that completed the exam), teaching episodes (the four videos), and teaching practices (the 6 teaching practices observed). For an evaluator to be accurate, they had to be within one point of the criterion rating; otherwise they were marked as inaccurate.

How accurate are principals’ ratings when conducting a classroom observation?

The authors found that evaluators “had high overall accuracy.” The authors pointed out possible reasons for that accuracy, including the use of the innovative research approach.They also noted the NEE evaluation system and the contribution it could make to helping evaluators become more accurate in classroom observations including:

  • The annual training that principals participate in
  • The quantitative rubric with an elongated scale with behavioral descriptors
  • The lack of a cut score for a teacher to be classified as “proficient”

Does rating accuracy vary by which classroom is being observed?

Further findings also suggested that principals “tended to cluster” in their accuracy levels. The authors found there to be several groups of evaluators with similar accuracy – and that the individual differences within each of those groups were not distinct. This suggests that there are some evaluators that demonstrated high accuracy at the conclusion of training, while others demonstrated low accuracy. This finding opens up the possibility for further research on the characteristics of principals and how those characteristics may influence accuracy. Possible characteristics that could influence accuracy include previous familiarity with subject areas or grade levels and previous familiarity with teaching practices.

Does rating accuracy vary by which teaching practice is being observed?

The authors also found “teaching practices varied in how difficult they were to rate accurately.” Significant findings include that formative assessment was the most difficult teaching practice to rate, while critical thinking was the easiest to rate.

In conclusion, it is not appropriate to expect an evaluator to be consistently accurate. Observations, and the ability to be accurate when rating those observations, vary based on what type of classroom is observed or what type of teaching practice is observed.

Reflection on the study

This study provides important justification for NEE’s requirement of continuing training, especially in rating classroom observations. The ability to be accurate across classrooms and teaching practices is a skill that needs to be continuously practiced and examined. Through practice and skill-building, our hope is that our evaluators continuously increase their accuracy not only after training but in their school buildings and districts.
To help with that accuracy, it is important to remember the following recommendations from the Network for Educator Effectiveness: focus on fewer teaching practices (hone in on 3-5 indicators), observe all classrooms on those teaching practices as often as possible (6 to 10 times per year), and align those teaching practices with the improvement initiatives and professional development of the school and/or district.