In many measurement settings, it is important to assess the reliability and validity of measurements. As an example, forensic examiners are called upon to assess the quality of forensic evidence and draw conclusions about the evidence (e.g., whether two fingerprints came from the same source). Reliability and validity are often assessed through “black box” studies in which examiners make judgments regarding evidence of known origin under conditions meant to imitate real investigation. An open question is whether examiners differ in their ability to assess different items of evidence, i.e., whether there are examiner-by-evidence interactions. For logistical and cost reasons it is not practical to obtain a full set of replicate measurements. We leverage a hierarchical Bayesian analysis of variance model to address this limitation and simultaneously explain the variation in the decisions both between different examiners (reproducibility) and within an examiner (repeatability). The model can be applied to continuous, binary or ordinal data. Simulation studies demonstrate the approach and the methods are applied to data from handwriting and latent print examinations.