Insights: Error Rates, Likelihood Ratios, and Jury Evaluation of Forensic Evidence


Error Rates, Likelhood Ratios, and Jury Evaluation of Forensic Evidence


Forensic examiner testimony regularly plays a role in criminal cases — yet little is known about the weight of testimony on jurors’ judgment.

Researchers set out to learn more: What impact does testimony that is further qualified by error rates and likelihood ratios have on jurors’ conclusions concerning fingerprint comparison evidence and a novel technique involving voice comparison evidence?

Lead Researchers

Brandon L. Garrett J.D.
William E. Crozier, Ph.D. 
Rebecca Grady, Ph.D.


Journal of Forensic Sciences

Publication Date

22 April 2020


Participants would place less weight on voice comparison testimony than they would on fingerprint testimony, due to cultural familiarity and perceptions.

Participants who heard error rate information would put less weight on forensic evidence — voting guilty less often — than participants who heard traditional and generic instructions lacking error rates.

Participants who heard likelihood ratios would place less weight on forensic expert testimony compared to testimony offering an unequivocal and categorical conclusion of an ID or match.



900 participants read a mock trial about a convenience store robbery with 1 link between defendant and the crime


2 (Evidence: Fingerprint vs. Voice Comparison)
x 2 (Identification: Categorical or Likelihood Ratio)
x 2 (Instructions: Generic vs. Error Rate) design


Participants were randomly assigned to 1 of the 8 different conditions

After reading materials + jury instructions, participants decided whether they would vote “beyond-a-reasonable-doubt” that the defendant was guilty


Laypeople gave more weight to fingerprint evidence than voice comparison evidence.

Fewer guilty verdicts arose from voice evidence — novel forensic evidence methods might not provide powerful evidence of guilt.

Fingerprint evidence reliability decreases when jurors learn about error rates.

Error rate information appears particularly important for types of forensic evidence that people may already assume as highly reliable.

Participants considering fingerprint evidence were more likely to find the defendant not guilty when provided instruction on error rates. When the fingerprint expert offered a likelihood ratio, the error rate instructions did not decrease guilty verdicts.

When asked to rate which is worse — wrongly convicting an innocent person or failing to convict a guilty person or both — the study found the majority of participants were concerned with convicting an innocent person.

30 %

Participants who believe convicting an innocent person was the worst offense were less likely to vote guilty due to more doubt in the evidence.

0 %

Those who had greater concern for releasing a guilty person were more likely to vote guilty.

0 %

of participants believed the errors were equally bad.

Researchers found, overall, that presenting an error rate moderated the weight of evidence only when paired with a fingerprint identification.


To produce better judicial outcomes when juries are formed with laypeople:

Direct efforts toward offering more explicit judicial instructions.

Craft a better explanation of evidence limitations.

Consider the findings when developing new forensic techniques –– new techniques aren’t as trusted by a jury despite proving more reliable and lowering error rates.

Pay attention to juror preconceptions about the reliability of evidence.

Laboratories Learn About Accuracy of Forensic Software Tools through NIST Study

hand holding mobile phone showing multiple apps

In a new article, NIST researcher Jenise Reyes-Rodriquez shares an inside look at her work testing mobile forensic software tools. She and her team explore the validity of different methods for extracting data from mobile devices, even from damaged phones. Researchers subject a wide array of digital forensic tools to rigorous and systematic evaluation, determining how accurately it retrieves crucial information from the device.

She explains that unlike what you might see on television, forensic labs are often working with limited budgets and may not have access to multiple tools. They typically need to work with what they already have or can afford. Reyes-Rodriquez and her research team test these mobile tools on the most popular devices on the market and create reports for labs listing any anomalies such as incomplete text messages or contact names. These reports help labs know if the tool they have is appropriate to use in their case, and provides a guide on alternative options or an ideal tool to buy.

This research was funded by NIST and the Department of Homeland Security’s Cyber Forensics Project. Read the full story on the NIST website, and learn more about the research group’s work on the CFTT website. Here you can also access NIST’s testing methodology and forensic tool testing reports.

NIST and Noblis Seek Participants for Bullet Black Box Study

Are you a US firearms examiner who has conducted operational casework in the past year? NIST and Noblis are seeking participants for a bullet black-box study to evaluate the accuracy, repeatability, and reproducibility of bullet comparisons by firearms examiners.

Study Overview

Participants will conduct 100 comparisons over a period of approximately 6-7months. The test will be conducted by sending the physical samples to the participants in 10 packets, each of which contains 10 bullet sets for comparison. The test samples will be a range of bullets that will be collected under ground-truth controlled conditions, attempting to be as broadly representative of casework as practical. Firearms, calibers, and ammunition frequently encountered in casework will be used. Custom web-based software will be used to record examiner responses, and transmit responses back to the test administrators.

Interested in participating? Email Additional details can be found on the NIST flyer.

Insights: Comparison of three similarity scores for bullet LEA matching


Comparison of three similarity scores for bullet LEA matching


As technology advances in the forensic sciences, it is important to evaluate the performance of recent innovations. Researchers funded by CSAFE judged the efficacy of different scoring methods for comparing land engraved areas (LEAs) found on bullets.

Lead Researchers

Susan Vanderplas
Melissa Nally
Tylor Klep
Christina Cadevall
Heike Hofmann


Forensic Science International

Publication Date

March 2020


Evaluate the performance of scoring measures at a land-to-land level, using random forest scoring, cross correlation and consecutive matching striae (CMS).

Consider the efficacy of these scoring measures on a bullet-to-bullet level.

The Study

  • Data was taken from three separate studies, each using similar firearms from the same manufacturer, Ruger, to compare land engraved areas (LEAs), areas on a bullet marked by a gun barrel’s lands –– the sections in between the grooves on the barrel’s rifling.
  • Examiners processed the LEA data through a matching algorithm and scored it using these three methods:


Random Forest (RF):

A form of machine-learning that utilizes a series of decision trees to reach a single result.


Cross-Correlation (CC):

A measure of similarity between two series of data.


Consecutive Matching Striae (CMS):

Identifying the similarities between the peaks and valleys of LEAs.


The Equal Error rate of each scoring method across multiple studies

  • On a bullet-to-bullet level, the Random Forest and Cross-Correlation scoring methods made no errors.
  • On a land-to-land level, the RF and CC methods outperformed the CMS method.
  • When comparing equal error rates, the CMS method had an error rate of over 20%, while both the RF and CC methods’ error rates were roughly 5%. The RF method performed slightly better.



The random forest algorithm struggled to identify damage to bullets that obscured LEAs caused by deficiencies in the gun barrel such as pitting from gunpowder or “tank rash” from expended bullets.

  • In future studies, examiners could pair the RF algorithm with another algorithm to assess the quality of the data and determine which portions can be used for comparison.

All the studies used firearms from Ruger, a manufacturer picked because their firearms mark very well on bullets. Future studies can assess the performance of these scoring methods on firearms from different manufacturers with differing quality marks.