Statistical methods for digital image forensics: Algorithm mismatch for blind spatial steganalysis and score-based likelihood ratios for camera device identification

Forensic science currently faces a variety of challenges. Statistically suitable reference databases need to be developed and maintained. Subjective methods that can introduce bias need to be replaced by objective methods. Highly technical forensic methods need to be clearly and accurately communicated to juries. Juries should also be given information about the strength of the forensic evidence.

Many traditional blind steganalysis frameworks require training examples from all potential steganographic embedding algorithms, but creating a representative stego image database becomes increasingly more difficult as the number and availability of embedding algorithm grows. We introduce a straight-forward, non-data-intensive framework for blind steganalysis that only requires examples of cover images and a single embedding algorithm for training. Our framework addresses the case of algorithm mismatch, where a classifier is trained on one algorithm and tested on another. Our experiments use RAW image data from the BOSSbase Database and six iPhone devices from the StegoAppDB project. We use four spatial embedding algorithms: LSB matching, MiPOD, S-UNIWARD, and WOW. We train Ensemble Classifiers with Spatial Rich Model features on a single embedding algorithm and tested on each of the four algorithms. Classifiers trained on the MiPOD, S-UNIWARD and WOW embedding algorithms achieve decent detection rates when testing on all four spatial embedding algorithms. Most notably, an Ensemble Classifier with an adjusted decision threshold trained on LSB matching data achieves decent detection rates on the three more advanced, content-adaptive algorithms: MiPOD, S-UNIWARD and WOW.

Score-based likelihood ratios (SLR) have been employed in various areas of forensics when likelihood ratios (LR) are unavailable. SLRs, like LRs, quantify the strength of evidence in support of two mutually exclusive but non-exhaustive hypotheses. LRs and SLRs have both been used for camera device identification, but the full framework for source-anchored, trace-anchored, and general match SLRs has not been investigated. In this work, we present a framework for all three types of SLRs for camera device identification. We use photo-response non-uniformity (PRNU) estimates as camera fingerprints and correlation distance as a similarity score. We calculate source-anchored, trace-anchored, and general match SLRs for 48 camera devices from four publicly available image databases: ALASKA, BOSSbase, Dresden, and StegoAppDB. Our experiments establish that all three types of SLRs are capable of distinguishing between devices of the same model and between devices from different models. The false positive and false negative rates for all three types of SLRs are low, and can be lowered further still by adding an inconclusive class.

Bayesian hierarchical modeling for the forensic evaluation of handwritten documents

The analysis of handwritten evidence has been used widely in courts in the United States since the 1930s (Osborn, 1946). Traditional evaluations are conducted by trained forensic examiners. More recently, there has been a movement toward objective and probability-based evaluation of evidence, and a variety of governing bodies have made explicit calls for research to support the scientific underpinnings of the field (National Research Council, 2009; President’s Council of Advisors on Science and Technology (US), 2016; National Institutes of Standards and Technology). This body of work makes contributions to help satisfy those needs for the evaluation of handwritten documents.

We develop a framework to evaluate a questioned writing sample against a finite set of genuine writing samples from known sources. Our approach is fully automated, reducing the opportunity for cognitive biases to enter the analysis pipeline through regular examiner intervention. Our methods are able to handle all writing styles together, and result in estimated probabilities of writership based on parametric modeling. We contribute open-source datasets, code, and algorithms.

A document is prepared for the evaluation processed by first being scanned and stored as an image file. The image is processed and the text within is decomposed into a sequence of disjoint graphical structures. The graphs serve as the smallest unit of writing we will consider, and features extracted from them are used as data for modeling. Chapter 2 describes the image processing steps and introduces a distance measure for the graphs. The distance measure is used in a K-means clustering algorithm (Forgy, 1965; Lloyd, 1982; Gan and Ng, 2017), which results in a clustering template with 40 exemplar structures. The primary feature we extract from each graph is a cluster assignment. We do so by comparing each graph to the template and making assignments based on the exemplar to which each graph is most similar in structure. The cluster assignment feature is used for a writer identification exercise using a Bayesian hierarchical model on a small set of 27 writers. In Chapter 3 we incorporate new data sources and a larger number of writers in the clustering algorithm to produce an updated template. A mixture component is added to the hierarchical model and we explore the relationship between a writer’s estimated mixing parameter and their writing style. In Chapter 4 we expand the hierarchical model to include other graph-based features, in addition to cluster assignments. We incorporate an angular feature with support on the polar coordinate system into the hierarchical modeling framework using a circular probability density function. The new model is applied and tested in three applications.

Mock Jurors’ Evaluation of Firearm Examiner Testimony

Objectives: Firearms experts traditionally have testified that a weapon leaves “unique” toolmarks, so bullets or cartridge casings can be visually examined and conclusively matched to a particular firearm. Recently, due to scientific critiques, Department of Justice policy, and judges’ rulings, firearms experts have tempered their conclusions. In two experiments, we tested whether this ostensibly more cautious language has its intended effect on jurors (Experiment 1), and whether cross-examination impacts jurors’ perception of firearm testimony (Experiment 2). Hypotheses: Four hypotheses were tested. First, jurors will accord significant weight to firearm testimony that declares a “match” compared to testimony that does not (Experiments 1 and 2). Second, variations to “match” language will not affect guilty verdicts (Experiment 1). Third, only the most cautious language (“cannot exclude the gun”) would lower guilty verdicts (Experiment 1). Fourth, cross-examination will reduce guilty verdicts depending on specific language used (Experiment 2). Method: In two preregistered, high-powered experiments with 200 mock jurors per cell, participants recruited from Qualtrics Panels were presented with a criminal case containing firearms evidence, which varied the wording of the examiner’s conclusion and whether cross-examination was present. These variations include conclusion language used by practitioners, language advised by government organizations, and language required by judges in several cases. Participants gave a verdict, rated the evidence and expert in all conditions. Results: Guilty verdicts significantly increased when a match was declared compared to when a match was not declared. Variation in conclusion language did not affect guilty verdicts nor did it affect jurors’ estimates of the likelihood the defendant’s gun fired the bullet recovered at the crime scene. In contrast, however, a more cautious conclusion that an examiner “cannot exclude the defendant’s gun” did significantly reduce guilty verdicts and likelihood estimates alike. The presence of cross-examination did not affect these findings. Conclusion: Apart from the most limited language (“cannot exclude the defendant’s gun”), judicial intervention to limit firearms conclusion language is not likely to produce its intended effect. Moreover, cross-examination does not appear to affect perceptions or individual juror verdicts.