Skip to content

Score-based Likelihood Ratios Using Stylometric Text Embeddings

Conference/Workshop:
2024 Joint Statistical Meetings
Published: 2024
Primary Author: Rachel Longjohn
Secondary Authors: Padhraic Smyth
Type: Poster

We consider the problem setting in which we have two sets of texts in digital form and would like to quantify our beliefs that the two sets of texts were written by the same author versus by two different authors. Motivated by problems in digital forensics, the sets of texts could be composed primarily of short-form messages, and texts by the same author may be about vastly different topics. To this end, we focus on user-specific stylometric aspects of the texts that are consistent across an author’s writings and are invariant to topics. Recent work in machine learning has sought to learn a mapping from input texts to output a vector representation intended to capture such stylometric features. In this work, we investigate the use of such stylometric text embeddings to construct a score-based likelihood ratio (SLR), an increasingly popular way of quantifying evidence in forensics. We present the results of SLR experiments using recently proposed stylometric embeddings from machine learning applied to real-world datasets relevant to digital forensics.

Related Resources

Statistics and its Applications in Forensic Science and the Criminal Justice System

Statistics and its Applications in Forensic Science and the Criminal Justice System

This presentation is from the 2024 Joint Statistical Meetings (JSM), Portland, Oregon, August 3-8, 2024.
Algorithmic matching of striated tool marks

Algorithmic matching of striated tool marks

Automatic matching algorithms for assessing the similarity between striation marks have been investigated for bullet lands and some tool marks, such as screwdrivers. We are interested in the investigation of…
Silencing the Defense Expert

Silencing the Defense Expert

In the wake of the 2009 NRC and 2016 PCAST Reports, the Firearms and Toolmark (FATM) discipline has come under increasing scrutiny. Validation studies like AMES I, Keisler, AMES II,…
A reproducible pipeline for extracting representative signals from wire cuts

A reproducible pipeline for extracting representative signals from wire cuts

We propose a reproducible pipeline for extracting representative signals from 2D topographic scans of the tips of cut wires. The process fully addresses many potential problems in the quality of…