Official crime data is often plagued with underreporting: many crimes are undiscovered or are not registered if reported to the police. To shed light on the dark figure of crime, I harness the data on over 9 million accident reports that have been registered by Russian law enforcement authorities. Any accident reported to the police or other agencies should be recorded in writing by the authorities, regardless of it being a crime. Since 2013, the Office of the Prosecutor-General of Russia maintains a computerised system to gather and store such accident reports in selected regions. I build on rich accident metadata and textual summaries therein to (a) estimate word embeddings of accident reports, (b) average those embeddings at accident level, (c) discern unsupervised clusters of accidents given their texts, (d) refine the obtained clusters with domain experts. Steps (a)-(c) of this procedure are cross-validated against the observable accident metadata to ensure accurate semantic representation. Results suggest 40 interpretable clusters, the leading one being domestic accidents (22.5% of reports), followed by thefts and lost items (10.3%). The proposed procedure paves the way for automated near real-time analysis of accident reports before they become crimes.
Keywords: criminology, dark figure of crime, word embeddings, text-as-data