Ambiguity-Robust Dictionaries using Contextualized Embeddings
Working Paper
Abstract
Keyword dictionaries remain widely used in text analysis for their simplicity and transparency, but their sensitivity to keyword selection and inability to account for context introduce measurement error that attenuates estimates and obscures real relationships. We introduce ambiguity-robust dictionaries, a text-as-data method that integrates dictionary interpretability with contextual word embeddings to produce more precise measures. Our method requires only a few researcher-identified anchor words tied to the target concept. We generate contextualized embeddings for all keyword instances, apply constrained fuzzy clustering to define a target cluster anchored to these words, and weight each instance by its cluster membership. We demonstrate the utility and precision of our method across three applications: UN environmental discourse, ethnic bias in Kenyan judicial politics, and negative moral language US congressional speech.