Researcher choice and overlooked variables: A "bottom-up" reanalysis of Villarreal (2018)
Dan Villarreal (Pitt Linguistics)
An immense body of sociolinguistic research has demonstrated how social processes (from macrosocial structures to momentary identity performance) are reflected and reproduced in speakers' production and perception of individual linguistic variables. The approach of isolating individual variables for investigation, despite being clearly fruitful, misses two critical facts about language variation in actual use. First is variable co-occurrence: variables do not exist in isolation. In between tokens of the individual variable under investigation are numerous socially meaningful variables (some in structurally related changes) that may mediate or change the social meanings of the studied variable. Second is researcher choice: the process by which we choose what to investigate may lead us to miss meaningful variation. The present study attempts to address these shortcomings using "bottom-up" methods to investigate Californian listeners' attitudes toward a multiplicity of co-occurring vowel variables, comparing these variables' influence on social meanings to previous research on California vowels.
To investigate this question, my co-author, James Grama, and I re-analyzed the results of my earlier matched-guise research on California English perceptions (Villarreal 2018). In that study, 97 Californian listeners rated excerpts from a cartoon-retell task (produced by 12 Californian speakers) on 12 attribute scales. Because stimuli were spontaneously produced (albeit all on the same topic), they all contained slightly different content and thus different vowel variables. Aside from the two vowels acoustically manipulated into guises (TRAP and GOOSE), all other vowel phonemes were left to vary naturally. The original analysis found that, despite substantial variance in attribute ratings overall, guise significantly affected three scales—suggesting stimuli contained additional socially meaningful variation that guise failed to capture.
To model this variation, we treated each stimulus as a "bag of features", mirroring "bag-ofwords" approaches to text corpora (Jurafsky & Martin 2022). These comprised vowel changes that are well-attested in California English (TRAP, DRESS, KIT, GOOSE, GOAT, LOT/THOUGHT), marginally attested (FOOT, STRUT), and largely unattested (FLEECE, FACE, PRICE). Vowels' F1 and F2 measurements were normalized and translated to discrete features using Atlas of North American English benchmarks (Labov, Ash & Boberg 2006). For each attribute, we used the Boruta algorithm (Kursa, Jankowski & Rudnicki 2010), via the Boruta package in R (Kursa & Rudnicki 2010), to determine which features influenced scale ratings.
This reanalysis revealed that vowels that are changing (or have changed) in California English did not necessarily impact social meanings more than those with marginal or no evidence of change. The most impactful variable, FOOT, is rarely investigated in California despite being structurally related to well-attested GOOSE and GOAT fronting; FLEECE, which is almost completely unattested as undergoing change, outranked several well-attested California English variables. In addition, despite the historical ordering of Low-Back-Merger Shift (Becker 2019) sound changes (TRAP then DRESS then KIT), TRAP and KIT impacted social meanings more than DRESS.
I argue that these findings reveal a need for greater attention to variable co-occurrence in modeling language variation. While variable co-occurrence is a known problem in sociolinguistics, actually accounting for it in practice is challenging. I suggest that bottom-up approaches like that described here can account for variable co-occurrence while mitigating the potential bias introduced by researcher choice.