Phonetic cues to depression: A sociolinguistic perspective
Abstract
Phonetic data are used in several ways outside of the core field of phonetics. This paper offers the perspective of one such field, sociophonetics, towards another, the study of acoustic cues to clinical depression. While sociophonetics is interested in how, when, and why phonetic variables cue information about the world, the study of acoustic cues to depression is focused on how phonetic variables can be used by medical professionals as tools to diagnosis. The latter is only interested in identifying phonetic cues to depression, while the former is interested in how phonetic variation cues anything at all. While the two fields fundamentally differ with respect to ontology, epistemology, and methodology, I argue that there are, nonetheless, possible avenues for future engagement, collaboration, and investigation. Ultimately, both fields need to engage with Crip Linguistics for any successful intervention on the relationship between speech and depression.
1 INTRODUCTION
There have been decades of work on phonetic cues to mental illness and neurological pathology. Work on the acoustic cues to clinical depression is a particularly productive area of study, with around 3000 hits on both Linguistics and Language Behaviour Abstracts and Google Scholar. From a sociophonetic perspective, these studies ask how particular phonetic cues can index mental illnesses: that is, how they co-occur with other independent measures of mental pathology. Indexicality ‘is a process of association, where a linguistic form points to some dimension of its conventional context of use’ and is important to sociophonetic work because ‘it represents one of the primary means through which the connection between a linguistic form and its social interpretation arises’ (Hall-Lew, Moore, & Podesva, 2021, p. 5). However, none of the clinical literature engages with indexicality theory (Silverstein, 1976, p. 2003), which is otherwise growing in influence in phonetics via sociolinguistic engagement with linguistic anthropology. There has also been little engagement in the clinical work1 with those factors that have been found to predict phonetic variation in sociolinguistics. While some papers describe the basic demographics of the participants, such as gender and age, none hypothesise about those factors based on known studies of phonetic variation, gender, and age. And while some also elicit speech from varying speech tasks, no studies consider the roles of identity, style, or social meaning. Even the very idea of the linguistic variable (Labov, 2006 [1966]), which is central to sociolinguistics, might be unfamiliar to those conducting clinical studies of speech variability, including contemporary work that depends on large computation modelling.
My anxiety’s been triggered from the beginning of this pandemic, but I felt like I was coping, em, for the first while anyway, em, but tha- As we’ve gone into the winter it's been harder. I just became really-- just feel like the stress of everything has got on top of me, em, over the winter. Just really depressed really, and really anxious about the virus. And we’re now in this weird place where we're somehow supposed to shift from, heh, being anxious about the virus and trying to avoid it at all costs to like, just living with it, and it’s just around, but we're not all vaccinated yet. Em, the cases are higher than they’ve ever been in Scotland.
Our sociophonetic analysis of these data, so far, shows some surprising results: that working-class men were producing a vowel quality that otherwise indexes a middle-class identity, and to a greater extent than the middle-class men (Hall-Lew et al., 2023). We wondered if one of the reasons for this unexpected pattern might be due to the somber and serious style of the recordings; perhaps the phonetic variant in question indexes negative affect, and maybe negative affect was a more relevant indexical meaning at the time than social class identity, especially for people most negatively impacted by COVID-19. This idea follows a growing body of work on phonetic cues to emotion and affect (e.g., D’Onofrio and Eckert, 2021; Podesva, 2021; Wan et al., 2022; Pratt, 2021; Pratt, 2023). But sociophonetic research has never considered the links between affect and mental illness or engaged with the research on acoustic cues to depression. This, even though one of variationism's classic field techniques involves asking a participant to revisit personal trauma in hopes that this will make them less self-conscious of their speech (the ‘Danger of Death’ question; Labov, 2013).
The current paper reviews research from both fields and considers the implications they have for each other, with suggestions for new avenues of investigation.2 In line with an epistemology that recognises the value of reflexivity, I also draw here on my lived experience as a person living with dysthymia (or ‘persistent depressive disorder’, PDD), and who has benefitted from diagnosis and pharmacological treatments. In other words, I have benefitted from a ‘medical model’ of disability. I am interested in how Critical Disability Studies retains the value of diagnosis and treatment while fighting the harmful and dehumanising effects of pathologisation, and how this balance can inform a future study of the acoustic cues to depression, grounded in Crip Linguistics.
2 HOW TO STUDY DISABILITY
The ‘medical model’ of disability (Byrnes & Muller, 2017) or the ‘normalisation model’ of disability (Zaks, 2023), is the paradigm underlying all research on the acoustic cues to depression. Under this model, depression (for example) is a pathology of the individual, and pathologies are problems eliminated so that the individual can be made ‘normal’. The path to elimination is only through medicalised knowledge, and knowledge gained from the individual's lived experience, and an understanding of their sociocultural context, is considered irrelevant to the main analysis (although central to most studies in sociophonetics). The medical model follows from ‘the ableist juxtaposition of disability and deficiency (Ben-Moshe & Magaña, 2014)’, such that a disabled body is framed as deficit and without access to medical knowledge (Fuller Medina, 2024, p. 86). In the literature review that follows, we see how phonetic cues are analysed only as potential tools to support medicalised knowledge and diagnosis.
In contrast to the ‘medical model’ is the ‘social model’, which ‘suggests that it is actually the way our society is set up that creates disability and inaccessibility’ (Bailey & Mobley, 2019, p. 28). What counts as ‘depression’, and the way that it affects a person's life, depends entirely on the time and place, and the barriers to (mental) health created in that time and place. Since the social model is not seen in any studies of acoustic cues to depression, adopting such a paradigm would be one step forward towards synergy with sociophonetic epistemologies. The social context of an individual, and the lived experience of the individual, are fundamental to sociolinguistic work. However, the social model was criticised for not providing a ‘basis for both understanding the origin and nature of distress and providing enabling and empowering assistance to those experiencing such distress’ (Barnes & Shardlow, 1996, p. 130).
Critical Disability Studies (CDS; Meekosha & Shuttleworth, 2009; Ben-Moshe & Magaña, 2014) or the pursuit of Disability Justice (Berne et al., 2015) developed in response to both models. CDS challenges assumptions inherent to both models, such as the framing of disability as impairment or deficiency. A critical approach grants disabled persons agency, rather than representing them as passive providers of symptoms. CDS also offer synergies between the medical and social models, namely arguing for the need for medical diagnosis, treatment, and cure, and that this need depends on the context. CDS recognises the risk that de-pathologisation could result in an unwanted reduction of health care and rights. It was forged by scholars working from queer (McRuer, 2006), trans (Krieg, 2013), and Black feminist perspectives, that is, groups who have systematically been denied healthcare. Meekosha (2011, p. 670) and others working in CDS critique the disability research that ‘ignores the lived experience of disabled people in much of the global South’. CDS strives to be anti-racist and decolonial, recognising that the source and severity of so-called ‘mental illnesses’ (reclaimed as Madness) is fundamentally shaped by legacies such as slavery and coloniality (Barclay, 2017). As Barker and Murray (2010, p. 230) note, ‘the history of colonialism …[is]… a history of mass disablement’. Curry (2017, inter alia) for example, shows how the pathologising practices inherent to the medical model entail gross dehumanisation and violence towards Black male bodies. Within the radical framing of mental illness in Madness studies, Bruce (2017, p. 304) writes, ‘any critical investigation of madness and modernity must confront the matters of blackness and antiblackness’. CDS and Madness studies argue that there is an ideology of ableism in mental health discourses: ‘sanism (or mental ableism/mentalism) mobilises arguments about ‘mental stability/capacity’ to revoke people's voices and agency’ (Baril, 2020, p. 4), which delegitimises lived experiences with oppressive social structures that may contribute to the onset of mental illness.
The fundamental argument that ‘language use cannot be disordered’ (Henner & Robinson, 2023, p. 17) represents a radical departure from the perspective of all studies of acoustic cues to depression. Orienting to such a paradigm would be a huge step forward towards synergy with sociophonetics, which is itself working towards engagement with disability studies (e.g., Wan et al., 2014). A sociophonetic approach to depression would be grounded in indexicality theory, seeing language users as agents who use language to navigate and construct meaning. The research question would be how ‘depression’ is or is not part of that meaning-making process.…no way of using language should be described as atypical, disordered, or defective. …Crip Linguistics means to critique language and language scholarship through the lens of disability, include disabled perspectives, elevate disabled scholars, center disabled voices in conversations about disabled languaging, dismantle the use of disorder and deficit rhetorics, and finally, welcome disabled languaging as a celebration of the infinite potential of the bodymind.
3 PHONETIC CUES TO DEPRESSION
Depression is said to characterise many psychological disorders. In one study of speech variation the authors stated that ‘at least 1497 unique profiles’ are logically possible (Cummins, Scherer et al., 2015, citing Østergaard et al., 2011). The various profiles are united by ‘the presence of sad, empty, or irritable mood, accompanied by somatic and cognitive changes that significantly affect the individual's capacity to function’ (American Psychiatric Association, 2013, p. 155).3 Speakers included in phonetic studies of depression are classified as ‘depressed’ based on a range of criteria, but most have an official medical diagnosis prior to their participation. The medical facility often serves as the location of data collection, and sometimes portions of the diagnosis interview or talk therapy serve as speech data for the phonetic analysis. Participants are diagnosed using one or more evaluative tools for depression, such as the Beck Depression Inventory (Beck et al., 1961, 1988; Ozdas et al., 2004; Tasnim et al., 2023) or the ‘Ham-D score’ (Hamilton, 1960; Helfer et al., 2013; Stassen et al., 1998; Tasnim et al., 2023). From a sociolinguistic perspective, this kind of speech data reflects the stylistic context of a speaker who is posited as a ‘disordered’ patient (low power), with an interlocutor positioned as a ‘healthy’ evaluator (high power), in a speech activity centred on the quantification of the speaker's pathology. The literature has not yet considered these factors.
The context of this quote is important for understanding the rest of the literature that follows. Newman and Mather's analysis was based on 40 individuals made available by two clinicians at a hospital for the treatment of mental disorders in Middletown, New York. The authors thank these doctors by name but do not mention the participants (and informed consent was not practiced at the time). Mather was a psychiatric nurse. Newman was a linguistic anthropologist who studied under Sapir and had an interest in sound symbolism informed by his work with indigenous American languages (Newman, 1933). Like many of the white American anthropologists of his time, his linguistic work presents ‘a deficit image’ of indigenous languages and cultures, with descriptions that they had ‘lexical deficiency … simplicity, redundancy, … and lack of formal structure’ (Kroskrity, 2020, pp. 76; 77). Kroskrity notes how Newman's descriptions treated data as ‘decontextualised objects rather than as cultural practices’ (2020, p. 78). Although Yokuts and Mono narrative practices and the phonetic productions of white American psychiatric patients might seem unrelated, Newman approached both with the same epistemology and ‘colonial extractivist practices’ (Fuller Medina, 2024, p. 85) typical of clinical research of the time and of the current day. It is the epistemology that characterises nearly all the work on phonetic cues to depression.A marked laxity of articulatory movements characterised the speech of these patients. With sparing use of pitch and accent, their voice had a dead, listless quality; changes of pitch covered a narrow tonal range and were predominantly step-wise rather than gliding; hovering tones appeared at the end of sentences, where speakers of English usually employ the broadest pitch changes; intonations tended to recur in the same stereotyped patterns; and emphatic accents were either rare or absent entirely. Their speech gave an impression of being slow and halting; because of the frequent appearance of hesitation pauses interrupting the flow of their phrases. In its resonance, their voice was pharyngeal and sometimes nasal; glottal rasping was present, and this, added to the pharyngeal resonance, gave their speech a ‘throaty’ quality.
Like Newman and Mather (1938), DSM's description mentions an increased production of pauses and a reduced variation in pitch. In general, prosodic features have been studied more in this literature than segmental and voice quality features. In the interest of space, here I focus on features that appear in both this literature and sociophonetics.4 I first consider suprasegmental features and then I review the work on segmental features.Psychomotor changes include agitation (e.g., the inability to sit still, pacing, handwringing; or pulling or rubbing of the skin, clothing, or other objects) or retardation (e.g., slowed speech, thinking, and body movements; increased pauses before answering; speech that is decreased in volume, inflection, amount, or variety of content, or muteness) (Criterion A5). The psychomotor agitation or retardation must be severe enough to be observable by others and not represent merely subjective feelings.
3.1 Phonetic cues to depression: Suprasegmental
Alpert et al. (2001, p. 59) say, succinctly: ‘Depressed patients showed less prosody than the normal subjects’. At of the time of writing, longer and more frequent pauses are considered the single most robust speech cue to differentiate depression and a ‘normal’ mental state. Szabadi et al. (1976), analyse four women with depression counting from 1 to 10 (framed as ‘automatic speech’, a stylistic concept and speech elicitation task commonly used in this literature), and find that, ‘the pause times were significantly elongated while the patients were depressed compared to pause times measured after recovery’ (Szabadi et al., 1976, p. 592). Nilsonne et al. (1987, p. 717) observe that ‘patients who respond well to antidepressant medications are those that show long pauses in their spontaneous speech before the onset of treatment’. Stassen et al. (1998), find that mean pause duration for a sample of 43 patients correlates with the patient's Ham-D (depression; Hamilton, 1960) scores, with both measures taken every 2 days of a 2-week therapy. Many studies have corroborated these findings (Alpert et al., 2001; Mundt et al., 2007, 2012; Tasnim et al., 2023). Trevino et al. (2011) identify pause length correlations with specific questions on the Ham-D scale, including, for example, ‘psychomotor retardation’, but also ‘depressed mood’, ‘hypochondriasis’, and ‘thoughts of suicide’. Pause length is often discussed in terms of speech rate, in general, such as in Ellgring and Scherer's (1996: p. 83) observation that, ‘an increase in speech rate and a decrease in pause duration are powerful indicators of mood improvement in the course of therapy’. However, it is pause length that most matters to understanding the speech rate result, as phonation rate has been found to show a weaker correlation (e.g., Alpert et al., 2001; Godfrey and Knight, 1984; Szabadi et al., 1976).
Pause durations are thought to be longer for individuals with depression for cognitive reasons: they are ‘signs of word finding difficulty, which result in less fluid or fluent speech’ (Tasnim et al., 2023, p. 6, citing Pope et al., 1970). Researchers working under the medical model view longer pauses as the automatic result of a psycho-physiological state, whereas a sociophonetic approach would consider any of the likely additional correlates for longer pause duration, such as the interactional dynamics between the speaker and their interlocutor. Critical Disability Studies would additionally avoid pathologising slower rates of speech in ways that deny the agency and experience of speakers. In arguing for Crip Linguistics, Henner and Robinson (2023, p. 25) say that ‘time [is] a factor that generates deficit perspectives about language and contributes to the disordering of language through attitudes and expectations’. In contrast, ‘Crip language insists that crip time in languaging is vital for a person's agency, be it through interpretation, translation, delayed speech, repetition, gesture, movements in gaze, and prosodic changes’ (Henner and Robinson, 2023, p. 26). An indexical approach to pause duration would grant individual agency while also allowing for the consideration of cognitive factors.
Pitch variation is another prosodic factor that has been found to correlate with measures of depression. Overall, more severe depressive conditions may correspond to more monotonic speech. This was described by Newman and Mather (1938) 5 and supported by later work by neuroscientists (e.g., Darby et al., 1984; Kuny and Stassen, 1993; Mundt et al., 2007), but not in work by phoneticians (Cannizzaro et al., 2004). Pitch range is typically studied alongside variation in stress and amplitude, again with the expectation that depression will cause a reduction in variance (see, e.g., Darby et al., 1984). Prosodic variation is also sometimes described in conjunction with voice quality variation. Depressed voices have been described by neurologists as ‘harsh’ (Darby et al., 1984; Hargreaves et al., 1965) or by neurologists (Darby et al., 1984; Quatieri and Malyska, 2012) and computational linguists (Hönig et al., 2014; Scherer et al., 2013) as ‘breathy’. But as phoneticians Gobl and Ní Chasaide (2003, p. 192) note, ‘[t]he problem with impressionistic labels such as ‘harsh voice’ is that they can mean different things to different researchers’. Overall, the results are mixed, perhaps in part because of different methods in different subdisciplines, for example, neuroscience versus phonetics.
Sociophonetics is probably more amenable to the study of depression than clinical studies are to Crip Linguistics. Sociophoneticians who work on pitch variation (e.g., Esposito & Gratton, 2022; Levon, 2016; Podesva, 2007; Zimman, 2017) and voice quality variation (e.g., Esling, 1978; Podesva, 2007; Pratt, 2021; Starr, 2015) could integrate a speaker's level of depression as an additional factor of a quantitative analysis, or their lived experience of depression as fundamental to a qualitative analysis. Both pitch variation and voice quality are already part of the sociophonetic canon, although both are understudied relative to research on segmental variation. Pause-duration, in contrast, is understudied in sociophonetics, but this is already well-worn territory in interactional sociolinguistics, where pause duration is recognised as a shared interactional achievement between interlocutors, and a feature that co-varies with intonation and syntactic structure (e.g., Wennerstrom & Siegel, 2003). In sociophonetics, Clopper and Smiljanic (2015) and Kendall (2009, 2013, 2023) show how variation in pause duration can index some of the more traditionally studied social factors in studies of US English, such as gender and region. Pratt (2021)'s ethnographic analysis in a California high school shows that while speakers who produce more pauses and a slower speech rate produce more creaky voice, with the indexical qualities of all three features aligning with a ‘chill’ affect, longer pause durations correlate with less creaky voice than shorter pause durations. This demonstrates the social semiotic potential for investigations of pause length, and the need for studies of depression to understand the feature's indexicality complexity: cognitive or neurophysiological explanations are only a subset of the potential reasons that a speaker may produce a longer pause length than another speaker.
3.2 Phonetic cues to depression: Segmental
Vowels and consonants have been examined in a small number of studies on phonetic cues to depression. Flint et al. (1992) examine variation in voice onset time (VOT) in (presumably Canadian) English and find that VOT was shorter among patients with depression (and Parkinson's Disease) than speakers in a control group. Trevino et al. (2011) examine the length of both vowels and consonants as produced by US English speakers from various locations across the country (Mundt et al., 2012), comparing segment length with speakers' Ham-D scores for specific depressive traits (Hamilton, 1960). Based on a large computational model, they find that some segments tend to be longer when speakers have higher scores on the ‘psychomotor retardation’ (slow thoughts and movement) trait, which they take to be reflective of general psychomotor slowing. However, this correlation is only observed for open or open-mid vowels, coronal fricatives, velar stops, /t/, /w/, and /r/. They argue that the global speech rate differences found in previous studies may reflect ‘distinct phone-specific relationships’ (Trevino et al., 2011, p. 2), but they provide no analysis for why these particular segments might be expected to co-vary with depression. These results are also potentially conflicting with the previously mentioned finding that it’s pause duration, and not articulation rate, that correlates more often with depression (e.g., Cannizzaro et al., 2004).
Several studies test for correlations between depression and vowel quality, again with varying results. Vowel quality is one of the staples of sociophonetic research, from Labov's (1963) foundational study of diphthong height on Martha's Vineyard, to contemporary work on indexicality and affect (D’Onofrio and Eckert, 2021; Podesva, 2021; Wan et al., 2022; Pratt, 2023). This feature will therefore be considered here in detail, although much of the research is inconclusive.
The hypothesis about the effect of depression on vowel quality is related to the more general hypothesis that depression causes more lax articulations (Newman & Mather, 1938). For vowels, it has been proposed that higher scores of psychomotor retardation result in ‘less articulatory effort’, and therefore vowels closer to ‘resting position’, that is, more centralised, schwa-like vowels (Tolkmitt et al., 1982, pp. 221, 220). Stasak et al. (2019) frame this as hypoarticulation (Lindblom, 1990). The earliest study, Tolkmitt et al. (1982, p. 221), found significant differences in /ʌ/ and /ei/ formant vowels before and after depression treatment, concluding that, ‘as a result of therapy the patients exert greater articulatory efforts’. However, they found no similar differences for /i/, /e/, or /æ/, which is odd given that the corner vowels would be expected to be most affected by a general articulatory laxing.
A paper often cited for evidencing vowel quality as a cue to depression is Flint et al. (1992, p. 386), who measure the ‘second formant transition rate’ of the English /ai/ vowel, specifically, the two lexical items light and dial, as produced by 30 patients with Major Depressive Disorder (MDD) in read speech. The English /ai/ vowel is known to vary systematically across regional and ethnoracial varieties of North American English (e.g., Labov, 1963), particularly when followed by a voiceless consonant (as in light) versus other environments (Labov, Ash, & Boberg, 2006; Thomas, 1989), and particularly in Canada (where Canadian Raising might be relevant; Chambers, 1973). Flint et al. (1992)'s study took place in Toronto, but no mention is made of speaker accent. Syllable-final /l/ (as in dial) is also known to affect the formants of the preceding vowel (Labov, Ash, & Boberg, 2006; Veatch, 1991), but the authors do not mention this feature. Instead, Flint et al. (1992) hypothesise that depressed speech will result in a more monophthongal quality (a smaller ‘second formant transition rate’) than non-depressed speech, again for psychomotoric reasons. Their results showed that depressed speakers did produce the word light in a more monophthongal way than did the control group, but there was no difference for the word dial.
France et al. (2000) compare global differences in the first three formants, and their bandwidths, in two different studies with different genders and depression characteristics. They find, for example, that men at high risk for suicide have lower and backer vowels (larger F1 and F2 values) than the control group, with men with MDD falling in between. However, the data were ‘[a]pproximately 2 min and 30 s of unedited speech…randomly extracted from either a therapy session or a post-session dictation’, (France et al., 2000, p. 832), and no attention was given to balancing for vowel type or any other linguistic factor. Moore et al. (2008, p. 103) show, unsurprisingly, that formant measures are ‘highly correlated to the sentence content’, and so they ‘limit the effectiveness of the vocal tract features under investigation’. Stasak et al. (2019, p. 151) later ‘show performance advantages to utilising … phoneme-specific parameters’, a finding that will be of no surprise to (socio)phonetic researchers.
Helfer et al. (2013, p. 2127), using the same data as in Trevino et al. (2011; that is, Mundt et al., 2012), predicted ‘(1) modifications of the average formant space (e.g., slurring may compress this space) and (2) modifications of the dynamics of the formants (e.g., agitation may introduce an erratic behaviour in a formant track and monotony may reduce the rate of frequency transitions)’. They extracted nine formant features from spontaneous speech (Ham-D interviews; Hamilton, 1960) and from sustained productions of /i/, /a/, /u/, and /æ/. However, the main purpose of the paper is to compare two different classifier models, and they conclude that one performs better on spontaneous speech, and the other on sustained vowels, but they report no findings about what happens to the individual vowels or the overall vowel space, so it is difficult to interpret the findings from a sociophonetic perspective. Hönig et al. (2014), another large computational model including vowel formant data, found no significant results indicating centralisation of vowels, or any difference in the overall Vowel Space Area (VSA), between depressed and non-depressed speakers. This study was based on a dataset of ‘1122 recordings from 219 German subjects’ and the computation of ‘3805 acoustic features’ (Hönig et al., 2014, pp. 1248, 1249).
Scherer et al. (2015, p. 4791), in contrast, find a significant difference in VSA, calculated as the F1/F2 area between /i/, /a/, and /u/, between depressed and non-depressed speakers. The result is based on a corpus of ‘semi-structured clinical interviews’ with a ‘virtual human interviewer’ (4791) recorded with ‘veterans of the U.S. armed forces and from the general public’ in the ‘Greater Los Angeles metropolitan area’ (Gratch et al., 2014, p. 3123). Comparable results from a corpus of read speech showed a similar but non-significant trend. The authors suggest that ‘reading proficiency might be a confounding factor’, but they do not consider the possibility that it might matter that the read speech is produced by completely different speakers and is in German.6 Although this is the first robust finding that vowel space hypoarticulation and depression may be correlated, no mention is made of the linguistic varieties spoken, much less their potential sociophonetic confounds (e.g., variable /u/ production in Los Angeles; Fought, 1999). Cummins et al. (2017, pp. 212–213) used the same spontaneous speech corpus (DAIC-WOZ; Gratch et al., 2014) and added a comparison of binary gender, finding that ‘depression may manifest differently in formant measures for male and females’, namely for F1. Their data show that the ‘depressed male’ sample produced a higher or more closed VSA than the ‘non-depressed male’ sample, in line with a hypoarticulation analysis, but that the opposite was true for the female samples. Vlasenko et al. (2017), using the same data and analysis, further show that the depressed and non-depressed males are distinguished by F1 variation in 9 vowels, including /i/, /a/, and /u/, but in F2 only for one vowel (/ʌ/). The females are distinguished by F1 variation in 6 vowels, including /i/, and F2 variation in 4 vowels, including /a/. It is notable that Scherer et al.’s (2015) presentation of results from this corpus showed two male speakers as exemplars of the VSA effect, whereas the results from the females suggest that VSA reduction is not a universal feature. This is expected from a sociophonetic perspective (e.g., Pratt, 2023), but no explanation is offered for the gendered difference among the Los Angeles English speakers analysed here.
Miley (2020) examined nine depressed and nine healthy adults in northeastern England, collecting data on dialect, gender, age, ethnicity, and education, but unfortunately not analysing these factors. Participants produced /i/, /æ/, /ɒ/, /ɔ/, /o/, and /u/ based on readings of embedded phonetically balanced sentences. In measuring the individual vowel formants and overall VSA, Miley found that ‘…low vowels /æ/ and /ɒ/ are disproportionately affected by depressive status. Global VSA measures were not significantly correlated with depression’ (2020: 10). Specifically, F1 was smaller (i.e., a more closed vowel) for depressed productions of /æ/ and /ɒ/ than healthy /æ/ and /ɒ/.
The mixed results around vowel quality and depression detection are not surprising given the vast range of social, stylistic, and contextual meanings that we know can be indexed by segmental variation. While prosodic variation is also indexical, it is arguably less salient as a marker for features such as region, class, and ethnicity as is segmental variation. These strong indexical relationships mean that the role of context, broadly construed, will always have to be considered before vowel quality or other segmental features can be tested as cues to mental illness. This again suggests that a synergy between fields lies more in a general investigation of the indexical qualities of phonetic cues, with attention to meanings like ‘depression’ emerging, or not, through a more holistic understanding of sociocultural context.
4 EXPLAINING THE CAUSES OF PHONETIC VARIATION
In the mid-20th century, phoneticians often discussed social and psychological indicators in the speech signal in tandem. Laver and Trudgill (1979, p. 3), refer to three types of information indexed by variation in speech: ‘social markers, physical markers and psychological markers’. The concept of indexicality (Peirce, 1868; Silverstein, 1976) and the attention to both ‘social’ and ‘physical’ attributes (and the blurring between the two, e.g., age and gender), has long been of interest in the field of sociophonetics, while attention to ‘psychological’ attributes (e.g., cognitive processing styles7) is more of a recent focus. Earlier studies in sociophonetics (e.g., Esling, 1978; Labov, 1963) tended to pursue empirical evidence for correlations between acoustic features and social information about speakers, a tradition that continues to this day (e.g., Kendall, 2023). Other subfields of sociophonetics focus specifically on the indexical relationship, itself, a difference between subfields that Eckert (2012) describes in terms of ‘waves’ of research (see also Hall-Lew, Moore, & Podesva, 2021). For ‘third wave’ researchers, the signalling potential of the phonetic features means that all possible indexicalities, including ‘psychological’ ones like depression, are the focus of study (D’Onofrio & Eckert, 2021; Podesva, 2021; Pratt, 2021, 2023; Wan et al., 2022). ‘The meanings of [linguistic] variables are not precise or fixed but rather constitute a field of potential meanings—an indexical field, or constellation of ideologically related meanings, any one of which can be activated in the situated use of the variable’ (Eckert, 2008, p. 453). From this perspective, it is not possible to say definitively that this or that acoustic cue will differentiate depressed from non-depressed speakers. Rather, such an indexical relationship depends on the social and linguistic context, because when we quantify measures of the speech signal, what we are observing is the output of social practice, where the speaker has agency.
In contrast, work on depression detection in speech focuses on the goal of finding an optimal, best-fit correlational model between the features of the acoustic signal and depression: clinical utility is the goal, rather than semiotic analysis. Depression detection research does not engage with insights even from the most conservative, ‘first wave’ of sociolinguistics (considering social factors, but with little indexical analysis; Eckert, 2012). Social information about the speaker or the context has never been subject to analysis and social information has rarely been included, despite being ‘important factors in explaining the variability in depression prevalence rates’ (Akhtar-Danesh & Landeen, 2007, p. 1). Newman and Mather (1938) provide biographical details of each of four case studies (e.g., ‘a 30-year-old American survey engineer’; 918; ‘a 60-year-old German housewife’, 924), reminiscent of the style of dialect surveys of the time, but these descriptions do not impact the analysis. Teasdale, Fogarty, and Williams (1980) note ‘heterogeneity of the state of depression’, and that its acoustic correlates are ‘likely to be affected by a number of factors’ (277), including the nature of the speech elicitation task (something that is at the core of sociophonetic research; see Hall-Lew and Boyd, 2020), but they do not investigate the role of the speech task, empirically. Kuny and Stassen (1993) do consider speaker education level, along with age and gender (all operationalised in a binary way) and find no significant correlations with the features they go on to correlate with depression. Lee et al. (2021) found that prosodic features distinguished older Korean females with MDD from older Korean females without that diagnosis, but that no difference was found between males; however, the only explanation the authors entertain is one based on hormonal and physiological causes.
The ontological and epistemological difference between depression detection research and sociophonetics is most clearly seen in the fields' orientations to the notion of objectivity. Contemporary sociophonetics generally takes a critical approach to objectivity, both in general (that objectivity is an ideological concept) and in specific (that every acoustic measure is influenced by the subjectivity of the researcher and the choices they make). In contrast, studies on the acoustic cues to depression all demonstrate a positivist orientation, both in general (that objectivity is possible and desired) and in specific (that speech signals can be measured objectively). The research goal in clinical work is clearly the pursuit of an ‘[o]bjective characterisation of the voice’, (Moore et al., 2008, p. 96). Speech is framed as an observable human behaviour that is less vulnerable to subjective bias (either the patient's or the evaluator's) than other quantifications of level of depression. Depression is notoriously difficult to diagnose, for example, being difficult to distinguish from Parkinson's Disease (Flint et al., 1992), or difficult to detect if a patient is intentionally repressing symptoms (Lee et al., 2021). Speech cues are therefore framed as a providing a direct window into (1) a diagnosis of depression, and (2) the extent of an individual's level of depression.
Alignment with the medical model can be seen in the field's understanding of speech variation as resulting from cognitive or psychological factors, rather than social factors. Cummins et al. (2015), explain that ‘[c]ognitive impairments slow speech planning and preparation of neuromuscular commands needed to produce speech, whilst changes in affective state, fatigue and psychomotor retardation affect muscle tension, creating articulatory errors and altering vocal tract properties’. Cannizzaro et al. (2004, p. 31) say, ‘by objectively measuring the speech acoustic signal, we are quantifying the observed output of the neurological and physiological subsystems as they coordinate to create speech’. This view of speech variation at odds with theories of indexicality as used in sociophonetics (see Hall-Lew, Moore, & Podesva, 2021). Sociophonetic explanations for a slow rate of speech might include regional dialect (Jacewicz et al., 2010), or the construction of a locally salient social style (Pratt, 2021). To rephrase Cannizzaro et al. (2004), sociophoneticians might say that they are quantifying the observed output of speakers' coordination of speech production and social signalling. This coordination necessarily draws on neurological and physiological subsystems, but these are also inherently ‘social’; social information is represented cognitively, and therefore neurologically, and affects physiological factors such as articulatory position (e.g., jaw opening; Pratt & D’Onofrio, 2017). The clinical understanding of ‘affective state’ is also different from a sociophonetic understanding. In social theory, ‘affect’ is understood as a social practice, such that ‘[a]ffective activity is an ongoing flow…of forming and changing body-scapes, qualia (subjective states), and actions constantly shifting in response to the changing context’ (Wetherell, 2015, p. 147).
From this perspective, social factors complicate the goals of an objectivist ontology. For example, if the effects of depression on speech are purely mechanical, then there would be no expectation for cross-cultural or cross-linguistic differences in their manifestation. In studies of phonetic markers of depression, the language under analysis is often not even mentioned, especially when that language appears to be US English (e.g., Pope et al., 1970; Dumpala et al., 2023; see Bender, 2011 on this issue more generally), but not always (e.g., Nilsonne, 1987, which appears to use Swedish). Studies on non-European languages do mention the language being studied, but this is never a factor in the analysis (e.g., Hebrew; Wasserzug et al., 2023). Both Lee et al. (2021, on Korean) and Taguchi et al. (2018, on Japanese) mention culturally specific methodological tools, although the analysis makes no mention of cultural or linguistic matters. The expectation across the literature is that we have no reason to expect that culture or language influences how depression affects or does not affect speech, or that depression manifests in different ways within a nation, language group, or culture, as well. In the US context, it is ‘as if there are no critical exigencies involved in being people of colour that might necessitate these individuals understanding and negotiating disability in a different way from their white counterparts’ (Bell, 2006, p. 282, cited in Curry, 2017, p. 322). In contrast, sociolinguistics (in theory) always assumes that the potential role of each of these factors is an empirical question.
Cummins, Scherer, et al. (2015, p. 34) say that ‘acoustic variability is diluted by linguistic information and speaker characteristics’, and that linguistic and social considerations introduce ‘unwanted forms of variability, herein referred to as nuisance factors’ (Cummins, Scherer, et al., 2015, p. 38). Other ‘nuisance factors’ include features such as speaker height and weight, speech disorders, and intoxication. Rather than considering the range of likely phonetic indexes in a given context, and then determining cases where it is depression that is being indexed, the goal of the clinical literature is to mitigate the effect of such factors by continually fine-tuning statistical and acoustic modelling methods.
The outputs of this field are many. An example is Lee et al. (2021, p. 17), who consider in their model both the AVEC 2013 baseline feature set, including ‘2268 features that include 32 energy- and spectral-related low-level descriptors (LLD) and 6 voicing-related fundamental LLD, delta coefficients of each of these LLD, and 10 voiced/unvoiced durational features…’ and a second feature set which includes ‘62 features including a compressed set of 25 LLD (frequency-related, energy/amplitude-related, and spectral parameters) and percentile-related functionals’. The outcome, currently, is less about ‘the second sub-challenge’, that is, how any specific feature indexes depression; rather, the goal seems to be to first build the best classification model possible, for purposes of prediction and diagnosis. The process necessitates a ‘neurobiologically essentialist model of mental illness with a model of language drawn from speech signal processing’ (Semel, 2022, p. 271). Speech is understood as an object which can be measured as an objective cue to cognitive or physiological states.The challenge has two goals logically organised as sub-challenges: the first is to predict the continuous values of the affective dimensions valence and arousal at each moment in time. The second sub-challenge is to predict the value of a single depression indicator for each recording in the dataset.
While technological developments in computation and phonetic analysis are a crucial aspect of this methodological shift, those developments were likewise available to sociophonetics researchers, and yet the latter did not develop similarly. Early variationist approaches took a relatively more objective stance towards speech variation, seeking large datasets and the refinement of quantitative modelling (e.g., Labov, 2006 [1966]; Trudgill, 1972; Sankoff and Labov, 1979), while current research into the social semiotics of phonetic variation (Eckert, 2012; Silverstein, 1976) typically draws on mixed-method analyses of a handful of phonetic features coded from a small, ethnographically defined speaker set (see Hall-Lew, Moore, & Podesva, 2021). The reason is that the range of potential ‘meanings’ that any phonetic variant can index—be they about the speaker's demographics, identity, persona, style, stance, mood, mental health, or something else—is vast, and inherently unspecified until realised in a particular sociolinguistic context. A long pause, for example, might index depression, but it might equally index membership in a particular Community of Practice (see Pratt, 2021). From this perspective, the pursuit of an ideal classification model depends on the sociocultural context of the data feeding the model as well as the sociocultural context around the model's implementation.
5 TOWARDS A SOCIOPHONETICS OF DEPRESSION
Depression detection research from the 1980s and earlier often debated whether between-speaker comparisons should be pursued, at all. Hargreaves et al. (1965, p. 219) writes: ‘We did not expect the same kind of depressed voice quality in every subject. Therefore it was appropriate to study each patient's voice to discover his own particular pattern of change’. Similarly, Teasdale et al. (1980, p. 277) writes, ‘Speech rate is likely to be affected by a number of factors in addition to a person's state of general behavioural activation-deactivation. It is for this reason probably more appropriate to use [pause time] as a measure of change within a subject than to use it to make comparisons between subjects’. Tolkmitt et al. (1982, p. 210) argue that the ‘the diagnostic validity’ of comparison between subjects is questionable, and that comparisons are ‘more appropriately studied in patients during treatment from ‘abnormal’ to ‘normal’ behaviour, using each patient with his or her idiosyncratic speech patterns as his or her own control’. Finally, Nilsonne et al. (1987, p. 727) observe that, ‘[t]he variation in speech behaviour within a nondepressed population is wide enough to encompass the speech behaviour of some depressed patients’, and therefore that, ‘the usefulness of the F0 changeability measures will lie in tracking within-patient changes’. However, the development of the field towards large computational models has also moved the field away from these observations and warnings, and most studies are now focused on interspeaker modelling.
The focus on a singular index of phonetic variation—depression—is extremely limiting from the theory of indexicality. Within-speaker studies at least offer the promise of controlling for many of those possible indexed meanings related to durative aspects of an individual's identity. I therefore argue that the use of acoustic cues to track changes in mental health over time, within the same individual, is the most promising area of clinical work for engaging with sociophonetics. I also think that it is the method most amenable to a social justice approach to disability, where participants and researchers could collaborate and co-produce knowledge. Ann Cvetkovich asks, ‘what happens if we don't think about depression as bio-medical but instead think about it as the cumulative result of histories of racism, capitalism, colonialism, sexism, ableism, and more that we carry with us in our bodies, minds, and spirits’ (Cvetkovich and Wilkerson, 2016, p. 499). Analysing variation within the individual means studying how a person navigates the world from moment to moment, and how different aspects of the world are made relevant at any given time. In contrast, analysing variation between individuals inevitably reduces a person to the moment of data collection, which could be a profound misrepresentation of that person, even when working within a non-pathologising, social model of disability.
Whilst exploring language, the influence of socio-cultural background, social status, comorbidity and causative factors of depression is likely to impact the accuracy of detecting and debilitating the conditions of depression. Generalisations across studies on linguistic features of depression may be sustainable once the socio-demographic factors are included in the analysis. … Although language mirrors and reflects the mental and emotional state of depressed individuals, it is relevant to comprehend the causal and contextual aspects of depression i.e., economic, relational, medical, environmental, and social stigma for a more prominent interpretation and generalisation of sociolinguistic markers.
The closest the field of depression detection got to considering sociolinguistic factors is in the work of Klaus R. Scherer (e.g. Ellgring & Scherer, 1996; Tolkmitt et al., 1982). Scherer, who was influenced by the sociologist Erving Goffman (Ladd, p.c.), argues that ‘social presentation’ is a variable that should be ‘taken into account’ in speech depression research (Ellgring & Scherer, 1996, p. 87). Ellgring and Scherer (1996, p. 85) also note that ‘situational context, type of content, and type of speech sample, may also contribute’. Their main argument is that ‘[t]hree types of hypothesise mechanisms’ account for depression being detectable in speech: ‘persistent psychophysiological changes, cognitive impairment, and socio-emotional change’ (85), and that the latter is due to the ‘social interaction consequences of a depressed state’ (87). Among the ‘large number of factors’ they put forward for future research, they include ‘position in social networks, as well as interaction and communication strategies’ (105). To my knowledge, there has not been any follow-up work taking up this call, but the computational power of new approaches makes this even more possible.
From the other direction,8 the vast literature on acoustic cues to mental illness suggests that sociophonetic studies should at least consider collecting data on speakers' mental health. While this might seem like an ethically challenging prospect, the alternative means not only that our phonetic models might be inaccurate but that we are potentially ignoring an important dimension of lived experience. How many sociolinguistic community studies in the past have included participants with mental illnesses that not only influenced their speech patterns but also influenced their social identities and the ways they moved through their social world?
The possibilities for synergistic research are many,9 but in closing I propose one that draws on the fundamental similarity between fields: attention to the phonetic indexicality of speech (Laver and Trudgill, 1979). Specifically, the point of contact lies in an interest in phonetic cues to negative affect. Two example papers on this topic in sociophonetics are D’Onofrio and Eckert (2021), a perception study on US English, and Wan, Hall-Lew, and Cowie (2022), a production study on Taiwanese Mandarin. Both studies argue that negative affect is indexed by a relatively backed vocalic variant (indicated by a lower, normalised second formant), /a/ in the former case and /i/ in the latter case. Both build on previous observations of this correlation between backing and negative emotional valence Eckert (2008) for example, in US English, and Erickson et al. (2016) in Mandarin. These and other sociophonetic studies of affect all draw on the concept of embodiment, where the semiotics of language and the semiotics of the body are intertwined (Bucholtz & Hall, 2016; Esposito & Gratton, 2022; Pratt, 2021). It would be interesting to see how sociophonetics might engage with the over-emphasis in the clinical literature on physiological explanations for mood-based variation in the speech signal. A sociophonetics of depression might, for example, posit new and holistic understandings of features like ‘greater muscle tension and respiratory rate’ and overall ‘psychomotor retardation’ (Stasak et al., 2019), and the ways in which these embodied positions impact phonetic variation.
6 CONCLUSION
Both sociophonetics and the study of acoustic cues to depression draw on phonetic variation as data. Both have explored the signalling potential of prosodic and segmental features. Both fields are interested in how acoustic cues can signal affective meanings. However, the way they view a concept like ‘affect’ is fundamentally different, and this belies a more general ontological and epistemological divide that is a challenge to reconcile. In this paper I have briefly reviewed the vast clinical literature and considered how it might be expanded and improved with reference to the social theories that drive contemporary sociophonetics, such indexicality, queer and trans theory, anti-racism, and decoloniality, and how both fields can benefit from more engagement with Critical Disability Studies and Crip Linguistics.
ACKNOWLEDGEMENTS
My unending thanks go to Teresa Pratt, whose deep engagement with this manuscript improved it beyond measure. Many thanks also to the journal editors Gabriela Alfaraz and Rebecca Roeder, and one anonymous reviewer. Shermaine Ang, Thomas Bak, and Gabrielle Hodge gave additional feedback that further strengthened the paper. This article is the result of conversations with students enrolled in the 2023 seminar, taught as part of Guided Research in Linguistics and English Language, called ‘Sociophonetics and phonetic cues to mental illness.’ Many of the ideas presented here were dialogically produced with them: Shermaine Ang, Charlotte Baskerville, Alexandra Burgess, Carrie Chow, Shutong Han, Yating He, Jiayi Liang, and Tia Sadlon. I am also grateful to the audiences of the University of Edinburgh's Language Variation and Change Research Group, the University of Lancaster's Phonetics Lab, and the University of Kent's Language and Linguistics Network for feedback on earlier versions. All errors are my own.
ENDNOTES
- 1 For the purposes of this paper I use ‘clinical’ to refer to work on psychiatric conditions. This differs from work in clinical phonetics/linguistics, which largely concentrates on pathologies specific to speech or language (see Müller and Ball, 2013). The two areas are certainly connected, but beyond the scope of this paper.
- 2 Due to space constraints, I was not able to review all the work on acoustic cues to emotion, more generally, even though this is obviously also a highly relevant subfield. For more information, see, for example, Scherer et al. (1972), Gobl and Ní Chasaide (2003), and Ververidis and Kotropoulos (2006), among many others.
- 3 The study of acoustic cues to depression could also benefit from a deeper engagement with the psychological literature on types of depression, especially since the expected phonetic effects of, for example, agitated depression with anxiety (Koukopoulos & Koukopoulos, 1999) are likely to differ. There's also little engagement with the crucial distinction between depression and apathy-without-depression (Levy et al., 1998).
- 4 I also have not considered here the literature on how depression medication affects phonetic variation, but see, for example, Ellgring and Scherer (1996), Alpert et al. (2001).
- 5 This claim was also made by Moses (1954), but his work is not considered a reliable source to cite here. Paul J. Moses was a speech pathologist invested in the idea that voice can cue personality, but his work was criticised as making overblown claims based on scant empirical evidence. Of course, sociophonetic work explores the ways in which linguistic cues do cue personality traits, but the epistemological assumptions differ.
- 6 From the source paper of the AVEC 2013 corpus, Valstar et al. (2013): ‘Read speech: excerpts of the novel ‘Homo’ Faber by Max Frisch and the fable ‘Die Sonne und der Wind’ (The North Wind and the Sun)’.
- 7 See, for example, Yu (2013), Yu & Zellou (2019). But the sociolinguistic orientation to theories of embodiment means that all ‘psychological’ attributes are ultimately also realised and interpreted in the world as things that are achieved through social practice (hence why ‘psychological’ is in quotation marks), and so from a sociophonetic perspective these categories are all necessarily blurred or non-existent. See Hall-Lew, Honeybone, and Kirby (2021) for a sociolinguistic critique of the notion of ‘individual differences’.
- 8 There's another insight from the depression detection literature that's relevant to sociophonetics, but it doesn't fit anywhere in this paper. Many studies of depression control for the time of day when data is collected. Kuny and Stassen (1993: 291), for example, say that ‘variations were excluded by always carrying out recordings at a fixed time in the morning’. Sociophoneticians do not consider ‘circadian variations’, but perhaps we should! Hoffman et al. (1985: p. 538), for example, found that ‘diurnal variation in SPT [speech pause time] was found in control subjects, but not in depressed patients’. Nothing is stated about the reasons, or about the speakers' experiences of this aspect of variation.
- 9 Another area of synergy concerns the role of speech task. Although speech elicitation methods differ between the depression research (e.g., ‘automatic speech’), and sociophonetics (e.g., ethnographic interviews), they are sometimes exactly the same (e.g., ‘the Rainbow Passage’, Fairbanks, 1940). How might the stylistic expectations in sociophonetics influence how the tasks are analysed from a clinical perspective? We could test, for example, Alpert et al.’s (2001: 67) claim that ‘[t]he speech pause-time differences [in depression] seem less a reflection of the motor requirements of speech than with the subject's motivation and interpretation of task demands’ and that ‘[i]t appears that depressed subjects set lower standards for their performance of the task’.
Biography
Lauren Hall-Lew is Professor and Personal Chair of Sociolinguistics, in Linguistics and English Language, in the School of Philosophy, Psychology, and Language Sciences, at the University of Edinburgh. She holds a B.A. in Linguistics from the University of Arizona and a Ph.D. in Linguistics from Stanford University. Her research focuses on phonetic variation, social meaning, and sound change.