Adaptation to Socio-Linguistic Associations in Audio-Visual Speech: Q&A with Dr. Molly Babel

August 25, 2022

In the study, 'Adaptation to Social-Linguistic Associations in Audio-Visual Speech', the research team examined how listeners adapt to speech in noise from four speakers who are representative of selected accent-ethnicity associations in the local speech community: an Asian English-L1 speaker, a white English-L1 speaker, an Asian English-L2 speaker, and a white English-L2 speaker.

In this Q&A with Language Sciences, Dr. Molly Babel discusses how information can be extracted from moving faces, how spoken language is an element of social categorization, and the many layers of linguistic stereotyping.

Can you explain how spoken language can be an element of social categorization?

Spoken language is an incredibly rich social signal. There are two paths that make it so rich, one that makes us humans special and one that grounds us a bit more amongst our fellow animals. Starting with the more humble path we share with other critters: when we make any vocalization (just say “aaaaaaa”!), we are providing a bio-social signal that allows a perceiver to make (imperfect) inferences about our size, age, and health because the quality of our vocalizations are a function of the size and shape of our vocal folds and vocal tract. 

The second path foregrounds language. An individual’s speech patterns can often be associated with our social communities and identities. At the most macro-level, me saying anything will label me as a speaker of North American English. If what I say contains particular vowels, many listeners will quickly be able to label me as not being originally from Vancouver, and those with the requisite experience might be able to pinpoint my region of origin more directly. The exact way in which I speak my variety of English will also carry associations that will allow listeners to categorize my self-identified gender, ethnicity, age, socio-economic class. These social categorizations are imperfect, by which I mean listeners are not wholly accurate at identifying these social categories from a speech sample. But, often, listeners' ability to categorize a speaker in a way that aligns with that speaker’s self-identified categories is well above chance. A listener’s accuracy depends on the consistency of the speech and social category associations in the real world and a listener's having the linguistic experiences to form these associations.

What phonetic informtion might someone extract from viewing a moving face, as compared to viewing a static face?

When presented with a static face, all one has at their disposal are expectations and stereotypes about what a particular talker is expected to sound like. A moving face, however, provides indirect information about speech patterns. The movements and deformations in a talking face are coordinated with articulatory actions going on under the hood (i.e., at the vocal folds and in the vocal tract) and their acoustic consequences. A perceiver no longer needs to rely on expectations, but can actively use the facial movements as they parse the acoustic-auditory signal. One of the most prominent pieces of information in a moving face is eyebrow movement, which is correlated with the pitch of our voices. Our late colleague Eric Vatikiotis-Bateson produced some of the most seminal work in this area. Thank you for giving me the opportunity to give his work a posthumous shout-out.

What are some characteristics of speech that people may use to develop expectations about a speaker?

Let’s start with the most ironclad of expectations. We expect tall individuals to have lower fundamental and resonant frequencies and short individuals to have higher fundamental and resonant frequencies; we have ample experiences that support these expectations. Listeners develop expectations based around physiology — like size, age, and sex — and these can be rather solid generalizations. (Imagine a 2 year-old. You know that she will not sound like a 60 year-old, let alone a 10 year-old.) Expectations that are based more on acquired socio-cultural habits or social-demographic associations are more tenuous. For example, listeners can and do form speaker expectations about gender, race, and ethnicity — categories that are fluid social constructs — but these are much more likely to be accent expectations that are based on overt stereotypes, small sample sizes, and biases. Such accent expectations will not always be wrong, but they might not be accurate in the majority or even plurality of cases. 

Can you explain why not all accents are equal in their ability to elicit ethnicity and accent associations in speech processing?

There are many layers in this question. To me, the most transparent reading of this question offers a really simple response: because one hasn’t accumulated sufficient experiences. If one hasn’t met a speaker of, for example, Oroko-accented Canadian English, how is one to develop any expectations about that particular accent?

Let’s have a deeper dive into this question. A large player in that deeper answer is going to be colonialism, but a step back provides some important framing. Many tend to think of language borders and nation-state borders as being well-aligned. Many also imagine that language is inherently intertwined with ethnic identity. This is far from the case. Language borders bleed across geo-political borders. Language contact and language shift create new varieties as ethnic groups immigrate and emigrate. There are no neat boundaries in this space-time-language-people continuum! But, back to the role of colonialism here. Countries, like Canada, with histories as colonizers have ethnically and racially diverse populations, which means that speakers of Canadian English are ethnically and racially diverse. There is no single ethnicity-to-accent mapping in Canada. While it is true that many Canadians — non-white and white — speak languages other than English and French at home this does not necessarily mean that their English or French accents will be revealing of their multilingual status. When it comes to speech (and much else), individuals contain multitudes. 

What are some ways that people can recognize signs of linguistic stereotyping?

While I do not say this to excuse linguistic stereotyping, it is important to acknowledge that speech processing relies on expectations. Our phonetic categories are distributions formed around our experiences, which feed future expectations. For example, an English speaker’s expectations about what a /b/ will sound like are different from a Cantonese speaker’s expectations about a /b/ because there are language-specific distributions for that linguistic category. These sound-level expectations are the foundation. We use these expectations to actively predict the distributions of familiar talkers and to anticipate the phonetic distribution of unfamiliar talkers.

This, however, can get us into trouble when interacting with new talkers. A personal anecdote conveys this point. There was a UBC administrator who I had exchanged many emails with, but when I first spoke with them on the phone, I missed the first few sentences because my expectations did not match their speech. This individual had a British English accent, which I was not anticipating. Of course, I was able to quickly rectify this situation, implicitly updating my system to have a successful conversation. Where we want to be careful is if we habitually find ourselves struggling to understand individuals from particular demographic backgrounds. This is an opportunity to broaden your own linguistic categories. I love reminding people that it is far easier to perceptually adapt to an accent than it is for an individual to change their accent. That is, while I can learn to understand any accent rather quickly, at my age it would take decades (if not be impossible!) for me to authentically produce a particular accent. 

