The dynamics of indexical information in speech and its role in speech communication and speaker re

Page 1


Speaking the language of voice recognition

Voice recognition processes are fundamental to human social interaction, enabling us to rapidly identify a speaker and participate in discourse with other people. While previously a common assumption was that we recognise speakers on the basis of static features in their voice, Professor Volker Dellwo and his colleagues in an SNSF-funded project are now exploring a different hypothesis.

Indexical properties

Voices are made up of individual acoustics, the voice timbre, by which speakers can be identified, a phenomenon common across all languages. While there are many other ways to identify people, voice recognition is fundamental to human social interaction. “If you are in a social situation with six or seven people, where multiple people are talking at once, then you are entirely lost in the discourse if you can’t attribute voices to the individuals. There’s no way you can participate in the dialogue,” stresses Volker Dellwo, Professor in the Phonetics and Speech Sciences Group at the University of Zurich. As the Principal Investigator of an SNSF-funded research project based at the University, Professor Dellwo and his group are now investigating the importance of voice recognition in communication. “How do speakers construct their voice to help individuals recognise them?” he outlines.

Professor Dellwo and his team in the project are exploring a novel viewpoint. “Our argument is that voice specific properties are not static. Rather humans control them, deliberately, to fulfil certain communicative needs,” he says. “There

This recognisability does not happen by chance, rather the speakers adapt their voice to ensure they can be easily identified. There are also circumstances where a speaker may want to adopt a style which makes them less easily recognisable, maybe if they are trying to deceive their audience. “Our contention is that

“Our argument is that speaker-specific properties in voice are not static. Rather, humans control them deliberately to fulfil certain communicative needs.”

are certain situations where speakers want to make sure they are recognisable. For example, politicians need to be iconic, thus their voice - amongst other properties - is part of their identity that stands for a certain political programme or view that they hold.”

these speaking styles are not more – or less – recognisable by chance. There is a strategy behind them,” says Professor Dellwo. Two PhD candidates, Valeriia Perepelytsia and Leah Bradshaw, are now looking to gain fresh insights in this area by using AI, machine learning and behavioural as well as

Part of the project team in the speech & voice laboratory of the Linguistic Research Infrastructure (LiRI) at Zurich University.

neuroscientific methods on a large sample of people who were asked to speak in different styles during highly controlled voice recording sessions. “We gave them different speaking tasks, and the study participants were asked to speak in a style that we thought would make them highly recognisable. We also asked them to speak in a way that would make their speech easily intelligible, a speaking style that is referred to as clear speech,” continues Professor Dellwo. “When you are trying to be easily intelligible, you want to produce signals that carry canonical linguistic markers and you want to remove all speaker specific information.” Machine learning models by Dr. Thayabaran Kathiresan delivered the first strong evidence for this assumption.

A further topic of interest in the project is infant-directed speech (IDS), a speaking style that adults tend to adopt when speaking to young children. While IDS has been thought of as supporting language acquisition processes, Professor Dellwo investigates a novel hypothesis according to which IDS is applied to make caregivers more easily recognisable to their infant, which he says is particularly important when an infant starts walking and moving away on their own. “At that point it’s crucial to be able to recognise the key people around you, and voice plays a central role,” he stresses. One of the speaking tasks carried out by Dr. Elisa Pellegrino involves mothers communicating with their child.

“The mothers sit in the lab and are asked to communicate with their infant child, who is in an adjacent booth. They are able to see and hear each others while they communicate and the mothers naturally fall into IDS,” explains Dr. Pellegrino. “With similar methods we also retrieve non-native directed speech, where a person struggles to understand and you often have to repeat yourself. People tend to fall into clear speech in this situation and focus on their intelligibility.”

Computational modelling

Researchers in the project are analysing the data gathered from this work, then conducting computational modelling and human perception experiments to find out which forms of speech are more intelligible, and also which are more recognisable in terms of identity recognition through voice and across the visual and auditory modalities. Speech is highly multi-dimensional, and a wide variety of factors are involved in making it recognisable.

“Pitch plays a role, as do resonances in the vocal tract and individual voice quality, for example whether you speak in a creaky or breathy way. These are just a couple of a seemingly endless number of dimensions,” says Professor Dellwo. The question then is how to deal with this vast number of dimensions; Alessandro De Luca, a PhD candidate in the project, uses computational models to identify those that describe differences between speakers. “You essentially throw different dimensions into a model, then the model can identify which are dominant,” Professor Dellwo explains. “That can be done with quite complex methods, including deep-learning neural networks for

example. The crux is that different speakers might adopt different dimensions.”

The fundamental hypothesis underlying this research is the idea that humans can control their voice timbre properties. This idea then leads into several sub-hypotheses, for example that voices in IDS are better for acquiring speaker identity than in adult-directed speech, or that voices are less identifiable in deceptive speech or in communicative situations wherein signalling cooperation, through vocal convergence, have priority over individualization. “It’s entirely possible that some of these sub-hypotheses are correct and others aren’t. That then affects the way we will understand the overall picture of how speakers can control their recognisability,” outlines Professor Dellwo. This work holds relevance not only to understanding the role of voice timbre information in communication between humans, but also between humans and machines and so to the development of voice recognition and synthesis software, with Professor Dellwo and his research group developing systems in this area. “We build these systems from state-of-the-art procedures that are widely known about,” he says.

Voice recognition is a highly inter-disciplinary area of investigation, bringing together elements of phonetics, engineering, neuroscience and physics to name just four disciplines, so it’s important for researchers to be exposed to ideas and techniques from different areas. Concrete initiatives promoting the cross-fertilization between scholars and disciplines have been taken at the research and educational level since 2022: a new conference series on voice identity was launched at the University of Zurich in 2022 (www.voice-id.org; this year at University of Marburg in Germany). In 2023, the University of Zurich hosted the first interdisciplinary Summer School on Voice Identity open to

young researchers working in the multi-faceted domain of voice identity and recognition. As a result of these actions, a consortium of more than 20 European academic and industrial partners – including Professor Dellwo’s group at the University of Zurich – was established, and were successful in receiving funding for a Marie Curie Doctoral Network to train the next generation of researchers in the domain of Voice Communication Sciences (www.vocs. eu.com). “The idea is really to bring people from different disciplines together to work on voice communication,” says Professor Dellwo. This will also open up new opportunities to collaborate, with Professor Dellwo planning to conduct

by the time they have uttered half a syllable, you probably don’t even need an entire syllable. Then you use the rest of the signal to process what the person is saying,” he outlines. This then affects our understanding of the brain’s workload with respect to some processes.

“One question is whether voice recognition is a different mechanism from speech recognition? What precisely happens there?” continues Professor Dellwo. “We’re also interested in the attribution of resources. Voice recognition costs the brain resources – so how much attention does the brain pay to recognising a speaker, during the time that they are speaking? These are potential questions for follow-up projects.”

“A further topic of interest in the project is infant-directed speech (IDS), a speaking style that adults tend to adopt when speaking to young children.”

further research in this area. “The next step is to look more closely into how voice recognition interacts with the linguistic processing of speech in human listeners,” he says. “In situations where you are conversing with more than two people you need to identify the speakers to process the messages that speakers convey. As such, voice recognition is integrated with speech recognition and not a separate process as it has often been viewed in the past.”

Speaker identification must happen very quickly during speech communication, believes Professor Dellwo, which has implications for the time-course of the voice recognition mechanism in the brain. The next step then is to study when voice recognition during dialogue actually happens; Professor Dellwo’s hypothesis is that it must happen near the onset of an utterance.

“We probably settle on the identity of a speaker

The knowledge obtained from the project feeds into forensic phonetic applications in which phonetic experts compare two voice samples, typically one from a criminal scene of which the identity is disputed and one from a known suspect, to decide whether they stem from one and the same or two different speakers. Together with his colleague Professor Peter French, Professor Dellwo and his group founded the Centre for Forensic Phonetics and Acoustics (CFPA) at the Universtity of Zurich, which provides forensic expertise for courts, including prosecution and defence lawyers. Both Professors Dellwo and French appear at court regularly to provide evidence. Recently the casework branch of the CFPA merged with the company JP French to form JP French International, a company supporting law enforcement agencies worldwide.

The DynamICS of InDexIC al InformaTIon In SPeeCh anD ITS role In SPeeCh CommunIC aTIon anD SPeaker reCognITIon

Project objectives

Recognizing individuals by their voice is an important trait in human social interaction. To interact successfully in social situations, it is essential for humans to safely attribute vocal utterances to individuals. In his project, Professor Dellwo researches how humans model their voice to be utmost identifieable when voice identity is at stake.

Project funding

This project is funded by the Swiss National Science Foundation SNSF (896,518 CHF). https://data.snf.ch/grants/grant/185399

Project Partners

• Linguistic Research Infrastructure (LiRI): https://www.liri.uzh.ch

• VoCS network: https://www.vocs.eu.com

• JPFrench International: https://www.jpfrench.com

• CFPA: https://www.cl.uzh.ch/en/researchgroups/phonetics/cfpa.html

• Evolving Language network: https://evolvinglanguage.ch/

Contact Details

Project Coordinator, Prof. Dr. Volker Dellwo Phonetics & Speech Sciences Group Department of Computational Linguistics Universität Zürich Andreasstrasse 15, 8050 Zurich, Switzerland T: +41(0)78 807 75 50 e: volker.dellwo@uzh.ch W: https://www.cl.uzh.ch

Prof. Dr. Dellwo’s studied Phonetics at the University of Trier and has a PhD in Phonetics from the University of Bonn. He worked at Univesity College London before he came to Zurich and was guest professor at UIMP Madrid, London City University, Greenwich University and the University of Chiang Mai.

PD Dr. elisa Pellegrino is a senior research associate in the Phonetics & Speech Sciences Group and collaborates with Professor Dellwo on the Indexical Dynamics project amongst others. alessandro De luca is a PhD candidate with expertise in machine learning and computational modelling of voice data.

leah Bradshaw is a PhD candidate with a particular focus on forensic applications of voice analysis.

Valeriia Perepelytsia is a PhD candidate working on neuro-cognitive processing of voice identities. Dr. Thayabaran kathiresan was a Post-doc researcher for automatic voice analysis in the project for two years.

Right to Left: Ms Perepelytsia, Prof. Dr. Dellwo, Dr Pellegrino, Ms Bradshaw, and Mr De Luca
Computational modelling of voices using UMAP. The clusters in colour show different utterances by the same speaker. Distances between clusters indicate the acoustic similarity between voices (closer distance = higher similarity).
A unique recording system in the laboratory allows speakers to be recorded in isolation while they interact with each other audiovisually via headphones and cameras.
An example of the centralised control computer, monitoring the recording sessions and presenting stimuli to the study subjects during multi-player game-based experiments.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.