Speech prosody is a multidimensional phenomenon comprising intonation, energy, and rhythm.It is the carrier of both linguistic information, e.g. sentence structure, focus and contrast, lexical stress; as well as paralinguistic information, e.g. gender, age, personality, and...
Speech prosody is a multidimensional phenomenon comprising intonation, energy, and rhythm.
It is the carrier of both linguistic information, e.g. sentence structure, focus and contrast, lexical stress; as well as paralinguistic information, e.g. gender, age, personality, and emotions.
In recent years, prosodic research has considerably enlarged the spectrum of its properties and functions.
In contrast, prosodic models able to map signals to functions or vice-versa are rare: comprehensive models of rhythm and intonation have difficulties coping with this expanding dimensionality, and machine learning techniques still have difficulties with offering structuring principles.
This issue is of increasing importance to society as speech enabled applications see widespread deployment in our everyday lives.
Specifically, we have seen a proliferation of virtual assistants and virtual call operators whose primary mode of communication with the user is speech.
Even though these systems handle well the linguistic content of the speech signal, i.e. the spoken words,
they struggle understanding information embedded in the prosody, i.e. the meaning behind these words.
Thus, a model is needed that will enable these systems to disentangle and decode the various information embedded in speech prosody.
The prime objective of the ProsoDeep project was to develop a state-of-the-art Deep Prosody Model that would provide a deeper understanding of the hierarchical encoding of information through the language of prosody.
A secondary objective was the recording of a prosodically rich Database that can be used to analyse the interaction of multiple linguistic functions in the way they are communicated through prosody.
The ProsoDeep project has achieved both of these objectives.
The main result of the Project is the creation of a state-of-the-art prosody model - the Variational Prosody Model (VPM), which bridges the gap between traditional prosody modelling approaches, and modern black box Deep Learning based systems. The VPM builds on the structured modelling paradigm of the top-down Superposition of Functional Contours (SFC) model, with Deep Learning techniques such as Recurrent Neural Networks and Variational Autoencoders.
Thanks to this, the VPM can capture and reveal variation in prosodic structure within the imposed constraints of the modelling framework. Specifically, the VPM can map out a prosodic latent space representation of the prosodic variety manifold at different linguistic levels of organisation. An example of this structure is shown in the attached figures for a prosodic latent space captured at three levels of the linguistic hierarchy: the phrase level, the syntax level and the word level. This representation gives us an insight of the prototype shape distribution within the corresponding three levels of prosodic encoding.
The VPM can be used to analyse and to synthesise prosody that corresponds to a variety of interacting communicative functions including but not limited to: attitudes, syntax, and focus. Specifically, in an automatic speech recognition (ASR) system the VPM can potentially be used to detect linguistic functions in the prosody of the input speech, while in a text-to-speech synthesis (TTS) system it can be used to embed these communicative functions in the output speech. Finally, the VPM can be integrated in Speech-to-speech (STS) systems where it can appropriately transfer the communicative functions from the input speech in the speech output in a different language.
A secondary result of the Project is the recording of a prosodically rich Database, which after segmentation and annotation will be made available as open access on Zenodo. The Database has been recorded in 6 languages: English, German, Chinese, Vietnamese, Macedonian, and French, each recorded with a single native speaker. Each language comprises of 70 utterances spoken in 11 different attitudes with varied word focus, giving a total of 800 utterances. These numbers vary between the languages. The Database was recorded in studio conditions in an anechoic chamber using high-quality audio recording equipment. Most recordings were also augmented with breathing measurements and video. The Database will facilitate prosody research across these languages, specifically in regards to the interaction between different linguistic functions at different levels in the linguistic hierarchy in their encoding in prototype prosodic contours.
Four scientific publications have been published at two flagship speech conferences: the International Conference on Speech Prosody 2018, and Interspeech 2018, and two linguistic conferences focused on prosody: the International Symposium Tonal Aspects of Languages 2018, and the Workshop on Prosody and Meaning 2018. We are finalising a journal manuscript that will summarise our work on the VPM in great detail. We are also planning to write and submit a journal article that will describe the recorded Database upon its complete segmentation and annotation.
The prosodic latent space representation of the context-specific prosodic prototype variation is a unique contribution of the VPM that goes beyond the state-of-the-art. This results can be used by engineers that work on prosody analysis in the context of speech recognition, or prosody generation in the context of speech synthesis. Especially in systems that rely on expressive and nuanced natural speech interaction, such as virtual assistants or virtual call operators. Additional potential users of the VPM are linguists for whom the system can be adapted to investigate prosodic phenomena marked by an interaction of linguistic functions.
When used in a speech synthesis system the VPM can be used within assistive technology, e.g. a screen-reader, and used to address the societal needs of the blind and people with impaired vision.
More info: https://gerazov.github.io/prosodeep/.