#	Pagina
attuale pagina	/open-fp7/projects/98278/index.html
-1	/open-fp7/projects/110468/index.html
-2	/open-h2020/projects/227546/index.html
-3	/open-fp7/projects/98554/index.html

PERCQUALAVS

A Model for Predicting Perceived Quality of Audio-visual Speech based on Automatic Assessment of Intermodal Asynchrony

Coordinatore	TECHNISCHE UNIVERSITAT BERLIN Organization address address: STRASSE DES 17 JUNI 135 city: BERLIN postcode: 10623 contact info Titolo: Ms. Nome: Simone Cognome: Ludwig Email: send email Telefono: 493031000000 Fax: 493031000000
Nazionalità Coordinatore	Germany [DE]
Totale costo	155˙542 €
EC contributo	155˙542 €
Programma	FP7-PEOPLE Specific programme "People" implementing the Seventh Framework Programme of the European Community for research, technological development and demonstration activities (2007 to 2013)
Code Call	FP7-PEOPLE-2010-IEF
Funding Scheme	MC-IEF
Anno di inizio	2011
Periodo (anno-mese-giorno)	2011-05-01 - 2013-04-30

Partecipanti

#	participant	country	role	EC contrib. [€]
1	TECHNISCHE UNIVERSITAT BERLIN Organization address address: STRASSE DES 17 JUNI 135 city: BERLIN postcode: 10623 contact info Titolo: Ms. Nome: Simone Cognome: Ludwig Email: send email Telefono: 493031000000 Fax: 493031000000	DE (BERLIN)	coordinator	155˙542.40

Mappa

Word cloud

Esplora la "nuvola delle parole (Word Cloud) per avere un'idea di massima del progetto.

basis re perceptual quality video score asynchrony model synchronised synchrony time asynchronous initial perceived learning visual acoustic evaluation computer machines prediction techniques automatically assessing speech first automatic responses communication intermodal technologies vision human data subjective input audio humans levels synthesised effect machine scores audiovisual

Obiettivo del progetto (Objective)

'In recent years, there has been a marked increase in communication technologies and computer interfaces that operate within the audio-visual speech domain, (e.g. video-telephony, synthesised avatars, etc). Faithful synchrony between the visual and acoustic speech elements of such technologies is of great importance in ensuring that they are perceived by end-users as operating at high and optimal quality levels. The effect of intermodal asynchrony on user-perceived quality is typically assessed using subjective evaluation techniques. A system for automatically assessing asynchrony levels, and predicting quality degradation on that basis, would therefore be both desirable and useful, and will have direct application to techniques for automatic synchrony adjustment. The proposed project will examine audio-visual speech as both spoken naturally by humans and as artificially synthesised by machines, and will employ subjective assessment techniques and machine learning in a combined iterative semi-automatic strategy for producing a Quality Prediction Model. Different levels of intermodal asynchrony will first be assessed by human subjects, who will be required to score the effect of the asynchrony levels on perceived speech quality using standardised techniques that will be modified for use with multimodal speech. Asynchrony patterns and their corresponding subjective assessment scores will be automatically learned by machines, resulting in an initial Quality Prediction Model. The initial model will be tested using data that will be simultaneously assessed by humans, using the subjective assessment techniques, above. The output from the prediction model will be directly compared with the subjective scores, providing an initial evaluation of the model's performance. The model will be adjusted on this basis, and re-trained using new data. The process of re-train, re-test, re-score, will be repeated iteratively, leading to a more robust quality prediction model.'

Introduzione (Teaser)

When video lags behind speech, the effect can be discouraging for technology users. New research into effectively measuring asynchronous audiovisual signals could help address this phenomenon.

Descrizione progetto (Article)

High-tech audiovisual communication is becoming a very common form of exchange, from teleconferencing by satellite to live chatting through smart phones. As simple as the idea seems, matching the picture and voice together in a synchronised manner is challenging yet crucial for the success of such complex applications. If users perceive that the communication experience is not synchronised, they could switch to other means of communication.

Against this backdrop, the EU-funded project PERCQUALAVS aimed to measure synchrony between the visual and acoustic speech elements of such technologies. Building on fields such as computer vision, cognitive science, machine learning and speech processing, it conceived a model to predict the perceived quality of audiovisual speech.

To achieve its aims, the project was divided into four parts. The first looked at extracting key audiovisual features from an input signal to apply automatic asynchrony detection. The second involved gathering subjective perceptual response data through several perceptual experiments.

The third component analysed perceptual responses gathered, while the fourth represented a machine learning component that predicts human perception of asynchronous input. The project team successfully developed computer vision-based feature extractors that track the lips in real time and extract valuable data, creating as well speech processing toolkits to assist in analysing the data.

Another key project achievement was the development of software to process the extracted features in order to measure synchrony and map the results. This enabled comparison between perceptual responses of users and automatically generated results.

While the project's results were not disseminated adequately due to different technical and time constraints, they have laid the groundwork for more research in the field. This is a step forward for assessing and improving audiovisual technology, which is growing rapidly worldwide.