Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - SPEAKER DICE (Robust SPEAKER DIariazation systems using Bayesian inferenCE and deep learning methods)

Teaser

The SPEAKER DICE project dealt with the Speaker Diarization (SD) task. The SD is a task, which consists in automatically finding speaker turns in an audio utterance, or as it is commonly stated, finding “who spoke when?”Although being apparently easy for humans...

Summary

The SPEAKER DICE project dealt with the Speaker Diarization (SD) task. The SD is a task, which consists in automatically finding speaker turns in an audio utterance, or as it is commonly stated, finding “who spoke when?”

Although being apparently easy for humans, diarization is a highly challenging task for machines, as it deals with the complex task of Speaker Recognition, it needs to find the (unknown) number of speakers in the utterance, it has to do segmentation of speech into speaker turns (finding boundaries between speakers) and needs to deal with overlapped speech (cross-talk).

One of the main applications of Speaker Diarization is the indexing of audiovisual resources with speakers. This indexing allows a structured search and access to resources depending on the speaker of interest. This feature can be very useful in a wide range of scenarios. First of all it would be very valuable for public institutions, allowing the indexing of sessions of parliaments, courts, etc. The indexing can also be helpful for companies allowing, for example, access to specific parts of meetings or seminars.
Besides, TV, internet and radio broadcasters would benefit from such system, as they could provide a more versatile access to their contents. The indexing of TV broadcasters is of a special interest, as it would allow the automatic colouring of subtitles according to the speaker, which would make the media more accessible for hearing-impaired people.
In addition to the direct applications speaker diarization, the diarization systems are also helpful and relevant for other related tasks. To list a few, it can be used for speaker adaptation in Automatic Speech Recognition (ASR). Also, it is a very important part of the system pipeline for Speaker Recognition (SR) in wild scenarios in which several speakers are present but only one is of interest. Moreover, it is relevant for the production of linguistic resources, as it allows collecting language utterances avoiding speaker repetitions.

This project focuses on improving, extending current and developing new approaches to enhance the performance of Speaker Diarization systems. For that purpose, we set three main objectives: first, optimize the current Bayesian models which have strong mathematical foundation to achieve better performance. Second, driven by the success of the artificial Neural Network (NN) based techniques for the related speaker recognition task, we aim to integrate NN modules into the diarization pipeline. Finally, the third objective is to make the system applicable to the general case, so that it generalizes to any kind of speech and environment.

Work performed

The SPEAKER DICE project has successfully achieved the objectives set at the beginning of the project. The optimization of the Bayesian inference and speaker modelings of the system has led to significant improvements in performance as is reflected in the related conference publication, recognized with a best paper award, and the related journal publication (currently under review).
Two different neural network based modules have been integrated into the pipeline, one NN for the extraction of robust and discriminative features (embeddings) and another NN based module that detects and handles overlap speech (segments in which two or more speakers talk at the same time). The integration of these modules proved to be very successful. As a result, the BUT team led by the main researcher of the project achieved the third and first positions in the two tracks of the last DIHARD challenge, organized to foster research on hard diarization conditions. Results can be seen in: https://coml.lscp.ens.fr/dihard/2018/results.php. Also, this research led to several publications.
Finally, the diarization modules were successfully integrated in ASR and SR systems. Besides, the project rose the interest of industry: a collaboration has started with Ericsson to optimize the technology for its application on real broadcast data.

Final results

The work performed advanced the state of the art in speaker diarization and the Bayesian HMM system is nowadays a de-facto standard, used also by other international laboratories (as shown in the recent DIHARD evaluation). The project had a significant impact in the area of speech data mining, where a reliable diarization is considered as a necessary pre-processing block in many applications, from commercial contact center voice traffic analysis, through media indexing, to applications in investigative and intelligence work of law enforcement. The project thus contributed to (1) advancing the European industry - cooperation is already running with Phonexia (Czechia) and Ericsson (Sweden), (2) increasing the security of European citizens, by providing the law enforcement with more powerful speech analytics tools. On the social and gender plan, the project helped to further advance a successful international research group - BUT Speech@FIT - and to promote scientific work in artificial intelligence and speech data mining among female researchers and students.

Website & more info

More info: http://www.fit.vutbr.cz/.