Location: Thomas Davis Lecture Theatre (Room 2043), Arts Block. Trinity College Dublin

Date: 18 Jan

Talk from Prof. Simon King on Friday 18th January at 11.30am
Location: 
Thomas Davis Lecture Theatre (Room 2043), Arts Block. Trinity College Dublin

Simon King is Professor of Speech Processing at the University of Edinburgh, where he is director of the Centre for Speech Technology Research and of a long-running Masters programme in Speech and Language Processing. He has research interests in speech synthesis, speech recognition, speaker verification spoofing, and signal processing, with around 200 publications. He co-authored the Festival speech synthesis toolkit and made contributions to Merlin. He is a Fellow of IEEE and of ISCA. Currently: associate editor Computer Speech and Language. Previously: associate editor IEEE Trans Audio Speech & Language Proc; member IEEE Spoken Language Technical Committee; board member ISCA Speech Synthesis SIG; coordinator of Blizzard Challenges 2007-2018.​

ABSTRACT
Almost every text-to-speech synthesiser contains three components. A front-end text processor normalises the input text and extracts useful features from it. An acoustic model performs regression from these features to an acoustic representation, such as a spectrogram. A waveform generator then creates the corresponding waveform.

In many commercially-deployed speech synthesisers, the waveform generator still constructs the output signal by concatenating pre-recorded fragments of natural speech. But very soon we expect that to be replaced by a neural vocoder that directly outputs a waveform. Neural approaches are already the dominant choice for acoustic modelling, starting with simple Deep Neural Networks guiding waveform concatenation, and progressing to sequence-to-sequence models driving a vocoder. Completely replacing the traditional front-end pipeline with an entirely neural approach is trickier, although there are some impressive so-called “end-to-end” systems.

In this rush to use end-to-end neural models to directly generate waveforms given raw text input, much of what we know about text and speech signal processing appears to have been cast aside. Maybe this is a good thing: the new methods are a long-overdue breath of fresh air. Or, perhaps there is still some value in the knowledge accumulated from 50+ years of speech processing. If there is, how do we decide what to keep and what to discard – for example, is source-filter modelling still a good idea?