Workshop on Machine Learning in Speech and Language Processing

September 13, 2016
San Francisco, CA, USA
Speaker: Herman Kamper (University of Edinburgh)

Title: Unsupervised speech recognition using acoustic word embeddings

Abstract:
In settings where only unlabelled speech data is available, unsupervised techniques are required to discover linguistic structure directly from the raw audio without transcriptions, pronunciation dictionaries, or language modelling text. In this talk, I will motivate why unsupervised, or *zero-resource*, speech processing is important. I will focus on the problems of segmenting and clustering unlabelled speech into word-like units, and will introduce our solution based on segmental modelling. In contrast to previous work, our Bayesian model represents whole words directly using a segmental representation referred to as *acoustic word embeddings*. The idea behind these embeddings is that an acoustic word segment (which can be of any duration) is mapped to a fixed-dimensional space in which distinct tokens of the same word lie close to each other. Such an approach can be useful, not only for unsupervised discovery, but also for other query-type speech tasks. I will describe our Bayesian model for unsupervised speech recognition, and review some methods for creating acoustic word embeddings, including own work which uses Siamese convolutional neural networks that are trained using weak pair-wise supervision.