08:30 -- 09:00
Two robust generalizations about human language are its arbitrariness (de Saussure, 1916), the fact that the relationship between the form of a word and its meaning is largely arbitrary, and the distributional hypothesis (Harris, 1954), according to which the meaning of a word is determined by the contexts it is used in. Taken at face value, these suggest multi-view learning must look for other sets of observations to correlate since words have no information about their content, and their content is merely a summary of the contexts they occur in. However, linguistic behavior suggests neither of these is the whole story: we readily assign semantic intuitions and grammatical roles to words we have never seen before based on, e.g., robust cross-lingual phono-semantic correspondences. In this talk, I provide empirical evidence that this relationship is not as arbitrary as the linguists would have it: modeling the character-word relationship provides useful word representations in a variety of tasks, and the representations found have qualitatively appealing value.
09:00 -- 09:30
In multimodal representation learning, it is important to capture high-level associations between multiple data modalities with a compact set of latent variables. Deep learning has been successfully applied to this problem, with a common strategy to learning joint representations that are shared between multiple modalities at the higher layer after learning several layers of modality-specific features in the lower layers. Nonetheless, there still remains an important question on how to learn a good association between multiple data modalities, in particular, to reason or predict the missing data modalities effectively in the testing time. This talk will provide will present my perspectives on advances in multimodal deep learning, with applications to challenging problems with heterogeneous input modalities, such as audio, video, image, text, and other sensors. I will present multimodal deep learning methods that explicitly encourage cross-modal associations. In particular, instead of conventional maximum likelihood learning, we formulate the learning objective as bi-directional conditional prediction and theoretically show that the learned model converges to the generative modeling of joint probabilities under reasonable assumptions. Further, I will talk about learning joint embedding for multimodal data (image, text, attributes, etc.), with applications to fine-grained classification. Finally, I will describe my recent work on conditional image generation with semantic attributes and text descriptions.
09:30 -- 09:45
09:45 -- 10:00
10:00 -- 10:30
10:30 -- 11:00
We will introduce a new large-scale music dataset, MusicNet, that consists of freely-licensed classical music recordings along with instrument/note labels, including 40 hours of music, covering 10 instruments and 10 composers, resulting in over 1 million temporal labels with an average of $51$ distinct notes per instrument. Audio recordings in MusicNet span a variety of musical periods, from Baroque to late Romantic, performed by a wide range of performers under various studio and microphone conditions.
Hand transcribing a dataset of this size is infeasible, even for expert musicians. This work provides a remarkably simple and robust algorithm to automatically label recordings by aligning them to musical scores.
11:00 -- 11:30
The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. In this talk I will first introduce a novel method, termed "Actor-Mimic", that exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. I will show that the representations learnt by the deep policy network are capable of generalizing to new tasks with no prior expert guidance, speeding up learning in novel environments. Next, I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities. In particular, I will discuss models that can generate images from natural language descriptions using attention.
11:30 -- 11:45
11:45 -- 12:00
12:00 -- 01:30
01:30 -- 02:00
We compare different nonlinear extensions of CCA, and find that the deep neural network extension of CCA, termed deep CCA (DCCA), has consistently good performance while being computationally efficient for large datasets. We further compare DCCA with deep autoencoder-based approaches, as well as new variants. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with a new variant, deep canonically correlated autoencoders (DCCAE). This talk summarizes our ICML 2015 paper "On Deep Multi-view Representation Learning", and discusses some new findings in the end.
02:00 -- 02:30
I will present a simple novel probabilistic extension of the standard CCA model for multi-view data that is inspired from the ICA framework. This "non-Gaussian CCA" model can be used with discrete or continuous data, and has the nice property of having identifiable parameters, unlike standard CCA. I will discuss its relationship to PCA / ICA / etc. and how we can estimate its parameters efficiently using the method of moments. I will highlight at the same time a useful tool from the ICA literature that we call "generalized covariance matrices" that can replace the more complicated tensor approach for the method of moments. In a second part, I will briefly mention a computer vision dataset of narrated instruction videos that could be of interest to the multi-view modeling community. [The first part is joint work with Anastasia Podosinnikova and Francis Bach]
02:30 -- 02:45
02:45 -- 03:00
03:00 -- 05:00