ICML 2016 Workshop on
Multi-View Representation Learning (MVRL)

June 23, 2016
New York City, NY, USA

June 23, Carnegie-Booth Room (5th floor), Marriott Marquis

Session 1

08:30 -- 09:00 Chris Dyer : Text and Context: Two Views of Words?

Two robust generalizations about human language are its arbitrariness (de Saussure, 1916), the fact that the relationship between the form of a word and its meaning is largely arbitrary, and the distributional hypothesis (Harris, 1954), according to which the meaning of a word is determined by the contexts it is used in. Taken at face value, these suggest multi-view learning must look for other sets of observations to correlate since words have no information about their content, and their content is merely a summary of the contexts they occur in. However, linguistic behavior suggests neither of these is the whole story: we readily assign semantic intuitions and grammatical roles to words we have never seen before based on, e.g., robust cross-lingual phono-semantic correspondences. In this talk, I provide empirical evidence that this relationship is not as arbitrary as the linguists would have it: modeling the character-word relationship provides useful word representations in a variety of tasks, and the representations found have qualitatively appealing value.
09:00 -- 09:30 Honglak Lee : Multimodal deep learning via joint embedding and conditional generation

In multimodal representation learning, it is important to capture high-level associations between multiple data modalities with a compact set of latent variables. Deep learning has been successfully applied to this problem, with a common strategy to learning joint representations that are shared between multiple modalities at the higher layer after learning several layers of modality-specific features in the lower layers. Nonetheless, there still remains an important question on how to learn a good association between multiple data modalities, in particular, to reason or predict the missing data modalities effectively in the testing time. This talk will provide will present my perspectives on advances in multimodal deep learning, with applications to challenging problems with heterogeneous input modalities, such as audio, video, image, text, and other sensors. I will present multimodal deep learning methods that explicitly encourage cross-modal associations. In particular, instead of conventional maximum likelihood learning, we formulate the learning objective as bi-directional conditional prediction and theoretically show that the learned model converges to the generative modeling of joint probabilities under reasonable assumptions. Further, I will talk about learning joint embedding for multimodal data (image, text, attributes, etc.), with applications to fine-grained classification. Finally, I will describe my recent work on conditional image generation with semantic attributes and text descriptions.
09:30 -- 09:45 Roy Lederman : Alternating Diffusion
09:45 -- 10:00 Rong Ge / James Zou : Rich Component Analysis
10:00 -- 10:30 BREAK + POSTERS

Session 2

10:30 -- 11:00 Sham Kakade : MusicNet: Frustratingly Easy Music to Score Alignment

We will introduce a new large-scale music dataset, MusicNet, that consists of freely-licensed classical music recordings along with instrument/note labels, including 40 hours of music, covering 10 instruments and 10 composers, resulting in over 1 million temporal labels with an average of $51$ distinct notes per instrument. Audio recordings in MusicNet span a variety of musical periods, from Baroque to late Romantic, performed by a wide range of performers under various studio and microphone conditions. Hand transcribing a dataset of this size is infeasible, even for expert musicians. This work provides a remarkably simple and robust algorithm to automatically label recordings by aligning them to musical scores.
11:00 -- 11:30 Ruslan Salakhutdinov : Deep Multimodal and Transfer Learning

The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. In this talk I will first introduce a novel method, termed "Actor-Mimic", that exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. I will show that the representations learnt by the deep policy network are capable of generalizing to new tasks with no prior expert guidance, speeding up learning in novel environments. Next, I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities. In particular, I will discuss models that can generate images from natural language descriptions using attention.
11:30 -- 11:45 Nika Haghtalab : Towards a General Theory of Multi-view Mixture Models
11:45 -- 12:00 Harrie Oosterhuis : Semantic Video Trailers
12:00 -- 01:30 LUNCH

Session 3

01:30 -- 02:00 Weiran Wang : Multi-View Representation Learning via Canonical Correlation Analysis

We compare different nonlinear extensions of CCA, and find that the deep neural network extension of CCA, termed deep CCA (DCCA), has consistently good performance while being computationally efficient for large datasets. We further compare DCCA with deep autoencoder-based approaches, as well as new variants. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with a new variant, deep canonically correlated autoencoders (DCCAE). This talk summarizes our ICML 2015 paper "On Deep Multi-view Representation Learning", and discusses some new findings in the end.
02:00 -- 02:30 Simon Lacoste-Julien : Beyond CCA: Moment Matching for Multi-View Models

I will present a simple novel probabilistic extension of the standard CCA model for multi-view data that is inspired from the ICA framework. This "non-Gaussian CCA" model can be used with discrete or continuous data, and has the nice property of having identifiable parameters, unlike standard CCA. I will discuss its relationship to PCA / ICA / etc. and how we can estimate its parameters efficiently using the method of moments. I will highlight at the same time a useful tool from the ICA literature that we call "generalized covariance matrices" that can replace the more complicated tensor approach for the method of moments. In a second part, I will briefly mention a computer vision dataset of narrated instruction videos that could be of interest to the multi-view modeling community. [The first part is joint work with Anastasia Podosinnikova and Francis Bach]
02:30 -- 02:45 Kiran Vodrahalli : A Semantic Shared Response Model
02:45 -- 03:00 Seungwhan Moon : Cross Modal Content Based Objective for Learning Adequate Multimodal Representations
03:00 -- 05:00 BREAK + POSTERS

ICML 2016 Workshop on Multi-View Representation Learning (MVRL)

June 23, Carnegie-Booth Room (5th floor), Marriott Marquis

Session 1

Session 2

Session 3

ICML 2016 Workshop on
Multi-View Representation Learning (MVRL)