Thesis Advisor: | Trevor Darrell, MIT |
Committee: | Bill Freeman, MIT |
Michael Collins, MIT | |
Shimon Ullman, Weizmann Institute of Science |
The right measure of similarity between examples is important in many areas of computer science. In particular it is a critical component in example- based learning methods. Similarity is commonly defined in terms of a conventional distance function, but such a definition does not necessarily capture the inherent meaning of similarity, which tends to depend on the underlying task. We develop an algorithmic approach to learning similarity from examples of what objects are deemed similar according to the task-specific notion of similarity at hand, as well as optional negative examples. Our learning algorithm constructs, in a greedy fashion, an encoding of the data. This encoding can be seen as an embedding into a space, where a weighted Hamming distance is correlated with the unknown similarity. This allows us to predict when two previously unseen examples are similar and, importantly, to efficiently search a very large database for examples similar to a query.
This approach is tested on a set of standard machine learning benchmark problems. The model of similarity learned with our algorithm provides and improvement over standard example-based classification and regression. We also apply this framework to problems in computer vision: articulated pose estimation of humans from single images, articulated tracking in video, and matching image regions subject to generic visual similarity.
This chapter defines some technical concepts, most importantly the notion of similarity we want to model, and provides a brief overview of the contributions of the thesis.
Among the topics covered in this chapter: example-based classification and regression, previous work on learning distances and (dis)similarities (such as MDS), and algorithms for fast search and retrieval, with emphasis on locality sensitive hashing (LSH).
This is the main algorithmic "meat" of the thesis: this chapter describes three algorithms for learning an embedding of the data into a weighted Hamming space, with the objective for L1 distance there to reflect the underlying similarity. The algorithms are:
The underlying idea of all three algorithms is the same: build an embedding that, based on training examples of similar pairs, maps two similar objects close to each other (with high probability). At the same time, there is an objective to control for "spread": the probability of arbitrary two objects (in particular of dissimilar pairs of objects, if examples of such pairs are available) to be close in the embedding space should be low.
This chapter also describes results of an evaluation of the proposed algorithms on seven benchmarks data sets from UCI and Delve repositories.
An application of the ideas developed in previous chapters to the problem of pose estimation: inferring the articulated body pose (e.g. the 3D positions of key joints, or values of joint angles) from a single, monocular image containing a person.
In a tracking scenario, a sequence of views, rather than a single view of a person, is available. The motion provides additional cues, which are typically used in a probabilistic framework. In this chapter we show how similarity-based algorithms have been used to improve accuracy and speed of two articulated tracking systems: a general motion tracker and a motion-driven animation system focusing on swing dancing.
An important notion of similarity that is naturally conveyed by examples is the visual similarity of image regions. In this chapter we focus on a particular definition of such similarity, namely invariance under rotation and slight shift. We show how the machinery developed in Chapter 3 allows us to improve matching performance for two popular representations of image patches.