Learning Visual Representations Via Language

Project Abstract/Statement of Work:

Currently, most computer vision systems rely on large labeled datasets such as ImageNet or COCO for training. This paradigm has been hugely successful, but the necessity of employing human annotators limits the scale of data on which we can train.

Recent developments in natural language processing such as BERT have shown that large quantities of raw textual data downloaded from the web can be used to learn high quality representations of text that transfer to many downstream language tasks. 

We aim to build on these recent advances in NLP, and use large quantities of image and text data in order to jointly learn representations of images and text that transfer to many downstream vision and vision+language tasks.

We believe that the supervisory signal provided by language will be particularly useful for learning visual representations that generalize to the long tail of objects, and the at capture the compositional structure of scenes. If rare or unusual objects appear in images, they are likely to be mentioned in corresponding text; this provides natural supervision for objects in the long tail of the category distribution that is not present when training on datasets labeled with fixed categories, such as ImageNet or COCO. Text associated with images often describes properties of objects, or relationships between objects; as such it provides natural supervision for the compositional structure of complex images, without resorting to expensive and explicit manual annotation of compositional structure as in datasets such as Visual Genome.

We imagine that language supervision will be useful both in weakly and strongly supervised regimes. Compared to traditional human annotation pipelines that rely on complex category hierarchies and marking exact segmentation hierarchies, we believe that asking people to annotate images with language will require orders of magnitude less annotation time per image, resulting in cost savings.