LifeQA: Holistic Visual-Linguistic Scene Understanding for Real-Life Question Answering
Project Abstract/Statement of Work:
The main goal of our project is to lay the foundations for a new generation of in-home multimodal question answering systems, which can answer day-to-day questions by jointly leveraging language and vision. At the core of our approach are scene semantic graphs, a novel holistic representation of scenes that we propose, which combines pixels and words into a common symbolic space. Semantic graphs are abstractions whose nodes represent instances or entities, while the edges represent relationships. For example, “person cutting cake with knife” can be represented by a graph that has four vertices (“person”, “cutting”, “knife”, “cake”) and three edges (“person-cutting”, “cutting-with-knife”, “cutting-cake”). The edges connect “cutting” with its components—“person” is the subject, “knife” is the instrument, and “cake” is the patient. We note that because the edges can represent arbitrary relations, such graphs are expressive enough to capture a wide range of semantics, be it causal, spatial, or temporal. We will develop algorithms to convert video scenes into a semantic graph that represents a scene at a semantic and cognitive level.
The result of this process will be a large dataset of real-life multiple-choice questions that can be used for the purpose of training and evaluating our methods.