Spatially Coherent Interpretations of Videos Using Pattern Theory

Fillipe D.M. de Souza; Sudeep Sarkar; Anuj Srivastava; Jingyong Su

doi:10.1007/s11263-016-0913-6

Spatially Coherent Interpretations of Videos Using Pattern Theory

Fillipe D.M. de Souza, Sudeep Sarkar, Anuj Srivastava, Jingyong Su

Mathematics and Statistics

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

Original language	English
Pages (from-to)	5-25
Number of pages	21
Journal	International Journal of Computer Vision
Volume	121
Issue number	1
DOIs	https://doi.org/10.1007/s11263-016-0913-6
State	Published - Jan 1 2017

Keywords

Activity detection
Compositional approach
Graphical methods
Pattern theory

Access to Document

10.1007/s11263-016-0913-6

Cite this

@article{37f38c1760db4a8c86d6f39fe89edcca,

title = "Spatially Coherent Interpretations of Videos Using Pattern Theory",

abstract = "Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander{\textquoteright}s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.",

keywords = "Activity detection, Compositional approach, Graphical methods, Pattern theory",

author = "{de Souza}, {Fillipe D.M.} and Sudeep Sarkar and Anuj Srivastava and Jingyong Su",

note = "Publisher Copyright: {\textcopyright} 2016, Springer Science+Business Media New York.",

year = "2017",

month = jan,

day = "1",

doi = "10.1007/s11263-016-0913-6",

language = "English",

volume = "121",

pages = "5--25",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

number = "1",

}

TY - JOUR

T1 - Spatially Coherent Interpretations of Videos Using Pattern Theory

AU - de Souza, Fillipe D.M.

AU - Sarkar, Sudeep

AU - Srivastava, Anuj

AU - Su, Jingyong

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

AB - Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

KW - Activity detection

KW - Compositional approach

KW - Graphical methods

KW - Pattern theory

UR - http://www.scopus.com/inward/record.url?scp=84973138787&partnerID=8YFLogxK

U2 - 10.1007/s11263-016-0913-6

DO - 10.1007/s11263-016-0913-6

M3 - Article

AN - SCOPUS:84973138787

SN - 0920-5691

VL - 121

SP - 5

EP - 25

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

IS - 1

ER -

Spatially Coherent Interpretations of Videos Using Pattern Theory

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this