TY - JOUR
T1 - Pattern theory for representation and inference of semantic structures in videos
AU - De Souza, Fillipe D.M.
AU - Sarkar, Sudeep
AU - Srivastava, Anuj
AU - Su, Jingyong
N1 - Publisher Copyright:
© 2016 Elsevier B.V. All rights reserved.
PY - 2016/3/1
Y1 - 2016/3/1
N2 - We develop a combinatorial approach to represent and infer semantic interpretations of video contents using tools from Grenander's pattern theory. Semantic structures for video interpretation are formed using generators and bonds, the fundamental units of representation in pattern theory. Generators represent features and ontological items, such as actions and objects, whereas bonds are threads used to connect generators while respecting appropriate constraints. The resulting configurations of partially-connected generators are termed scene interpretations. Our goal is to parse a given video data set into high-probability configurations. The probabilistic models are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints, co-occurrence frequencies, etc). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data. The proposed framework is able to obtain 20% higher classification rates, compared to a purely machine learning-based baseline, despite artificial insertion of low-level processing errors. In an uncontrolled scenario, video interpretation performance rates are found to be double that of the baseline.
AB - We develop a combinatorial approach to represent and infer semantic interpretations of video contents using tools from Grenander's pattern theory. Semantic structures for video interpretation are formed using generators and bonds, the fundamental units of representation in pattern theory. Generators represent features and ontological items, such as actions and objects, whereas bonds are threads used to connect generators while respecting appropriate constraints. The resulting configurations of partially-connected generators are termed scene interpretations. Our goal is to parse a given video data set into high-probability configurations. The probabilistic models are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints, co-occurrence frequencies, etc). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data. The proposed framework is able to obtain 20% higher classification rates, compared to a purely machine learning-based baseline, despite artificial insertion of low-level processing errors. In an uncontrolled scenario, video interpretation performance rates are found to be double that of the baseline.
KW - Activity recognition
KW - Compositional approach
KW - Graphical methods
KW - Pattern theory
KW - Video interpretation
UR - http://www.scopus.com/inward/record.url?scp=84977951503&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2016.01.028
DO - 10.1016/j.patrec.2016.01.028
M3 - Article
AN - SCOPUS:84977951503
SN - 0167-8655
VL - 72
SP - 41
EP - 51
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -