Document Type
Article
Publication Date
7-2024
Abstract
In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.
Citation
Chappa, N., Nguyen, P., Dobbs, P. D., & Luu, K. (2024). React: Recognize Every Action Everywhere All at Once. Machine Vision and Applications, 35 (4), 102. https://doi.org/10.1007/s00138-024-01561-z
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Keywords
Group activity recognition (GAR); Action retrieval; Vision-language modeling
Included in
Artificial Intelligence and Robotics Commons, Electrical and Computer Engineering Commons