Date of Graduation
5-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Engineering (PhD)
Degree Level
Graduate
Department
Electrical Engineering and Computer Science
Advisor/Mentor
Luu, Khoa
Committee Member
Nelson, Alexander H.
Second Committee Member
Andrews, David L.
Third Committee Member
Seo, Han-Seok
Fourth Committee Member
Djuric, Nemanja
Fifth Committee Member
Dobbs, Page D.
Keywords
foundational models; Group Activity Recognition; Multimodal Analysis; Self-supervised learning; vision-language modeling
Abstract
Group Activity Recognition (GAR) has emerged as a crucial problem in computer vision, with wide-ranging applications in sports analysis, video surveillance, and social scene understanding. Unlike traditional action recognition focused on individuals, GAR requires understanding complex spatiotemporal relationships between multiple actors, their interactions, and the broader context in which these activities occur. This complexity introduces unique challenges, including the need for accurate actor localization, modeling of inter-actor dependencies, and understanding of temporal evolution in group behaviors. While recent advances have shown promise, existing approaches often rely heavily on extensive annotations such as ground-truth bounding boxes and action labels, creating significant barriers to practical deployment and scalability. Additionally, current methods struggle to capture the full spectrum of contextual factors that give meaning to group activities, particularly in real-world applications where multiple modalities of information are available. We first introduce Self-supervised Spatiotemporal Transformers Approach to Group Activity Recognition (SPARTAN) and Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition (SoGAR), novel self-supervised frameworks that significantly reduce annotation requirements while maintaining high recognition accuracy. SPARTAN leverages multi-resolution temporal views to capture varied motion characteristics, while SoGAR implements temporal collaborative learning and spatiotemporal cooperative learning strategies. These approaches achieve state-of-the-art performance on multiple benchmark datasets, including JRDB-PAR, NBA, and Volleyball, without requiring person-level annotations. In the multimodal domain, we present three frameworks: Recognize Every Action Everywhere All At Once (REACT), which employs a Vision-Language Encoder for sparse spatial interactions; Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos (HAtt-Flow), which introduces flow conservation principles in attention mechanisms; and LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition (LiGAR), which utilizes LiDAR data as a structural backbone for processing visual and textual information. These frameworks demonstrate significant improvements in capturing cross-modal dependencies and spatial-temporal relationships. Finally, we extend our research to healthcare applications, particularly in analyzing tobacco-related content on social media platforms. We develop Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media (PHAD), Flow-Attention Adaptive Semantic Hierarchical Fusion for Multimodal Tobacco Content Analysis (FLAASH), and A Large-scale 1M Dataset and Foundation Model for Tobacco Addiction Prevention (DEFEND) frameworks to address the limitations of current Large Language Models in processing video content. Our experimental results show substantial improvements over existing methods, achieving up to 10.6% gains in F1-score on JRDB-PAR and 5.9% improvement in Mean Per Class Accuracy on the NBA dataset. This thesis advances the field of group activity recognition by reducing reliance on extensive annotations, improving multimodal integration, and demonstrating practical applications in public health monitoring. The proposed frameworks provide a foundation for future research in automated understanding of complex group behaviors while addressing real-world challenges in data annotation and multimodal analysis.
Citation
Chappa, N. (2025). Vision-Based Multimodal Frameworks for Human Behavioral Analysis: Applications in Group Activity Understanding and Public Health. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/5622