Date of Graduation

5-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Engineering (PhD)

Degree Level

Graduate

Department

Electrical Engineering and Computer Science

Advisor/Mentor

Luu, Khoa

Committee Member

Nelson, Alexander H.

Second Committee Member

Andrews, David L.

Third Committee Member

Seo, Han-Seok

Fourth Committee Member

Djuric, Nemanja

Fifth Committee Member

Dobbs, Page D.

Keywords

foundational models; Group Activity Recognition; Multimodal Analysis; Self-supervised learning; vision-language modeling

Abstract

Group Activity Recognition (GAR) has emerged as a crucial problem in computer vision, with wide-ranging applications in sports analysis, video surveillance, and social scene understanding. Unlike traditional action recognition focused on individuals, GAR requires understanding complex spatiotemporal relationships between multiple actors, their interactions, and the broader context in which these activities occur. This complexity introduces unique challenges, including the need for accurate actor localization, modeling of inter-actor dependencies, and understanding of temporal evolution in group behaviors. While recent advances have shown promise, existing approaches often rely heavily on extensive annotations such as ground-truth bounding boxes and action labels, creating significant barriers to practical deployment and scalability. Additionally, current methods struggle to capture the full spectrum of contextual factors that give meaning to group activities, particularly in real-world applications where multiple modalities of information are available. We first introduce Self-supervised Spatiotemporal Transformers Approach to Group Activity Recognition (SPARTAN) and Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition (SoGAR), novel self-supervised frameworks that significantly reduce annotation requirements while maintaining high recognition accuracy. SPARTAN leverages multi-resolution temporal views to capture varied motion characteristics, while SoGAR implements temporal collaborative learning and spatiotemporal cooperative learning strategies. These approaches achieve state-of-the-art performance on multiple benchmark datasets, including JRDB-PAR, NBA, and Volleyball, without requiring person-level annotations. In the multimodal domain, we present three frameworks: Recognize Every Action Everywhere All At Once (REACT), which employs a Vision-Language Encoder for sparse spatial interactions; Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos (HAtt-Flow), which introduces flow conservation principles in attention mechanisms; and LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition (LiGAR), which utilizes LiDAR data as a structural backbone for processing visual and textual information. These frameworks demonstrate significant improvements in capturing cross-modal dependencies and spatial-temporal relationships. Finally, we extend our research to healthcare applications, particularly in analyzing tobacco-related content on social media platforms. We develop Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media (PHAD), Flow-Attention Adaptive Semantic Hierarchical Fusion for Multimodal Tobacco Content Analysis (FLAASH), and A Large-scale 1M Dataset and Foundation Model for Tobacco Addiction Prevention (DEFEND) frameworks to address the limitations of current Large Language Models in processing video content. Our experimental results show substantial improvements over existing methods, achieving up to 10.6% gains in F1-score on JRDB-PAR and 5.9% improvement in Mean Per Class Accuracy on the NBA dataset. This thesis advances the field of group activity recognition by reducing reliance on extensive annotations, improving multimodal integration, and demonstrating practical applications in public health monitoring. The proposed frameworks provide a foundation for future research in automated understanding of complex group behaviors while addressing real-world challenges in data annotation and multimodal analysis.

Share

COinS