Date of Graduation
12-2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Engineering (PhD)
Degree Level
Graduate
Department
Electrical Engineering and Computer Science
Advisor/Mentor
Le, Ngan
Committee Member
Rainwater, Chase E.
Second Committee Member
Raj, Bhiksha
Third Committee Member
Gauch, John M.
Fourth Committee Member
Luu, Khoa
Keywords
Artificial Intelligence; Computer Science and Engineering; Computer Vision; Video Language Modeling; Video Understanding
Abstract
Video understanding is a critical domain in computer vision, focusing on analysis of sequential visual data to extract meaningful spatiotemporal information for tasks such as action recognition, video captioning, video retrieval, and temporal action localization, etc. Despite significant advancements with spatio-temporal convolutional neural networks and attention-based video models, current methods face limitations, including inadequate representation of main actors, lack of fine-grained modeling of relevant objects, and limited interpretability.
This thesis addresses these challenges by proposing novel approaches that enhance video understanding through modeling interactions among entities (actors and objects) and between entities and the environment, while improving interpretability in the decision-making process. We introduce the Actor-Aware Boundary Network (ABN) for temporal action proposal generation and detection, which explicitly factorizes scenes into actors, the environment, and models their interaction via self-attention to improve action proposal precision. Building upon this, the Actor Environment Interaction (AEI) network incorporates an Adaptive Attention Mechanism to select main actors, enhancing performance by eliminating inessential actors. We further extend this approach with the Actors-Objects-Environment Interaction Network (AOE-Net), which integrates relevant objects, using a vision-language model to represent objects through linguistic features, achieving state-of-the-art results across multiple datasets on temporal action detection, including both exocentric and egocentric videos.
Finally, we present the Hierarchical Entities Assembly (HENASY) framework, an interpretable video-language model that assembles dynamic scene entities from video frames in an end-to-end fashion, following a compositional perception approach inspired by human cognition. Trained with multi-grained contrastive losses to optimize both entity-level and video-level representations, HENASY enhances the alignment between visual entities generated from input video and textual elements in the associated description. Experiments demonstrate that HENASY outperforms existing models on various benchmarks, including video retrieval and activity recognition via zero-shot transfer, while providing strong interpretability without relying on external detectors.
Overall, this thesis advances the field of video understanding by developing more accurate, robust, and interpretable models, contributing to a deeper understanding of dynamic visual content and setting the stage for future exploration in modeling complex dynamics in videos.
Citation
Vo, K. (2024). Towards Comprehensive and Interpretable Video Understanding. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/5557