Date of Graduation

12-2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Engineering (PhD)

Degree Level

Graduate

Department

Electrical Engineering and Computer Science

Advisor/Mentor

Le, Ngan

Committee Member

Rainwater, Chase E.

Second Committee Member

Raj, Bhiksha

Third Committee Member

Gauch, John M.

Fourth Committee Member

Luu, Khoa

Keywords

Artificial Intelligence; Computer Science and Engineering; Computer Vision; Video Language Modeling; Video Understanding

Abstract

Video understanding is a critical domain in computer vision, focusing on analysis of sequential visual data to extract meaningful spatiotemporal information for tasks such as action recognition, video captioning, video retrieval, and temporal action localization, etc. Despite significant advancements with spatio-temporal convolutional neural networks and attention-based video models, current methods face limitations, including inadequate representation of main actors, lack of fine-grained modeling of relevant objects, and limited interpretability.
This thesis addresses these challenges by proposing novel approaches that enhance video understanding through modeling interactions among entities (actors and objects) and between entities and the environment, while improving interpretability in the decision-making process. We introduce the Actor-Aware Boundary Network (ABN) for temporal action proposal generation and detection, which explicitly factorizes scenes into actors, the environment, and models their interaction via self-attention to improve action proposal precision. Building upon this, the Actor Environment Interaction (AEI) network incorporates an Adaptive Attention Mechanism to select main actors, enhancing performance by eliminating inessential actors. We further extend this approach with the Actors-Objects-Environment Interaction Network (AOE-Net), which integrates relevant objects, using a vision-language model to represent objects through linguistic features, achieving state-of-the-art results across multiple datasets on temporal action detection, including both exocentric and egocentric videos.
Finally, we present the Hierarchical Entities Assembly (HENASY) framework, an interpretable video-language model that assembles dynamic scene entities from video frames in an end-to-end fashion, following a compositional perception approach inspired by human cognition. Trained with multi-grained contrastive losses to optimize both entity-level and video-level representations, HENASY enhances the alignment between visual entities generated from input video and textual elements in the associated description. Experiments demonstrate that HENASY outperforms existing models on various benchmarks, including video retrieval and activity recognition via zero-shot transfer, while providing strong interpretability without relying on external detectors.
Overall, this thesis advances the field of video understanding by developing more accurate, robust, and interpretable models, contributing to a deeper understanding of dynamic visual content and setting the stage for future exploration in modeling complex dynamics in videos.

Share

COinS