Date of Graduation


Document Type


Degree Name

Master of Science in Computer Engineering (MSCmpE)

Degree Level



Computer Science & Computer Engineering


Ngan Le

Committee Member

El-Shenawee, Magda

Second Committee Member

Gauch, John


Computer Vision, Video Analysis


This thesis introduces an innovative approach to video comprehension, which simulates human perceptual mechanisms and establishes a comprehensible and coherent narrative representation of video content. At the core of this approach lies the creation of a Visual-Linguistic (VL) feature for an interpretable video portrayal and an adaptive attention mechanism (AAM) aimed at concentrating solely on principal actors or pertinent objects while modeling their interconnections. Taking cues from the way humans disassemble scenes into visual and non-visual constituents, the proposed VL feature characterizes a scene via three distinct modalities: (i) a global visual environment, providing a broad contextual comprehension of the scene; (ii) local visual key entities, focusing on pivotal elements within the video; and (iii) linguistic scene elements, incorporating semantically pertinent language-based information for an all-encompassing grasp of the scene. Through the integration of these multimodal traits, the VL representation presents an extensive, diverse, and explicable perspective of video content, effectively bridging the divide between visual perception and linguistic depiction. In our study, we suggest a method for modeling these interactions using a multi-modal representation network. This network consists of two main components: a perception-based multi-modal representation (PMR) and a boundary-matching module (BMM). Additionally, we introduce an "adaptive attention mechanism (AAM)" within the PMR to focus on primary actors or relevant objects while showing their connections. The PMR module represents each video segment by combining visual and linguistic features. It represents primary actors and their immediate surroundings with visual elements and conveys information about relevant objects through language attributes, using an image-text model. The BMM module takes a sequence of these visual-linguistic features as input and generates action recommendations. Extensive experiments and thorough investigations were carried out on the ActivityNet-1.3 and THUMOS-14 datasets to showcase the superiority of our proposed network over previous cutting-edge methods. It displayed impressive performance and adaptability in both Temporal Action Proposal Generation (TAPG) and temporal action detection. These findings provide strong evidence for the effectiveness of our approach. To demonstrate the robustness and efficiency of our network, we conducted an additional ablation study on egocentric videos, focusing on the EPIC-KITCHENS 100 dataset. This underscores the network's potential to advance the field of video comprehension.s In conclusion, this thesis delineates a promising path toward the development of interpretable video comprehension models. By emulating human perceptual processes and harnessing multimodal attributes, we contribute a fresh perspective to the discipline, opening the door for more advanced and intuitive video comprehension systems in the future.