Date of Graduation


Document Type


Degree Name

Master of Science in Computer Science (MS)

Degree Level



Computer Science & Computer Engineering


Ngan Le

Committee Member

Khoa Luu

Second Committee Member

Lu Zhang


Computer Vision, Video Understanding


This thesis presents a novel approach to video understanding by emulating human perceptual processes and creating an explainable and coherent storytelling representation of video content. Central to this approach is the development of a Visual-Linguistic (VL) feature for an interpretable video representation and the creation of a Transformer-in-Transformer (TinT) decoder for modeling intra- and inter-event coherence in a video. Drawing inspiration from the way humans comprehend scenes by breaking them down into visual and non-visual components, the proposed VL feature models a scene through three distinct modalities. These include: (i) a global visual environment, providing a broad contextual understanding of the scene; (ii) local visual main agents, focusing on key elements or entities in the video; and (iii) linguistic scene elements, incorporating semantically relevant language-based information for a comprehensive understanding of the scene. By integrating these multimodal features, the VL representation offers a rich, diverse, and interpretable view of video content, effectively bridging the gap between visual perception and linguistic description. To ensure the temporal coherence and narrative structure of the video content, we introduce an autoregressive Transformer-in-Transformer (TinT) decoder. The TinT design consists of a nested architecture where the inner transformer models the intra-event coherency, capturing the semantic connections within individual events, while the outer transformer models the inter-event coherency, identifying the relationships and transitions between different events. This dual-layer transformer structure facilitates the generation of accurate and meaningful video descriptions that reflect the chronological and causal links in the video content. Another crucial aspect of this work is the introduction of a novel VL contrastive loss function. This function plays an essential role in ensuring that the learned embedding features are semantically consistent with the video captions. By aligning the embeddings with the ground truth captions, the VL contrastive loss function enhances the model's performance and contributes to the quality of the generated descriptions. The efficacy of our proposed methods is validated through comprehensive experiments on popular video understanding benchmarks. The results demonstrate superior performance in terms of both the accuracy and diversity of the generated captions, highlighting the potential of our approach in advancing the field of video understanding. In conclusion, this thesis provides a promising pathway toward building explainable video understanding models. By emulating human perception processes, leveraging multimodal features, and incorporating a nested transformer design, we contribute a new perspective to the field, paving the way for more advanced and intuitive video understanding systems in the future.