Date of Graduation

12-2024

Document Type

Thesis

Degree Name

Bachelor of Science in Computer Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor/Mentor

Luu, Khoa

Committee Member

Gauch, John

Second Committee Member

Gauch, Susan

Abstract

Video Question Answering (VideoQA) focuses on developing mod- els capable of engaging in natural language conversations about video con- tent. Current state-of-the-art typically analyze videos frame-by-frame, a process that is both computationally and memory-intensive. Integrating the Atkinson-Shiffrin memory model with Video Language Models has demon- strated potential for enhancing video understanding capabilities. Reducing the number of frames processed by the model is a crucial operation in this approach, which is achieved by a memory consolidation algorithm. This al- gorithm condenses a video sequence into a small set of representative frames which capture the essence of the video content. However, due to the com- plexity of events in videos, selecting keyframes efficiently and effectively remains a challenge. This work aims to address this challenge by comparing video understanding capabilities across different memory consolidation algo- rithms. Specifically, we present experiments evaluating simple but effective memory consolidation algorithms on the ActivityNet-QA dataset. Through this analysis, we aim to construct an optimal memory consolidation algo- rithm to improve model performance in VideoQA tasks.

Keywords

Video Understanding; Multimodal Large Language Models; Video Question Answering; Atkinson-Shiffrin Memory Model

Share

COinS