•  
  •  
 

Mentor

Khoa Luu

Keywords

Artificial Intelligence, Machine Learning, Large Language Model

Abstract

Video Question Answering (VideoQA) is a field of research focused on developing models that can engage in natural conversations with humans about the content of videos. Currently, the most successful approaches involve analyzing videos frame-by-frame, which is computationally and memory-intensive. To imitate human memory, the Atkinson-Shiffrin memory model can formulate the machine’s video understanding capability through Vision-Language Models. Reducing the number of frames processed by the model is a crucial operation in this approach category and can be handled by a memory consolidation algorithm. The memory consolidation algorithm should be able to determine the keyframes to transfer from short-term to long-term memory. However, due to the complexity of events in videos, this approach may need to pay more attention to critical information by efficient and appropriate operations. This paper aims to compare video understanding capabilities by analyzing the memory consolidation algorithms. Specifically, we present experiments evaluating simple but effective memory consolidation operations on the ActivityNet-QA dataset to construct an optimal memory consolidation process.

Share

COinS