Author ORCID Identifier:

https://orcid.org/0000-0003-1517-1382

Date of Graduation

12-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science (PhD)

Degree Level

Graduate

Department

Computer Science & Computer Engineering

Advisor/Mentor

Luu, Khoa

Committee Member

Yilmaz, Alper

Second Committee Member

Gauch, John

Third Committee Member

Gauch, Susan

Keywords

Video Modeling; Discriminative modeling; Multimodal alignement

Abstract

Video modeling stands at the core of modern computer vision, enabling progress in domains such as surveillance, autonomous driving, and instructional assistance. Yet the complexity of spatiotemporal dynamics, multimodal integration, and the need for scalable and generalizable models present significant challenges. This dissertation addresses these issues from three complementary perspectives: discriminative modeling, multimodal (vision + language) alignment, and generative approaches, contributing new methods, datasets, and paradigms for advancing video understanding. In the discriminative setting, we propose a domain-adaptive framework for crowd counting that employs entropy minimization and adversarial learning to improve cross-domain generalization, and introduce a single-stage global association method for multi-camera multi-object tracking in autonomous driving, formulated via Fractional Optimal Transport Assignment (FOTA) to significantly reduce IDSwitch errors and improve accuracy on. In the multimodal setting, we present the GroOT dataset, a large-scale benchmark for grounded multiple object tracking with diverse prompts, and propose Type-to-Track, an intuitive language-guided tracking paradigm supported by the efficient, class-agnostic MENDER framework. In the generative setting, we advance video modeling with Tracking-by-Diffusion, which reformulates object tracking as next-frame reconstruction in latent diffusion space to unify prior tracking paradigms; DINTR, a diffusion-based interpolation operator that improves temporal modeling efficiency by avoiding unnecessary noise mappings; and an Autoregressive Visual Action Hypergraph, which captures multi-entity interactions through directed hypergraphs, enabling structured anticipation in procedural understanding tasks. Collectively, these contributions enhance robustness, interpretability, and versatility across video understanding problems, bridging discriminative, multimodal, and generative perspectives for more reliable deployment in real-world scenarios.

Share

COinS