Author ORCID Identifier:
Date of Graduation
12-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science (PhD)
Degree Level
Graduate
Department
Computer Science & Computer Engineering
Advisor/Mentor
Luu, Khoa
Committee Member
Yilmaz, Alper
Second Committee Member
Gauch, John
Third Committee Member
Gauch, Susan
Keywords
Video Modeling; Discriminative modeling; Multimodal alignement
Abstract
Video modeling stands at the core of modern computer vision, enabling progress in domains such as surveillance, autonomous driving, and instructional assistance. Yet the complexity of spatiotemporal dynamics, multimodal integration, and the need for scalable and generalizable models present significant challenges. This dissertation addresses these issues from three complementary perspectives: discriminative modeling, multimodal (vision + language) alignment, and generative approaches, contributing new methods, datasets, and paradigms for advancing video understanding. In the discriminative setting, we propose a domain-adaptive framework for crowd counting that employs entropy minimization and adversarial learning to improve cross-domain generalization, and introduce a single-stage global association method for multi-camera multi-object tracking in autonomous driving, formulated via Fractional Optimal Transport Assignment (FOTA) to significantly reduce IDSwitch errors and improve accuracy on. In the multimodal setting, we present the GroOT dataset, a large-scale benchmark for grounded multiple object tracking with diverse prompts, and propose Type-to-Track, an intuitive language-guided tracking paradigm supported by the efficient, class-agnostic MENDER framework. In the generative setting, we advance video modeling with Tracking-by-Diffusion, which reformulates object tracking as next-frame reconstruction in latent diffusion space to unify prior tracking paradigms; DINTR, a diffusion-based interpolation operator that improves temporal modeling efficiency by avoiding unnecessary noise mappings; and an Autoregressive Visual Action Hypergraph, which captures multi-entity interactions through directed hypergraphs, enabling structured anticipation in procedural understanding tasks. Collectively, these contributions enhance robustness, interpretability, and versatility across video understanding problems, bridging discriminative, multimodal, and generative perspectives for more reliable deployment in real-world scenarios.
Citation
Nguyen, A. (2025). Discriminative and Generative Video Modeling. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/5993