Author ORCID Identifier:

https://orcid.org/0000-0003-1517-1382

Date of Graduation

12-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science (PhD)

Degree Level

Graduate

Department

Computer Science & Computer Engineering

Advisor/Mentor

Luu, Khoa

Committee Member

Yilmaz, Alper

Second Committee Member

Gauch, John

Third Committee Member

Gauch, Susan

Keywords

Video Modeling; Discriminative modeling; Multimodal alignement

Abstract

Video modeling stands at the core of modern computer vision, enabling progress in domains such as surveillance, autonomous driving, and instructional assistance. Yet the complexity of spatiotemporal dynamics, multimodal integration, and the need for scalable and generalizable models present significant challenges. This dissertation addresses these issues from three complementary perspectives: discriminative modeling, multimodal (vision + language) alignment, and generative approaches, contributing new methods, datasets, and paradigms for advancing video understanding. In the discriminative setting, we propose a domain-adaptive framework for crowd counting that employs entropy minimization and adversarial learning to improve cross-domain generalization, and introduce a single-stage global association method for multi-camera multi-object tracking in autonomous driving, formulated via Fractional Optimal Transport Assignment (FOTA) to significantly reduce IDSwitch errors and improve accuracy on. In the multimodal setting, we present the GroOT dataset, a large-scale benchmark for grounded multiple object tracking with diverse prompts, and propose Type-to-Track, an intuitive language-guided tracking paradigm supported by the efficient, class-agnostic MENDER framework. In the generative setting, we advance video modeling with Tracking-by-Diffusion, which reformulates object tracking as next-frame reconstruction in latent diffusion space to unify prior tracking paradigms; DINTR, a diffusion-based interpolation operator that improves temporal modeling efficiency by avoiding unnecessary noise mappings; and an Autoregressive Visual Action Hypergraph, which captures multi-entity interactions through directed hypergraphs, enabling structured anticipation in procedural understanding tasks. Collectively, these contributions enhance robustness, interpretability, and versatility across video understanding problems, bridging discriminative, multimodal, and generative perspectives for more reliable deployment in real-world scenarios.

Citation

Nguyen, A. (2025). Discriminative and Generative Video Modeling. Graduate Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/5993

Download

Included in

Computer Sciences Commons

COinS

Graduate Theses and Dissertations

Discriminative and Generative Video Modeling

Author ORCID Identifier:

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Third Committee Member

Keywords

Abstract

Citation

Included in

Search

Links

Browse

Contact Us

Graduate Theses and Dissertations

Discriminative and Generative Video Modeling

Author

Author ORCID Identifier:

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Third Committee Member

Keywords

Abstract

Citation

Included in

Share

Search

Links

Browse

Contact Us