Date of Graduation

5-2025

Document Type

Thesis

Degree Name

Bachelor of Science in Computer Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor/Mentor

Le, Ngan

Committee Member

Rainwater, Chase

Second Committee Member

McCann, Roy A.

Third Committee Member

Gunderman, Anthony

Fourth Committee Member

Doretto, Gianfranco

Abstract

Multimodal learning aims to weave information from images, language, depth, and other sensors into one coherent representation, much as people naturally combine sight, speech, and sound. Progress toward that goal is slowed by three gaps: vision encoders that cannot balance crisp object boundaries with global context, 3-D semantic maps that are computationally prohibitive for real-time, open-vocabulary queries, and vision-language-action pipelines that depend on large token pools with weak relational grounding.

We first introduce AerialFormer, a lightweight hybrid of convolutional and Transformer layers that captures long-range structure without sacrificing fine detail. On the large-scale iSAID benchmark it reaches 69.3% mean IoU, improving on the previous best by 2.1 points, and it also surpasses recent methods on Potsdam and LoveDA without extra computation.

We then introduce Open-Fusion, a real-time 3D semantic mapping system that incrementally builds a TSDF volume using region-level features extracted from a vision-language model. By storing open-vocabulary semantic embeddings in spatial memory, it enables interactive and language-driven queries such as locating objects directly from the 3D map, providing a practical foundation for semantic understanding in robotic environments.

Finally, we propose SlotVLA, a relation-centric visual tokenizer and policy that compresses each observation into a compact set of four interaction-focused slots, explicitly capturing functional object relationships. On ten LIBERO-Goal manipulation tasks, SlotVLA achieves 63% success with a single camera and 75% when a wrist camera is added, an improvement of 4 to 11 points over object-centric or pooled-token baselines while sustaining 12–15 fps inference.

These three contributions show that explicit structural bias, language-aligned 3-D semantics, and compact relational tokens can make multimodal perception and reasoning both faster and more accurate, offering a solid foundation for future work on understanding complex environments across space, time, and modality.

Keywords

Artificial Intelligence; Computer Science and Engineering; Computer Vision; Video Language Modeling; Vision Language Action Modeling

Share

COinS