Date of Graduation
5-2025
Document Type
Thesis
Degree Name
Bachelor of Science in Computer Science
Degree Level
Undergraduate
Department
Computer Science and Computer Engineering
Advisor/Mentor
Le, Ngan
Committee Member
Rainwater, Chase
Second Committee Member
McCann, Roy A.
Third Committee Member
Gunderman, Anthony
Fourth Committee Member
Doretto, Gianfranco
Abstract
Multimodal learning aims to weave information from images, language, depth, and other sensors into one coherent representation, much as people naturally combine sight, speech, and sound. Progress toward that goal is slowed by three gaps: vision encoders that cannot balance crisp object boundaries with global context, 3-D semantic maps that are computationally prohibitive for real-time, open-vocabulary queries, and vision-language-action pipelines that depend on large token pools with weak relational grounding.
We first introduce AerialFormer, a lightweight hybrid of convolutional and Transformer layers that captures long-range structure without sacrificing fine detail. On the large-scale iSAID benchmark it reaches 69.3% mean IoU, improving on the previous best by 2.1 points, and it also surpasses recent methods on Potsdam and LoveDA without extra computation.
We then introduce Open-Fusion, a real-time 3D semantic mapping system that incrementally builds a TSDF volume using region-level features extracted from a vision-language model. By storing open-vocabulary semantic embeddings in spatial memory, it enables interactive and language-driven queries such as locating objects directly from the 3D map, providing a practical foundation for semantic understanding in robotic environments.
Finally, we propose SlotVLA, a relation-centric visual tokenizer and policy that compresses each observation into a compact set of four interaction-focused slots, explicitly capturing functional object relationships. On ten LIBERO-Goal manipulation tasks, SlotVLA achieves 63% success with a single camera and 75% when a wrist camera is added, an improvement of 4 to 11 points over object-centric or pooled-token baselines while sustaining 12–15 fps inference.
These three contributions show that explicit structural bias, language-aligned 3-D semantics, and compact relational tokens can make multimodal perception and reasoning both faster and more accurate, offering a solid foundation for future work on understanding complex environments across space, time, and modality.
Keywords
Artificial Intelligence; Computer Science and Engineering; Computer Vision; Video Language Modeling; Vision Language Action Modeling
Citation
Hanyu, T. (2025). Multimodal Learning for Visual Perception and Robotic Action. Electrical Engineering and Computer Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/elcsuht/19