Date of Graduation

5-2025

Document Type

Thesis

Degree Name

Bachelor of Science in Computer Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor/Mentor

Le, Ngan

Committee Member

Rainwater, Chase

Second Committee Member

McCann, Roy A.

Third Committee Member

Gunderman, Anthony

Fourth Committee Member

Doretto, Gianfranco

Abstract

Multimodal learning aims to weave information from images, language, depth, and other sensors into one coherent representation, much as people naturally combine sight, speech, and sound. Progress toward that goal is slowed by three gaps: vision encoders that cannot balance crisp object boundaries with global context, 3-D semantic maps that are computationally prohibitive for real-time, open-vocabulary queries, and vision-language-action pipelines that depend on large token pools with weak relational grounding.

We first introduce AerialFormer, a lightweight hybrid of convolutional and Transformer layers that captures long-range structure without sacrificing fine detail. On the large-scale iSAID benchmark it reaches 69.3% mean IoU, improving on the previous best by 2.1 points, and it also surpasses recent methods on Potsdam and LoveDA without extra computation.

We then introduce Open-Fusion, a real-time 3D semantic mapping system that incrementally builds a TSDF volume using region-level features extracted from a vision-language model. By storing open-vocabulary semantic embeddings in spatial memory, it enables interactive and language-driven queries such as locating objects directly from the 3D map, providing a practical foundation for semantic understanding in robotic environments.

Finally, we propose SlotVLA, a relation-centric visual tokenizer and policy that compresses each observation into a compact set of four interaction-focused slots, explicitly capturing functional object relationships. On ten LIBERO-Goal manipulation tasks, SlotVLA achieves 63% success with a single camera and 75% when a wrist camera is added, an improvement of 4 to 11 points over object-centric or pooled-token baselines while sustaining 12–15 fps inference.

These three contributions show that explicit structural bias, language-aligned 3-D semantics, and compact relational tokens can make multimodal perception and reasoning both faster and more accurate, offering a solid foundation for future work on understanding complex environments across space, time, and modality.

Keywords

Artificial Intelligence; Computer Science and Engineering; Computer Vision; Video Language Modeling; Vision Language Action Modeling

Citation

Hanyu, T. (2025). Multimodal Learning for Visual Perception and Robotic Action. Electrical Engineering and Computer Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/elcsuht/19

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Electrical Engineering and Computer Science Undergraduate Honors Theses

Multimodal Learning for Visual Perception and Robotic Action

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Abstract

Keywords

Citation

Included in

Search

Links

Browse

Contact Us

Electrical Engineering and Computer Science Undergraduate Honors Theses

Multimodal Learning for Visual Perception and Robotic Action

Author

Date of Graduation

Document Type

Degree Name

Degree Level

Department

Advisor/Mentor

Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Abstract

Keywords

Citation

Included in

Share

Search

Links

Browse

Contact Us