Date of Graduation

12-2024

Document Type

Thesis

Degree Name

Bachelor of Science in Computer Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor/Mentor

Zhang, Lu

Committee Member

Gauch, Susan

Second Committee Member

Wu, Xintao

Abstract

Large language models (LLMs), including Google’s Gemini, OpenAI’s GPT series, and Meta’s Llama, have driven remarkable advancements in artificial intelligence, achieving complex, human-like performance across many fields. These transformer-based models are skilled at processing and generating many types of textual information, enabling them to perform a variety of tasks. However, an important question remains about their actual capacity to grasp causal relationships—whether these models can truly differentiate between causal directions or simply respond based on learned patterns. This thesis tests this ability by evaluating LLMs on tasks created to test their understanding of causal, anti-causal, and third-party reasoning. We conduct experiments using GPT-3.5, Llama 3, and Gemini Pro to compare their performance on prompts reflecting different causal structures. Our findings reveal that, across all models, causal prompts yielded the lowest performance. For GPT-3.5 and Llama 3, third-party prompts achieved the best results, while Gemini Pro excelled with anti-causal prompts. These patterns suggest that the models may favor a “review-to-rating” approach, summarizing a review’s content before inferring a rating. We guess that this behavior stems from reinforcement learning from human feedback (RLHF), particularly the reward model aligning with human preference data. This RLHF phase likely guides the Actor model, shaping responses that closely align with human expectations rather than inherently understanding causative structures. This thesis also displays some of the limitations and capabilities of LLMs, suggesting that although these models appear proficient in human-aligned tasks, they might be responding to the patterns learned during training rather than understanding causal relationships.

Keywords

Large Language Models; GPT; Gemini; Causality; LLM

Share

COinS