Date of Graduation
12-2024
Document Type
Thesis
Degree Name
Bachelor of Science in Computer Science
Degree Level
Undergraduate
Department
Computer Science and Computer Engineering
Advisor/Mentor
Zhang, Lu
Committee Member
Gauch, Susan
Second Committee Member
Wu, Xintao
Abstract
Large language models (LLMs), including Google’s Gemini, OpenAI’s GPT series, and Meta’s Llama, have driven remarkable advancements in artificial intelligence, achieving complex, human-like performance across many fields. These transformer-based models are skilled at processing and generating many types of textual information, enabling them to perform a variety of tasks. However, an important question remains about their actual capacity to grasp causal relationships—whether these models can truly differentiate between causal directions or simply respond based on learned patterns. This thesis tests this ability by evaluating LLMs on tasks created to test their understanding of causal, anti-causal, and third-party reasoning. We conduct experiments using GPT-3.5, Llama 3, and Gemini Pro to compare their performance on prompts reflecting different causal structures. Our findings reveal that, across all models, causal prompts yielded the lowest performance. For GPT-3.5 and Llama 3, third-party prompts achieved the best results, while Gemini Pro excelled with anti-causal prompts. These patterns suggest that the models may favor a “review-to-rating” approach, summarizing a review’s content before inferring a rating. We guess that this behavior stems from reinforcement learning from human feedback (RLHF), particularly the reward model aligning with human preference data. This RLHF phase likely guides the Actor model, shaping responses that closely align with human expectations rather than inherently understanding causative structures. This thesis also displays some of the limitations and capabilities of LLMs, suggesting that although these models appear proficient in human-aligned tasks, they might be responding to the patterns learned during training rather than understanding causal relationships.
Keywords
Large Language Models; GPT; Gemini; Causality; LLM
Citation
Bergin, J. (2024). An Empirical Study on the Capability of Large Language Models in Learning Causality. Electrical Engineering and Computer Science Undergraduate Honors Theses Retrieved from https://scholarworks.uark.edu/elcsuht/2