Author ORCID Identifier:

https://orcid.org/0009-0008-5219-507X

Date of Graduation

5-2026

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science (PhD)

Degree Level

Graduate

Department

Computer Science & Computer Engineering

Advisor/Mentor

Wu, Xintao

Committee Member

Zhang, Lu

Second Committee Member

Zhang, Shengfan

Third Committee Member

Gauch, Susan

Keywords

Differential Privacy; Machine Unlearning; Privacy; Trustworthy Machine Learning

Abstract

As machine learning models become increasingly integrated into data-driven decision-making, the protection of sensitive information throughout the model lifecycle is a paramount concern. As these models process and memorize sensitive, proprietary, or personal data, they risk leaking information through their outputs or internal states, especially in domains such as healthcare and finance. The protection of data in machine learning has thus been a crucial field of study. Within this paradigm, researchers have studied theoretical and application-oriented mechanisms for realizing privacy protections for various data formats. Nonetheless, privacy in machine learning still has many open problems, especially with the emergence of novel and complex algorithms. This dissertation aims to tackle such open problems, focusing on differential privacy or machine unlearning. Primarily, we ask how rigorous privacy guarantees, such as differential privacy, can be applied to complex structured data, specifically graphs and tabular records, without the catastrophic loss of utility typically observed in decentralized or high-dimensional settings. We then transition to the challenge of data erasure, exploring whether regulations such as right to be forgotten can be efficiently realized in large foundation models without the prohibitive computational costs of retraining. In this context, we also investigate whether white-box evaluation frameworks can distinguish better between a model's output shifts to desirable responses and the actual structural erasure of memorized data. Furthermore, we examine the unique challenges of unlearning in multi-modal environments where sensitive information may be embedded within the cross-modal interactions between text and vision. To address these questions, we propose a comprehensive suite of methodologies to safeguard sensitive information across diverse data modalities: 1. We develop RGNN, a novel reconstruction-based privacy-aware graph neural network framework that allows each user to protect their data locally. Based on frequency estimation from randomized data, we develop reconstruction methods to approximate features and labels from perturbed data to enable effective model training while providing differential privacy guarantees. 2. We present and evaluate LDP-TabICL and GDP-TabICL, two frameworks to produce differentially private demonstrations for In-Context Learning (ICL) on tabular data with Large Language Models (LLMs). We utilize and evaluate standard differentially private mechanisms to protect tabular data used for ICL under strict privacy requirements. 3. We develop SPUL, a lightweight and resource-efficient framework for unlearning textual data in LLMs. With losses designed to enforce forgetting as well as utility preservation, SPUL learns prompt tokens that are prepended to a query to induce unlearning of specific training examples at inference time without pre-trained parameters. 4. We formulate an evaluation framework for textual unlearning methods to quantify true forgetting. Our white-box evaluation implements a membership inference attack by constructing meaningful features based on attention patterns to optimize a classifier that can detect traces of removed knowledge for multiple benchmark unlearning methods. 5. We develop CAGUL, a resource-efficient unlearning framework for combined textual and visual unlearning in Vision-Language Models (VLMs). We leverage cross-modal attention scores computed between the textual and visual modalities to identify the least important visual tokens that are used to encode unlearning signals while keeping pre-trained parameters frozen.

Share

COinS