University of Arkansas, Fayetteville


Latent Semantic Analysis (LSA) is a matching technique capable of recognizing the semantic relationships of data that ordinary techniques such as string matching cannot. This is especially valuable for data integration applications, like those of Acxiom, where data items are usually related by context, rather than in a literal match. Even though it has been shown that LSA is 30% more effective in finding and ranking relevant pieces of information than existing string-by-string matching techniques (Deerwester et al., 1990; Dumais, 1995), the performance of the LSA seems to be affected by the presence of shared words, or “noise”, in data. The objective of this research is to study the influence of noise on the LSA performance quantitatively and analytically, which provides understanding for the following researches to develop a noise-filter method used to improve LSA performance. Our research shows that shared terms degrade the performance of LSA for matching queries to documents from the same category, and result in increased misclassification. In addition, share terms change the document that best matches the query.