Date of Graduation

5-2016

Document Type

Thesis

Degree Name

Bachelor of Science

Degree Level

Undergraduate

Department

Computer Science and Computer Engineering

Advisor

Li, Wing

Reader

Beavers, Merwin

Second Reader

Patitz, Matthew

Abstract

Heparin is a highly sulphated and negatively charged polysaccharides belonging to the glycosamino- glycans(GAGs) family. It is widely used in medical treatments as an injectable anticoagulant. Although many heparin-binding proteins have been identified through experimental studies, there are still many proteins needing to be classified as heparin-binding or not. Many studies have been aimed at prediction of heparin binding patterns or motifs in the primary structure of proteins. For example XBBXBX and XBBBXXBX are two well-known patterns or motifs. In spite of intensive studies, still no good model has emerged which reasonably predicts proteins in the protein database as heparin-binding or not. The main objective of this study is to be able to predict heparin-binding proteins from their amino acid sequence information. A supervised learning algorithm based on support vector machine (SVM) is applied to two data sets; each contains 70 proteins, which are known to be heparin-binding and non-heparin-binding respectively. With appropriate adjustment of the parameters of the support vector machines, severl models are produced by the computer algorithm. These models are used to classify those proteins that are not used in the learning or training. The testing set contains 137 proteins with 104 of them are known to be heparin-binding and the rest of 33 proteins are known to be non-heparin-binding. For the testing set, the models achieve ~75% accuracy in predicting heparin binding proteins. For the complete data set, the model achieves ~87% accuracy. The current models use different combinations of XB patterns and biological metrics as features in a higher dimensional vector space.

Share

COinS