Scientific Journal of Information Engineering April 2015, Volume 5, Issue 2, PP.33-38
Research on Gene Expression Profiles Based on Principal Component and Cluster Analysis Yunfei Guo 1# , Zhe Yin 1, 2 1. Mathematics Department, Yanbian University, 133002, China 2. Department of Information Management, Peking University, China #
Email: guoyunfei0413@sina.com
Abstract This In this paper, we removed a total of 1931 irrelevant genes and obtained 69 valid ones with the Bhattacharyya distance. After using component and cluster analysis synthetically, better classification factors are chose, and then gene labels are determined. In addition, quasi-cancer groups are discovered in the cluster process, which may have more important significance to actual cancer diagnosis. Keywords: Gene Labels; Bhattacharyya Distance; Principal Component Analysis; Cluster
1 INTRODUCTION The advent of DNA micro-array technology has brought to data analysis broad patterns of gene expression simultaneously recorded in a single experiment (Fodor, 1997). In the past few months, several data sets have become publicly available on the Internet. These data sets present multiple challenges, including a large number of gene expression values per experiment (several thousands to tens of thousands), and a relatively small number of experiments (a few dozen). This last paper on leukemia classification presents a feasibility study of diagnosis based solely on gene expression monitoring. In the present paper, we go further in this direction and demonstrate that, by applying component and cluster analysis synthetically.
2 EXTRACTION OF GENE EXPRESSION PROFILES INFORMATION 2.1 Removing of irrelevant genes Our initial leukemia cancer of colon data set consisted of 22 normal samples and 40 cancer samples, and each sample contained 2000 related genes. For the research purpose, the data were divided into 38 training data and 25 test data, which were displayed in Figure 1: Normal 10
Normal 12
+ Cancer 25
Cancer 15
Training data
Test data FIG. 1 GENE GROUP
Gene expression levels were very close in all samples in Gene Expression Profiles, and these genes would not - 33 http://www.sjie.org