Research on gene expression profiles based on principal component and cluster analysis

Page 1

Scientific Journal of Information Engineering April 2015, Volume 5, Issue 2, PP.33-38

Research on Gene Expression Profiles Based on Principal Component and Cluster Analysis Yunfei Guo 1# , Zhe Yin 1, 2 1. Mathematics Department, Yanbian University, 133002, China 2. Department of Information Management, Peking University, China #

Email: guoyunfei0413@sina.com

Abstract This In this paper, we removed a total of 1931 irrelevant genes and obtained 69 valid ones with the Bhattacharyya distance. After using component and cluster analysis synthetically, better classification factors are chose, and then gene labels are determined. In addition, quasi-cancer groups are discovered in the cluster process, which may have more important significance to actual cancer diagnosis. Keywords: Gene Labels; Bhattacharyya Distance; Principal Component Analysis; Cluster

1 INTRODUCTION The advent of DNA micro-array technology has brought to data analysis broad patterns of gene expression simultaneously recorded in a single experiment (Fodor, 1997). In the past few months, several data sets have become publicly available on the Internet. These data sets present multiple challenges, including a large number of gene expression values per experiment (several thousands to tens of thousands), and a relatively small number of experiments (a few dozen). This last paper on leukemia classification presents a feasibility study of diagnosis based solely on gene expression monitoring. In the present paper, we go further in this direction and demonstrate that, by applying component and cluster analysis synthetically.

2 EXTRACTION OF GENE EXPRESSION PROFILES INFORMATION 2.1 Removing of irrelevant genes Our initial leukemia cancer of colon data set consisted of 22 normal samples and 40 cancer samples, and each sample contained 2000 related genes. For the research purpose, the data were divided into 38 training data and 25 test data, which were displayed in Figure 1: Normal 10

Normal 12

+ Cancer 25

Cancer 15

Training data

Test data FIG. 1 GENE GROUP

Gene expression levels were very close in all samples in Gene Expression Profiles, and these genes would not - 33 http://www.sjie.org


provide used information to the research, on the other hand, they would increase complexity of calculation. Therefore, the mentioned irrelevant genes must be removed.

1  2 to remove the irrelevant genes. However, variances were not 1   2 considered in this index. For this reason, we applied the Bhattacharyya distance (Duta, 2001) which is considered 2 1  1  2  1   12   2 2  both means and variance to remove irrelevant genes. B   ln   Bhattacharyya distance of 4  12   2 2  2  2 1 2  Golub used Signal-to-noise-ratio, that is: d 

each gene can be calculated with the Matlab7.0 software, as shown in Table 1. TABLE 1 DISTRIBUTION OF BHATTACHARYYA DISTANCE OF GENES Bhattacharyya distance [0,0.2] (0.2,0.3] (0.3,0.4) (0.4,0.6] (0.6,0.8]

Number of genes 1931 45 16 7 1

Percentage 96.55% 2.25% 0.8% 0.35% 0.05%

We can remove a total of 1931 irrelevant genes and obtained 69 valid ones when the threshold   0.2 from Table 1.

2.2 Principal component analysis Principal component analysis of the 69 valid genes above was finished with SPSS software, and the result was shown as Table 2. TABLE 2 VARIANCE TABLE Total Variance Explained Component

Initial Eigenvalues

Extraction Sums of Squared Loadings

Total

% of Variance

Cumulative %

Total

% of Variance

Cumulative %

1 2 3 4 5 6 7 8 9

36.286 7.222 5.016 4.001 2.918 2.490 1.594 1.299 .964

52.589 10.466 7.270 5.799 4.229 3.609 2.310 1.882 1.397

52.589 63.055 70.325 76.123 80.353 83.962 86.272 88.154 89.551

36.286 7.222 5.016 4.001 2.918 2.490 1.594 1.299

52.589 10.466 7.270 5.799 4.229 3.609 2.310 1.882

52.589 63.055 70.325 76.123 80.353 83.962 86.272 88.154

10

.842 .785

1.220 1.138

90.771 91.909

.680 .630

.986 .913

92.895 93.808

.527 .446

.764 .646

94.572 95.218

.406 .368

.589 .533

95.807 96.340

.302 .280

.437 .406

96.777 97.183

.249 .234

.361 .339

97.544 97.883

.211 .190

.306 .275

98.189 98.464

.172

.250

98.713

11 12 13 14 15 16 17 18 19 20 21 22 23 24

- 34 http://www.sjie.org


Component

Initial Eigenvalues

Extraction Sums of Squared Loadings

Total

% of Variance

Cumulative %

25 26

.138 .118

.201 .171

98.914 99.085

27 28

.100 .098

.145 .142

99.230 99.372

29 30

.090 .084

.131 .121

99.503 99.624

31 32

.070 .059

.102 .086

99.726 99.813

33 34

.044 .038

.063 .055

99.876 99.931

35 36

.027 .021

.039 .030

99.970 100.000

37 38

1.157E-15 1.075E-15

1.677E-15 1.559E-15

100.000 100.000

Total

% of Variance

Cumulative %

Extraction Method: Principal Component Analysis.

We can see that 88% of the information would be obtained from Table 1, while considering the first 8 principal components. In addition, contribution coefficient of each variable to principal component was also given: TABLE 3 COMPONENT MATRIX Component Matrix Component 1

2

3

4

5

6

7

8

VAR00023 VAR00055

.916 .916

-.122 -.138

-.200 .123

-.138 -.074

-.096 -.147

-.142 -.022

-.032 -.158

.056 -.163

VAR00028 VAR00069

.916 .909

-.178 -.118

.086 .066

-.143 -.140

-.160 -.219

-.088 -.118

-.111 -.115

-.135 -.162

VAR00064 VAR00066

.902 .900

-.187 -.028

.165 -.241

-.035 -.127

-.049 -.115

.109 -.135

-.093 .047

-.057 .020

VAR00033 VAR00013

.897 .882

-.220 .006

-.106 .250

-.156 -.110

-.113 -.121

.036 -.003

.022 -.237

-.122 -.154

VAR00050 VAR00038

.868 .865

-.126 -.137

.181 .059

-.356 -.043

-.023 .045

-.021 -.054

-.137 .012

-.078 -.013

VAR00058 VAR00012

.859 .847

-.163 -.115

-.023 -.257

-.311 .009

-.008 .031

-.224 -.314

.017 .076

-.169 -.065

VAR00035 VAR00047

.847 .833

-.287 -.175

.006 .190

-.211 .383

.002 -.133

.180 .138

-.213 -.103

-.137 -.050

VAR00039 VAR00020

.830 .808

-.171 -.213

.051 -.116

-.229 -.363

-.121 -.038

.155 -.053

-.148 .188

-.009 .001

VAR00034 VAR00007

.807

.496

.096

-.063

-.106

-.083

.004

.160

.807 .806

-.153 .015

-.048 -.266

.280 -.023

.100 .099

.302 -.244

-.121 -.208

.041 .004

.805 .797

.447 .012

.205 -.502

.033 .021

.001 .074

-.115 -.200

-.086 -.057

.101 .098

.796 .795

.328 -.124

.110 -.079

-.164 -.313

.327 -.245

.073 .097

.051 .145

.065 -.047

.795 .786

-.052 -.126

-.476 -.408

-.177 -.121

-.087 .229

-.118 -.025

.059 -.142

.039 .039

VAR00032 VAR00030 VAR00025 VAR00002 VAR00017 VAR00046 VAR00045

- 35 http://www.sjie.org


VAR00024 VAR00006 VAR00010 VAR00053 VAR00065 VAR00044 VAR00051 VAR00061 VAR00008 VAR00043 VAR00018 VAR00067 VAR00042 VAR00052 VAR00048 VAR00011 VAR00057 VAR00040 VAR00021 VAR00022 VAR00026 VAR00004 VAR00041 VAR00029 VAR00056 VAR00009 VAR00027 VAR00060 VAR00014 VAR00016 VAR00062 VAR00015 VAR00063 VAR00068 VAR00031 VAR00001 VAR00054 VAR00003 VAR00059 VAR00049 VAR00036 VAR00037 VAR00005 VAR00019

.778 .773

-.029 .066

-.321 .180

-.108 -.059

.130 -.511

-.331 -.024

-.048 -.132

.099 -.010

.767 .767

.503 .471

-.056 .112

.007 -.103

.054 .119

-.090 .094

-.095 -.088

.171 .073

.761 .761

-.239 -.087

-.011 .226

.240 .362

-.135 -.305

.011 .080

.157 .072

-.383 .040

.752 .727

-.084 -.079

-.196 -.371

.336 .390

-.254 .129

.334 .053

-.084 .051

.097 -.222

.724 .718

-.349 -.116

.417 -.141

.210 .298

-.093 -.190

-.169 .373

.026 -.110

.212 .153

.717 .715

-.155 .118

-.198 .330

.466 .008

-.075 .025

.055 .192

.250 .213

-.062 .181

.711 .710

.330 .563

.215 .136

.034 .176

.004 -.119

.010 -.088

.448 .114

-.105 .142

.697 .694

-.119 -.168

-.199 -.407

-.446 .426

-.172 .094

.207 .139

.260 .063

.222 -.014

.694 .693

-.043 -.308

-.294 .061

.429 -.480

.124 -.045

.323 .017

.046 -.055

-.033 .129

.690 .690

-.265 -.212

.460 -.024

.275 .415

-.250 .304

-.122 -.269

.055 .160

.194 .180

.683 .674

.597 -.407

-.102 -.080

.042 .148

.036 -.126

.113 .174

-.230 -.016

.127 .353

.672 .671

-.218 .569

-.218 .053

.290 .078

.059 -.291

.233 -.198

-.046 .036

-.265 -.019

.666 .646

-.043 -.143

-.375 .616

.048 .155

.257 -.262

-.493 -.141

-.044 -.033

-.030 .016

.646

-.389

.227

-.072

.204

-.009

.393

-.035

.645 .633

.512 -.123

-.286 -.475

.120 -.428

.146 .073

-.032 -.005

.313 -.094

-.017 .243

.627 .596

.317 -.342

.240 .458

-.219 .099

.369 .242

.274 -.015

-.203 -.093

.076 .083

.587 .552

-.299 .536

-.223 .248

-.212 .307

.310 .287

.530 -.103

.087 -.158

-.120 -.124

.548 .341

.474 .822

-.050 .008

-.261 .108

-.059 .093

.040 -.082

.445 -.053

.065 -.163

.530 .565

.646 .625

-.077 .099

-.049 .132

-.167 .073

.097 -.184

.192 -.144

-.100 -.174

.542 .375

.565 .455

.258 .318

-.075 -.137

.194 .331

.166 .355

.003 -.061

-.124 .185

.407 .242

-.438 -.488

.379 .683

-.039 -.071

.287 .362

-.418 -.085

.071 .123

-.039 .106

.464 .542

-.044 -.314

.636 .166

-.311 .565

.067 .368

.156 -.181

.116 -.070

-.263 .111

.436

-.260

-.052

-.193

.691

.131

.011

-.095

a. 8 components extracted.

2.3 Cluster test Since the first principal component has provided more than a half, or about 53% of the information, besides, the first two principal components has provided 63% of the information, we might as well take the first principal component and the second component as the main criteria, and to the other main component can be as a reference Finally, the - 36 http://www.sjie.org


accuracy rate was observed using the independent test samples. Classification results were as follows: TABLE 4 RESULTS OF CLUSTER Cluster Membership Case Number

Cluster

Cluster Membership

Distance

Case Number

Cluster

Distance

1

1

1.189 14

2

.784

2

2

1.352 15

2

2.229

3

2

1.703 16

2

.958

4

2

1.194 17

2

1.810

5

2

1.550 18

2

.762

6

2

1.158 19

2

.672

7

2

.641 20

1

1.152

8

2

1.656 21

1

1.427

9

2

.501 22

1

1.725

10

2

1.464 23

1

2.538

11

1

1.379 24

1

1.564

12

1

1.762 25

1

.955

13

1

2.015

Through these results we can see that the first ten samples should have been normal ones and last 15 samples should have been cancer ones, so the accuracy is: 18/25=72%. Besides, it can be seen from the results that mainly on the 14 to 19 samples do not did not correctly determined, and this is where this article surprises, which is because these samples were not completely turned cancer, so called quansi-cancer. These quansi-cancer samples have even more important significance for doctors as well as patients. In previous studies, the sample were simply divided into normal and cancer ones, namely a binary variable, that is, samples were either completely normal or completely cancer, which is not of realistic significance. As mentioned in fuzzy mathematics, many are not sure about something, but rather fuzzy. In the sample used, although there were perfectly healthy groups, there were also sub health ones; although there were completely cancer patient, early cancer patients existed as well. These samples may be in the interval (0, 1), and early cancers and subhealth samples were called quasi-cancer, which were more meaningful for doctors and patients. Doctors can give the timely treatment to the patients through early detection of the cancer population before they progress to cancer groups for patients; patients can pay more attention to the body by promptly finding himself in cancer, so as not to further develop into a complete cancer. In a word, finding of quasi-cancer has important practical significance.

3 CONCLUSIONS We chose better classification factors and then determine the gene labels are determined using component and cluster analysis synthetically. Besides, quasi-cancer groups are discovered in the cluster process, which may have more important significance to actual cancer diagnosis. In future studies, how to determine the right quasi-cancer samples more accurately is the focus of our research.

REFERENCES [1] T. R. Golub, et al. Monitoring and Class Prediction by Gene Expression, Science, Vol. 286, 531-537, 1999 - 37 http://www.sjie.org


[2] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Machine Learning , 46(13): 389-422, 2000 [3] Z. Sun, P. Yang, Gene expression profiling on lung cancer Outcome Prediction: Present Clinical Value and Future Premise, Cancer Epidemiology Biomarkers & Prevention, 15(11): 2063-2068, 2006 [4] Duda OR, Hart PE, Stork G D. Pattern Classification. Second Edition. New York: John Wiley & Sons 46-48, 2001 [5] Theodoridis S, K outroumbas K. Patter Recognition [M]. Second Edition. New York: Academic Press, 177-179, 2003

AUTHORS 1

Yunfei Guo was born on April 13th, 1983 in Jilin province, and received his M.S. degrees in Yanbian University,

China in 2010. He is a Lecture of Yanbian University. His research interests are reliability and statistical analysis.

- 38 http://www.sjie.org


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.