INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 3 ISSUE 1 –JANUARY 2015 - ISSN: 2349 - 9303
A Classification of Cancer Diagnostics based on Microarray Gene Expression Profiling V.S. Gokkul1
Dr. J. Vijay Franklin2
Department of Computer Science & Engineering, Department of Computer Science & Engineering, Bannari Amman Institute of Technology, Bannari Amman Institute of Technology, 1 2 Anna University, Anna University, vijayfranklinj@bitsathy.ac.in gokkul.vs@gmail.com Abstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step. Index Terms— Pattern Recognition, Microarray Gene Expression Profiling, Mahalanobis, Classifier, Gene Selection —————————— —————————
1INTRODUCTION
Term cancer does not refer to one disease, but rather to many diseases that can occur in various regions of the body. Every type of cancer is characterized to growth of cell. The cancer is third most common diseases and the second leading cause of death in this world. Detection of cancer is the research topic with significant importance. Every gene array techniques have been shown to provide inside into cancer study and the molecular profiling based on gene expression array technology is expected to the promise of precise cancer detection and the classification. The most important problems in the treatment of cancer is the early detection of the disease. If the cancer is detected in later stages, it compromised the function of one or more organ systems and is widespread entire body. Methods for the early detection of cancer are of utmost importance and are an active area of current research. An important step in the diagnosis of cancer is classification of different types of unknown malignant tumors to known classes.
IJTET©2015
After the initial detection of a cancerous growth valid diagnosis and staging of disease are essential for the design of treatment plan. Gene chip analysis the Microarray technology is a powerful tool for genomic analysis. It gives a global view of the genome in a single experiment. Data analysis of the microarray is a vital part of the experiment. Each microarray study comprises multiple microarrays, each giving tens of thousands of data points. So the volume of data is bean growing exponentially as microarrays grow larger, the analysis becomes high challenging. During scanning image the image processing techniques is applied. It is any form of signal processing for which the input is an image, such as a photograph or video frame, the output of image processing may be either an of characteristics or parameters related to the input image. After applying image processing techniques, image that is scanned from microarray gene chip is transformed to a data matrix.
144
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 3 ISSUE 1 –JANUARY 2015 - ISSN: 2349 - 9303 Then this data matrix is pre-processed and analyzed to get useful information required for design study.
Figure1: Pattern Recognition solution for microarray data The preprocessing of gene expression data is to carry out a supervised classification and unsupervised classification may have several reasons. So that technological problems of the microarrays expression for some genes values cannot accurately to be measured, in that problem missing data arises. The classification methods do not generally have the capability to handle missing data like one exception is the classification tree. Feature selection is another important tissue in cancer classification. By removing features that are irrelevant to cancers, the accuracy of prediction can usually be improved. With a sub set of many few genes, the contribution of each gene will be more prominent and even the interaction of these genes can be revealed by using other techniques such as genetic network. We used a method called Recursive Feature Elimination (RFE) to iteratively select a series of nested gene sub sets and found that some of the sub sets could achieve most perfect classification with many fewer genes than the original set of genes. During classification model the three well known classifier, namely the decision tree learner C4.5, the simple Bayesian classier na¨ıve Bayes, and Support Vector Machine (SVM). The first two algorithms and the corresponding feature selection methods.
2 RELATED WORKS Wang et al. (2005) demonstrates various gene selection methods to improve the classification performance. As microarray data contains small number of samples and large number of genes/features therefore it is a very challenging task to choose relevant features to be in various types of cancer. In the study various Feature Selection (FS) algorithms namely wrappers, filters and Correlation based Feature Selection (CFS) are statistically analyzed to get the useful information about genes and to reduce the dimensionality. IJTET©2015
Wrappers choose the best genes to build the classifier while filters rank the genes for the specific problem. CFS can choose genes which are highly correlated to cancers but uncorrelated to each other. The dataset used in this paper are acute leukemia and lymphoma microarray. Results show that the above mentioned FS approaches have same impact on classification. But for fast analysis of data the filters and CFS are suggested. But in order to select a very few genes validation of the results the wrapper approaches can be proposed. F. Chu and L.Wang (2005) used four reduction methods for dimension they are principal components analysis (PCA), class separability measure, Fisher ratio and T-test for selection of genes. The Classification SVM is used in this process. Publicly available micro array gene expression datasets i.e., SRBCT, lymphoma and leukemia is used in the study. The accuracy is good but it is same as the previously published results but with fewer number of features. Huilin Xiong and Xue-Wen Chen (2005) presented a new approach for classification. It is based on the optimize kernel function for the classification performance. To increase the class separability of data a more flexible kernel model is used. Datasets used in the study are ALL-AML Leukemia Data, Breast-ER, Breast-LN, Colon, Lung and Prostate microarray data. For performance analysis K-nearest-neighbor (KNN) and support vector machine (SVM) were used and compared with optimized kernel model. Optimized kernel provides better accuracy. L. Shen and E.C. Tan (2005) presented the Penalized Logistic Regression (PLR) for classification of cancer. This PLR along with the dimension reduction technique is also used to improvement in classification performance. Support vector machine (SVM) and least square regression are used for classification and for iterative gene selection Recursive feature elimination (RFE) was used. RFE tries to find the gene subset that is most related to the cancers. Seven publically available datasets i.e., breast cancer, central nervous system (CNS), colon, Acute Leukemia, Lung, ovarian and Prostate microarray datasets are used. To compare with the regression methods linear SVM is used. Performance can be improved by the combination of Penalized logistic regression and PLS. For cancer classification Feng Chu and Lipo Wang (2006)proposed a novel radial basis function neural network using very few genes. They applied their technique to three publically available datasets which are lymphoma data set, the small round blue cell tumors (SRBCT). To measure the discriminative ability of genes T-test scoring method is used for gene ranking. Compared to the previous nearest shrunken centroids, RBF neural network used very few genes as well as it also reduces the gene redundancy for the classification of tumors using microarray data.
145
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 3 ISSUE 1 –JANUARY 2015 - ISSN: 2349 - 9303 Xiyi Hang (2008) presented a new approach for cancer classification using microarray data. The method is called Sparse Representation.
Osareh et al. (2010) developed an automated system for consistent cancer analysis based on gene microarray expression data.
Two date sets are used to check the classifiers’ performance. The datasets are Tumors and Brain Tumor. Experimental results of Sparse Represent the gene expression data is compared with the SVM’s. The proposed sparse representation method is implemented in MATLAB R14.The Result of SVM are calculated By GSM GEMS (Gene Expression Model Selector), with graphical user interface for classification data. This GUI is available at http://www.gemssystem.org/. Accuracies are compared with SVM results and it was found that there are no differences except a little improvement. Hence it was found that classification performance of sparse representation is quite similar to that of SVM.
The system contains number of famous classifiers such as K nearest neighbors, naive Bayes, neural networks and decision tree, Support vector machine. Microarray datasets that are used contains both binary and multiclass problems. Best experimental results are obtained using support vector machine classifier.
M. Rangasamy and S. Venketraman (2009) presented a new classification scheme which is based on statistical methods. It is called efficient Statistical Model Based Classification. It uses very less genes from microarray gene data for classification. For gene ranking classical statistical technique is used and for gene ranking and prediction two classifiers are used. The methodology was applied to three publicly available cancer datasets and results were compared with earlier approaches and proven better prediction strength with less computation cost. A. Bharathi and A. M. Natarajan (2010) give a gene selection method called Analysis Of Variance(ANOVA). It is a method used to find minimum number of genes from microarray gene expression dataset. For classification support vector machine is used. Effectiveness of classifier is checked on lymphoma dataset. To check performance of classifier two stepped procedure is followed. In 1,st step important genes are selected using ANOVA method and then the top ranked genes are selected with the highest score. In second step gene combinations are formed and classification strength of all gene combinations is checked using support vector machines. For early analysis of cancer patients, Zainuddin and Pauline (2009) worked on wavelet neural network (WNN) and enhanced WNN using clustering algorithms. Different clustering algorithms such as , K-means (KM), Fuzzy C-means (FCM), symmetry-based K-means (SBKM), symmetry-based Fuzzy Cmeans (SBFCM) and modified point symmetry-based K-means (MPKM) are used. In this study they used four datasets which includes LEU, SRBCT, GLO and CNS. Performance of proposed classifier is compared with other classifier. Performance analysis showed that results of proposed classifiers. Accuracy of proposed classifier ranges from 86% to 100%. IJTET©2015
3 PROPOSED SYSTEM Principal Component Analysis (PCA) is statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is lesser than to the number of original variables. Datasets that used in this work has a large number of features m and a small number of samples n. It is very computationally intensive to calculate covariance of such type of data. To beat this issue, covariance is computed through the idea of eigenfaces. The eigenvectors of the covariance matrix that are count by projecting the data into feature space that spans the changes among known classes then it can solve the memory issue occur due to the large number of features. The data into feature space makes some mathematical assumptions. The eigenvectors of the original covariance matrix is used to transform matrix to low dimensional space. So in this case when there is large m and small n. Calculations are reduced from the number of features to number of samples. PCA is one of the most famous dimensionality reduction techniques, used for the datasets that have redundant information. In PCA class information is not given while for proposed scheme class information is given and taken into consideration for the projection of data. It is not exactly the PCA, but a variation of PCA because although, it reduces the dimension of data, it preserves the information of data. The term eigenfaces is used to because they were dealing with face images for recognition purpose. However in this study transformed genes are used for classification purpose therefore eigengenes are used instead of eigenfaces for recognition purpose. It is difficult to compute inverse of covariance matrix because it is not a diagonal matrix. Karhunen Leove (KL) theorem is used to make data somewhat diagonal matrix. KL-transformation is mostly used for the datasets that are highly correlated to make them uncorrelated. In proposed scheme, data is KL-transformed by multiplying each sample with the transformation matrix. The computations are greatly reduced due to the fact that there is less correlation between various transformed features.
146
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 3 ISSUE 1 –JANUARY 2015 - ISSN: 2349 - 9303 Transformation to Mahalanobis Space is samples of each class are transformed to Mahalanobis space defined by means and variances of the same class. The Mahalanobis space is a space where sample variance along with each dimension is unity. This transformation is achieved by dividing each KL-transformed coefficient of vector by its corresponding standard deviation. As eigengenes of original covariance matrix are used for transformation, therefore instead of standard deviation, square root of eigenvalue is used because eigenvalues are proportional to the variances in the direction of principal components (PCs). For the final classification, the Euclidean distance is calculated between the transformed samples and class means. This is the strength of proposed classification scheme that complex classification problem become much simpler as Euclidean classifier.
4 PERFORMANCE ANALYSES According to the experimental result, the proposed classification schema should be plotted in the graph. Simulation results for the performance indices are showed below. The performance of classifiererror rates are computed for each dataset using bootstrap method.In this method is used for the estimation of error rate and uses the weighted average of the reconstitution error, means the training set and error on the samples are not used to train the classifier to leave out cross validation. The Adeno, Colon and SRBCT the error rate is high, when compare to other datasets when we going to take for classification. The number of errors present in this datasets represents the deficiency of the process. Although the classification boundary is nonlinear as it incorporates different data spreads between classes as well as gene types. 1.5
Err…
1 0.5 0
5 CONCLUTION The classification scheme for both binary and multiclass cancer diagnosis involves the transformation of Microarray data to the Mahalanobis space after performing KL-transformation. The force of proposed pattern is that the final classification becomes simple Euclidean classifier. Also the classification boundary is nonlinear it incorporates different data spreads between classes as well as gene types. The gene selection is applied for better performance of a algorithm. For both gene selection and cancer classification the researchers were curious to know about the main causes that lie behind the occurrence of any disease therefore a lot of research has been done for the analysis data. A various number of issues were encountered during the analysis. So the Accuracy and speed of proposed technique is checked on different published datasets. Results show that proposed technique is extremely efficient in performance as well as computation.
6 FUTURE WORKS In future, this work can be extended to apply ensembling techniques. Ensemble methods as sets of machine learning techniques whose decisions are combined in some way to improve the performance of the overall system. Two important aspects to be focused on ensemble approaches. First aspect is how to generate diverse base classier. In traditional, re-sampling has been widely used to generating training datasets for base classier learning. This method is much too random and due to the small numbers of samples, the datasets may be greatly similar. The second aspect is how to combine the base classifiers. An intelligent approach for constructing ensemble classifiers is proposed. The methods first training the base classifiers with Particle Swarm Optimization (PSO) algorithm, and then select the appropriate classifiers to construct a high performance classification committee with Estimation of Distribution Algorithm (EDA).
Gene selection is applied for better performance of algorithm. Results on other datasets are also better after performing the gene selection. IJTET©2015
147
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 3 ISSUE 1 –JANUARY 2015 - ISSN: 2349 - 9303 REFERENCES [1] Bharathi and A. M. Natarjan, ―Cancer Classification Of [13] U. I. Bajwa, I. A. Taj, and Z. E. Bhatti, ―A Bioinformatics Data Using ANOVA,‖ International Journal Of Comprehensive Comparative Analysis on Performance of Computer Theory And Engineering, vol. 2, pp.1793-8201, 2010. Laplacian faces and Eigen faces for Face Recognition,‖ The [2] A. Osareh and B. Shadgar, ―Microarray data analysis Imaging Science Journal, vol. 59, pp. 32-40, 2011. for cancer classification,‖5th International Symposium on Health [14] X.Hang, "Cancer Classification By Sparse Informatics and Bioinformatics (HIBIT), vol. 9, pp. 459-468, Representation Using Microarray Gene Expression Data,‖ 2010. IEEE International Conference On Bioinformatics And [3] A. Osareh and B. Shadgar, ‖Microarray data analysis Biomedicine Workshops, pp. 174-177, 2008. for cancer classification,‖5th International Symposium on Health [15] Y. Wang, I. V. Tetko, M. A. Hall and E. Frank, ―Gene Informatics and Bioinformatics (HIBIT), vol. 9, pp. 459Selection From Microarray Data For Cancer Classification: A 468,2010. Machine Learning Approach,‖ Comput Biol Chem, vol. 29, [4] D. Singh, P.G. Febbo, K. Ross, D. G. Jackson, J. pp. 37-46,2005. Manola and C. Ladd, ―Gene expression correlates of clinical prostate cancer behavior,‖ Cancer Cell, pp. 203–209, 2002 [5] F. Chu and L. Wang, "Applications of Support Vector Machines to Cancer Classification With The Microarray Data," International Journal of Neural Systems, vol. 15, pp.475–484, 2005. [6] Dimension Reduction-Based Penalized Logistic Regression For ray Data,‖ IEEE/ACM Trans, Computational Biology And Bioinformatics, vol. 2, pp. 166-175, Apr -June 2005. [7] M. H. Asyali, D. Colak, O. Demirkaya and M. S. Inan, ―Gene expression profile of classification,‖ Current Bioinformatics, vol. 1, pp. 55-73, 2006. [8] M. Rangasamy and S.Venketraman, ―An Efficient Statistical Model Based Classification Algorithm for Classifying Cancer Gene Expression Data with Minimal Gene Subsets,‖ International Journal of Cyber Society and Education, vol. 2, pp.51-66, 2009. [9] P. Qiu, Z. J. Wang and K. J. R. Liu, ―Ensemble dependancemodel for classification and prediction of cancer and normal gene expression data,‖ Bioinformatics, vol. 21, pp. 3114 3121, 2005. [10] F. Chu and L.Wang, ―Applying Rbf Neural Networks To Cancer Classification Based on Gene Expressions,‖ International Joint Conference on Neural Networks, pp. 37-46, July 2006. [11] H. Xiong and Xue-Wen Chen, "Optimized Kernel Machines for Cancer Classification using Gene Expression Datas," Proceedings Of The 2005 IEEE Symposium On Computational Intelligence In Bioinformatics And Computational Biology, pp.1-7,2005.L. Shen and E.C. Tan, ― [12] S. Dudoit, J. Fridlyand, and T. P. Speed, ―Comparison of discriminat menthods for the classification of tumors using gene expression data,‖ Journal of the American Statistical Association, vol. 97, pp. 77-87, 2002.
IJTET©2015
148