Dimensionality Reduction Techniques for Document Clustering- A Survey by International Journal for Trends in Engineering and Technology

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 3 ISSUE 3 – MARCH 2015 – ISSN: 2349 – 9303

Dimensionality Reduction Techniques for Document Clustering- A Survey K.Aravi

G.Kalaivani

Thiyagarajar College of Engineering, Department of Computer Science and Engineering vanigk4@gmail.com

Thiyagarajar College of Engineering Department of Computer Science and Engineering kalaitce22@gmail.com

Abstract— Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet Allocation(LDA), Singular value decomposition (SVD). Linear Discriminant Analysis(LDA) Index Terms— Document clustering, CUR decomposition, Latent Dirichlet Allocation, Latent Semantic Indexing, Principle Component Analysis, Random Projection, Singular Value Decomposition. ——————————  ——————————

1 INTRODUCTION A text document is stored in a large database, if a user needs to search or retrieve a specific document; it is tough to try and do this task. So clustering the text documents makes easy for searching and retrieving the data. But high dimensional datasets are not efficient for clustering and information retrieval. High dimensionality denotes that quite 10 thousand terms in a document. It slows down computing the distance between documents and it makes the clustering process also slow The illustration of every and each term within the documents is projected in a vector space. The preprocessing steps are done before dimensionality reduction such as stop word removal and stemming process. The redundant and noisy data to be reduced for efficient document clustering. It improves the performance of clustering.

2. DIMENSIONALITY REDUCTION TECHNIQUES The curse of dimensionality refers that when the dimensions are increasing and the size of the space also increases, it leads to sparseness. It increases running time. Dimensionality reduction is the transformation of high dimensional data into a low dimensional space. The main idea for this technique is to reduce the number of dimensions without much loss of data.

2.1 Principle Component Analysis PCA aims to seek out the correlation between the documents. The correlation value ranges between -1 to 1. It reduces the sparseness in documents. The objective is to maximize the variance of the projected data. Step 1: The original matrix is the documents are projected in vector space. Taking Xi.... Xn as column vectors, each of which has M rows. Place the column vectors into a single matrix X of dimensions M× N.

Step 2: Calculate mean for each documents

…………..(1)

Step 3: Calculate Covariance Matrix ………………(2) = Mean of 1st data set = 1st dataset raw score = Mean of 2nd dataset = 2nd dataset raw score Step 4: Calculate Eigen vectors and Eigen values for covariance matrix..(2) Step 5: Finally dimensions are reduced as multiplying matrix ―X‖ with matrix Ek Ek Eigen vectors X PCA = EkT X ……………………….(3) The number of the terms of the original data matrix X is reduced by multiplying with a d k matrix Ek which has k eigenvectors corresponding to the k largest Eigen values. The result matrix is X PCA.

PROS and CONS It works well for sparse data. It can be applied only for linear datasets.

2.2 Random Projection Random projection is a technique which projects the documents to a randomly chosen low-dimensional space. Its equation is as follows:

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 3 ISSUE 3 – MARCH 2015 – ISSN: 2349 – 9303 N = Total no of Documents d =Original dimension of documents. k = desired dimension. X = Matrix of those documents projected in vector space. We reduce the number of the features by multiplying the matrix R to the original document matrix X.

2.3 Singular Value Decomposition SVD is a technique for matrices dimensionality reduction and based on a theorem of linear algebra which says that a rectangular matrix M can be divided into the product of three matrices – U is an orthogonal matrix, a diagonal matrix S, and the transpose of an orthogonal matrix V. M= USVT ...........................................(5) M is an m*n matrix. Rows and Column represents those documents are projected in vector space. U is calculated as M * MT (left singular matrix) . Take Eigen value and Eigen vectors and Orthonormalization process for the left singular matrix. S is a diagonal matrix . V is calculated as MT * M (right singular matrix) and take VT. At the end dimensions are reduced as discard the values which are 0 and below 0.

PROS and CONS

words belong to each topic, Classify unseen documents into those topics. Topic modeling associates with each document a probability distribution over "topics", which are in turn distributions over

words. The task of parameter estimation in these models is to learn both what the topics are, and which documents use them in what proportions. The objective of LDA is to perform dimensionality reduction while preserving as much of the document discriminatory information as possible. It seeks to find directions along which the topics are best separated.

2.6 Linear Discriminant analysis It does so by taking into consideration the scatter within-documents but also the scatter between –documents. The methodology is follows. Suppose there are n documents. Let µi be the mean vector of the documents i, i=1,2,3…n let Mi be the no of terms within documents i, i= 1,2,3…n let M= M=

be the total no of terms.

Within-class scatter matrix:

Sw=

Reduce the sparse data.

Between-class scatter matrix:

2.4 Latent Semantic Indexing

Sb=

LSI is one of the standard dimensionality reduction techniques in information retrieval. A collection of documents can be represented as a huge term-document matrix and various things such as how closely two documents is to a user issued query. In this matrix, each word is a row and each document is a column. Each cell contains the number of times that word occurs in that document. LSI transforms the original data in a different space so that two documents/words about the same concept are mapped close (so that they have higher cosine similarity). LSI achieved this by Singular Value Decomposition (SVD) of term-document matrix and embeds the original high dimensional space into a lower dimensional space with minimal distance distortion. q is query matrix.

LDA computes a transformation that maximizes the betweenclass scatter while minimizing the within-class scatter.

Q= qTUS-1 …………….(6)

PROS and CONS Finding relationships between terms in a document. The queries are projected into vector space for each and every user query

2.5 Latent Dirichlet Allocation It’s a way of automatically discovering topics in documents. It represent documents as mixtures of topics that spit out words with certain probabilities. Topic modeling is a classic problem in information retrieval. LDA discovers the different topics used, discovers the mixture of topics in each document, discover which

µ)

-µT

Maximize

Such a transformation should retain class separability while reducing the variation due to sources other than identity

PROS and CONS Maximizes topic seperability.

2.7 CUR Decomposition A CUR matrix is a set of three matrix that, when multiplied together. A CUR approximation can be used in the same way as the low-rank approximation of the Singular value decomposition (SVD). CUR approximations are less accurate than the SVD, but the rows and columns come from the original matrix. A= CUR………………(7) Matrix A is those documents are projected in vector space. Matrix C is columns of matrix A. Matrix U is pseudo inverse of intersection of C and R . Matrix R is rows of matrix A .

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 3 ISSUE 3 – MARCH 2015 – ISSN: 2349 – 9303 PROS and CONS (+)The vectors are original columns and rows. (-)There are duplicated columns and rows. So the matrix is very large. So it takes more time to compute.

3. CONCLUSION We conclude that the dimensionality reduction technique is efficient technique to cut back the inessential terms in documents. In this paper we provided major dimension reduction approaches which will be applied to document clustering.. In this paper we provided major dimension reduction approaches that can be applied to document clustering. These techniques can be applied to linear datasets. It mainly reduces sparseness in data.

4. END SECTIONS 4.1 Acknowledgments We would like to thank UCI repository of machine learning databases for the use of several of their public datasets.

5. REFERENCES [1] Jessica Lin Dimitrios Gunopulos , ―Dimensionality Reduction by Random Projection and Latent Semantic Indexing,‖Department of Computer Science & Engineering University of California, Riverside.{jessica,dg}@cs.ucr.edu.. [2]Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos Milios, Bin Tang, Jane Tougas, Ray Spiteri‖ Document Representation and Dimension Reduction for Text Clustering‖ Workshop on Text Data Mining and Management (TDMM), In conjunctionwith 23rd IEEE ICDE Conference, April 15, 2007, Istanbul, Turkey. [3 ]Barbara Rosario‖ Latent Semantic Indexing: An overview‖ INFOSYS 240 Spring 2000 Final Paper. [4] ] Laurence A. F. Park Kotagiri Ramamohanarao ―Latent Semantic Analysis for large document sets.‖ ARC Centre for Perceptive and Intelligent Machines in Complex Environments Department of Computer Science and Software Engineering The University of Melbourne. [5] Bin Tang, Michael Shepherd, Malcolm I. Heywood, Xiao Luo‖ Comparing Dimension Reduction Techniques for Document Clustering‖ Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, Canada, B3H 1W5 {btang, shepherd, mheywood, luo}@cs.dal.ca [6] Sabrina Simmons, Zachary Estes‖ Using latent semantic analysis to estimate similarity‖Department of Psychology, University of Warwick Coventry CV4 7AL, UK [7] Imola K.Fodor‖ A survey of dimension reduction techniques‖ center for Applied Scientific Computing,Lawrence Livermore National Laboratory P.O.Box808,L560, Livermore,CA [8] L.J.P. van der Maaten , E.O. Postma, H.J. van den Herik ―Dimensionality Reduction: A Comparative Review‖ MICC, Maastricht University, P.O. Box 616, 6200 MD Maastricht, The Netherlands [9] Si Chen and Yufei Wang ―Latent Dirichlet Allocation‖ University of California San Diego. [10]Ravikumar V, K.Raguveer ‖Legal Documents Clustering using

Latent Dirichlet Allocation‖ Dept. of IS&E Research scholar, NIE Mysore, India [11]David M. Blei Princeton University ―Introduction to Probabilistic Topic Models‖ [12] Alireza Sarveniazi ―An Actual Survey of Dimensionality Reduction‖ American Journal of Computational Mathematics, 2014, 4, 55-72. [13] Yoni Halpern, Steven Horng, Larry A. Nathanson, Nathan I. Shapiro, David Sontag ―A Comparison of Dimensionality Reduction Techniques for Unstructured Clinical Text‖ New York University, New York, NY. [14] G.N.Ramadevi, K.Usharani‖ Study On Dimensionality Reduction Techniques And Applications‖ Research Scholar, Department of Computer Science, S.P.M.V.V, Tirupati. [15] Manoranjan dash ―Dimensionality Reduction‖ Department of Information Systems,School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798. [16] Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf1, Mazdak Zamani1, Alireza Hooman . ―An Overview of Principal Component Analysis‖ Journal of Signal and Information Processing, 2013, 4, 173-175 doi:10.4236/jsip.2013.43B031 Published Online August 2013. [17] Jon Shlens, ―A Tutorial on Principal Component Analysis Derivation, Discussion and Singular Value Decomposition‖ 25march 2003 [18] Ali Ghodsi‖ Dimensionality Reduction A Short Tutorial‖ Department of Statistics and Actuarial Science University of Waterloo, Waterloo, Ontario, Canada, 2006 [19] C. Suh, S. Graduciz, M. Gaune-Escard , K. Rajan, ―Data Dimensionality Reduction:Introduction to Principal Component Analysis‖ Combinatorial Sciences and Materials Informatics Collaboratory Iowa State University, CNRS , Marseilles, France [20]Arunasakthi. K, KamatchiPriya ―A Review on Linear and NonLinear Dimensionality Reduction Techniques‖ Assistant Professor Department of Computer Science and Engineering Ultra College of Engineering and Technology for Women,India. And Assistant Professor, Department of Computer Science and Engineering, Vickram College of Engineering, Enathi, Tamil Nadu, India [21] Kresimir Delac 1, Mislav Grgic 2 and Sonja Grgic‖ A Comparative Study of PCA, ICA AND LDA‖ Croatian Telecom, Savska 32, Zagreb, Croatia, e-mail: kdelac@ieee.org ,University of Zagreb, FER, Unska 3/XII, Zagreb, Croatia [22] David M. Blei, Andrew Y. Ng, Michael I. Jordan‖ Latent Dirichlet Allocation‖ Computer Science Division University of California Berkeley, CA 94720, USA, Computer Science Department,Stanford University, Stanford, CA 94305, USA, Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA. Journal of Machine Learning Research 3 (2003) 993-1022 [23] Yehuda Koren and Liran Carmel,‖ Robust Linear Dimensionality Reduction‖ IEEE Transactions on Visualization and Computer Graphics. [24] Kresimir Delac, Mislav Grgic, Sonja Grgic‖ Independent Comparative Study of PCA, ICA, and LDAon the FERET Data Set‖ University of Zagreb, FER, Unska 3/XII, Zagreb, Croatia [25] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990 [26] P. Husbands, H. Simon, and C. H. Q. Ding. On the use of the singular value decomposition for text retrieval. In Computational

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 3 ISSUE 3 – MARCH 2015 – ISSN: 2349 – 9303 information retrieval, pages 145–156. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. [27] K. Lerman. Document clustering in reduced dimension vector space. http://www.isi.edu/_lerman/papers/papers.html, 1999, last accessed Jan. 11, 2007. [28] M. Shafiei, S. Wang, R. Zhang, E. Milios, B. Tang, J. Tougas, and R. Spiteri. A systematic study of document representation and dimension reduction for text clustering. Technical Report Technical Report CS-2006-05, Faculty of Computer Science, Dalhousie University, Halifax, Canada, July 2006. [29] M. Steinbach, G. Karypis, and V. Kumar. A comparison of common document clustering techniques. In KDD Workshop on Text Mining, 2000 [30] B. Tang, X. Luo, M. I. Heywood, and M. Shepherd. Comparative study of dimension reduction techniques for document clustering. Technical Report CS-2004-14, Faculty of Computer Science, Dalhousie University, December 2004 [31] Morgan Kaufmann Publishers, San Francisco, US. K. Yeung and W. Ruzzo. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763– 774, September 2001 [32] G.W. Furnas et al., Information retrieval using a singular value decomposition model of latent semantic structure, in Proceedings of SIGIR, 1988.