IJRET: International Journal of Research in Engineering and Technology
eISSN: 2319-1163 | pISSN: 2321-7308
SCALABLE AND EFFICIENT CLUSTER-BASED FRAMEWORK FOR MULTIDIMENSIONAL INDEXING Aretty Narayana1, M. Srinivasa Rao2, R.V.Krishnaiah3 1
M.Tech Student, 2Associate Professor, 3Principal, Dept. of CSE, DRKCET, Hyderabad, AP, India arettynarayan4u@gmail.com, msrao@drkgroup.org, r.v.krishnaiah@gmail.com
Abstract Indexing high dimensional data has its utility in many real world applications. Especially the information retrieval process is dramatically improved. The existing techniques could overcome the problem of “Curse of Dimensionality” of high dimensional data sets by using a technique known as Vector Approximation-File which resulted in sub-optimal performance. When compared with VAFile clustering results in more compact data set as it uses inter-dimensional correlations. However, pruning of unwanted clusters is important. The existing pruning techniques are based on bounding rectangles, bounding hyper spheres have problems in NN search. To overcome this problem Ramaswamy and Rose proposed an approach known as adaptive cluster distance bounding for high dimensional indexing which also includes an efficient spatial filtering. In this paper we implement this high-dimensional indexing approach. We built a prototype application to for proof of concept. Experimental results are encouraging and the prototype can be used in real time applications.
Index Terms–Clustering, high dimensional indexing, similarity measures, and multimedia databases ---------------------------------------------------------------------***------------------------------------------------------------------------1. INTRODUCTION
2. PRIOR WORKS
Data mining has been around for many years for programmatically analyzing huge amount of historical data. Clustering is one of the data mining techniques which is widely used to group similar objects into a clusters. However, clustering is not easy with high-dimensional data which is being produced in many domains such as digital multimedia, CAD/CAM systems, Geographical Information Systems (GIS), stock markets, and medical imaging. The high dimensional data is very huge ranging from 100 GB to 100 TB or more. Enterprises have to deal with such huge data. Such data is accessed using NN queries and spatial queries. Many researches came into existence on queries high dimensional data [1], [2]. These techniques use similarity measures like Euclidean Distance while making clustering. In [3] search performance has been improved using R-tree like structures. All the existing techniques expect the data sets to have uniform distribution of correlations in order to overcome the problem of “curse of dimensionality”. As the nearest and distant neighbors are not distinguishable, indexing such datasets is not possible. The real world data sets on the other hand have non-uniform distribution of correlations. Therefore these data sets can be indexed really using techniques such as Euclidean Distance. The usage of ED is a significant research activity which is used in many applications including content – based image retrieval [4], [5]. In this paper we focused on the real world datasets which are high-dimensional in nature for indexing.
In case of low dimensional data techniques such as hyperrectangles [6], [7], hyper spheres [8] or combination of both [9] were used for NN searches and effective data retrieval. Manyresearches were found in the literature to deal with highdimensional data. Multi-dimensional indexing always outperforms low dimensional structures as they can access data quickly when compared with sequential scan. However, Weber et al. [10] proved that when dimensionality is more than 10, sequential scan is better than using indexing. The degradation of performance is due to the “curse of dimensionality” concept proposed in [11]. VA-files concept became popular to overcome thisproblem. The VA-File technique divides space into many hyper-rectangular cells and the approximation file holds strings pertaining to encoded bits that represent non-empty cell locations. The vector approximation ultimately leads to scalar quantization which overcomes curse of dimensionality. LDR [12] uses non-linear approximation in order to perform sequential scan. It is achieved by using dimensionality reduction and also clustering. There are some hybrid methods such as IQ-Tree [13] and A-Tree [14] that makes use of both VA and tree based index. Other methods like Pyramid Tree [15], iDistnce [16] and LDC [17] reduce transformations based on local dimensionality. A reference point is used to evaluate partitions made on dataset. Each partition has feature vectors that are indexed using their centroid-distance. When queries are made the spheres increase gradually until the cluster sphere is intersected. For quality reasons, the query processing
__________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org
166