Outlier Detection based on Distance Concentration: Reverse Nearest Neighbors Approach

Page 1

IJIRST –International Journal for Innovative Research in Science & Technology| Volume 4 | Issue 1 | June 2017 ISSN (online): 2349-6010

Outlier Detection based on Distance: Reverse Nearest Neighbors Pranita S Jawale ME Student Department of Computer Engineering P.V.P.I.T Bavdhan, Pune, Maharashtra, India

Prof. Yogesh B. Gurav Associate Professor Department of Computer Engineering P.V.P.I.T Bavdhan, Pune, Maharashtra, India

Abstract Outlier identification previously, high-dimensional information displays Different tests coming about because of the “curse of dimensionality”. An prevailing perspective may be that separation concentration, i.e., the propensity from claiming distances over high-dimensional information on turned into indiscernible, Hinders those identification about outliers by making distance-based strategies name at focuses as very nearly just as beneficial outliers. In this paper, we gatherings give proof supporting the assumption that such a see will be a really simple, toward demonstrating that distance-based strategies could. Handle All the more differentiating outlier scores previously, high-dimensional settings. Furthermore, we demonstrate that helter skelter dimensionality could have. An alternate impact, toward re-examining those ideas about opposite closest neighbours in the unsupervised outlierdetection connection. Namely, it might have been as of late watched that those dissemination about points’ reverse-neighbour tallies turns into skewed on secondary dimensions, bringing about. The wonder known as hubness. We furnish knowledge under how a few focuses (antihubs) show up extremely rarely to k-NN records about. Different points, What's more clarify those association between antihubs, outliers, and existing unsupervised outlier-detection systems .Toward. Assessing that excellent kNN method that angle-based method outlined to high-dimensional data, those density-based nearby. Outlier component What's more impacted outlierness methods, also antihubs-based strategies on Different manufactured Also real-world information sets, we offer novel knowledge under the convenience about opposite neighbour tallies previously, unsupervised outlier identification. Keywords: Outlier Detection, Reverse nearest Neighbors, High-Dimensional Data, Distance Concentration _______________________________________________________________________________________________________ I.

INTRODUCTION

Identification for outliers over information characterized Concerning illustration finding designs to information that don't fit in with ordinary conduct technique or information that don't conformed to expected behavior, such an information would know as Similarly as outliers, anomalies, exceptions [1]. Aberrance and outlier need comparable intending. The examiners have solid enthusiasm toward outliers since they might represent able basic Furthermore significant majority of the data for Different domains, for example, such that interruption detection, duplicity detection, Also medicinal what’s more wellbeing analysis. An outlier may be a perception to information instances which will be unique in relation to the others in dataset [2]. There are a lot of people motivations because of outliers emerge like poor information quality, breaking down for equipment, ex: MasterCard duplicity. Information Labels connected with information instances indicates if that example belongs with ordinary information or bizarre [3]. In light of the accessibility about labels to information instance, the aberrance identification systems work done a standout amongst those three models are 1) regulated aberrance Detection, strategies prepared over managed mode Think as of that those accessibility about marked instances for ordinary and also aberrance classes to a An preparing dataset. 2) Semi-supervised aberrance Detection, strategies prepared clinched alongside regulated mode think about that the accessibility from claiming marked instances to normal; don't oblige labels to that aberrance population. 3) Unsupervised aberrance Detection, systems that work to unsupervised mode don't require preparing information [1]. There would Different techniques to outlier identification dependent upon closest neighbors, which Think as of that outliers show up a wide margin from their closest neighbors. Such routines base looking into a separation or comparability measure will quest the neighbors, with Euclidean separation. A significant number neighbor-based strategies incorporate characterizing the outlier score of a purpose as the separation should its k-th closest neighbor (k-NN method), some techniques that focus the score of a side of the point as stated by its relative thickness [4]. Those separation of the k-th closest neighbor to a provided for information point could make seen as a evaluate of the opposite thickness around it. II. RELATED WORK For Gave An binding together perspective of the part of reverse closest neighbor tallies done issues concerning unsupervised outlier detection, centering on the impacts from claiming secondary dimensionality once unsupervised outlier-detection routines and the hubness phenomenon, extending those past examinations from claiming (anti)hubness on substantial values from claiming k, and exploring those association the middle of hubness Also information sparsity [1]. In light of those analysis, figured the AntiHub strategy for unsupervised outlier detection, talked about its properties, Also recommended An inferred strategy which enhances separation between scores the picture of the transaction between the sorts about

All rights reserved by www.ijirst.org

208


Outlier Detection based on Distance: Reverse Nearest Neighbors (IJIRST/ Volume 4 / Issue 1/ 035)

outliers and properties from claiming data, filling An hole done seeing which might need in this way hindered those across the board utilization of reverse-neighbor strategies to unsupervised outlier identification. In the survey are an attempt should provide for a sorted out Furthermore An totally outline about wide Scrutinize looking into conflict ID number methods spreading again various examination zones Also requisition spaces [2]. An extensive parcel of the present reviews once eccentricity ID number possibly focus on a particular provision space alternately around a lone exploration zone. [Agyemang et al. 2006] and [Hodge Also Austin 2004] need aid two related meets expectations that gathering eccentricity distinguishment under various characterizations Also analyze methodologies under each order. This contemplates stretches upon these two meets expectations by inside and out developing those talk done a couple directions. , two that's only the tip of the iceberg characterizations of eccentricity ID number strategies, viz. , information theoretic What's more extraordinary procedures, of the four classes inspected to [Agyemang et al. 2006] What's more [Hodge Furthermore austin 2004]. To each of the six classes, recognize momentous suppositions for admiration to those method for peculiarities settled on by the methodologies in that order. These suppositions would fundamental for choosing the point when those frameworks in that arrangement might need those limit with recognize oddities, and the point when they might come up short. To each classification, a fundamental conflict finding procedure, also after that demonstrate how the different existing frameworks in that order would varieties of the essential methodology. This arrangement provides for a lesquerella requesting Furthermore short cognizance of the techniques hosting a put for each class. Promote, for each order distinguish the focuses of interest and drawbacks of the frameworks in that population provide for a talk on the computational many-sided nature of the methodologies since it may be a basic issue clinched alongside honest to goodness provision spaces. It manages discovering anomalies (exemptions) for expansive, multidimensional datasets [3]. The identication from claiming exceptions could prompt of the revelation about by any means startling Taking in Previously, ranges, to example, electronic business, Visa misrepresentation, Furthermore actually those examinations of execution insights for master rivals. Existing methodologies that we have seen to finding exceptions might simply plan productively with two measurements/properties of a dataset. In this paper, we focus the considered perfect DB-(Distance Based) exceptions. Over particular, we exhibit that: (i) exemption area ought a chance to be time permits viably to broad datasets, What's more for k-dimensional datasets with generous estimations about k (e. G. , k 5); What's more (ii), aberrance distinguishment may be a noteworthy Also basic data revelation chore. III. PROPOSED SCHEME It may be basic will perceive how the development of dimensionality impacts aberrance area. Concerning illustration elucidated in the certified challenges postured Eventually Tom's perusing those "scourge from claiming dimensionality “contrasts from those for the most part recognized perspective that every purpose turns under a Likewise OK aberrance for high-dimensional space. We will present extra affirmation which tests this view, impelling the (re)examination of systems. Reverse closest neighbor include have been proposed the past as a strategy It may be basic will perceive how the development of dimensionality impacts aberrance area. Concerning illustration elucidated in the certified challenges postured Eventually Tom's perusing those "scourge from claiming dimensionality “contrasts from those for the most part recognized perspective that every purpose turns under a Likewise OK aberrance for high-dimensional space. We will present extra affirmation which tests this view, impelling the (re)examination of systems. Reverse closest neighbor include have been proposed the past as strategy.

Fig. 1: Proposed System Architecture

Outlierness of information focuses however no understanding separated from fundamental instinct was offered with respect to why these numbers ought to speak to important exception scores. Late recognitions that rearrange neighbor checks are impacted by extended dimensionality of data warrant their reconsideration for the aberrance distinguishment errand. In this light, we will profit of the method.

All rights reserved by www.ijirst.org

209


Outlier Detection based on Distance: Reverse Nearest Neighbors (IJIRST/ Volume 4 / Issue 1/ 035)

IV. IMPLEMENTATION PLAN System Implementation Plan Detection of outliers in data defined as finding patterns in data that do not conform to normal behavior or data that do not conformed to expected behavior, such a data are called as outliers, anomalies, exceptions. Anomaly and Outlier have similar meaning. The analysts have strong interest in outliers because they may represent critical and actionable information in various domains, such as intrusion detection, fraud detection, and medical and health diagnosis.

Fig. 2: Software Development Planning Activity

Mathematical Model S is the system S={I, O, F ,Success, Failure } Where, I = Set of Input I={I1, I2, I3,I4} Where, I1=Login user ID I2=Login password I3=File I4=Outlier Keyword O=Set of Outputs O= {O1, O2, O3 } Where,O1=Authentication Message O2=Outlier Detection O3= AntiHub F=Set of Functions F={F1, F2, F3 } Where,F1=Authentication O1ßF1(I1, I2) F2=Outlier Result O2ßF2(I3,I4) F3=AntiHub Score O3ßF3(I3,I4) Algorithm Used AntiHub2dist(x, k, p, step) Input  Distance measure dist  Ordered data set D = (x1,x2 , xn), where xi ∈ Rd, for i∈{1,2, ...,n}

All rights reserved by www.ijirst.org

210


Outlier Detection based on Distance: Reverse Nearest Neighbors (IJIRST/ Volume 4 / Issue 1/ 035)

  

No. of neighbors k ∈ {1,2, ...} Ratio of outliers to maximize discrimination p ∈ (0,1) Search parameter step ∈ (0, 1) Output  Vector s = (s1,s2, ‌ , sn) ∈ Rn, where si is the outlier score of xi, for i∈{1,2, ... ,n} a) Temporary variables  AntiHub scores a ∈ Rn  Sums of nearest neighbors' AntiHub scores ann∈ Rn  Proportion Îą ∈ [0,1]  (Current) discrimination score cdisc, disc∈ Rn  (Current) raw outlier scores ct, t ∈ Rn b) Local functions  discScore(y,p): for y ∈ Rn and p ∈(0,1] outputs the number of unique items among [np] smallest members of y, divided by [np]  Steps 1) a := AntiHubdia(D, k) 2) For each I∈ (1, 2, ... , n) 3) ann:= ∑đ?‘—∈đ?‘ đ?‘ đ?‘‘đ?‘–đ?‘ đ?‘Ą(đ?‘˜,đ?‘–) aj , where NNdist (k,i) is the set of indices of k nearest neighbors of xi 4) disc := 0 5) For each Îą ∈ (0, step,2 • step, . ,1) 6) For each i∈ (1, 2‌ , n) 7) cti := (1 - Îą) • Îąi + Îą • anni 8) cdisc:= disoScore(ct, p) 9) If cdisc>disc 10) t:= ct, disc := cdisc 11) For each i∈ (1,2,... ,n) 12) si:= f(ti), where f : R→R is a monotone function V. RESULT TABLES Above table shows the result and efficiency of our proposed system it gives the time which is required to Sr. No 1 2 3 4 5

Table - 1 Result Table Time Number Of words in file 2 1000 4 1700 5 2000 6 2800 8 3800

As shown in figure 3the graph represents ratio to find out outlier graphical Analysis. We are processing this data of various people and finding the outlier. Here the outlier meaning the unmatched entry. Or the dataset for which some entry is missing

Category v/s Dataset Fig. 3: Data Table Description

All rights reserved by www.ijirst.org

211


Outlier Detection based on Distance: Reverse Nearest Neighbors (IJIRST/ Volume 4 / Issue 1/ 035)

As shown in figure 4 the graph represents time require to find out outlier score. If file size is increasing rapidly then the time is also increased to find out outlier score.

Fig. 4: Result graph

Outcome We are getting result from this dissertation, its just a expecting result. It’s part of knowledge about outlier in the data. The expected output of modules is as follows: a) User  User login to the the system and upload the file in which outlier to be find out.  User can preprocess the file through system to examine each attribute for outlier detection. b) System  Detect the outlier and antihubs score.  Shows the relevant documents to user enter keyword. VI. CONTRIBUTION IMPLEMENTATION We have created graphical view for outlier detection. Graphical analysis makes it simple for the viewer to analyze the data properly, We have created separate module for this with following fields which are used to generate the graph.  Rules  direct_rule_protection  Indirect_rule_protection  rulegen

Fig. 5: Graph for Outlier Detection Analysis

All rights reserved by www.ijirst.org

212


Outlier Detection based on Distance: Reverse Nearest Neighbors (IJIRST/ Volume 4 / Issue 1/ 035)

VII. CONCLUSION AND FUTURE WORK In this Paper, we provided a unifying view of the role of reverse nearest neighbor counts in problems concerning unsupervised outlier detection, focusing on the effects of high dimensionality on unsupervised outlier-detection methods. Based on the analysis, the formulated the Anti Hub method for detection of outliers, discussed its properties, and proposed a derived method which improves discrimination between scores. The existence of hubs and anti-hubs in high-dimensional data is relevant to machinelearning techniques from various families: supervised, semi-supervised, as well as unsupervised. The mainly focused on only unsupervised methods, but in future work it can be extended to supervised and semi-supervised methods as well. Another relevant topic is the development of approximate versions of Anti Hub methods that may sacrifice accuracy to improve execution speed. Finally, secondary measures of distance/similarity, such as shared-neighbor distances warrant further exploration in the outlierdetection context. Graphical view for analyzing outlier detection makes it easy to analyze the detection. In future, enhance the proposed system to handle high dimensional data and high computation complexity for better experimental results, to make an efficient intrusion and anomaly detection system. In Future System used in as:  For Multi type data website, protocol will play important role.  Used to find fraud detection in net banking or credit card. REFERENCES Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic, “Reverse Nearest Neighbors in UnsupervisedDistance-Based Outlier Detection” ,IEEE Transaction May 2015 [2] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Survey, vol. 41, no. 3, p. 15, 2009. [3] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection. Hoboken, NJ, USA: Wiley, 1987. [4] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” SIGMOD Rec., vol. 29, no. 2, pp. 427–438, 2000. [5] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data,” in Proc. Conf. Appl. Data Mining Comput. Security, 2002, pp. 78–100. [6] E. M. Knorr, R. T. Ng, and V. Tucakov, “Distance-based outliers: Algorithms and applications,” VLDB J., vol. 8, nos. 3–4, pp. 237– 253, 2000. [7] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Proc. 7th Int. Conf. Database Theory, 1999, pp. 217– 235. [8] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional spaces,” in Proc. 8th Int. Conf. Database Theory, 2001, pp. 420–434. [9] D. Franc¸ois, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Trans. Knowl. Data. Eng., vol. 19, no. 7, pp. 873–886, Jul. 2007. [10] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” in Proc. 27th ACM SIGMOD Int. Conf. Manage. Data, 2001, pp. 37–46. [11] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised outlier detection in high-dimensional numerical data,” Statist. Anal. Data Mining, vol. 5, no. 5, pp. 363–387, 2012. [1]

All rights reserved by www.ijirst.org

213


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.