November2015

Page 1

ISSN (ONLINE) : 2045 -8711 ISSN (PRINT) : 2045 -869X

INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY & CREATIVE ENGINEERING NOVEMBER 2015 VOL- 5 NO - 11

@IJITCE Publication


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

UK: Managing Editor International Journal of Innovative Technology and Creative Engineering 1a park lane, Cranford London TW59WA UK E-Mail: editor@ijitce.co.uk Phone: +44-773-043-0249 USA: Editor International Journal of Innovative Technology and Creative Engineering Dr. Arumugam Department of Chemistry University of Georgia GA-30602, USA. Phone: 001-706-206-0812 Fax:001-706-542-2626 India: Editor International Journal of Innovative Technology & Creative Engineering Dr. Arthanariee. A. M Finance Tracking Center India 66/2 East mada st, Thiruvanmiyur, Chennai -600041 Mobile: 91-7598208700

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

www.ijitce.co.uk

IJITCE PUBLICATION

INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY & CREATIVE ENGINEERING Vol.5 No.11 November 2015

www.ijitce.co.uk

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

From Editor's Desk Dear Researcher, Greetings! Research article in this issue discusses about motivational factor analysis. Let us review research around the world this month. Engineers are Developing Robotic Spacecraft to Assist Satellite Repairs in Orbit NASA engineers are developing robotic spacecraft to help service and repair satellites in distant orbits. NASA is developing and demonstrating technologies to service and repair satellites in distant orbits. Robotic spacecraft likely operated with joysticks by technicians on the ground would carry out the hands-on maneuvers, not human beings using robotic and other specialized tools, as was the case for spacecraft like the low-Earth-orbiting Hubble Space Telescope. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behaviour at a Web site. The user logs are collected by the Web server. Typical data includes IP address, page reference and access time. Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted many end applications require a combination of one or more of the techniques applied in the categories above. Web usage mining by itself does not create issues, but this technology when used on data of personal nature might cause concerns. It has been an absolute pleasure to present you articles that you wish to read. We look forward to many more new technologies related research articles from you and your friends. We are anxiously awaiting the rich and thorough research papers that have been prepared by our authors for the next issue. Thanks, Editorial Team IJITCE

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Editorial Members Dr. Chee Kyun Ng Ph.D Department of Computer and Communication Systems, Faculty of Engineering,Universiti Putra Malaysia,UPMSerdang, 43400 Selangor,Malaysia. Dr. Simon SEE Ph.D Chief Technologist and Technical Director at Oracle Corporation, Associate Professor (Adjunct) at Nanyang Technological University Professor (Adjunct) at ShangaiJiaotong University, 27 West Coast Rise #08-12,Singapore 127470 Dr. sc.agr. Horst Juergen SCHWARTZ Ph.D, Humboldt-University of Berlin,Faculty of Agriculture and Horticulture,Asternplatz 2a, D-12203 Berlin,Germany Dr. Marco L. BianchiniPh.D Italian National Research Council; IBAF-CNR,Via Salaria km 29.300, 00015 MonterotondoScalo (RM),Italy Dr. NijadKabbaraPh.D Marine Research Centre / Remote Sensing Centre/ National Council for Scientific Research, P. O. Box: 189 Jounieh,Lebanon Dr. Aaron Solomon Ph.D Department of Computer Science, National Chi Nan University,No. 303, University Road,Puli Town, Nantou County 54561,Taiwan Dr. Arthanariee. A. M M.Sc.,M.Phil.,M.S.,Ph.D Director - Bharathidasan School of Computer Applications, Ellispettai, Erode, Tamil Nadu,India Dr. Takaharu KAMEOKA, Ph.D Professor, Laboratory of Food, Environmental & Cultural Informatics Division of Sustainable Resource Sciences, Graduate School of Bioresources,Mie University, 1577 Kurimamachiya-cho, Tsu, Mie, 514-8507, Japan Dr. M. Sivakumar M.C.A.,ITIL.,PRINCE2.,ISTQB.,OCP.,ICP. Ph.D. Project Manager - Software,Applied Materials,1a park lane,cranford,UK Dr. Bulent AcmaPh.D Anadolu University, Department of Economics,Unit of Southeastern Anatolia Project(GAP),26470 Eskisehir,TURKEY Dr. SelvanathanArumugamPh.D Research Scientist, Department of Chemistry, University of Georgia, GA-30602,USA.

Review Board Members Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials,CSIRO Process Science & Engineering Private Bag 33, Clayton South MDC 3169,Gate 5 Normanby Rd., Clayton Vic. 3168, Australia Dr. Zhiming Yang MD., Ph. D. Department of Radiation Oncology and Molecular Radiation Science,1550 Orleans Street Rm 441, Baltimore MD, 21231,USA Dr. Jifeng Wang Department of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign Urbana, Illinois, 61801, USA Dr. Giuseppe Baldacchini ENEA - Frascati Research Center, Via Enrico Fermi 45 - P.O. Box 65,00044 Frascati, Roma, ITALY. Dr. MutamedTurkiNayefKhatib Assistant Professor of Telecommunication Engineering,Head of Telecommunication Engineering Department,Palestine Technical University (Kadoorie), TulKarm, PALESTINE.

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 Dr.P.UmaMaheswari Prof &Head,Depaartment of CSE/IT, INFO Institute of Engineering,Coimbatore. Dr. T. Christopher, Ph.D., Assistant Professor &Head,Department of Computer Science,Government Arts College(Autonomous),Udumalpet, India. Dr. T. DEVI Ph.D. Engg. (Warwick, UK), Head,Department of Computer Applications,Bharathiar University,Coimbatore-641 046, India. Dr. Renato J. orsato Professor at FGV-EAESP,Getulio Vargas Foundation,São Paulo Business School,RuaItapeva, 474 (8° andar),01332-000, São Paulo (SP), Brazil Visiting Scholar at INSEAD,INSEAD Social Innovation Centre,Boulevard de Constance,77305 Fontainebleau - France Y. BenalYurtlu Assist. Prof. OndokuzMayis University Dr.Sumeer Gul Assistant Professor,Department of Library and Information Science,University of Kashmir,India Dr. ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg., Rm 120,Hampton University,Hampton, VA 23688 Dr. Renato J. Orsato Professor at FGV-EAESP,Getulio Vargas Foundation,São Paulo Business SchoolRuaItapeva, 474 (8° andar),01332-000, São Paulo (SP), Brazil Dr. Lucy M. Brown, Ph.D. Texas State University,601 University Drive,School of Journalism and Mass Communication,OM330B,San Marcos, TX 78666 JavadRobati Crop Production Departement,University of Maragheh,Golshahr,Maragheh,Iran VineshSukumar (PhD, MBA) Product Engineering Segment Manager, Imaging Products, Aptina Imaging Inc. Dr. Binod Kumar PhD(CS), M.Phil.(CS), MIAENG,MIEEE HOD & Associate Professor, IT Dept, Medi-Caps Inst. of Science & Tech.(MIST),Indore, India Dr. S. B. Warkad Associate Professor, Department of Electrical Engineering, Priyadarshini College of Engineering, Nagpur, India Dr. doc. Ing. RostislavChoteborský, Ph.D. Katedramateriálu a strojírenskétechnologieTechnickáfakulta,Ceskázemedelskáuniverzita v Praze,Kamýcká 129, Praha 6, 165 21 Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials,CSIRO Process Science & Engineering Private Bag 33, Clayton South MDC 3169,Gate 5 Normanby Rd., Clayton Vic. 3168 DR.ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg.,HamptonUniversity,Hampton, VA 23688 Mr. Abhishek Taneja B.sc(Electronics),M.B.E,M.C.A.,M.Phil., Assistant Professor in the Department of Computer Science & Applications, at Dronacharya Institute of Management and Technology, Kurukshetra. (India). Dr. Ing. RostislavChotěborský,ph.d, Katedramateriálu a strojírenskétechnologie, Technickáfakulta,Českázemědělskáuniverzita v Praze,Kamýcká 129, Praha 6, 165 21

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 Dr. AmalaVijayaSelvi Rajan, B.sc,Ph.d, Faculty – Information Technology Dubai Women’s College – Higher Colleges of Technology,P.O. Box – 16062, Dubai, UAE Naik Nitin AshokraoB.sc,M.Sc Lecturer in YeshwantMahavidyalayaNanded University Dr.A.Kathirvell, B.E, M.E, Ph.D,MISTE, MIACSIT, MENGG Professor - Department of Computer Science and Engineering,Tagore Engineering College, Chennai Dr. H. S. Fadewar B.sc,M.sc,M.Phil.,ph.d,PGDBM,B.Ed. Associate Professor - Sinhgad Institute of Management & Computer Application, Mumbai-BangloreWesternly Express Way Narhe, Pune - 41 Dr. David Batten Leader, Algal Pre-Feasibility Study,Transport Technologies and Sustainable Fuels,CSIRO Energy Transformed Flagship Private Bag 1,Aspendale, Vic. 3195,AUSTRALIA Dr R C Panda (MTech& PhD(IITM);Ex-Faculty (Curtin Univ Tech, Perth, Australia))Scientist CLRI (CSIR), Adyar, Chennai - 600 020,India Miss Jing He PH.D. Candidate of Georgia State University,1450 Willow Lake Dr. NE,Atlanta, GA, 30329 Jeremiah Neubert Assistant Professor,MechanicalEngineering,University of North Dakota Hui Shen Mechanical Engineering Dept,Ohio Northern Univ. Dr. Xiangfa Wu, Ph.D. Assistant Professor / Mechanical Engineering,NORTH DAKOTA STATE UNIVERSITY SeraphinChallyAbou Professor,Mechanical& Industrial Engineering Depart,MEHS Program, 235 Voss-Kovach Hall,1305 OrdeanCourt,Duluth, Minnesota 55812-3042 Dr. Qiang Cheng, Ph.D. Assistant Professor,Computer Science Department Southern Illinois University CarbondaleFaner Hall, Room 2140-Mail Code 45111000 Faner Drive, Carbondale, IL 62901 Dr. Carlos Barrios, PhD Assistant Professor of Architecture,School of Architecture and Planning,The Catholic University of America Y. BenalYurtlu Assist. Prof. OndokuzMayis University Dr. Lucy M. Brown, Ph.D. Texas State University,601 University Drive,School of Journalism and Mass Communication,OM330B,San Marcos, TX 78666 Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials CSIRO Process Science & Engineering Dr.Sumeer Gul Assistant Professor,Department of Library and Information Science,University of Kashmir,India Dr. ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg., Rm 120,Hampton University,Hampton, VA 23688

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 Dr. Renato J. Orsato Professor at FGV-EAESP,Getulio Vargas Foundation,S찾o Paulo Business School,RuaItapeva, 474 (8째 andar)01332-000, S찾o Paulo (SP), Brazil Dr. Wael M. G. Ibrahim Department Head-Electronics Engineering Technology Dept.School of Engineering Technology ECPI College of Technology 5501 Greenwich Road Suite 100,Virginia Beach, VA 23462 Dr. Messaoud Jake Bahoura Associate Professor-Engineering Department and Center for Materials Research Norfolk State University,700 Park avenue,Norfolk, VA 23504 Dr. V. P. Eswaramurthy M.C.A., M.Phil., Ph.D., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 007, India. Dr. P. Kamakkannan,M.C.A., Ph.D ., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 007, India. Dr. V. Karthikeyani Ph.D., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 008, India. Dr. K. Thangadurai Ph.D., Assistant Professor, Department of Computer Science, Government Arts College ( Autonomous ), Karur - 639 005,India. Dr. N. Maheswari Ph.D., Assistant Professor, Department of MCA, Faculty of Engineering and Technology, SRM University, Kattangulathur, Kanchipiram Dt - 603 203, India. Mr. Md. Musfique Anwar B.Sc(Engg.) Lecturer, Computer Science & Engineering Department, Jahangirnagar University, Savar, Dhaka, Bangladesh. Mrs. Smitha Ramachandran M.Sc(CS)., SAP Analyst, Akzonobel, Slough, United Kingdom. Dr. V. Vallimayil Ph.D., Director, Department of MCA, Vivekanandha Business School For Women, Elayampalayam, Tiruchengode - 637 205, India. Mr. M. Moorthi M.C.A., M.Phil., Assistant Professor, Department of computer Applications, Kongu Arts and Science College, India PremaSelvarajBsc,M.C.A,M.Phil Assistant Professor,Department of Computer Science,KSR College of Arts and Science, Tiruchengode Mr. G. Rajendran M.C.A., M.Phil., N.E.T., PGDBM., PGDBF., Assistant Professor, Department of Computer Science, Government Arts College, Salem, India. Dr. Pradeep H Pendse B.E.,M.M.S.,Ph.d Dean - IT,Welingkar Institute of Management Development and Research, Mumbai, India Muhammad Javed Centre for Next Generation Localisation, School of Computing, Dublin City University, Dublin 9, Ireland Dr. G. GOBI Assistant Professor-Department of Physics,Government Arts College,Salem - 636 007 Dr.S.Senthilkumar Post Doctoral Research Fellow, (Mathematics and Computer Science & Applications),UniversitiSainsMalaysia,School of Mathematical Sciences, Pulau Pinang-11800,[PENANG],MALAYSIA. Manoj Sharma Associate Professor Deptt. of ECE, PrannathParnami Institute of Management & Technology, Hissar, Haryana, India RAMKUMAR JAGANATHAN Asst-Professor,Dept of Computer Science, V.L.B Janakiammal college of Arts & Science, Coimbatore,Tamilnadu, India

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 Dr. S. B. Warkad Assoc. Professor, Priyadarshini College of Engineering, Nagpur, Maharashtra State, India Dr. Saurabh Pal Associate Professor, UNS Institute of Engg. & Tech., VBS Purvanchal University, Jaunpur, India Manimala Assistant Professor, Department of Applied Electronics and Instrumentation, St Joseph’s College of Engineering & Technology, Choondacherry Post, Kottayam Dt. Kerala -686579 Dr. Qazi S. M. Zia-ul-Haque Control Engineer Synchrotron-light for Experimental Sciences and Applications in the Middle East (SESAME),P. O. Box 7, Allan 19252, Jordan Dr. A. Subramani, M.C.A.,M.Phil.,Ph.D. Professor,Department of Computer Applications, K.S.R. College of Engineering, Tiruchengode - 637215 Dr. SeraphinChallyAbou Professor, Mechanical & Industrial Engineering Depart. MEHS Program, 235 Voss-Kovach Hall, 1305 Ordean Court Duluth, Minnesota 55812-3042 Dr. K. Kousalya Professor, Department of CSE,Kongu Engineering College,Perundurai-638 052 Dr. (Mrs.) R. Uma Rani Asso.Prof., Department of Computer Science, Sri Sarada College For Women, Salem-16, Tamil Nadu, India. MOHAMMAD YAZDANI-ASRAMI Electrical and Computer Engineering Department, Babol"Noshirvani" University of Technology, Iran. Dr. Kulasekharan, N, Ph.D Technical Lead - CFD,GE Appliances and Lighting, GE India,John F Welch Technology Center,Plot # 122, EPIP, Phase 2,Whitefield Road,Bangalore – 560066, India. Dr. Manjeet Bansal Dean (Post Graduate),Department of Civil Engineering,Punjab Technical University,GianiZail Singh Campus,Bathinda -151001 (Punjab),INDIA Dr. Oliver Jukić Vice Dean for education,Virovitica College,MatijeGupca 78,33000 Virovitica, Croatia Dr. Lori A. Wolff, Ph.D., J.D. Professor of Leadership and Counselor Education,The University of Mississippi,Department of Leadership and Counselor Education, 139 Guyton University, MS 38677

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Contents A Novel Multimodel Feature Extraction for Authentication Approach Using Iris Image R.Ashwinkumar & Dr.S.Pannirselvam ……………….……………………………….……………………….[301] High Dimensional Feature Based Word Pair Similarity Measuring For Web Database Using Skip-Pattern Clustering Algorithm Dr.C.Senthilkumar & R.Navin Kumar …………………….………………………..[306]

www.ijitce.co.uk


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

A Novel Multimodel Feature Extraction for Authentication Approach Using Iris Image R.Ashwinkumar Ph.D (Research Scholar), Department of Computer Science, Erode Arts & Science College (Autonomous), Erode, Tamil Nadu, India. Email: ahaashwin@gmail.com Dr.S.Pannirselvam Research Supervisor & Head Department of Computer Science, Erode Arts & Science College (Autonomous), Erode, Tamil Nadu, India. Email: pannirselvam08@gmail.com Abstract— Identity authentication is the most essential and required task in the real world environment which make use of biometric model to authenticate the persons in the real time. Biometrics usage to authenticate the system requires is a more burden process where the existences of fake images are exists. In the existing work, this problem is over come by introducing the multi model based authentication system in which wrong authentication due to fake images is reduced considerably. In the existing work, score level fusion is done using the triangulation based method which will combine the multi bio metrics (dual iris, thermal face and normal face) to authentication single person. However the authentication of system based on identity information might be more complex in case of presence of noises present in the given input images. Due to the noises resides in the images, the efficient and accurate matching of test input image with the database images might fail. The performance of combined multi bio metric authentication is improved by replacing the triangulation based method with min-max score level fusion approach which will lead to efficient processing. Thermal image enhancement is done before feature extraction to provide the optimal compilation of finding the fake images. The experimental tests have been conducted in the Matlab simulation environment which provides a flexible and convenient environment for the testers to execute the system. The performance evaluation conducted were proves that the proposed methodology provides better result than the existing system in terms of improved accuracy and successful authentication system. Keywords— DCT, WALSH, HAAR, RCF.

1. INTRODUCTION A biometric system is essentially a pattern recognition system which makes a personal identification by determining the authenticity of a specific physiological or behavioural characteristic possessed by the user. Biometric technologies are thus defined as the "automated methods of identifying or authenticating the identity of a living person based on a physiological or behavioural characteristic". Multimodal biometric systems are those that utilize more than one physiological or behavioural characteristic for enrolment, verification, or identification. In applications such as border entry/exit, access control, civil identification, and network security, multi-modal biometric systems are looked to as a means of Reducing false non-match and false match rates. Providing a secondary means of enrolment, verification and

identification if sufficient data cannot be acquired from a given biometric sample and Combating attempts to fool biometric systems through fraudulent data sources such as fake fingers. A multimodal biometric verification system can be considered as a classical information fusion problem i.e. can be thought to combine evidence provided by different biometrics to improve the overall decision accuracy. Abstract level: The output from each module is only a set of possible labels without any confidence value associated with the labels; in this case a simple majority rule may be used to reach a more reliable decision. Rank level: The output from each module is a set of possible labels ranked by decreasing confidence values, but the confidence values themselves are not specified. Measurement level: the output from each module is a set of possible labels with associated confidence values; in this case, more accurate decisions can be made by integrating different confidence values. 2. RELATED WORKS Ismail [1] comprehensively categorizes image quality measures, extend measures defined for gray scale images to their multispectral case and propose novel image quality measures. Subjective tests are tedious, time consuming and expensive and the results depend on various factors such as the observer’s background, motivation, etc. The statistical behavior of the measures and their sensitivity to coding artifacts are investigated via analysis of variance techniques. Their similarities or differences are illustrated by plotting their Kohonen maps. Samir [2] proposes a new technique for generating synthetic iris images. The background texture is first generated using a texture synthesis scheme based on a single primitive element. Then, features of the iris such as the radial and concentric furrows, collarets and crypts are added to the synthetic images. Line integral convolution is used to impart texture to the radial furrows. Anil [3] reliable identity management system is urgently needed in order to combat the epidemic growth in identity theft and to meet the increased security requirements in a variety of applications ranging from international border crossings to securing information in databases. Establishing the identity of a person is a critical task in any identity management system. Surrogate representations of identity such as passwords and ID cards are not sufficient for reliable identity determination because they can be easily misplaced,

301


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 shared, or stolen. Biometric recognition is the science of establishing the identity of a person using his/her anatomical and behavioural traits. Matthew [4] discussed that with the exception of the identity mapping, pixel value mappings leave behind statistical artifacts which are visible in an image’s pixel value histogram. These artifacts as the intrinsic fingerprint of a pixel value mapping. By observing the common properties of the histograms of unaltered images, this work able to build a model of an unaltered images pixel value histogram. This model to identify diagnostic features of a pixel value mapping’s intrinsic fingerprint. Because a number of image processing operations are in essence pixel value mappings, this work propose a set of image forgery detection techniques which operate by detecting the intrinsic fingerprint of each operation. At lower quality factors, however, noise detection appears to become more difficult. 3. DRAWBACKS IN THE EXISTING SYSTEM The above research methodologies concentrates on finding the similarity of the authors based on gait sequences and author bio metric sequences. Single biometrics is used for authentication of users. Feature extraction and analysis are done in the synthetic manner. From these literatures it is clear that the authentication of the persons would be more difficult in the real world environment where the there is presence of more security threats. Gait sequence based authentication would be more complex process and also user with same manners can hack the system easily. Statistical measure based authentication system cannot support the authentication permission in the real time dynamic environment. 4. PROPOSED METHODOLOGY The various problems that reside in the existing methodology are the noisy elements presents in the images might lead to the failure in the bio metric authentication process and also triangulation based score level fusion does not provide an optimal decision which is more complex process that leads to computational overhead. In the existing system, multi biometric based authentication is used. The bio metrics used in this work are dual iris, normal face and the thermal faces. The features from these images are extracted and the feature fusion is done for iris, normal face and thermal face features. Then score level fusion is combining the feature details of the images. Finally, fake image is identified from the output retrieved from the score level fusion approach. 4.1 MULTI BIOMETRIC BASED AUTHENTICATION In the existing work, multi biometric authentication is done to prevent the malicious user authentication. The multi biometric that are considered in this work dual iris and the thermal face image. 4.2 FEATURE EXTRACTION IN THE IRIS IMAGES In the existing research work, feature extraction in the iris code images are done by using the 1D log filter based feature extraction. Relations between activations for a specific spatial location are very distinctive between objects in an image. Furthermore, important activations can be extracted

from the Gabor space in order to create a sparse object representation. Gabor filters have been used extensively in a variety of image processing problems, such as fingerprint enhancement and iris recognition. 4.3 FEATURE EXTRACTION IN THERMAL FACE IMAGES The thermal feature extraction is done by using (2D)2FPCA feature extraction approach. The existing (2D)2LDALPP method effectively combines alternative 2DLDA with alternative 2DLPP. The feature extraction is split into two steps: firstly, the column directional information is extracted by applying alternative 2DLDA; secondly, the feature matrix is inversed and alternative 2DLPP is used to extract the row directional information. 4.4 FEATURE EXTRACTION IN VISIBLE FACE IMAGES Multilinear principal component analysis (MPCA) is a mathematical procedure that uses multiple orthogonal transformations to convert a set of multidimensional objects into another set of multidimensional objects of lower dimensions. There is one orthogonal (linear) transformation for each dimension (mode) hence multilinear. This transformation aims to capture as high a variance as possible, accounting for as much of the variability in the data as possible, subject to the constraint of mode-wise orthogonality. 4.5 FEATURE LEVEL FUSION FOR BOTH VISIBLE AND THERMAL FACES Almost every biometric system consists of four major parts – sensor, feature extraction, comparison, and the final decision. In order to address specific parts of the process, we use a slightly more complex pipeline structure. The numbers in the following list refer directly to the numbers. 1. The detection of facial features involves the localization of important facial landmarks. These landmarks are used in subsequent steps. In this paper, we are using manually annotated data, because precise detection of facial features is still a challenge. Moreover, we are focusing on algorithm performance, rather than detection accuracy. 2. The face can be normalized into some predefined position by affine transformation, based on the detected points (2D-warping). The other solution (3D-projection) involves a 3D model that is adapted to the input image. 3. Some of our testing databases contain the IR images, captured in dynamic range mode. Due to this fact, intensity normalization is needed. 4. The feature vector extraction is a simple vectorization of the normalized image. The other possibility is the application of a filter bank. Either the Gabor filter or the Laguerre-Gaussian filter banks may be used. 5. In order to reduce space dimensionality, as well as to increase inter-class variability and/or reduce redundancy, the feature vector may optionally be passed onto some statistical projection technique.

302


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 6.

7.

8.

Feature vector post-processing involves additional processing of the feature vector. For example, a selection of the best components (in terms of recognition performance) may be used. Individual components may obtain weights, or the components may be normalized. A comparison of two processed faces (feature vectors) is accomplished by calculating the distance between them. Any distance-metric function may accomplish this task. In our paper, we are using Euclidean, cosine, city-block (Manhattan, sum of absolute differences), and correlation metric. The final decision is simply a thresholding of the achieved comparison score.

extend the capability of the MATLAB environment for the solution of Digital Image Processing. 5.1 MULTI BIOMETRIC SAMPLE DATABASE The sample biometrics images that are considered in this work for secured authentication of persons who are trying to access information is given as follows: These bio metrics improves the authentication security in terms of various difference resides between the biometrics that are proposed.

Iris Right Image

4.6 Algorithm Input: Thermal Face Image, normal face image, Dual iris image Output: Authentication result

Iris Left Image

5.2 PROCEDURE FOR IDENTITY FUSION BASED AUTHENTICATION A complete procedure of the authentication an person in terms of multiple bio metrics that are gathered from the users are given as follows: Feature Extraction in biometrics Feature level Fusion creation Score level Fusion Calculation Authenticate the person

Step 1: Retrieve the thermal face image from the database. Step 2: Enhance the thermal image by eliminating noises and improving the contrast. Step 3: Extract the features from the image using 2LDPP. Step 4: Calculate the feature fusion score of thermal face image. Step 5: Retrieve the normal face image from the database. Step 6: Extract the features from the image using MPCA. Step 7: Calculate the feature fusion score of normal face image. Step 8: Retrieve the dual iris image from the database. Step 9: Extract the features using 1D gabor filter approach. Step 10: Calculate the fusion score of the iris features. Step 11: Compute score fusion value by integrating the feature score values of dual iris and the thermal face image. Step 12: Output whether the person is authenticated or not.

6. PERFORMANCE EVALUATION 6.1 Accuracy Accuracy is defined as the correctness of authentication of the persons who are entering into field without error. The accuracy of authentication of system should be high, so that the retrieval can be done without wrong authentication. The accuracy is calculated as follows:

5. EXPERIMENTAL RESULTS MATLAB is an interactive software package which was developed to perform numerical calculations on vectors and matrices. It is a high performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notations. It is an interactive system whose basic data element is an array that does not require dimensioning. This allows formulating solutions to many technical computing problems, especially those involving matrix representations, in a fraction of the time it would take to write a program in a scalar noninteractive language. MATLAB stands for Matrix Laboratory. It was written originally to provide easy access to matrix software developed by the linear system package, Eigen system package. It is complimented by a family of application-specific solutions called toolboxes. It contains a high-level programming language which makes it quite easy to code complicated algorithms involving vectors and matrices. It can do quite sophisticated graphics in two and three dimensions. Image processing toolbox is a collection of MATLAB functions that

------------(1) 6.2 True positive rate (TP) The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation ---------(2) 6.3 False positive rate (FP) The false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive, as calculated using the equation: ---------(3) 6.4 False negative rate (FN) The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative, as calculated using the equation: ---------(4) a is the number of correct predictions that an instance is negative

303


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 b is the number of incorrect predictions that an instance is positive c is the number of incorrect of predictions that an instance negative d is the number of correct predictions that an instance is positive 6.5 Precision Precision is the percentage of ability of system to authentication person in the correct manner. The precision value indicates that the ability of system to correct prediction made among the total number of prediction. The precision value is calculated as follows:

Figure 6.1 Comparison of Accuracy parameter of different research methods

Precision =True Positive / (True Positive + False Positive)-------(5)

6.6 Recall Recall value is determined based on the retrieval of information at true positive prediction, false negative. Recall in this context is also referred to as the True Positive Rate. In that process the fraction of relevant instances that are retrieved. Recall =TP / (TP+FN)

--------------(6)

6.7 F-Measure The F-Measure computes some average of the information retrieval precision and recall metrics F-measure=

Figure 6.2 Comparison of Precision parameter of different research methods

---------------(7)

To evaluate the overall performance of the proposed system in terms of accurate authentication of the persons based on their correct bio metric system is conducted in the MATLAB simulation environment. The different performance parameters are considered for the efficient comparison of the proposed approach with the existing approach in order to prove their performance improvement in terms efficient authentication. Table 6.1 shows the performance metrics values that are considered for evaluating the proposed with the existing approach which are obtained in the MATLAB simulation environment. The comparison of the proposed methodology with the various existing approaches and its actual performance metrics values are listed in the table 6.1.

Figure 6.3 Comparison of Recall parameter of different research methods

Figure 6.4 Comparaison of F-Measure parameter of different research methods Table 6.1. Performance measure values of proposed and various existing researches 304


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015 7. CONCLUSION Identity based authentication is the most essential process in most of the real world application which started to use bio metric based authentication for the secured environment. In this work secured multi bio metric based system is done in which bio metrics are authorized in the secured manner. Biometric considered are dual iris, normal face image and the thermal face. Thermal face is enhanced before feature extraction to improve the matching rate. The fusion of features is done with the consideration of the various segmentation details. The min max based approach is used to fuse the multi bio metric features from which the security is enhanced considerably. Experimental results prove that the proposed methodology provides better result than the existing work.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

8. REFERENCES Matthew C. Stamm, and K. J. Ray Liu,” Forensic Detection of Image Manipulation Using Statistical Intrinsic Fingerprints”, IEEE Transactions On Information Forensics And Security, Vol. 5, No. 3, September 2010. Samir Shah and Arun Ross, “Generating Synthetic Irises By Feature Agglomeration”, Proceedings of International Conference on Image Processing (ICIP), (Atlanta, USA), October 2006. Ismail avcas, Bulent saker, khalid shayid, “Statistical evaluation of image quality measures”, Journal of Electronic Imaging 11(2), 206–223, April 2002. Raffaele Cappelli, Alessandra Lumini, Dario Maio and Davide Maltoni, “Fingerprint Image Reconstruction from Standard Templates”, IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 29, No. 9, September 2007 Siwei Lyu, and Hany Farid, “Steganalysis Using Higher-Order Image Statistics”, IEEE Transactions On Information Forensics And Security, Vol. 1, No. 1, March 2006. Gian Luca Marcialis, Aaron Lewicke, Bozhao Tan, Pietro Coli, Fabio Roli, “First International Fingerprint Liveness Detection Competition—LivDet 2009”, Image analysis and processing, Volume 5716, pp 12-23, 2009. Anmin Liu, Weisi Lin, and Manish Narwaria, “Image Quality Assessment Based on Gradient Similarity”, IEEE Transactions On Image Processing, Vol. 21, No. 4, April 2012. Kristin Adair Nixon Valerio Aimale Robert K. Rowe, “Spoof Detection Techniques”, published in handbook of biometrics. A. Hadid, M. Ghahramani, V. Kellokumpu, M. Pietikainenm, “Can Gait Biometrics be Spoofed?”, Pattern Recognition (ICPR), PP: 3280 – 3283, 11-15 Nov. 2012. Anil K. Jain, Karthik Nandakumar, and Abhishek Nagar, “Biometric Template Security”, Hindawi Publishing Corporation, EURASIP Journal on Advances in Signal Processing, Volume 2008, Article ID 579416.

305


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

High Dimensional Feature Based Word Pair Similarity Measuring For Web Database Using Skip-Pattern Clustering Algorithm Dr.C.Senthilkumar Associate Professor, Department of Computer Science, Erode Arts & Science College (Autonomous), Erode, Tamil Nadu, India. Email: csincseasc@gmail.com R.Navin Kumar M.Phil Scholar (Part Time), Department of Computer Science, Erode Arts & Science College (Autonomous), Erode, Tamil Nadu, India. Email: navinsoccer07@gmail.com Abstract— Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Accurately measuring the semantic similarity between words is an important problem in web mining, information retrieval, and natural language processing. In information retrieval, one of the main problems is to retrieve a set of documents that is semantically related to a given user query. Text processing plays an important role in information retrieval, data mining, and web search. In text processing, the bag-of-words model is commonly used. In this paper a new scheme proposes an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web database for two words. Specifically, it defines various word co-occurrence measures using page counts and integrates those with lexical patterns extracted from text snippets. Keywords—Web mining, Text processing, Sematic Similiarity, Pattern matching, Skip Lexical Pattern.

1. INTRODUCTION Text mining, also referred to as text data mining, roughly correspondent to text analytics, refers to the process of deriving high-quality information from text [1][2] . Highquality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning system. Text mining usually involves the process of structuring the input, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness [3] [4]. Typical text mining tasks include text categorization, text clustering, and entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling. Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods [5] [6] [7].

306

Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content management system, for analysis although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis [8] [9]. The following results main contribution for the proposed system The new system integrates different web-based similarity measures using a machine learning approach. The new system extracts synonymous word pairs from WordNet and synsets Positive training instances and automatically generates negative training instances. Data can be retrieved easily and Retrieval of data based on the similarity and the Page ranking Highlight the word and Reduces the manual sensation The remainder of this paper is organized as follows. Section II reviews the similarity measures and related works in web mining. Section III we briefly discuss similarity mechanism and then the proposed lexical pattern technologies in are presented in section IV. Section V performance analysis and section VI concludes this paper. 2. RELATED WORKS In information retrieval, one of the main problems is to retrieve a set of documents that is semantically related to a given user query. Efficient estimation of semantic similarity between words is critical for various natural language processing tasks such as word sense disambiguation (WSD) and automatic text summarization [10] [11] [12]. Semantically related words of a particular word are listed in manually created general-purpose lexical ontologies such as WordNet. In WordNet, a synset contains a set of synonymous words for a particular sense of a word. However, semantic similarity between entities changes over time and across domains. For example, apple is frequently associated with computers on the web. However, this sense of apple is not listed in most general-purpose thesauri or dictionaries [13] [14] [15]. A user, who searches for apple on the web, might be interested in this sense of apple and not apple as a fruit. New


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

words are constantly being created as well as new senses are assigned to existing words. Some measures which have been popularly adopted for computing the similarity between two documents are presented in existing system. Let d1 and d2 be two documents represented as vectors. The Euclidean distance measure is defined as the root of square differences between the respective coordinates of D1 and D2, i.e., DEUC [D1, D2] = [(D1-D2). (D1-D2)]1/2 where A·B denotes the inner product of the two vectors A and B. Cosine similarity [25] measures the cosine of the angle between d1 and d2 as follows: SCos [D1, D2] = D1.D2/ (D1.D1)½ (D2.D2)1/2 The Jaccard coefficient for data processing: SEJ [D1, D2] = D1.D2/ D1.D1 + D2.D2 –D1.D2 While the Dice coefficient looks similar to it and is defined as follows: SDic [D1, D2] = 2D1.D2 / D1.D1 +D2.D2 IT-Sim, an information-theoretic measure for document similarity:

Where wi represents feature i, pji indicates the normalized value of wi in document dj for j = 1or j = 2, and ʌ(wi) is the proportion of documents in which wi occurs. A novel similarity measure between two documents and similarity degree increases when the number of presenceabsence features pairs decreases. Two documents are least similar to each other if none of the features have non-zero values in both documents. Besides, it is desirable to consider the value distribution of a feature for its contribution to the similarity between two documents. The proposed scheme has also been extended to measure the similarity between two sets of documents. To improve the efficiency, we have provided an approximation to reduce the complexity involved in the computation [16] [17] [18]. 3. EXISTING METHODOLOGY It is important to emphasize that getting from a collection of documents to a clustering of the collection, is not merely a single operation, but is more a process in multiple stages. These stages include more traditional information retrieval operations such as crawling, indexing, weighting, filtering etc. Some of these other processes are central to the quality and performance of most clustering algorithms, and it is thus necessary to consider these stages together with a given clustering algorithm to harness its true potential [19] [20]. We will give a brief overview of the similarity with clustering process, before begin our literature study and analysis. A. Clustering A general definition of clustering stated by Brian Everitt et al. [19] Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined.

B. Feature Base Clustering Two types of clustering have been studied - clustering the documents on the basis of the distributions of words that co-occur in the documents, and clustering the words using the distributions of the documents in which they occur. In this algorithm I have used a double-clustering approach in which I first cluster the words and then use this word cluster to cluster the documents [20] [21]. The clustering of words reduces the feature space and thus reduces the noise and increases the time efficiency. In general, this algorithm can be used for clustering of objects based on their features. A recently introduced principle, termed the information bottleneck method is based on the following simple idea. Given the empirical joint distribution of two variables, one variable is compressed so that the mutual information about the other is preserved as much as possible. Here the two variables are the object and the features. First, the features are clustered to preserve the information of objects and then these clusters are used to reduce the noise in the object graph [22] [23]. The main advantage of this procedure lies in a significant reduction of the inevitable noise of the original cooccurrence matrix, due to its very high dimension. The reduced matrix, based on the word-clusters, is denser and more robust, providing a better reflection of the inherent structure of the document corpus. C. K-Mean Clustering K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean [25] [26]. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Clusters the data into k groups where k is predefined. Select k points at random as cluster centers. Assign objects to their closest cluster center according to the Euclidean distance function. Calculate the centroid or mean of all objects in each cluster. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

307

Fig 3.1 K-Mean Clustering


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in advance and the final results are sensitive to initialization and often terminates at a local optimum. Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical approach is to compare the outcomes of multiple runs with different k and choose the best one based on a predefined criterion. In general, a large k probably decreases the error but increases the risk of over fitting [24] [25]. D. CLUSTERING PROCESS We have divided the offline clustering process into the five stages outlined below Fig 3.2: Collection of Data includes the processes like crawling, indexing, filtering etc. which are used to collect the documents that needs to be clustered, index them to store and retrieve in a better way, and filter them to remove the extra data , for example, stopwords Collection Web Data Set

Preprocessing

Similarity Measurement with Jaccard, Dice and IT sim

dĞdžƚ ŽĐƵŵĞŶƚ ůƵƐƚĞƌŝŶŐ

WƌŽƉŽƐĞĚ ŽĐƵŵĞŶƚ ůƵƐƚĞƌŝŶŐ Fig 3.2 Clustering Process Preprocessing is done to represent the data in a form that can be used for clustering. There are many ways of representing the documents like, Vector-Model, graphical model, etc. Many measures are also used for weighing the documents and their similarities. Similarity measure is implemented in this paper, called SMTP (Similarity Measure for Text Processing), for two documents with using Jaccard, Dice and IT sim Document clustering proposed algorithm is a generalization of K-Means algorithm in which the set of K centroids as the model that generate the data. It alternates between an expectation step, corresponding to reassignment, and a maximization step, corresponding to re computation of the parameters of the model. Proposed algorithm the patterns can be clustered using the lexical pattern clustering algorithm. The patterns are clustered and then the count and co-occurrence of the word can be considered. Based on this the word can be extracted. The cluster can be grouped based on the threshold value.

The proposed scheme along with all existing system approach has also been extended to measure the similarity between ‘n’ sets of documents. To improve the efficiency, it has provided an approximation to reduce the complexity involved in the computation. Sequential Clustering is also provided to group the documents into Clusters Set. Stemming is applied before text documents are taken. Stop word removal is applied which reduces the content size. Synonym word replacement is applied so that related documents with varying contents also coincide. Two web documents may have a certain value of cosine similarity, but if neither of them is in the other one’s neighborhood, they have no connection between them. In such a case, the proposed system applied some context-based knowledge or relativeness property by modifying the text content. ‘N’ group of documents set can be prepared as Cluster result. 4. PROPOSED METHODOLOGY Clustering is being studied since a long time, and many state-of-art algorithms have been applied till date but still the results are not very satisfactory and we are looking for some better algorithms. This gave us the motivation to think out of box and try something simple but different. While calculating the similarity between the documents we first tried to use synonyms of the words also as same words but in case of documents like news articles, its not the synonymy but the cooccurrence of words which plays importance in the similarity [26] [27]. A. Feature Extraction This is used for extraction of features (important words and phrases in this case) from the documents. We have used Named-Entity and frequency of unigrams and bigrams to extract the important words from the document [28]. B. Feature Clustering This is the most important phase in which the extracted features are clustered based on their co-occurrence. For this we tried many algorithms and found Squeal clustering algorithms to be best for large data set as it reduces the time taken to a large extent. C. Document Clustering This is the final phase in which documents are clustered using the feature clusters. For this we have used a simple approach in which a document is assigned to the cluster of words of which it has the maximum words.

&ĞĂƚƵƌĞ džƚƌĂĐƚŝŽŶ &ĞĂƚƵƌĞ ůƵƐƚĞƌŝŶŐ dĞdžƚ ŽĐƵŵĞŶƚ ^ŬŝƉ >ĞdžŝĐĂů WĂƚƚĞƌŶ Fig 4.1 Proposed Similarity Clustering

308


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

D. Skip Lexical Pattern Clustering The document retrieval problem in Information Retrieval [31] [32] is as follows: given a query | typically represented as a set of query terms | return a ranked list of documents from some set, ordered by relevance to the query. Terms here can be words or lemmas, or multi-word units, depending on the lexical pre-processing being used. The complete set of documents depends on the application; in the internet-search case it could be the whole web. One of the features of most solutions to the document retrieval problem, and indeed Information Retrieval problems in general, is the lack of sophistication of the linguistic modeling employed: both the query and the documents are considered to be \bags of words", i.e. multi-sets in which the frequency of words is accounted for, but the order of words is not. From a linguistic perspective, this is a crude assumption (to say the least), since much of the meaning of a document is mediated by the order of the words, and the syntactic structures of the sentences. However, this simplifying assumption has worked surprisingly well, and attempts to exploit linguistic structure beyond the word level have not usually improved performance. For the document retrieval problem perhaps this is not too surprising, since queries, particularly on the web, tend to be short (a few words) and so describing the problem as one of simple word matching between query and document is arguably appropriate. Once the task of document retrieval is described as one of word overlap between query and document, then a vector space model is a natural approach[33] [34]: the basis vectors of the space are words, and both queries and documents are vectors in that space. The coefficient of a document vector for a particular basis vector, in the simplest case, is just the number of times that the word corresponding to the basis appears in the document. Queries are represented in the same way, essentially treating a query as a “pseudo-document". Measuring word overlap, or the similarity of a document vector d Æ and query vector q Æ , can be achieved using the dot product:

important concept associated with a given document. The importance of a term in representing the semantic of the document is the term weight. One way to calculate it is to count the frequency with which the term occurs in document. As for the recall and precision, they are measures of performance. Higher the precision and recall, better the Information Retrieval system is. 0 < Recall = #Relevant Document Retrieval/ #Relevant Document Collection Finally, there is the tolerance. It is the benchmark for which only the documents with relevance score higher than it are retrieved. 0< Precision = # Relevant Document Retrieval/Document Retrieval Term-Document Matrix After getting familiar with important term definition, the second order of business is to learn how the Vector Space Model is constructed. Basically, the documents are stored in a term-document matrix. A total of d documents with t terms, for instance, are stored in a t x d matrix. Each vector of the matrix represents each individual document. Each element on a column represents frequency a term occurs in a document. Thus, aij will represent the frequency that term i is presented in the document j.

^ŝŵ ;ĚÆ͕ Ƌ ÆͿ с ĚÆ͘ Ƌ Æ ͲͲͲͲͲͲͲͲͲͲͲͲ ϭ с єŝ Ěŝ dž Ƌŝ

Where vi is the ith coefficient of vector vÆ The term-document matrix introduced in the previous section gives us the basic structure for determining word similarity. There the intuition was that words or terms are similar if they tend to occur in the same documents. However, this is a very broad notion of word similarity, producing what we might call topical similarity, based on a coarse notion of context. The trick in arriving at a more refined notion of similarity is to think of the term-document matrix as a termcontext matrix, where, in the IR case, context was thought of as a whole document. But we can narrow the context down to a sentence, or perhaps even a few words either side of the target word[35]. Term Definition The first order of business to define few important terms: term, term weight, recall, precision, and tolerance. Term, to begin with, is basically the keyword or 309

Lexical Pattern extraction Algorithm Input: Word Pair W Output: Extraction Patterns A Given a set W of word-pairs, Skip extract patterns For each word-pair (P, Q ) ‫ ܭ‬W do A ĸ Get-Snippets (“P Q”) N ĸ null For each snippet a ‫ ܭ‬A do N ĸ N + Get-N-grams ( a, P, Q ) Pats ĸ Count-Freq (N) return ( Pats ) Lexical Skip pattern Clustering Input: patterns A (a1, … ,an), threshold θ Output: Clusters C SORT (A) C ← {} for pattern ai ∈ A do max ← - ∞ c* ← null for cluster cj ∈ C do sim ← cosine (ai, cj) if sim > max then max ← sim c ← cj end if end for if max > θ then c* ← c* ⊕ ai else C ← C ∪ {ai} end if end for return C


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

The patterns can be clustered using the Skip Lexical Pattern Clustering Algorithm (SLPCA). The patterns are clustered and then the count and co-occurrence of the word can be considered. Based on this the word can be extracted. The cluster can be grouped based on the threshold value, the words are clustered and then the results are produced. 5. PERFORMANCES ANALYSIS The following Table 5.1 shows Spearman’s Rank Correlation Coefficient experimental result for existing system. The table contains word pair, word pair one value [W1], word pair one rank value [R1], word pair two value [W2], word pair two rank value Table 5.1 Word Pair Relation

The Table 5.3 shows experimental result for existing system analysis. The table contains word pair, word jaccard value, word overlab values, word dice values, word PMI values and its PMI max values details are shown. The word pair count details are measure the cor-relation coefficient score value in each word pair using precision and recall measure. The over all word pair coefficient values are jaccard value is 0.584, overlab value is 0.875, dice values is 0.695, PMI values are 2.846 and PMI max values are 4.025. Table 5.2 Spearman’s Rank Correlation Coefficient R2-R1

d2

n

n3

rs

10

100

13

2197

0.0842

S.No

Word-pair

W1

R1

W2

R2

Word Pair [n]

0

0

8

512

1.0432

1

cord-smile monkoracle noon-string glassmagician monk-slave coast-forest craneimplement carautomobile brother-lad bird-crane bird-cock coast-hill car-journey implementtool boy-lad forestgraveyard middaynoon furnacestove magicianwizard lad-wizard

38

2

14

12

13

3

9

13

2197

0.9175

15

13

10

13

8

3

9

13

2197

0.9175

22

8

16

11

13

-4

16

8

512

0.3650

25

7

17

10

13

5

25

15

3375

0.8511

15 27

13 5

18 17

9 10

8 15

-4

16

11

1331

0.7575

6

36

17

4913

0.8529

18

11

21

7

11

-9

81

19

6859

0.7631

30

3

18

9

17

5

25

17

4913

0.8978

19 29 29 27 44

10 4 4 5 1

38 18 25 30 23

1 9 4 2 5

19 17 23 22 23

0

0

23

12167

1.0567

-3

9

22

10648

0.9830

4

16

23

12167

0.9736

-6

36

16

4096

0.8235

21

9

26

3

16

-5

25

24

13824

0.9637

26

6

38

1

24

2

4

8

512

0.8412

17

12

9

14

8

-7

49

15

3375

0.7083

15

13

22

6

15

-4

16

17

4913

0.9346

17

12

20

8

17

-5

25

17

4913

0.8978

17

12

21

7

17

38

2

21

7

21

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5

Spearman's rank correlation coefficient (rs) is a reliable and fairly simple method of testing both the strength and direction (positive or negative) of any correlation between two word pair or document.

ƌƐ с ϭͲ ΀E ɇĚϮͬŶϯͲŶ΁ Where d2 = [R2-R1]2, n= Word Pair Count, N= Total number of word pair. The Table 5.2 shows the spearman’s rank correlation coefficient between two word pair for existing system. The table contains difference between rank values, square of rank values, word pair count and cube value of word pair values and spearman’s rank correlation coefficient for each word pair (rs) details are shown. The overall word pair spearman’s Rank Correlation Coefficient value is 0.8239.

25 21 9261 0.9458 Spearman’s Rank Correlation 0.8239 Coefficient The Fig 5.1 shows experimental result for existing system analysis. The figure contains word pair, word jaccard value, word overlab values, word dice values, word PMI values and its PMI max values details are shown. The word pair count details are measure the correlation coefficient score value in each word pair using precision and recall measure. The over all word pair coefficient values are accard value is 0.584, overlab value is 0.875, dice values is 0.695, PMI values are 2.846 and PMI max values are 4.025 The Table 5.4 shows examples are described the proposed system methodology in relation independent and relation specific word pair relation measurement process. The Table 5.5 shows and takes word pair details in proposed system analysis. The proposed system measures the relation independent and relation specific co efficient values.

310


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Table 5.4 Word Pair Example S.NO WORD PAIR 1 Pain 2 Cord 3 Smile 4 String 5 Noon 6 Blue 7 Car 8 Automobile 9 Forest 10 Coast

Table 5.3 Comparisons of Jaccard, Overlab, Dice, PMI and PMI max and Previous Measures on the Mc Data Set S. NO

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

WORD-PAIR

cord-smile monk-oracle noon-string glassmagician monk-slave coast-forest craneimplement carautomobile brother-lad bird-crane bird-cock coast-hill car-journey implementtool boy-lad forestgraveyard middaynoon furnacestove magicianwizard lad-wizard Standard Deviation

JACC ARD

OVER LAB

DICE

PMI

PMI Max

0.33 0.47 0.52

0.93 0.80 0.81

0.50 0.64 0.68

2.69 3.47 3.10

3.93 4.27 4.23

0.45

0.76

0.62

2.91

4.11

0.32 0.52

0.53 0.88

0.48 0.68

2.88 2.88

3.91 4.20

0.39

0.61

0.56

2.86

4.02

0.55

0.94

0.71

2.94

4.22

0.50 0.57 0.74 0.63 0.52

1.00 0.94 0.92 0.81 1.00

0.67 0.72 0.85 0.77 0.69

2.76 2.98 2.95 2.80 2.62

4.13 4.24 4.34 4.21 4.09

0.52

0.76

0.68

2.87

4.15

0.60

0.92

0.75

2.68

4.16

0.44

0.89

0.62

3.45

4.25

0.68

1.00

0.81

3.31

4.42

0.85

1.00

0.92

3.41

4.54

0.81

1.00

0.89

3.36

4.50

0.55 0.58 4

1.00 0.875

0.71 0.695

2.76 2.84 6

Table 5.5 Word Pair Co-efficient Values S.No

4.17 4.02 5

Comparisons Of Jaccard, Overlab, Dice, Pmi And Pmimax And Previous Measures On The Mc Data Set Word Pair Dataset

0.5

Jaccard

Overlap

Dice

PMI

PMI max

1

cordsmile

0.33

0.93

0.50

2.69

3.93

2

noonstring

0.52

0.81

0.68

3.10

4.23

3

Journeyvoyage

0.67

0.94

0.80

3.21

4.38

4

ladwizard

0.55

1.00

0.71

2.76

4.17

5

coastforest

0.52

0.88

0.68

2.88

4.20

6

middaynoon

0.68

1.00

0.81

3.31

4.42

7

coast-hill

0.63

0.81

0.77

2.80

4.21

8

monkoracle

0.47

0.80

0.64

3.47

4.27

9

magicianwizard

0.81

1.00

0.89

3.36

4.50

10

carautomobil e

0.55

0.94

0.71

2.94

4.22

0.573

0.911

0.71 9

3.05 2

4.253

The Table 5.6 takes word pair details in core-smile and Noon-string. The proposed system measures the relation independent and relation specific co efficient values. The Joint probability Values are given below. Table 5.6 Joint probability values- Word Pair [Relation dependent and Independent Values] S.NO

0

Wordpair

Word pair relation

Overlab

1

Dice

2

[pain, cord, String]

0.32

PMI

3

[pain, Noon, smile]

0.36

PMI max

4

[pain, Noon, String]

0.36

5

[Blue, cord, smile]

0.18

6

[Blue, cord, String]

0.18

1

Correlation Coefficient Values

Fig 5.1 Comparisons of Jaccard, Overlab, Dice, PMI and PMI max and Previous Measures on the Mc Data Set

311

[pain, cord, smile]

Joint probability values 0.32

7

[Blue, Noon, smile]

0.07

8

[Blue, Noon, String]

0.15

9

[car, automobile ,car]

0.16

10

[forest, coast , forest]

0.77


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Table 5.7 shows the table joint probability values in word pair for relation dependent and independent co-efficient values. Joint probability values are measure using the following equation.

Total frequency of a pattern ȡ in a particular relation type R is defined as the sum of the frequencies of ȡ in all entity pairs Table 5.8 shows the table mutual information values in word pair for relation dependent and independent co-efficient values. Mutual information values are measure using the following equation Fig 5.2 Lexical Skip Pattern Cluster Algorithm [Joint Probability Value] Fig 5.3 shows the table mutual information values in word pair for relation dependent and independent co-efficient values Table 5.10 is describing the comparison between spearman’s Rank and existing system. The table contains spearman’s Rank in all word pair details and word pair standard deviation details are shown.

Table 5.7 & 5.8 Mutual Information values-Word Pair [Relation dependent and Independent Values] Mutual Information Word pair relation S.NO values 1 [pain, pain] 0.654 2

[pain, Blue]

0.663

3

[Blue, pain]

0.351

4

[Blue, Blue]

0.320

5

[cord, smile]

0.234

6

[Noon, String]

0.765

7

[car, Automobile]

0.457

8 [forest, coast] 0.556 Table 5.9 shows the table entropy values in word pair for relation dependent and independent co-efficient values. Entropy values measurement using the following equation. Table 5.9 Entropy values- Word Pair [Relation dependent and Independent Values S.NO Word pair relation Entropy values 1 Pain 14873.942 2 Cord 68385.919 3 Smile 87654.332 4 String 65435.221 5 Noon 12666.223 6 Blue 76543.222 7 Car 34562.115 8 Automobile 23457.223 9 Forest 34593.104 10

Coast

Fig 5.3 Lexical Prefix Cluster Algorithm [Mutual Information Value Table 5.11 is shows the comparison between spearman’s Rank and prefix span clustering algorithm. The table contains spearman’s Rank in word pair and proposed word pair relation details are shown. Table 5.10 Comparison of Spearman’s Rank and Existing system EXISTING SYSTEM COEFFICIENT SPEARMAN’S Standard Deviation RANK [Word Pair] 0.584 0.8235 [All Word Pair]

34925.112

The Fig 5.2 takes word pair details in proposed word pair. The proposed system measures the relation independent and relation specific co efficient values.

312

0.875 0.695 PMI: 2.846 PMI Max: 4.025


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Table 5.11 Comparison of Spearman’s Rank and Lexical Pattern Clustering Algorithm PROPOSED LEXICAL SPEARMAN’S PATTERN CLUSTERING RANK ALGORITHM [Word Pair Relation] 0.654 0.663 0.8235 0.351 [All Word Pair] 0.320 Table 5.12 is describing the comparison between Existing Coefficient and prefix span clustering algorithm. The table contains Existing Word pair Coefficient in word pair and proposed word pair relation details are shown. Table 5.12 Comparison of Existing Word pair Coefficient and Prefix Span Clustering Algorithm Word pair relation

EXISTING WORD PAIR COEFFICIENT

PROPOSED LEXICAL PATTERN CLUSTERING ALGORITHM [Word Pair Relation]

Pain

0.8235

0.654

Cord

0.8153

0.663

Smile

0.8003

0.351

String

0.8213

0.320

Noon

0.8441

0.322

Blue

0.8521

0.567

Car

0.8457

0.775

Automobile

0.8235

0.234

Forest

0.8235

0.789

Coast

0.8235

0.586

The Table 5.13 is shows the bug word count term matrix in existing system process. The table contains bugs word and source code entity file is to finding co occurrence bugs word count details are shown below. Table 5.13 Term Matrix SOURCE CODE ENTITY

BUGS REPORT

1.html

1_2.html

1_3.html

1_4.html

ANTICMOS 3 2 5 4 ELIZA 2 3 5 4 GRAYBRID 3 3 5 4 COMMWAR 3 0 5 4 RRIOR ACMS 3 3 5 4 ABRAXAS 3 3 5 0 ACTIFED 3 3 5 4 BOMBER 2 3 1 4 AGENA 1 0 0 1 BUG 2 0 0 1 The Table 5.14 is shows the cosine similarity between bugs report entity and source code entity in existing system process. The table contains bugs report file and source code file in threshold value < 0.8 for vectors base model details are shown below. The Table 5.15 is shows the cosine similarity between bugs report entity and source code entity in existing system process. The table contains bugs report file and source code file in threshold value < 0.5 for vectors base model details are shown below. Table 5.14 cosine similarity value < 0.8

Fig 5.4 is shows the comparison between Existing Coefficient and Prefix Span Clustering Algorithm. The figure contains Existing Word pair Coefficient in word pair and proposed word pair relation details are shown. [Word pair Cord-Smile] [Pain and Blue]

Entity

1. html

1_2. html

1_3. html

1_4. html

1_4_2. html

1.html

-

0.86

0.86

0.85

-

1_2 html

-

0.87

0.87

0.89

-

1_3.html

-

0.95

0.95

0.89

-

1_4.htnl

-

0.89

0.89

0.84

-

Table 5.15 cosine similarity value <0.5 Entity

1. html

1_2. html

1_3. html

1_4. html

1_4_2. html

1.html

0.76

0.86

0.86

0.85

0.72

1_2 html

0.75

0.87

0.87

0.89

0.72

1_3.html

0.73

0.95

0.95

0.89

0.73

1_4.htnl

0.71

0.89

0.89

0.84

0.65

Comparison of Existing Word Pair Coefficient and Lexical Pattern Clustering Algorithm 0.9

0.8

AVG Co-Efficient Values

0.7

0.6 EXISTING WORD PAIR COEFFICIENT 0.5 PROPOSED LEXICAL PATTERN CLUSTERING ALGORITHM [Word Pair Relation]

0.4

The Fig 5.5 is shows the cosine similarity between bugs report entity and source code entity in existing system process. The fig contains bugs report file and source code file in threshold value < 0.8 for vectors base model details are shown below.

0.3

0.2

0.1

oa st C

Fo re st

A

C ar ut om ob ile

lu e B

N oo n

St ri ng

or d

S m ile

C

P ai n

0

Word Pairs

Fig 5.4 Comparison of Existing Word Pair Coefficient and Lexical Pattern Clustering Algorithm

313


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

Table 5.17 Cluster size in Lexical Pattern Clustering S.NO

Threshold values 0.5 0.6 0.8 0.9 1

1 2 3 4 5

Fig 5.5 Cosine similarity value < 0.8 The Fig 5.6 is shows the cosine similarity between bugs report entity and source code entity in existing system process. The figure contains bugs report file and source code file in threshold value < 0.5 for vectors base model details are shown below. The Table 5.16 is shows the bug word skip count in lexical pattern clustering process in proposed system. The table contains source code entity, word skip count value and group of phrase bugs word count details are shown below. For Example: Bugs Word: ANTICMOS ****** ABRAXAS

Cluster Size 1 2 2 4 5

The Fig 5.7 is shows the bug word skip count in lexical pattern clustering process in proposed system. The figure contains source code entity, word skip count value and group of phrase bugs word count details are shown below. The Fig 5.8 is shows the bug word cluster size in lexical pattern clustering algorithm in proposed system. The figure contains threshold values and cluster size details are shown below. HTML tags are removed from bug report and source code entity file before taken for similarity identification. Syntax words are removed from source code entity file before taken for similarity identification. The comparison table is used to represent the status of analyzing patterns and pattern clustering in both system analyses such as existing system and proposed system with processing details of stop word, stem word, synonym word. In the existing system, the pattern is not used in the application and pattern clustering were not applied. The proposed system is processed as patterns are used and pattern clustering are applied in the paper.

Bugs Skip Count

2. ht m l

2. ht m l

Bugs Phrase Count

2. ht m l

Count

30 25 20 15 10 5 0 10 .h tm l 10 .h tm l

Bugs Phrase Word

Skip Count in Lexical Pattern Clustering

Source Entity File

Fig 5.7 Skip Count in Lexical Pattern Clustering The results are discussed under the performance of novel cluster base lexical pattern scheme. The result of proposed model bugs finding is discussed and compared with the existing finding bugs. To measure the performance of the proposed works are throughput threshold values are evaluated.

Fig 5.6 Cosine similarity value <0.5 Table 5.16 Skip Count in Lexical Pattern Clustering Source Entity File

1 2 3

10.html 10.html 2.html

Bugs Skip Count 1 2 3

4 5

2.html 2.html

5 7

Cluster Size in Lexical Pattern Clustering

Bugs Phrase Count 24 24 2 3 3

The Table 5.17 is shows the bug word cluster size in lexical pattern clustering algorithm in proposed system. The table contains threshold values and cluster size details are shown below

Source Entity File

S.No

5 4 Cluster Size

3

Threshold values

2 1 0

2

4

6

Bugs Report Entity File

Fig 5.8 Cluster size in Lexical Pattern Clustering

314


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

6. CONCLUSION In this paper presents a novel similarity measure between two documents. Several desirable properties are embedded in this measure. For example, the similarity measure is symmetric. The presence or absence of a feature is considered more essential than the difference between the values associated with a present feature. The similarity degree increases when the number of presence-absence features pair decreases. Two documents are least similar to each other if none of the features have non-zero values in both documents. Besides, it is desirable to consider the value distribution of a feature for its contribution to the similarity between two documents. The proposed scheme has also been extended to measure the similarity between two sets of documents. In addition, the paper proposed a semantic similarity measure using both page counts and snippets retrieved from a web search engine for two words. Four word co-occurrence measures were computed using page counts. It proposed a skip lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words. Moreover, a sequential pattern clustering algorithm was proposed to identify different skip lexical patterns that describe the same semantic relation. The paper proposes an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, it defines various word co-occurrence measures using page counts and integrates those with skip lexical patterns extracted from text snippets. Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair. 7. FUTURE ENHANCEMENTS The new system will implement the following future enhancements, if it is implemented the application will further improved so that the paper can work with the full efficiency. To ensure reliability in ever growing client count. The content can be searched from more than one search engine. In future, searching and comparing the video content based on the exact semantic words. Multitasking can be performed. If developed as web service, it can be accessed from anywhere The proposed system can be further developed by working in different operating system independently.

[1]

[2]

[3]

8. REFERENCES G. Amati and C. J. V. Rijsbergen, “Probabilistic models of information retrieval based on measuring the divergence from randomness,” ACM Trans. Inform. Syst., vol. 20, no. 4, pp. 357–389, 2002. J. A. Aslam and M. Frost, “An information-theoretic measure for document similarity,” in Proc. 26th SIGIR, Toronto, ON, Canada, 2003, pp. 449–450. D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 1624–1637, Dec. 2005.

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

315

S. Clinchant and E. Gaussier, “Information-based models for ad hoc IR,” in Proc. 33rd SIGIR, Geneva, Switzerland, 2010, pp. 234–241. I.S.Dhillon, J. Kogan, and C. Nicholas, “Feature selection and document clustering,” in A Comprehensive Survey of Text Mining, M. W. Berry, Ed. Heidelberg, Germany: Springer, 2003. I.S. Dhillon and D.S. Modha.Concept decompositions for large sparse text data using clustering.Machine Learning, 42(1):143–175, January 2001.Also appears as IBM Research Report RJ 10147, July 1999. I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Mach. Learn., vol. 42, no. 1, pp. 143–175, 2001. Broder, A. Z., S. C. Glassman, M. S. Manasse, and G. Zweig: 1997, `Syntactic clustering of the web'. Technical Report 1997-015, Digital Systems Research Center. Dhillon, I. S., D. S. Modha, and W. S. Spangler: 1998, `Visualizing Class Structure of Multidimensional Data'. In: S.Weisberg (ed.): Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Vol. 30. Minneapolis, MN, pp. 488-493. H. Fang, T. Tao, and C. Zhai, “A formal study of heuristic retrieval constraints,” in Proc. 27th SIGIR, Sheffield, South Yorkshire, U.K., 2004, pp. 49–56. G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989. J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Laberty, editors, Language Modeling and Information Retrieval. Kluwer Academic Publishers, 2003. N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language Modeling and IR workshop. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR’01, pages 334–342, Sept 2001. C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu, “Fair and Balanced?: Bias in Bug-Fix Data Sets,” Proc. Seventh European Software Eng. Conf. and Symp. Foundations of Software Eng., pp. 121-130, 2009. J. Chang and D.M. Blei, “Relational Topic Models for Document Networks,” Proc. 12th Int’l Conf. Artificial Intelligence and Statistics, pp. 81-88, 2009. G.V. Cormack, C.L. Clarke, and S. Buettcher, “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,” Proc. 32nd Int’l Conf. Research and Development in Information Retrieval, pp. 758-759, 2009. R. Moser, W. Pedrycz, and G. Succi, “A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction,” Proc. 30th Int’l Conf. Software Eng., pp. 181-190, 2008.


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.5 NO.11 NOVEMBER 2015

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26] [27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35]

F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu, “BugCache for Inspections: Hit or Miss?” Proc. 19th Symp. and 13th European Conf. Foundations of Software Eng., pp. 322-331, 2011. S. Rao and A. Kak, “Retrieval from Software Libraries for Bug Localization: A Comparative Study of Generic and Composite Text Models,” Proc. Eighth Working Conf. Mining Software Repositories, pp. 43-52, 2011. Y. S. Maarek, D. M. Berry, and G. E. Kaiser. An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEE Trans. Softw. Eng., 17(8):800–813, 2006. E. Kocaguneli, T. Menzies, and J.W. Keung, “On the Value of Ensemble Effort Estimation,” IEEE Trans. Software Eng., vol. 38, no. 6 pp. 1403-1416, Nov./Dec. 2012. S.K. Lukins, N.A. Kraft, and L.H. Etzkorn, “Bug Localization Using Latent Dirichlet Allocation,” Information and Software Technology, vol. 52, no. 9, pp. 972-990, 2010. A.T. Nguyen, T.T. Nguyen, J. Al-Kofahi, H.V. Nguyen, and T.N. Nguyen, “A Topic-Based Approach for Narrowing the Search Space of Buggy Files from a Bug Report,” Proc. 26th Int’l Conf. Automated Software Eng., pp. 263-272, 2011 S. Rao and A. Kak, “Retrieval from Software Libraries for Bug Localization: A Comparative Study of Generic and Composite Text Models,” Proc. Eighth Working Conf. Mining Software Repositories, pp. 43-52, 2011. R.L. Glass, Facts and Fallacies of Software Engineering. Addison-Wesley Professional, 2003. R.W. Selby, “Enabling Reuse-Based Software Development of Large-Scale Systems,” IEEE Trans. Software Eng., vol. 31, no. 6 pp. 495-510, June 2005. S.K. Lukins, N.A. Kraft, and L.H. Etzkorn, “Bug Localization Using Latent Dirichlet Allocation,” Information and Software Technology, vol. 52, no. 9, pp. 972-990, 2010. http://web.ist.utl.pt/~acardoso/datasets/[Online]. http://www.cs.technion.ac.il/~ronb/thesis.html http://www.daviddlewis.com/resources/testcollections /reuters21578/ C. G. González, W. Bonventi, Jr., and A. L. V. Rodrigues, “Density of closed balls in real-valued and autometrized boolean spaces for clustering applications,” in Proc. 19th Brazilian Symp. Artif. Intell., Savador, Brazil, 2008, pp. 8–22. R. W. Hamming, “Error detecting and error orrecting codes,” Bell Syst. Tech. J., vol. 29, no. 2, pp. 147–160, 1950. K. M. Hammouda and M. S. Kamel, “Efficient phrase-based document indexing for web document clustering,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 10, pp. 1279–1296, Oct. 2004. K. M. Hammouda and M. S. Kamel, “Hierarchically distributed peer-to-peer document clustering and cluster summarization,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, May 2009.

[36]

316

J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Francisco, CA, USA: Morgan Kaufmann; Boston, MA, USA: Elsevier, 2006.


@IJITCE Publication


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.