ISSN NO (ONLINE) : 2045-8711 ISSN NO (PRINT) : 2045-869X
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY & CREATIVE ENGINEERING
APRIL 2016 VOL-6 NO-4
@ IJITCE Publication
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
UK: Managing Editor International Journal of Innovative Technology and Creative Engineering 1a park lane, Cranford London TW59WA UK E-Mail: editor@ijitce.co.uk Phone: +44-773-043-0249 USA: Editor International Journal of Innovative Technology and Creative Engineering Dr. Arumugam Department of Chemistry University of Georgia GA-30602, USA. Phone: 001-706-206-0812 Fax:001-706-542-2626 India: Editor International Journal of Innovative Technology & Creative Engineering Dr. Arthanariee. A. M Finance Tracking Center India 66/2 East mada st, Thiruvanmiyur, Chennai -600041 Mobile: 91-7598208700
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
www.ijitce.co.uk
IJITCE PUBLICATION
International Journal of Innovative Technology & Creative Engineering Vol.6 No.4 April 2016
www.ijitce.co.uk
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
From Editor's Desk Dear Researcher, Greetings! Research article in this issue discusses about motivational factor analysis. Let us review research around the world this month. Monster black holes in the early universe may have taken an unusual route to becoming so massive. Giant gas clouds in some of the universe’s first galaxies collapsed under their own gravity to form supermassive black holes, theoretical astrophysicist Lucio Mayer of the University of Zurich suggested. The postulated process offers a major shortcut to supermassive status as black holes are generally thought to start small and gradually grow by merging with each other and gobbling up matter. The mechanism also doesn’t rely on stars to spawn black holes in the first place.Mayer’s proposal still has hurdles to clear before other astrophysicists accept it as viable. But if confirmed, it would solve the mystery of why astronomers keep spotting gargantuan black holes when the universe was less than a billion years old. This supermassive conundrum boils down to timing. The first stars, some of them 100 times or more the mass of the sun, took shape a few hundred million years after the Big Bang. The largest ones exploded soon after and left behind black holes of roughly the same mass. Yet recent telescope observations reveal that by about 500 million years later, not very long on cosmic timescales, some black holes weighed in at 10 billion solar masses. No matter how often ancient black holes feasted and combined forces, they would have had trouble growing by a factor of 100 million so quickly.Mayer has tried to devise mechanisms that would birth jumbo black holes. The recipe requires getting huge amounts of matter to fall together until the collective gravity is strong enough to prevent light from escaping. It has been an absolute pleasure to present you articles that you wish to read. We look forward to many more new technologies related research articles from you and your friends. We are anxiously awaiting the rich and thorough research papers that have been prepared by our authors for the next issue.
Thanks, Editorial Team IJITCE
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
Editorial Members Dr. Chee Kyun Ng Ph.D Department of Computer and Communication Systems, Faculty of Engineering,Universiti Putra Malaysia,UPMSerdang, 43400 Selangor,Malaysia. Dr. Simon SEE Ph.D Chief Technologist and Technical Director at Oracle Corporation, Associate Professor (Adjunct) at Nanyang Technological University Professor (Adjunct) at ShangaiJiaotong University, 27 West Coast Rise #08-12,Singapore 127470 Dr. sc.agr. Horst Juergen SCHWARTZ Ph.D, Humboldt-University of Berlin,Faculty of Agriculture and Horticulture,Asternplatz 2a, D-12203 Berlin,Germany Dr. Marco L. BianchiniPh.D Italian National Research Council; IBAF-CNR,Via Salaria km 29.300, 00015 MonterotondoScalo (RM),Italy Dr. NijadKabbaraPh.D Marine Research Centre / Remote Sensing Centre/ National Council for Scientific Research, P. O. Box: 189 Jounieh,Lebanon Dr. Aaron Solomon Ph.D Department of Computer Science, National Chi Nan University,No. 303, University Road,Puli Town, Nantou County 54561,Taiwan Dr. Arthanariee. A. M M.Sc.,M.Phil.,M.S.,Ph.D Director - Bharathidasan School of Computer Applications, Ellispettai, Erode, Tamil Nadu,India Dr. Takaharu KAMEOKA, Ph.D Professor, Laboratory of Food, Environmental & Cultural Informatics Division of Sustainable Resource Sciences, Graduate School of Bioresources,Mie University, 1577 Kurimamachiya-cho, Tsu, Mie, 514-8507, Japan Dr. M. Sivakumar M.C.A.,ITIL.,PRINCE2.,ISTQB.,OCP.,ICP. Ph.D. Project Manager - Software,Applied Materials,1a park lane,cranford,UK Dr. Bulent AcmaPh.D Anadolu University, Department of Economics,Unit of Southeastern Anatolia Project(GAP),26470 Eskisehir,TURKEY Dr. SelvanathanArumugamPh.D Research Scientist, Department of Chemistry, University of Georgia, GA-30602,USA.
Review Board Members Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials,CSIRO Process Science & Engineering Private Bag 33, Clayton South MDC 3169,Gate 5 Normanby Rd., Clayton Vic. 3168, Australia Dr. Zhiming Yang MD., Ph. D. Department of Radiation Oncology and Molecular Radiation Science,1550 Orleans Street Rm 441, Baltimore MD, 21231,USA Dr. Jifeng Wang Department of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign Urbana, Illinois, 61801, USA Dr. Giuseppe Baldacchini ENEA - Frascati Research Center, Via Enrico Fermi 45 - P.O. Box 65,00044 Frascati, Roma, ITALY.
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
Dr. MutamedTurkiNayefKhatib Assistant Professor of Telecommunication Engineering,Head of Telecommunication Engineering Department,Palestine Technical University (Kadoorie), TulKarm, PALESTINE. Dr.P.UmaMaheswari Prof &Head,Depaartment of CSE/IT, INFO Institute of Engineering,Coimbatore. Dr. T. Christopher, Ph.D., Assistant Professor &Head,Department of Computer Science,Government Arts College(Autonomous),Udumalpet, India. Dr. T. DEVI Ph.D. Engg. (Warwick, UK), Head,Department of Computer Applications,Bharathiar University,Coimbatore-641 046, India. Dr. Renato J. orsato Professor at FGV-EAESP,Getulio Vargas Foundation,São Paulo Business School,RuaItapeva, 474 (8° andar),01332-000, São Paulo (SP), Brazil Visiting Scholar at INSEAD,INSEAD Social Innovation Centre,Boulevard de Constance,77305 Fontainebleau - France Y. BenalYurtlu Assist. Prof. OndokuzMayis University Dr.Sumeer Gul Assistant Professor,Department of Library and Information Science,University of Kashmir,India Dr. ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg., Rm 120,Hampton University,Hampton, VA 23688 Dr. Renato J. Orsato Professor at FGV-EAESP,Getulio Vargas Foundation,São Paulo Business SchoolRuaItapeva, 474 (8° andar),01332-000, São Paulo (SP), Brazil Dr. Lucy M. Brown, Ph.D. Texas State University,601 University Drive,School of Journalism and Mass Communication,OM330B,San Marcos, TX 78666 JavadRobati Crop Production Departement,University of Maragheh,Golshahr,Maragheh,Iran VineshSukumar (PhD, MBA) Product Engineering Segment Manager, Imaging Products, Aptina Imaging Inc. Dr. Binod Kumar PhD(CS), M.Phil.(CS), MIAENG,MIEEE HOD & Associate Professor, IT Dept, Medi-Caps Inst. of Science & Tech.(MIST),Indore, India Dr. S. B. Warkad Associate Professor, Department of Electrical Engineering, Priyadarshini College of Engineering, Nagpur, India Dr. doc. Ing. RostislavChoteborský, Ph.D. Katedramateriálu a strojírenskétechnologieTechnickáfakulta,Ceskázemedelskáuniverzita v Praze,Kamýcká 129, Praha 6, 165 21 Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials,CSIRO Process Science & Engineering Private Bag 33, Clayton South MDC 3169,Gate 5 Normanby Rd., Clayton Vic. 3168 DR.ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg.,HamptonUniversity,Hampton, VA 23688 Mr. Abhishek Taneja B.sc(Electronics),M.B.E,M.C.A.,M.Phil., Assistant Professor in the Department of Computer Science & Applications, at Dronacharya Institute of Management and Technology, Kurukshetra. (India).
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
Dr. Ing. RostislavChotěborský,ph.d, Katedramateriálu a strojírenskétechnologie, Technickáfakulta,Českázemědělskáuniverzita v Praze,Kamýcká 129, Praha 6, 165 21
Dr. AmalaVijayaSelvi Rajan, B.sc,Ph.d, Faculty – Information Technology Dubai Women’s College – Higher Colleges of Technology,P.O. Box – 16062, Dubai, UAE Naik Nitin AshokraoB.sc,M.Sc Lecturer in YeshwantMahavidyalayaNanded University Dr.A.Kathirvell, B.E, M.E, Ph.D,MISTE, MIACSIT, MENGG Professor - Department of Computer Science and Engineering,Tagore Engineering College, Chennai Dr. H. S. Fadewar B.sc,M.sc,M.Phil.,ph.d,PGDBM,B.Ed. Associate Professor - Sinhgad Institute of Management & Computer Application, Mumbai-BangloreWesternly Express Way Narhe, Pune - 41 Dr. David Batten Leader, Algal Pre-Feasibility Study,Transport Technologies and Sustainable Fuels,CSIRO Energy Transformed Flagship Private Bag 1,Aspendale, Vic. 3195,AUSTRALIA Dr R C Panda (MTech& PhD(IITM);Ex-Faculty (Curtin Univ Tech, Perth, Australia))Scientist CLRI (CSIR), Adyar, Chennai - 600 020,India Miss Jing He PH.D. Candidate of Georgia State University,1450 Willow Lake Dr. NE,Atlanta, GA, 30329 Jeremiah Neubert Assistant Professor,MechanicalEngineering,University of North Dakota Hui Shen Mechanical Engineering Dept,Ohio Northern Univ. Dr. Xiangfa Wu, Ph.D. Assistant Professor / Mechanical Engineering,NORTH DAKOTA STATE UNIVERSITY SeraphinChallyAbou Professor,Mechanical& Industrial Engineering Depart,MEHS Program, 235 Voss-Kovach Hall,1305 OrdeanCourt,Duluth, Minnesota 55812-3042 Dr. Qiang Cheng, Ph.D. Assistant Professor,Computer Science Department Southern Illinois University CarbondaleFaner Hall, Room 2140-Mail Code 45111000 Faner Drive, Carbondale, IL 62901 Dr. Carlos Barrios, PhD Assistant Professor of Architecture,School of Architecture and Planning,The Catholic University of America Y. BenalYurtlu Assist. Prof. OndokuzMayis University Dr. Lucy M. Brown, Ph.D. Texas State University,601 University Drive,School of Journalism and Mass Communication,OM330B,San Marcos, TX 78666 Dr. Paul Koltun Senior Research ScientistLCA and Industrial Ecology Group,Metallic& Ceramic Materials CSIRO Process Science & Engineering Dr.Sumeer Gul Assistant Professor,Department of Library and Information Science,University of Kashmir,India
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016 Dr. ChutimaBoonthum-Denecke, Ph.D Department of Computer Science,Science& Technology Bldg., Rm 120,Hampton University,Hampton, VA 23688
Dr. Renato J. Orsato Professor at FGV-EAESP,Getulio Vargas Foundation,São Paulo Business School,RuaItapeva, 474 (8° andar)01332-000, São Paulo (SP), Brazil Dr. Wael M. G. Ibrahim Department Head-Electronics Engineering Technology Dept.School of Engineering Technology ECPI College of Technology 5501 Greenwich Road Suite 100,Virginia Beach, VA 23462 Dr. Messaoud Jake Bahoura Associate Professor-Engineering Department and Center for Materials Research Norfolk State University,700 Park avenue,Norfolk, VA 23504 Dr. V. P. Eswaramurthy M.C.A., M.Phil., Ph.D., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 007, India. Dr. P. Kamakkannan,M.C.A., Ph.D ., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 007, India. Dr. V. Karthikeyani Ph.D., Assistant Professor of Computer Science, Government Arts College(Autonomous), Salem-636 008, India. Dr. K. Thangadurai Ph.D., Assistant Professor, Department of Computer Science, Government Arts College ( Autonomous ), Karur - 639 005,India. Dr. N. Maheswari Ph.D., Assistant Professor, Department of MCA, Faculty of Engineering and Technology, SRM University, Kattangulathur, Kanchipiram Dt - 603 203, India. Mr. Md. Musfique Anwar B.Sc(Engg.) Lecturer, Computer Science & Engineering Department, Jahangirnagar University, Savar, Dhaka, Bangladesh. Mrs. Smitha Ramachandran M.Sc(CS)., SAP Analyst, Akzonobel, Slough, United Kingdom. Dr. V. Vallimayil Ph.D., Director, Department of MCA, Vivekanandha Business School For Women, Elayampalayam, Tiruchengode - 637 205, India. Mr. M. Moorthi M.C.A., M.Phil., Assistant Professor, Department of computer Applications, Kongu Arts and Science College, India PremaSelvarajBsc,M.C.A,M.Phil Assistant Professor,Department of Computer Science,KSR College of Arts and Science, Tiruchengode Mr. G. Rajendran M.C.A., M.Phil., N.E.T., PGDBM., PGDBF., Assistant Professor, Department of Computer Science, Government Arts College, Salem, India. Dr. Pradeep H Pendse B.E.,M.M.S.,Ph.d Dean - IT,Welingkar Institute of Management Development and Research, Mumbai, India Muhammad Javed Centre for Next Generation Localisation, School of Computing, Dublin City University, Dublin 9, Ireland Dr. G. GOBI Assistant Professor-Department of Physics,Government Arts College,Salem - 636 007 Dr.S.Senthilkumar Post Doctoral Research Fellow, (Mathematics and Computer Science & Applications),UniversitiSainsMalaysia,School of Mathematical Sciences, Pulau Pinang-11800,[PENANG],MALAYSIA. Manoj Sharma Associate Professor Deptt. of ECE, PrannathParnami Institute of Management & Technology, Hissar, Haryana, India
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
RAMKUMAR JAGANATHAN Asst-Professor,Dept of Computer Science, V.L.B Janakiammal college of Arts & Science, Coimbatore,Tamilnadu, India Dr. S. B. Warkad Assoc. Professor, Priyadarshini College of Engineering, Nagpur, Maharashtra State, India Dr. Saurabh Pal Associate Professor, UNS Institute of Engg. & Tech., VBS Purvanchal University, Jaunpur, India Manimala Assistant Professor, Department of Applied Electronics and Instrumentation, St Joseph’s College of Engineering & Technology, Choondacherry Post, Kottayam Dt. Kerala -686579 Dr. Qazi S. M. Zia-ul-Haque Control Engineer Synchrotron-light for Experimental Sciences and Applications in the Middle East (SESAME),P. O. Box 7, Allan 19252, Jordan Dr. A. Subramani, M.C.A.,M.Phil.,Ph.D. Professor,Department of Computer Applications, K.S.R. College of Engineering, Tiruchengode - 637215 Dr. SeraphinChallyAbou Professor, Mechanical & Industrial Engineering Depart. MEHS Program, 235 Voss-Kovach Hall, 1305 Ordean Court Duluth, Minnesota 55812-3042 Dr. K. Kousalya Professor, Department of CSE,Kongu Engineering College,Perundurai-638 052 Dr. (Mrs.) R. Uma Rani Asso.Prof., Department of Computer Science, Sri Sarada College For Women, Salem-16, Tamil Nadu, India. MOHAMMAD YAZDANI-ASRAMI Electrical and Computer Engineering Department, Babol"Noshirvani" University of Technology, Iran. Dr. Kulasekharan, N, Ph.D Technical Lead - CFD,GE Appliances and Lighting, GE India,John F Welch Technology Center,Plot # 122, EPIP, Phase 2,Whitefield Road,Bangalore – 560066, India. Dr. Manjeet Bansal Dean (Post Graduate),Department of Civil Engineering,Punjab Technical University,GianiZail Singh Campus,Bathinda -151001 (Punjab),INDIA Dr. Oliver Jukić Vice Dean for education,Virovitica College,MatijeGupca 78,33000 Virovitica, Croatia Dr. Lori A. Wolff, Ph.D., J.D. Professor of Leadership and Counselor Education,The University of Mississippi,Department of Leadership and Counselor Education, 139 Guyton University, MS 38677
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL. 6 NO.4 APRIL 2016
Contents A Novel Text Stream Clustering Technique for Web Pages using Sliding Window V.Kumuthavalli & Dr.R.Vallimayil………………………….…………………………………….[344]
www.ijitce.co.uk
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.6 NO.4 APRIL 2016
A Novel Text Stream Clustering Technique for Web Pages using Sliding Window V.Kumuthavalli Associate Professor, Department of Computer Science, Sri Parasakthi College for Women, Courtallam, Tirunelveli, Tamil Nadu, India. Email: saikumuthavalli@gmail.com Dr.V.Vallimayil Associate Professor & Head, Department of Computer Science & Applications, Periyar Maniyammai University, Vallam, Thanjavur, Tamil Nadu, India. Email: vallimayilv@gmail.com
Abstract- The text mining gaining more importance recently because of the availability of the increasing number of the electronic documents from a variety of sources. In the current scenario, text data streams gains lot of significance in processing. Due to rapid development of the information technology, large numbers of electronic documents are available on the internet instead of hard copies. It provides beginning advice to information in social network for making decision, a clustering for text stream algorithm is proposed to cluster the text stream, which is formed by web crawler to continuously grab the web pages. The time sliding window able to split the text stream into continuous segments of web page news associated to velocity of stream and size of sliding window. Here, multilevel cluster method is used to merge the cluster in each sliding window. The results of experiments, used 2750 web page news simulate text stream by web crawler using the algorithm with executing efficiency and the higher clustering quality in terms of precision and recall rate. The experimentation and results with various documents and compared with existing methods and it provides better results. Keywords- Text Categorization, sliding window, data stream, text mining, clustering.
1. INTRODUCTION Today, information has a great value and the amount of information has been expansively growing during last years. Especially, text databases are rapidly growing due to the increasing amount of information available in electronic forms, such as electronic publications e-mail and the World Wide Web. The information around us, that becomes a problem to find related for our necessary. Because of this, there are many databases and catalogues of information classified into many categories, helping the viewer to easily navigate to the information. Most information in world is texts and here the text streaming comes to the scene. Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Generally, data mining is process of analyzing data from different perspectives and summarizing it into useful information. Text mining has been defined as discovery by computer of new, previously 344
unknown information, by automatically extracting information from different written resources. Text mining, which is sometimes referred to text analytics, is one way to make qualitative or unstructured data usable by computer. Text mining can help an organization derive potentially valuable business insights from text-based content. 1.1 Text Categorization Text categorization is one of the well studied problem in data mining and information retrieval. Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. A category illuminates a relationship between the subjects and objects of knowledge. The data categorization includes the categorization of text, image, object, voice etc. With the rapid development of the web, large numbers of electronic documents are available on the Internet. Text categorization becomes a key technology to deal with and organize large numbers of documents. Text categorization is the assignment of natural language documents to one or more predefined categories based on their semantic content is an important component in many information organization and management tasks [1][2]. The techniques of text categorization are necessary for improving the quality of any information system dealing with textual information, although they cover only a fraction of document management. 2. REVIEW OF LITERATURE In this research work, discussed about the methodology for feature extraction and document classification. In order to propose this works have analyzed various literatures which are very much relevant and helpful to do this work. The literature where are have retrieved and analyzed are presented in the following section. With the development of mobile cloud computing and social network application, the value of data is changing the users consume and decision habit in the big data era. Especially, the rapidly increasing data with high velocity to form a data stream, which is continuous, unlimited and dynamic variable data set [3]. In order to help web users making decision in realtime, how to achieve and extract the valuable information from the data stream and to form different topic clusters in time are new challenges, especially, how to extract text feature
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.6 NO.4 APRIL 2016 vector with multi-dimension and to design a cluster algorithm for text stream with lower time and space complexity. Therefore, the evaluation about the quality of cluster algorithm, such as precision rate, recall rate, efficiency and robustness, are becoming the key points, when a single pass scanning method [4] is used in this process. To solve above problems, this paper proposes a Topic-Based Dynamic Clustering Algorithm for Text Stream (TBDC4TS), which uses a sliding time window to split the text stream into continuous segments and to transform the text stream cluster from flow data to continuous batch data processing. There are three kinds of models to process the data stream, such as the time-limited model, the sliding window model and the snapshot model [4, 5]. The data scale of all these three models depend on the selection of time interval, which are defined by the time interval from an initial time to current time, a certain time widow size and a certain time interval between each snapshot operation, respectively. Moreover, some researchers focus on the structure of data stream clustering, which includes some algorithms based on the Single-pass and Clu-stream algorithm[6]. Single-Pass is a classic incremental clustering algorithm with single scanning the whole data set. The upcoming data in stream, which is captured by system, should be compared to existed clusters one by one, if there is a cluster which has the highest similarity degree with the new data and larger than the threshold, then merge the new data into this cluster and recalculate the new average feature of cluster, else a new cluster can be created by this new data point. This algorithm is suitable to the large data with certain number of clusters, but not suitable to the situation that the number of cluster is varying when the data volume is constant increasing with data flow. Based on the Single-Pass strategy, Zhu[7] analyze the influence factors on the efficiency and quality of clustering by feature’s weighted coefficient for the dimension of a feature vector. Yi[8] proposed a method of periodicity incremental clustering to obtain a new centre point of cluster. Yin[8] also put forward a method to split the data stream into a serials of chunk of data to optimize and decrease the effect of the sequence of data stream. Clu-Stream algorithm is a clustering framework for data stream with two phases, including a real-time online clustering and off-line clustering. In the process of online clustering, the micro-cluster can cluster the data sets in different segments of data stream. Then, in the process of offline cluster, the macro-cluster can put these new clusters created in online into the whole cluster sets and merge them into existed clusters by similarity measuring. In addition, a Pyramid model of time framework is designed to store the clustering results in different granularities and phases. Li[9] proposed a sliding time window, based on the Clu-Stream algorithm, to increase the efficiency of mircocluster in online. However, because the Clu-Stream adopts the hierarchical clustering method with BIRCH algorithm, which just is suitable for the data set with same number of feature’s dimension, and not suitable to cluster the text data sets with variable dimensions of feature vector. But, all these methods
have lower efficiency with larger computation to index the high frequency words in text. 3. METHODOLOGY 3.1 Single Pass Clustering The single pass clustering is incremental clustering algorithm. It requires only pass the input dataset. The specialization of single pass clustering algorithm using centriod list and cosine similarity[11], the inputs of single pass clustering are dataset and a threshold. The proposed algorithm starts with setting first document vector as an initial centroid, then iterates all document vectors of datasets and finally computes the cosine similarity of each document vector and centroid list to find the nearest centriod. It compares distance of the nearest centroid with given threshold it less than specified threshold the document nearest centroid will be recalculated after assigning the document vector. The output of single pass clustering algorithm is centroid list. 3.2 Proposed Methodology A major problem of traditional approach is high dimensionality of the feature vector. The feature vector with a large number of key terms is not only unsuitable but also easily to cause the over fitting problem. The goal of a classifier is to assign a category to given unseen documents. In general, the processing of automatic text streams, it first is the extraction of feature terms that become effective keywords and the second is classification of the document using these features. The sampling of data stream is chosen at random. It can use the sliding window model to analyze stream data. The sliding-window model computation is motivated by running computations on all of the data seen. It makes decisions based only on recent data. At every time t, a new data element arrives. In this element expires at time t +w, where w is the window size. This model useful for complex technique for producing approximate answers to a data stream query for only recent events may be important uses this model. To reduce the memory requirements only a small window of data is stored. In present scenario, lot of on-line analysis software tools uses this model for generating summaries of data in stream. There are two types of windows called count based and time based can be developed. In past, the n items are stored and later can store only those items which have been generated. Micro-clustering used be perform construction of data streams. Single-Pass clustering algorithm has lower efficiency it not only calculate the similarity by distance between each new data and existed cluster in time serial processing, can be affected by sequence of data input time. Sliding Time Window can be used in this work provided batch processing in continuous window. It can improve the clustering efficiency and manipulate the data sequence added into stream. The clustering framework for text stream in the whole framework and analysis. The mechanism of clustering includes the process of clustering text stream are sliding window scheduler, text clustering. Web Crawler can capture the web pages text data source to form continuously text stream. In order to improve the efficiency on extraction of key words needs a preprocessing for text filtering like reduplication URL address, embed image, video, advertisement etc., keep away from make the negative influence for text feature extracting. The
345
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.6 NO.4 APRIL 2016 Clustering of sliding window scheduler to control the basic time window (BWt)k with unique procedure and change the data stream into batch data stream. Text feature in document sets in (BWt)k can be extracted and indexed by segmenting the word tool and bag of words. The text feature vector can be represented by <Person, Address, Time and Frequent Terms>. After the text feature extracted and a feature vector of the document formed, the Text Clustering is triggered to cluster density algorithm the different feature vectors in k-th BWt, then a sets of clusters can be output, including {Clusterk1, Clusterk2…, Clusterkn}. In each cluster, same as a data point with its own feature is average of features documents in the cluster. Hence, each micro-cluster in one BWt are calculated the distance between each of present macro-clusters and then merged into macro-cluster which has shortest distance and less than the threshold. In addition to the process and result of clustering can be visualized to provide decision support. The framework of sliding time window and clustering process together to provide a multilevel strategy for text stream. Where i. Person sets of person name emerged in the document person1,person2…,personn} ii. Address sets of geographical place feature emerged in document {addr1, addr2… ,addrm} iii. Time sets of time related feature information emerged in document {time1,time2…, timek} iv. Frequent terms sets of high-frequency words in document {term1, term2…,termj}
constant, the SWt also is a stable Sliding Time Window, but if p is variable, the SWt is sliding window with more complexity and flexibility.
Basic Time Window (BWt):
Step4: Procedure_ Macro-clustering processing ( )
Let t and p represent the time and time interval respectively. The document set is obtained from system as a time series in the time interval of (t, t+p). Then, Basic Time Window (BWt) is document set in certain time interval. BWt = {Doci,j, 0< i ≤ n, t ≤ j ≤ (t+p)}…… (3.1) Where, the length of BWt is the time interval p. Sliding Time Window (SWt): The special BWt that can forward and operate in p cycle and form a control flow for series of BWt. SWt = {(BWt) k, 0< k < m} …… (3.2) Where, k means the t moment, then k+1 means the t + p moment. The size of SWt is volume of data, which is captured in time interval of p, the length of BWt and included a set of documents. In additional, the process of SWt can become a continuous and infinite window stream. Size of Basic Time Window (SBWt): Let v and p mean the velocity of data stream and the length of Basic Time Window, respectively. The SBWt means that the total amount of data in one BWt, which can be calculated as
SBWt = p * v …… (3.3) According to the above definition, let p=1(unit time) and v=1 (doc per unit time), then SBWt=1, which means that just one document flowed into the window in unit time interval. This process is become a real time continuous process classical Single-Pass process. Moreover, if p is fixed as a
Table.3.1 Transaction view Data Stream Sliding Window
Input : Output :
3.3 ALGORITHM Text Stream Sets of Cluster
Procedure_ Sliding Time Window ( ) { Step1: Startup the Web Crawler to form a text stream. Step2: To slide the k-th (0<k≤m) time window (SWt)k with the length p by the module of SWS. For (k=1, k≤m, k++) { } Step3: Procedure_ Micro clustering ( )
Step5: Repeat Next (SWt) k+1 Step6: Stop the process } Procedure_ Micro clustering ( ) { Step1: Start a process with text stream. Step2: For (i=1, i≤n, i++) { doc Feature ik(Person, Address, Time, FT); //Mining i-th document and extracted feature of text, such as person’s name, address, time// Step3: build Index By Feature (Person, Address, Time); Step4: DBSCAN({doc Feature ik} ); Step5: Return( {Mirco-clusters fk ,0<f ≤ n} ) } }
346
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.6 NO.4 APRIL 2016 Procedure_ Macro clustering ( ) { Step1: Start a process with micro cluster. Step2: FOR (f=1, fâ&#x2030;¤n, f ++) { Step3: Single-pass ({Mirco-clusters fk }); Step4: Return ({ClusterTn, 0<Tgâ&#x2030;¤k*n}); Step5: Update the database with new clusters feature; } } 4. EXPRERIMENTATION & RESULTS The proposed methodology is experimented with manually copied content to the text file from multiple websites, which is an unstructured data in the form of text represent the number of document for each category. In order to test the efficiency and precision the web page news can be taken from the website by web crawler to simulate the continuous text stream, which includes 7 topics and 2750 web pages as testing documents with a certain velocity. In the beginning phase, the parameters of system show the length, velocity and size of SWt are p, v and SBWt, respectively. The execution time are analysis with proposed algorithms to compare the execution time with variation of SBWt from 30 to450 stable interval of 30. From the experiments and results it shows Fig.4.1 proposed algorithm has better time executing efficiency than existing methods.
Fig.4.2 Comparison of Precision The experiment and results of recall rate with existing methods are fig.4.3 proposed method has higher than TBDC4TS and Single-pass. The increasing of text documents, the new clusters in each of sliding time window (SWt) should be merged into existed clusters. If the density of existed clusters is sparser, which means the topics of text is lower interrelated and scattered distribution and then the recall rate in clustering also will be decreased. The experiments results in above mentioned shown that the proposed method higher efficiency and performance than TBDC4TS and Single-pass with multilevel clustering method.
Fig.4.3 Comparison of Recall
Fig.4.1 Comparison of execution time The experiment and results precision rate with existing methods are fig.4.2 with sliding of Time Window, the precision rate of both of two algorithms, which adopt four text feature vectors to describe the document, also decrease in a certain degree, but the precision of proposed method is still is higher than TBDC4TS and Single-pass. Especially, owning to batch processing in each window and multi-phase clustering, the influence by the sequence of text stream is avoided in TBDC4TS algorithm, but is still disturbed in Single-pass algorithm.
5. CONCLUSION In this paper, a novel algorithm is proposed to modify the processing of text stream by sliding time window. When the web page news are captured continuously by web crawler and formed a text stream. The proposed sliding window algorithm to split the text stream into continuous segments, which merged the text data stream clustering. The every sliding time window, there be a set of web page news related to velocity of stream and size of sliding window. In addition, framework algorithm are designed to adopt a multilevel clustering method with sliding time window. The result shows 2750 web page news in websites to simulate text stream by web crawler. The proposed method provides better executing time efficiency, cluster quality, precision and recall than existing methods.
347
INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) VOL.6 NO.4 APRIL 2016
[1]
[2]
[3] [4]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
REFERENCES M.-L.Antonie and O.R.Zaiane, “Text document categorization by term association”, In Proc. of the IEEE 2002 “International Conference on Data Mining”, pp.19–26, Maebashi City, Japan, 2002. K.Androutsopoulos, K.V.Koutsias, Chandrinos, C.D. Spyropoulos, “An Experimental Comparison of Naïve Bayes and Keyword-based Anti-spam Filtering with Personal Email Message”, Proceedings of 23rd ACM SIGIR, pp.160-167, 2000. Huang Lei, Mining Stream Data: A Survey, Journal of Software (in Chinese), Vol. 15(1):Pp. 1-7, 2004. A. Forestiero, C. Pizzuti, G. Spezzano, A single pass algorithm for clustering evolving data streams based on swarm intelligence Data Mining Knowledge Discovery, Vol. 26:Pp.1-26, 2013. A. Arasu and G. Manku, Approximate counts and quartiles over sliding windows [C], the Processing of the 2004 ACM Symp. Principles of Database Systems, Pp.286-296, 2004. M. Oyamada, H. Kawashima, H. Kitagawa, Data Stream Processing with Concurrency Control, SIGAPP Applied Computing Review, Vol.13(2): p.54-64, 2013. C. Junghans, M. Karnstedt, M. Gertz, Quality-driven Resource-adaptive Data Stream Mining], SIGKDD Explorations Newsletter,Vol.13 (1): P.72-82, 2011. Zhu Hengmin, Zhu Weiwei, Study on Web Topic Online Clustering Approach Based on Single-pass Algorithm, New Technology of Library and Information Service(in Chinese), Vol.12, Pp.52-57, 2011. Yin Fengjing, Xiao Weidong, Ge Bin, etc., Incremental Algorithm for Clustering Texts in Internet-oriented Topic Detection, Application Research of Computers, Vol. 28(1): 249-252, 2011. Li Na, Xing Changzhen, Density-based Data Stream Clustering Algorithm over Time-based Sliding Window, Journal of Computer Applications, Vol. 31(5): 1363-1366, 2011. Shui Yidong, Qu Youli, Huang Houkuan, A New Topic Detection and Tracking Approach Combining Perodic Classification and Single-pass Clustering,, Journal of Beijing Jiaotong University (in Chinese), Vol.33 (5): Pp.85-89, 2009. E.Rasmussen, “Chapter 16:Clustering Algorithms”,in Frakes,W.B.Baeza-yates,R.(Eds),Information Retrieval:data structures & algorithms,(pp.419442),Prentice Hall,1992.
348
@ IJITCE Publication