CLEAR June 2016
1
CLEAR June 2016
2
Editorial…………………… 4 News & Updates……….5 CLEAR September 2016 Invitation…………………42 Last word…………………43
CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. Ajeesh Ramanujan Assistant Professor Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad678633
Speech to Sign Language through an Android Application………...........06 Abhinav Das M J, Ajil Ghosh V C, Manu Prasad P, Nandakumar K, Sachin Mathew Insurance Analyst: A Data mining Approach ............................19 Bhavya V R, Nikhitha Chandran, Roshni R, Saranya P S
A Survey on Data mining tools........................................30 Swedha B
Editors Deepthi Kavya V J Reshma M M Swedha B Cover page and Layout Anju Balachandran Faseela M
CLEAR June 2016
The never ending WSD ................33 Gayathri E S Subgroup detection and Summarization in ideological discussion......................................35 Anju R C
3
Dear Readers! Greetings! This edition of CLEAR is an issue consisting of papers and articles covering the major researches done in the area of Data mining, Speech processing and Natural Language Processing. In our last edition we concentrated mainly on soft computing techniques and its research scope while this issue contributes different research aspects for insurance analysis as an application of data mining, speech language conversion to sign language and also features articles on word sense disambiguation, a survey on data mining tools and subgroup detection and summarization in ideological discussion. The good response from our readers reflect the wider acceptability and growing awareness of Language processing among the technical community, which is definitely gratifying for the entire CLEAR team. On this positive note, I proudly present this Issue of CLEAR to the readers, for critical evaluation and feedback. Best Regards, Dr. Ajeesh Ramanujan (Chief Editor)
CLEAR June 2016
4
TALK BY Dr. T.V Geetha Dr. Geetha T V is a professor in Department of Computer Science & Engineering in Anna University, Chennai. She completed her M Tech from Anna University in the year 1986 and also PhD from Anna University in the year 1992. And her areas of specialisation NLP, Intelligent databases, Distributed Artificial Intelligence, Data Mining. The talk was about Text Categorization and Word sense disambiguation and various issues associated. In the following discussions there was about hand-on sections on text processing, Information retrieval techniques and different algorithms. We got wonderful opportunity to have such a session on text categorization and word sense disambiguation and we, the students of M.Tech CL is hence sincerely thankful to Head of the Department and Kala miss for providing us such an opportunity.
TALK BY Dr. Vaskar RayChoudhary Dr. Vaskar RayChoudhary one among the most reputed professors of IIT Roorkee visited our college on 31st March. He visited our college and delivered expert talks on Introduction to Pervasive Computing & Networking, Synergy between Internet of Things, Big Data, and Cloud Computing towards a Smart City initiative. It was a 2 day session organized by Department of Computer Science and Engineering. On 31st March, the interaction with PG students began by afternoon as per the schedule. Both the first and second year M .Tech Computational Linguistics batch students attended the class. On 1 st April the talk was about Research Guidance, How to read, write and present a research paper. The second day session was very useful to both students and faculties. We, the students of M Tech CL are very much thankful to Head of the Department and Dr. Ajeesh Ramanujan sir for providing us such an opportunity.
CLEAR June 2016
5
Speech to Sign Language through an Android Application Abhinav Das M J1, Ajil Ghosh V C2, Manu Prasad P3 , Nandakumar K4, Sachin Mathew5 Vidya Academy of Science and Technology Thalakkottukara, Thrissur abhin111nattika@gmail.com 1, ajilghoshvc@gmail.com2 , manumag123@gmail.com3, nandakumar6693@gmail.com4, sachinmathew609@gmail.com5
ABSTRACT: There
are about 70 million deaf people who use sign language as their first language or mother
tongue. To communicate with the hearing impaired the sign language is the best way. In this paper introducing a mobile android application that can convert the spoken and written words into sign language. The application will be of great use to the deaf community as well as others as it can easily convert the speech to sign language while they are in a conversation, or watching television etc. Thus it helps in improving their access to the information meant to be delivered to the world which was earlier denied to them because of their physical restrictions. Before Short messaging service/Multimedia messaging service, deaf people rarely used mobile phones. Now texting allows deaf people remotely to communicate with both deaf and hearing parties. Communications between deaf-mute and a normal person have always been a challenging task. This paper describes a way to reduce barrier of communication by developing an assistive device for deaf-mute persons. The advancement in embedded systems, provides a space to design and develop a sign language translator system to assist the dumb people, there exist a number of assistant tools. This paper able to help normal people to communicate with hard hearing people through the use of Mobile phone application. This Mobile application takes speech, it is converted to text and then appropriate text meant sign representation will be displayed. The accuracy obtained by implementing above ideas come to high compared to other means. The translation system implementation is made up of three modules. A speech recognizer is used for decoding the spoken utterance into word sequence. After that, a natural language translation converts speech to text followed by display of sign language based on the text input. In the last module avatar plays sign sequence. It is difficult to estimate the effect of such a tool in terms of integrating deaf and hearing people into one society.
I.
INTRODUCTION
According to the various studies it is revealed that more than 8 percent of the world’s population suffers from the hearing loss. It is the fact that found by many researches that hearing impaired people to
CLEAR June 2016
have very low reading skill compared to the normal people. It causes serious deficiency for them to learn. So they are more flexible to use another language for communication, so it paved way for sign language. Sign language uses lot of gestures so that sign language look like a movement language which is consists of series of action of hands and arms motion, Facial expression and 6
head/body postures. Sign language interception involves task of detecting hand movement, Facial expressions. There are various standard of sign language, but among these American Sign Language is mostly used. Each standard as its own specific keywords and hand movements. Some English words like hello as its own defined gesture made of facial expression, hand movement and placement of arms. Also to be noted that some unknown words are translated by simply showing gestures for each alphabet in the word. In addition, Sign language also includes specific gestures to each alphabet in English dictionary and also for each numbers between 0 and 9. Based the sign language are made up of two groups, first one is Static gesture and other one is dynamic gesture. Static gesture is used for alphabet and number representation and whereas dynamic gesture is used for specific concepts. Dynamic gestures also include words, sentences etc. Static consists of pose of hand whereas latter include motion of hands, head or both Sign language is a visual language and consists of 3 major components they are (i) Finger-spelling (ii) Word level sign vocabulary (iii) Non-manual features. This paper based on finger spelling and word level. Finger-spelling is used to spell words letter by letter whereas word level is keyword based. In this speech to sign language converter, it reduces the bridge between normal people and people having hearing impaired. The converter is simply an android application which recognizes speech CLEAR June 2016
or voice and display corresponding sign language based speech word. It also include dictionary so that user can choose the word to get the corresponding sign language. There are some online converters but in this paper we are trying to introduce off-line version. Since mobile phone is most commonly used device so developing such an application is very useful. The output of this application is the 2D or 3D avatar that shows sign language representation for corresponding input speech. The avatar may look like a human being. The creation of the avatar is the main issue for the development for this application. II.
RELATED WORK
In [1] Rekha K, Latha B the implementation of speech to sign language through the usage of mobile device with the help of cloud computing. It make use of the cloud computing technology as it provide user to use infrastructure, platforms and software at low cost and also in an ondemand fashion cloud computing enables users to elastically utilize resources. This paper also discuss about various layer concept used in cloud services such as infrastructure as a service, Platform as a service, software as a service. It also discuss about the mobile cloud computing, it is an explosive growth of mobile applications and budding trend of cloud computing concept, mobile cloud computing (MCC) has been introduced to be potential technology for mobile services. Mobile cloud computing refers to an infrastructure where both the data storage and the data processing happens outside of the mobile device. Mobile cloud 7
applications move the computing power and data storage away from mobile phones and into the cloud, bringing applications and mobile computing MCC can be defined as a combination of mobile web and cloud computing which is the most popular tool for mobile users to access applications and services on the Internet. This paper have mentioned that mobile cloud computing provide extending battery lifetime, Improving data storage capacity and processing power, improving reliability, dynamic provisioning, scalability, multitenancy. The important thing this paper does not mention is that most people having hearing problem comes from the poor country, where there will be slow-internet connections but in our proposed application the conversation module is integrated with in-built converter. In [2] M. Jerin Jose, V. Priyadharshni et.al proposes a system to help normal people can easily communicate with hard of hearing people. Instead we are using a camera and microphone as a device to implement the Indian Sign Language (ISL) system. The ISL translation system has translation of voice into Indian Sign Language. The ISL translation system uses microphone or USB camera to get images or continuous video image (from normal people) which can be interpreted by the application. Acquired voices are assumed to be translation, scale and rotation invariant. The GUI application is displaying and sending the message to the receiver. This system makes normal people to communicate easily with deaf/dumb person. Also in video calling or chatting this CLEAR June 2016
application helps the hard speaking and hearing people. This paper also discuss about the conversation of hand gesture into speech which make the usage of the image processing technology. In this paper we are proposing our Indian Sign Language translation system. In this system we are giving two modes of translation. First one is if a deaf-dumb person can communicate with normal person (who doesn’t know sign language). And the second one is reverse of first one i.e. a normal person speaks, it is converted to text and then appropriate text meant sign will be displayed. There Speech to Sign Translation The speech from normal person is taken via micro phone of cellular phone or computer. For the need of good quality of voice signal it will be sent for noise removal. Voice data is converted to text by speech-recognition module that is voice to text conversion with the use of trained voice database. The converted text is compared with database for finding meaning and symbol (sign). Display the Sign Symbol with text to the receiver (hard speaking and hard hearing person. This paper discuss about Noise Removal techniques like Filtering Techniques (spectral subtraction method, Weiner filter, least mean square filter and kalman filter), Spectral restoration (minimum mean square error short time spectral amplitude) and many. There are another two types of noise removal methods also, Modulation detection and Synchrony detection. This algorithm speech detector analyses in signal amplitude. Speech modulations are slow and have big amplitude fluctuations.
8
After noise removal the voice sent for speech recognition module. Here the speech recognizer converts voice into single letter (latter words and sentences). The system uses context dependent continuous Hidden Markov Models (HMMs) built using decision tree state clustering. In this context, a speech to sign language translation system is very useful since most of the normal persons do not know sign language and have difficulties when interacting with Deaf people. Once the ISL encoding was established, a professional translator translated the original set into ISE making use of more than 320 different signs. Then, the 416 pairs were randomly divided into two disjoint sets: 266 for training and 150 for testing purposes. The speech-to-sign translation purpose the same test set has been used. The test sentences were recorded by two speakers (1 male and 1 female). Recognized voice is now in the form of text that is voice to text conversion module gives us text output. Then text to sign matching is done by rule based technique. Here the relationship between text and sign has been defined. Here the natural language translation module considering a bottom-up strategy. In this case, the relationship between signs and words are defined by an expert hand. In a bottom-up strategy, the translation analysis is carried out by starting from each word individually and extending the analysis to neighbourhood context words or already-formed signs (generally named blocks). This extension is made to find specific combinations of words and/older signs (blocks) that generate another sign.
CLEAR June 2016
The Sign confidence measure generate confidence is provided by an internal procedure that is coded inside the proprietary language interpreter that executes the rules of the translation module. The confidence for the generated signs may be dependent on a weighted combination of confidences from a mixture of words and/or internal or final signs. This combination can consider different weights for the words or concepts considered in the rule. These weights are defined by the expert as the same time the rule is coded. Here this paper able to provide more efficient speech converter to sign language converter. But the main problem about this implementation technique is delays between the spoken utterance and the animation of the sign sequence. This delay slows down the interactions. But this paper cannot able to reduce the delay rate. In our paper we propose the implementation speech to sign language converter can be used for video relay service. In [3] S.M Halawani, Zaitun A.B describes display of sign language in the form of 3D or 2D avatar. The design and development consists of speech recognition engine. Speech recognition is the process of converting an acoustic signal, captured by microphone or a telephone, to a set of words. Start from phoneme, syllable, word and then sentence which is an input for speech recognition system. Chosen to use Sphinx-4 which is a state-of-the-art speech recognition system written entirely in the JavaTM programming language. Sphinx-4 is a very flexible system capable of performing many different types of recognition tasks. The sign database this paper use is Arabic alphabets 9
and also an Avatar is simply the graphical representation of the word that we want translate to ArSL. Therefore the Avatar database will be developed to contain as many symbols or graphical icons as possible to represent Arabic alphabets and words. This paper points method to decrease error and also disruption in speech recognition using sphinx-4 which is state of art speech recognition system written entirely in java programming language. This paper make database of only Arabic language which is least number of people uses, don’t mention about the ways sign animation is searched through the database and implementation using sphinx is not idle in making of android application. In our system, we make use of in-built translator. In [4] A. Sujith Kumar, Shahabaz Begum. The Sign language interpreter is responsible for helping deaf or hearing impaired individuals understand what is being said in a variety of situations. An interpreter must understand the subject matter so he or she can accurately translate what is being spoken into sign language. Whenever an audience will be in need of sign language interpretation, a sign language interpreter is needed, such as during an office meeting, in a court room or at a presidential speech. Interpreters may also be used in one-on-one situations; they might use technology to provide services from a remote location. This paper makes usage of SpeechRecognition Engine, Database and Recognized Text. Also mentioned about usage of video-Relay service. It allows deaf, hard-of-hearing and speech impaired individuals to communicate over video or CLEAR June 2016
other technology with hearing people in realtime, via a sign language interpreter. In the case of sign language interpretation the interpreter hears the voices of the hearing people through the telephone and renders the message thin to sign language, via a video camera which the deaf person views on his or her video display. The conversation of speech to text Translation is done using Google Translate. A mobile search system may sometimes require a database larger than the capacity of a given mobile device. It may be preferable at times to go to the cloud for image search, analysis and translation into text/voice, depending on the processing power of the mobile devices, the resolution of the images and the size of the vocabulary database. The paper proposes a system using the technologies like 3D character imitating. And makes combination of existing converter with video relay service which make helpful to the deaf people. When a person say hello then VSR converted it into speech to sign format and outfit-7 take this sign and convert it into speech to sign without connection. Instead of tomcat we take a human animation. This approach has disadvantage such as it require large database and creation of tomcat application very difficult task, also it may not work as accepted, so it is better to replace it with animations. In [5] R.San-Segundo, R. Barra et.al mention about major issues in translating Spanish into sign Language and concept of converting speech into sign language using one semantic concept corresponds to a specific sign language it involves assigning one sign to each semantic concept extracted 10
from the text, also several concepts generate a unique sign and also about the third situation when it is necessary to generate several signs from a unique concept. It depends on the concept and the value like verbs, general and specific nouns and lexical visual paraphrases. The translation system divided into three modules in the first module, the speech recognizer, and converts natural speech into a sequence of words (text). The natural language translation module converts a word sequence into a sign sequence. For this module, the paper presents two proposals. The first one consists of a rule-based translation strategy, where a set of translation rules (defined by an expert) guides the translation process and finally sign animation is displayed. The database and domain make usage of most frequently used sentence from typical dialogue. Also mentioned the natural language translation it uses rule-based translation, reducing the delay between the spoken utterance and the sign animation using partial recognition results. But disadvantage with this approach is when implementing this method; an important problem appeared due to the fact that the translation is not a linear alignment process between spoken words and signs. Words that are spoken in the middle of the utterance can report information about the first signs. So until these words are spoken, the first sign is not completely defined and it cannot be represented. In order reduce delay the commonly used phrases are placed at first order. In [6] Suganya R, Dr. T. Meeradevi describes a way to reduce barrier of communication by developing an assistive CLEAR June 2016
device for deaf-mute persons. The advancement in embedded systems, provides a space to design and develop a sign language translator system to assist the dumb people, there exist a number of assistant tools. The main objective is to develop a real time embedded device for physically challenged to aid their communication in effective means. The proposed work in this paper is to implement a system without hand held gloves and sensors and by capturing the gestures continuously and converting them to voice and vice versa, thus making the communication simpler for deaf and dumb people by a hand held embedded device along with the hardware setup. The gesture is matched by pattern recognition of the Neural Network. The set of all input values is called the input pattern, and the set of output values the output pattern A neural network learns the relation between different input and output patterns. Thus, a neural network performs pattern class output categories) then text output of the corresponding gesture recognized is displayed. The usage of the neural network and training phase initial stage somewhat cumbersome and time consuming. Initial training requires more time and network has to train with large number of training set. The conversation of speech to sign language requires more training and it needs sophisticated hardware resources. Since there no implementation of such a system there we cannot about to predict its accuracy. In [7] Dhananjai Bajpai, Uddaish Poro et.al. It is focused on providing an applicative architecture of hand glove that records the gestures made by a speech and 11
hearing disabled people converts them into a meaningful text and transmits them to remote areas with help of Bluetooth, micro controller is used for gesture processing algorithm, trans-receiver modules for transmitting and receiving data and a graphic user interface that displays all the information sent and received between two users. It doesn’t mention about delay in wireless transmission and this paper make usage of the hardware components for implementing conversation which very costly and cumbersome to carry on. In this system, it completely eliminated extra hardware settings. III. DESIGN AND IMPLEMENTATION A.
PROPOSED SYSTEM
Currently there are some applications available but most of them is an online version and also have many issues such not responding to input speech. The other issues are the storage and speed; current applications need a database larger than the capacity of a given mobile device, here most of them are cloud based so it takes preferable amount of time for database search, problem of translation into text/voice. Some of these applications are also more wide open to thrashing issues. Currently marketed speech to sign language application needed internet connection as it is cloud based. The problem lies here is that most of the remote areas have poor internet facility. The proposed system consist of first a gallery section, in the second section normal person (who doesn’t know sign language) can learn sign language and also to show letter by letter for CLEAR June 2016
unknown words such as names etc.. In Avatar section the corresponding spoken language is converted to text and then appropriate text matched sign will be displayed. The objective of this is to develop a mobile based interpreter of Sign language. It enables people who do not know the sign language to communicate with deaf individuals. This system is divided into starting with speech translating into text message followed by sign language for that text to be displayed. There we are trying to use idea of offline voice to text conversion. Still lot of problems of database retrieval and search exists; these can reduce by embedding the animation resource file in xml part of the android application. This system is an interactive application that will help the user to identify the appropriate sign representation for the commonly used phrases, thus provides an efficient mean to communicate with deafmute. The system can be divided into the following modules a home gallery and avatar gallery. The home page includes a menu to choose the two applications the sign gallery and avatar gallery. The proposed system is an interactive application that will help the user to identify the appropriate sign representation for the commonly used phrases, thus provides an efficient mean to communicate with deafmute. The system can be divided into the following subsections: 
Sign gallery: This section provides the user a way to familiarize with the sign representation of each English 12
alphabet, according to the American standard. The experimental framework is limited to sentences spoken by normal people. In speech to sign language translation system is very useful since most of the normal people don’t know sign language. So some words like name, place doesn’t have a specific or unique representation so that they can be displayed in the form of finger spelled. This section can be can used to support for that.
Speech recognizer: The speech recognizer is present in the Home page of the application. User can give input vocally and the avatar will perform its task accordingly. It is accomplished by Speech API in Android. The speech is taken via micro phone of the mobile phone. Google translate allows users to type text in their native tongues and receive textual and audible translations in several vernaculars. The natural language converter module converts the word sequence into a sign sequence that animated by the avatar. There are approaches are available Rule-based Translation: There translation is carried is carried out by starting from each word individually and extending the analysis to neighbourhood context or already formed signs. This extension is made to find specific combinations of words and/or signs that generate another sign.
CLEAR June 2016
Rule-based Translation: There translation is carried out by starting from each word individually and extending the analysis to neighbourhood context or already formed signs. This extension is made to find specific combination of words and/or signs that generate another sign.
Text Inputer: This section provides the direct conversation of the text into the sign representation. This section allow user to directly interact with physically impaired people directly. The text inputer is the simple text input line where user gives text input based on the text input the sign representation is displayed. Decode of the text input is done by word by word because it gives wide view of the sign representation than the complete sentence form. The sign form of the each word by word provides better design for displaying all the texts.
Broadcast: It is the process of sending
messages to multiple clients. There will be a server from which messages could be sending. Broadcasting is helpful by the fact that message can be delivered to multiple clients at a time. Here in this project, broadcasting has an important scope by the fact that it will be helpful for educational purposes of impaired students. Usually education has being given to 13
the deaf students and their future for making them fit to the modern society and educating them we are designing a server client model of broadcasting to the impaired students. B.
HOME PAGE
The home page includes a menu to choose the different activities.
Google API itself. This string based input is given to the application. In order to display the sign two methods are there one is by sentence itself other is by dividing the sentence into individual words. In order to display the sign form there is need for the data collection of the sign representation. The sign form here is the animated video much makes human representation form person displaying the sign representation. If there is no corresponding sign language for given word then the finger spell is used. The finger spell involves dividing the sentence into each individual word then display the sign form for the each alphabets form the resource file of the application.
Fig1.Dynamic Gestures
The proposed system is an android application. This application required the speech recognizer, text input editor, Broadcast section. The speech recognizer section involves the user of the application to give the voice input. The voice input is captured by invoking the Google inbuilt API recognizes the speech input. The speech input is converted into text form by the CLEAR June 2016
Fig2.Static Gestures
Also the application have the feature of giving the text input directly, in that case also have to search through the data collection in the resource file of the application. Another important feature is the broadcast section, since the broadcasting uses the client server model there is need for
14
the server. The placing web host server involves the delay in the access.
approaches are available. Rule-based Translation and statistical approach: In Rule based translation carried out by starting from each word individually and extending the analysis to neighbourhood context or already formed signs whereas in statically based on contentious sentence. This extension is made to find specific combinations of words and/or signs that generate another sign. Avatar is proposed as a humanoid form with comprehensible body parts sufficient for our application. Its movements are controlled using the animation facility in Android.
Fig3.Data Flow Diagram Level 0
The aim of the broadcast is to extend it to the classroom teaching purpose. There the speaker gives the speech or text input then the corresponding sign language is displayed on the client android phone. The server and client is connected the common Wi-Fi network. The separate server is made up of the simple socket programming Java code. The condition is that the server and client should be connected to the common Wi-Fi network. The speech recognizer will be present in the Avatar gallery. User can give input vocally and the avatar will perform its task accordingly. It is accomplished by Speech API in Android. The speech is taken via micro phone of the mobile phone. Google translate allows users to type text in their native tongues and receive textual and audible translations in several vernaculars. The natural language converter module converts the word sequence into a sign sequence that animated by the avatar. Two CLEAR June 2016
Fig4.Data Flow Diagram Level 1
C.
IMPLEMENTATION ISSUES
Translation involves the analysis the both the language. These analysis has to been carried out by semantic concepts and signs, instead of considering the relations between words and sign directly. In this case, a semantic concept is directly mapped onto a specific sign. The translation is simple as it just mapping of spoken sentence to 15
corresponding animated one. The translation module generates one confidence value for every sign: a value between 0.0 (lowest confidence) and 1.0 (highest confidence). There the problem that corresponding sign language is displayed is very near to value of highest confidence since resource from xml is displayed which is specified to each sentence.
corresponding animation is displayed. When implementing, an important problem is that the translation is not a linear alignment process between spoken words and signs. So until these words are spoken, the first sign is not completely defined and it cannot be represented. This problem appears in two major situations Verb tense signs. These signs are FUTURE and PAST (not present sign) and appear at the beginning of the sign sequence independently of the verb position. In sign recognition, gestural form of human communication exists for the deaf in the form of sign language. The gestures may static or dynamic one. The static one represents the class alphabets or numeral of a natural language which is arbitrary signs representing specific concepts. The dynamic gestures are used for an individual phrases or the complete sentence like word hello etc. IV. CONCLUSION WORKS
Fig5.Data Flow Diagram Level 2
One of the major issues is the delay between the spoken utterance and the animation of the sign language. This slow down the real-time interaction between normal people and deaf people. The major delay is to match the recognized speech to sign language by database search. In this paper the usage of separate database dedicated method is avoided by embedding the sign language in xml part of android application. Based on the input text the CLEAR June 2016
AND
FUTURE
The paper able to help normal people to communicate with hard hearing people. This Mobile application takes input as speech, it is converted to text and then appropriate text meant sign representation will be displayed. The accuracy obtained by implementing above ideas comes to high compared to other means. The translation system implementation is made up of three modules. A speech recognizer is used for decoding the spoken utterance into word sequence. After that, a natural language translation converts speech to text followed by display sign language based on the text input. In the last module 16
avatar plays sign sequence. It is difficult to estimate the effect of such a tool in terms of integrating deaf and hearing people into one society. Currently mobile application works for only few specific English sentences as input. The complexity involves when we further add more animations. So the size of application may increase a lot so there is need to look for technology such as neural networks. Also this application display avatar as 2D which can be modified into 3D animations and integration with video relay service. The building of the sign database can be time consuming but nevertheless, it is achievable with the right resources. The same can be said for the populating of the Avatar database with avatars. The next biggest challenge will be to select and use the best technology available. At the moment we are presented with many possible technologies such as; Robotics, Virtual Reality, Computer Vision, Neural Networks, Hidden Markov Models, 3D Animation and Natural Language Processing. They can be used to complement each other and this have yet to be explored in Sign language translation systems research. V. ACKNOWLEDGMENT
We would like to thank the Lord Almighty, the foundation of all wisdom who has been guiding us in every step. Our sincere thanks to Dr. Sudha Balagopalan PhD, MISTE, MIEEE, MIET, our Principal, for providing us all the necessary facilities. We take this opportunity to extend our thanks to Col. M.P Induchudan (Retd) ME,
FITE, FIE, SMCSI the Head of Computer Science & Engineering for valuable guidance in developing this project. We are extremely thankful to our guides and supervisors Mrs. Beena M V M.Tech, Asst.Professor, Mrs. Jucy Vareed B.Tech, M.Tech, and Asst.Professor, in the Department of Computer Science & Engineering for giving us valuable suggestions and critical inputs in the preparation of this report. We express our heartfelt thanks to our Lab Instructors for their valuable support and assistance for our work. Our sincere thanks to our parents and friends who have helped us during the course of the project work and have made it a great success.
REFERENCES [1] Mrs.
K.
Translation Motion
Rekha,
System
Dr. from
Language,
B. Latha, Speech
2014,
Mobile
to
Hand
International
conference paper IEEE. [2]
M.
Jerin
Jose,
Indian
Sign
Language
Translation System for Sign Language Learning, Vol 2 Issue 5, May 2013, International Journal of Innovative Research and Development. [3] S. M. Halawani and Zaitun A. B, An Avatar Based translation system from Arabic speech to Arabic Sign Language for deaf people, Vol 2, November
2012,
international
Journal
of
Information Science and Education. [4] A. Sujith Kumar, Shahabaz Begum, Sign Mobiles: An Android App for Specially Able
CLEAR June 2016
17
People,
Vol 4,
Issue 9 , September 2014 ,
International Journal Of Advanced research in
HCL wins best NLP at artificial intelligence summit in London.
Computer Science and Software Engineering. [5] R. San-Segudo, R Barra, Speech to Sign language translation system for Spanish, Vol 4, Issue 9 , Department of Electronics , University of Alcala, Spain. [6] Suganya R, Dr. T. Meeradevi, Design of a communication aid for physically challenged, IEEE sponsored 2nd international conference on Electronics and communication system.
Leading global IT services company HCL Technologies has been named winner of the „Best innovation in Natural language processing (NLP)‟ award at the Alconics Awards during the “AI Summit” in London recently. Alconics are the world‟s only independently judged awards celebrating the drive, innovation and hard work in the international Artificial Intelligence (AI) community.
[7] Dhananjai Bajpai, Uddaish Porov, Two Way Wireless Data Communication and American Sign Language Translator Glove for Images Text and Speech Display on Mobile Phone, Fifth International Conference
on
Communication
Systems and Network Technologies, 2015 IEEE. [8] Stokoe, D. Casterline, and C. Croneberg, A Dictionary of
American
Sign
Language
on
Linguistic Principles, Gallaudet College Press, Washington D.C., USA, 1965. [9] O. Aran, I. Ari, L. Akarun, B. Sankur, A. Benoit, A. Caplier, P. Campr, A.H. Carrillo, and F.-X. Fanard, SignTutor: An interactive system for sign language tutoring, IEEE Multimedia, vol.16, pp. 8193, 2009.
“This award gives testimony to the innovation and pragmatic implementation of „Natural Language Processing‟ in the 21st century enterprise,” This was claimed by Kalyan Kumar, executive vice president at HCL Technologies, said in a statement. HCL was named as a finalist in the “Best Intelligent Assistant” category which showcases companies making ground-breaking advancements in virtual assistants and advanced voice/text recognition capabilities. For more details Visit: http://www.hindustantimes.com/tech/hclwins-best-nlp-at-artificial-intelligencesummit-in-london/story8jWAzmvSN07ZLfA1nenW4K.html
CLEAR June 2016
18
Insurance Analyst : A Data mining Approach Bhavya V R1, Nikhitha Chandran2, Roshni R3, Saranya P S4 Vidya Academy of Science and Technology Thalakkottukara, Thrissur bhavyaradhakrishnan555@gmail.com 1, nikhithachandran94@gmail.com2 , roshnita93@gmail.com3, saranyapsankar @gmail.com4 ABSTRACT:
Life insurance, basically a tool against protection of life or against death of individuals or
any unforeseen event. It provides financial protection against such kind of risks. For every individual, the purpose of investments in life insurance might differ. According to the survey at present there are 23 private life insurance companies and one public life insurance company in India. Due to this large number of life insurance companies and having wide range of products there arise confusion among the investors as to which product and also of which company to purchase. So in order to form their updated marketing strategies, marketers are interested to know investment pattern of life insurance investors. It is possible to predict which type of customers are interested in certain kind of life insurance product and also developing new products based on company profitability (different products can be developed for different customers and marketing strategy can be made for them) using data mining techniques. Generally the Data mining techniques are used to extract hidden patterns, from large amount of data in the form of data-ware house. In the this research paper emphasis is on investor’s investment behavior in life insurance sector of India by using data mining techniques and implement big data with data mining concept. Here we consider two objective i.e., to find the list of potential customers and to find a reference plan in order to develop a new policy product.
I.
INTRODUCTION
This paper implements big data with data mining concept. Big data is a collection of dataset, so large and complex that it becomes difficult to process using traditional data processing application. Big data extract using data mining techniques. Data mining is the process of extracting knowledge from the large amount of data. This paper aim to present how data mining with big data is useful in insurance industry, how to produce good result and how data
CLEAR June 2016
mining enhance in decision making using insurance data. The rapid growth of accumulation of digital data by various life insurance companies is attributed to the growth in the field of information technology. The lots of valuable information hidden in the data are hardly exploited. The primary goal of data mining is to extract knowledge from data to support decision making. It is also known as technique of exploring data in order to discover previously unknown patterns. By performing data mining interesting regulates or high level information can be extracted from data bases and viewed or browsed from
19
different angles. The discovered knowledge can be applied to decision making, process control, information management and query processing. In this paper using big data concept we develop a new data mining algorithm and it has been applied to find out our two objective i.e., •
•
Potential customer i.e., which type of customers are interested in certain kind of life insurance product. New policy product i.e., with the help of data mining techniques by using the available data from different customers different products can be developed for different customers and marketing strategy can be made for them. Here an objective is to find a reference plan for developing a maximum profit new policy product.
Selection: Selecting data relevant to the analysis task from the database. Transformation: Transforming data into appropriate forms to perform data mining. Data mining: Choosing a data mining algorithm which is appropriate to pattern in the data, extracting data patterns. Interpretation/ Evaluation: Interpreting the patterns into knowledge by removing redundant or irrelevant patterns. Translating the useful patterns into terms that human understandable. IV.
information in the data they have collected about the behaviour of their customers and potential customers. Data mining assist insurance sector in predicting fraudulent claims & medical coverage and predicting the customer’s pattern which customer will buy new policies.
ROLE OF DATA MINING IN
Fig1. Overview of proposed System
Data mining is applied in insurance industry, but tremendous competitive advantages the companies who have implemented it successfully. Some of the areas data mining can be applied to Insurance industry are;
INSURANCE INDUSTRY Data mining is a powerful new technology with great potential to help insurance firms focus on the most important CLEAR June 2016
Identify risk factors that predict profits. Claims and losses. Customer level Analysis. Marketing and sales analysis. Developing new product lines. 20
Reinsurance. Financial analysis. Estimating outstanding provision. Detecting fraud.
claims
In the field of insurance industry decision support system plays an important role. Data mining used to support the controlling of policies, the administrative and management tasks, efficient management of organization and financial data.
used for the classification algorithms such as Decision tree, Bayesian classifier, neural network, Support vector machine etc. Customer database can be segmented into homogeneous groups, classification maps data into predefined group into segments. Data mining classification algorithms are applied on insurance benchmark data set. Types of classification are: Supervised classification Unsupervised classification
B. Clustering V.
DATA MINING TECHNIQUES
Data mining techniques have applied to various insurance domains to improve decision making. Data mining use predictive modeling, market segmentation, market basket analysis to answer business questions with greater accuracy. Various data mining techniques used for the insurance industry development are used for knowledge discovery from database. And they are:
It used for identification of similar classes of objects. It’s used for grouping based on the customer’s behavior. It is applicable for customer segmentation and targeted marketing. Types of Clustering are:
Hierarchical agglomerative
Partitioning methods
Density based methods
Grid based methods
Model based methods
C. Regression Classification Clustering Regression Association rules Summarization A. Classification
It is one of the most commonly used techniques, to develop models that can population records at large. The classifier training algorithm uses this technique for business development. Various classifiers are CLEAR June 2016
It can be used for prediction. Regression analysis used to model the relationship between one or more independent and dependent variables. In insurance firm, more complex techniques needed to predict future values. Types of regression include:
Linear regression
Non-linear regression
Multi- variant linear regression
Multi - variant non regression
21
D. Association Insurance companies faced a lot of problems on customer retention. Association used for this task, because it finds all the association where customers bought a frequent item set. Association helps business firms to make certain decisions. Market basket analysis and cross selling programs are typical examples for which association modeling is usually adopted. When the customers want to insure some policy then this technique helps us finding the associations between different items. Classification and predicting consumer behavior and predicting the likelihood of success of policies, classifying the historical customer records, prediction of what type of policy most likely to be retained and most likely to be left and predicting insurance product behavior and attitude. Also predicting the performance progress of segments throughout the performance period and to find what factors will attract new avenues in insurance sector to classify trends of movements through the organization from successful/unsuccessful customer historical records. E. Summarization This technique which used for report generation provides better decision making for large volume of customer database with the help of visualization tools. It will provide more functionality in business decision making. For solving the business problems and making decision, this data mining techniques can be help to the organization CLEAR June 2016
but selecting the appropriate techniques can important for the organization. IV.
DATA MINING TASKS
Data mining is becoming common in both the insurance sectors like private and public. Data of the customer are one of the most valuable assets of any firm. The traditional methods, which were used for handling huge amounts of data generated by insurance transactions, are too complex. For transferring huge amount of data for decision making, data mining makes the methodology. Insurance firms use the data mining methodologies to enhance research and increase sales among the customers. The data mining used for various tasks in the insurance sector as follows:
Acquiring new customers Customer level analysis Customer Segmentation Policy designing and policy selection Prediction Claims management Developing new product lines Underwriting and policy management Risk management Reinsurance Fraud detection Trend analysis
22
V.
TECHNOLOGIES USED
A. Hadoop Hadoop framework is designed to provide a reliable, shared storage and analysis infrastructure to the user community. There is a storage portion and an analysis functionality portion in this frame work. The storage portion is provided by a distributed file system solution such as HDFS, while the analysis functionality is presented by MapReduce. Several other components are part of the overall Hadoop solution suite. The MapReduce is a tool for deep data analysis and the transformation of very large data sets. Hadoop allows the users to explore and analyze complex data sets by utilizing customized analysis scripts/ commands. In other words, it can be said that through the customized MapReduce routines, unstructured data sets can be distributed, explored, and analyzed across thousands of shared-nothing processing systems/ clusters/ nodes. Hadoop’s HDFS uses replication technique on the data to safeguard the environment from any potential data-loss, by saving the data onto multiple nodes. HDFS (storage) and MapReduce (processing) are the two core components of Apache Hadoop. The most important aspect of Hadoop is that both HDFS and MapReduce are designed with each other in mind and each are codeployed such that there is a single cluster
CLEAR June 2016
and thus provides the ability to move computation to the data. Thus, it seems that the storage system is not physically separate from a processing system. HDFS is a distributed file system that provides high-throughput access to data. It becomes possible by providing limited interface for managing the file system. HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable rapid and reliable access. The 2 main components of HDFS are NameNode and DataNode. NameNode is called the master of the system. It maintains the name system (files and directories) and manages the blocks which are kept on the DataNodes. DataNodes are the slaves which are deployed on each machine. It provides the actual storage and is responsible for running read and writes requests for the clients. Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster. In simple words, a computer cluster used for Hadoop is called Hadoop Cluster. Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers. Hadoop clusters are well known for boosting the speed of data analysis applications. They also are highly scalable ie, if a cluster’s processing power is overwhelmed by increased volumes of data; to increase throughput additional cluster 23
nodes can be added. Hadoop clusters also are highly resistant to failure because each piece of data is replicated on other cluster nodes, which ensures that the data will not lost if there is a failure in one node. B. MapReduce MapReduce is a framework for performing distributed data processing using the MapReduce programming paradigm. In this programming paradigm, each job has 2 phases. A map phase followed by a reduce phase. These 2 are user-defined phases. Map phase is a parallel, share nothing processing of input .In reduce phase output of the map phase is aggregated. For these phases we have 2 algorithms, called mapper and reducer algorithms. The mapper will be having actual data as input and sorted form of input as output, which inturn will be the input for the reducer algorithm. HDFS is the storage system for both input and output of the MapReduce jobs. There are mainly 2 components in MapReduce. They are JobTrackers and TaskTrackers. JobTracker can be termed as the master of system which manages the jobs and resources in the cluster. It tries to schedule each map as close to the actual data being processed. TaskTrackers also termed as slaves which are deployed on each machine. They make sure that the map and reduce tasks are take place as instructed by the JobTracker. The component, MapReduce has complicated algorithms. So inorder to make it simple, languages like pig, hive were
CLEAR June 2016
invented. This language converts our scripts into MapReduce format . C. Pig Pig is a dataflow system which runs on top of Hadoop. It offers SQL like language, Pig Latin to express the data flow and transfer the dataflow to map reduce jobs. Also, it supports user defined functions which can currently be implemented in Java, Python, JavaScript and Ruby .Though it is developed to run on top of Hadoop, it can also be extended to run on top of other systems. Pig allows three execution modes: interactive mode, batch mode and embedded mode. In interactive mode, the user issues commands through an interactive Grunt shell. Only when a store command is given, Pig would execute all the commands. In batch mode, the user runs a pre-written query script, which typically ends with Store. Embedded mode allows user to embed Pig query in java or other programs. D. CentOs CentOs (abbreviated from Community Enterprise Operating System) are a Linux distribution that attempts to provide a free, enterprise-class. It is a computing platform supported by community that aims to be functionally compatible with its upstream source, Red Hat Enterprise Linux (RHEL). In January 2014, CentOs announced the official joining with Red Hat while staying independent from RHEL, under a new CentOs governing board.
24
VI.
DEVELOPMENT
This project includes 5 phases. Data Collection, Selection, Transformation, Data Mining and finally the Evaluation phase. In the first phase, data collection process is carried out. The Insurance Data, both structured and unstructured. This second phase is the selection phase. In this phase the required data are selected from the whole data. The third phase is the Transformation phase. In this phase the selected data’s are transformed into appropriate form and sort it. The fourth phase is the Data Mining phase. In this phase the data’s are analyzed. In the final phase the analyzed data’s are evaluated into the final result.
1st PHASE: Here, from the unstructured data, data collected and pre-processed into CSV files. A CSV is comma separated values files which are stored in tabular data form. Preprocessing is required to improve the input data quality and to generate enhanced data for further processing.
2nd PHASE: In this phase From the CSV Files, the required data were selected. For e.g.: we can select Policy Details, Plans, Customers, and so on as required by the query.
3rd PHASE: Gives table format of data. That is, the selected data were transformed into table format.
CLEAR June 2016
4th PHASE: Here, we analyze the transformed data based on query. Our 1st query is Finding Potential customers: Here we are getting the list of customers who are fit to join or choose the policy that is to be introduced. Next query is New Policy Product. This gives us the maximum profit policy in the company so that it can be used as a reference for designing an new policy.
FINAL PHASE: This phase gives the visualization of results. The analyzed results were stored and published into a portal. All the results can be visualized as charts, bar diagrams or tables etc. For implementing this we have to install centos
A. Installation of CentOs
Disable UEFI Boot in new Systems Disable secure boot if required (In windows 8 systems) Install CentOs the Minimal Installation Enable LAN as in below way (/etc/sysconfig/networkscripts/ifcfg-eth0) (ONBOOT=yes, NMCONTROLLED=no)
25
Install wget and nano ( yum install wgetnano –y) Download Java (wget-c http://www.xeoscript.com/java.rpm) Install Java ( rm -ivsjava.rpm) Download Cloudera Repository (wget-c http://www.xeoscript.com/cloude ra-repo.rpm) Install Cloudera Repository (rpm -ivscloudera-repo.rpm) Install pig (yum install pig –y) Install Cloudera Manager Download parcels Setup Cluster A.
5.
6.
Algorithm 7.
Here we have 2 objectives as described above and hence developed 2 algorithms to find out the result for them. a. Algorithm to find the potential customer : Inputs - Income of customers and plan details. Data.csv contains all data (in csv format) and plan details in plan.txt Steps: 1. Start 2. LOAD all PLAN_DETAILS from plan.txt 3. LOAD all WHOLE_DATA from data.csv 4. Group the WHOLE_DATA CLEAR June 2016
8.
by contact number by using FOREACH operator and enclose schema in the parenthesis using FLATTEN. For all the unique data generate annual_income, name, contact _number, id, expiry, monthly_income WITH_ MONTHLY _ INCOME_VALIDITY (Expiry = start date + ( ( plan _ validity + 1 ) *60*60*24*30) & Monthly_ income = annual _income / 12 ) Find the cross product of WTH_MONTHLY_INCOM E_VALIDITY and PLAN_ DETAILS as WITH_ PLAN _DATA For all WITH _ PLAN _ DATA generate plan_name , name, contact _number, expiry, plan_ sum _assured, plan_amount, plan_ policy_profit, plan _ validity, has _ expired, bal _ amount , monthly_income as WITH _ HAS _ EXPIRED _ BAL _AMOUNT( has_expire = expiry (plan_start * 60 * 60 * 24 * 30) & bal_amount = (monthly_income - 1200) (plan_amount * 2) ) Filter WITH _ HAS _ EXPIRED_BAL_AMOUNT by conditions has_expire > 0 && bal_amount > 0 as EXPIRED_USERS 26
9. For all EXPIRED_USERS generate plan_name, name, bal_amount, monthly _ income ,contact_number as POTENTIAL_USERS 10. Display POTENTIAL _USERS
_EXPENSE using the operator called FOREACH operator 8. Calculate profit as a income-total _expense 9. DUMP profit 10. Stop
(NOTE: Algorithm takes input plan_start in seconds. Hence in order to do further calculations convert it by multiplying (60*60*24*30).
b. Algorithm for New policy product Inputs - Profit details and No. of customers Steps: 1. 2.
Start Load WHOLE_DATA from newplan.txt using pigstorage as default 3. Generate new item set GENERATED_DATA from the WHOLE_DATA using the FOREACH operator 4. Calculate income as net (premium*number) of the customers 5. GENERATE_DATA WITH _EXPENSE from the GENERATE_DATA using the FOREACH operator 6. Calculate each expense value as ( income * expense ) divided by 100 7. Calculate expenditure from GENERATE_DATA_WITH CLEAR June 2016
VII.
CHALLENGES
In insurance organizations, processing large quantities of data during data mining has some of the challenges. Data mining system faces a lot of problems and pitfalls when handling customer’s data such as Noisy data High volume and high complexity for different kinds of data. Hybrid one or more techniques Corrupted values Missing attribute values One of the biggest challenges that insurance faces is improve the customer retention and higher revenue. VIII. REFERENCE [1] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding, Data Mining With Big Data , IEEE Transactions
On
Knowledge
And
Data
Engineering, VOL. 26, NO. 1, JANUARY 2014 [2] K. Umamaheswari, Dr. S. Janakiraman, Role of Data mining in Insurance Industry , An international journal of advanced computer technology,
3
(6),
June-2014
(Volume-III,
Issue-VI) ISSN:2320-0790 [3] Ms. Neha A. Kandalkar, Prof. Avinash Wadher, Extracting Large Data Using Big Data Mining, IJETT Volume 9 Number 11 - Mar 2014
27
[4]
UzmaShafaque,
Parag
D.
Thakare
,
Mangesh M. Ghonge , Milindkumar V. Sarode,
Volume 2 , Issue 6 , Pages 01-05, 2013, ISSN (e): 2319 1813 ISSN (p): 2319 1805
Algorithm and Approaches to Handle Big Data, International
Journal
Applications,(0975
of
8887)
Computer
National
Level
Technical Conference X-PLORE 14
[11] Dilbag Singh, Pradeep Kumar, Conceptual Mapping of Insurance Risk Management to Data Mining,
[5] Mr. Ketan Prajapati Asst. Professor, Shri Satsangi Saket Dham, To Study Investors
International
Journal
of
Computer
Applications (0975 8887) Volume 39 No.2, February 2012
Behavior Towards Life Insurance Products, Ram
[12] Yu Yan, Haiying Xie 2009, Research on the
Ashram Group of Institutions, Vol.2, Issue 1,
Application of Data Mining Technology in
January 2013 ISSN : 2320-0901
Insurance Informatization, In Proceedings of
[6] S. Balaji, Dr. S. K. Srivatsa, Data Mining Model For Insurance Trade In CRM System, IJARCSSE,2012 Volume 2, Issue 4, April 2012 ISSN: 2277 128X
the International conference on Hybrid Intelligent Systems(HIS), PP. 202-205 IEEE
[13] E.B. Belhadji, G. Dionne, and F. Tarkhani 2000, A model for the detection of insurance
[7] Gulshan Vohra, Dr. Bhart Bhushan, Data
fraud. Geneva Papers on Risk and Insurance-
Mining
Issues and Practice, vol. 25, pp. 517-539
Techniques
For
Analysing
The
Investment Behaviour Of Customers In Life Insurance Sector In India, IJREAS Volume 2, Issue 9 (September 2012), ISSN: 2249-3905
[14] Nanthawadee
Sucharittham,
Thanaruk
Theeramunkong, Choochart Haruechaiyasak ,
[8] A. B. Devale and Dr. R. V. Kulkarni, Applications Of Data Mining Techniques In Life Insurance, International Journal of Data Mining Knowledge Management Process (IJDKP) Vol.2,
Bao Tu Ho, Dam Hieu Chi, Data Mining for Life Insurance Knowledge Extraction: A Survey,� School
of
Communication International
No.4, July 2012
Information,
Computer,
Technology, Institute
of
and
Sirindhorn Technology,
Thammasat University, Thailand [9]
Kanwal
m.c.garg, Identifying
garg, Data The
Dharminder Mining
kumar
Techniques
Customer
Behavior
and For Of
Investment In Life Insurance Sector In India, International journal of information technology and knowledge management, Vol.1, pp.51-56 [10] M.Venkatesh, A Study Of Trend Analysis In Insurance Sector In India,
The International
Journal Of Engineering And Science (IJES),
CLEAR June 2016
[15]
Gabor
Melli,Osmar
R.
Zaane,Brendan
Kitts, Introduction to the Special Issue on Successful Real-World Data Mining Applications, SIGKDD Explorations Volume 8, Issue 1. [16] R. L. Grossman, R. Bayardo, K. Bennet, and J. Vaidya, editors,� Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and DataMining (KDD-
28
2005),” ACM Press, New York, 2005, ISBN 159593-135-X [17] P Salman Raju, Dr V Rama Bai, G Krishna Chaitanya,
Data
mining:
Techniques
for
Enhancing Customer Relationship Management in Banking and Retail Industries, IJIRCCE (An ISO 3297: 2007 Certified Organization) Vol. 2, Issue 1, January 2014
Apple brings Google-style machine learning to ‘Photos’ Apple brought machine learning to Photos to help you find, discover and share your images in a more intuitive way than ever before. The features borrow some of the best A features from Google Photos, like resurfacing memorable events, creating albums based on events, people and places, and using deep learning to help find images in a more intuitive way. The new algorithm uses advanced computer vision, a group of deep learning techniques that brings facial recognition to the iPhone. Now, you can find all of the most important people, places and things in your life in with automatically sorted albums. It‟s essentially facial recognition that works on places and objects as well.
For more details visit: http://thenextweb.com/apple/2016/06/13/apple-brings-google-stylemachine-learning-to-photos/
CLEAR June 2016
29
A Survey on Data mining tools Swedha B M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram swedhathrisala93@gmail.com
Data mining means discovery of knowledge. In short, we can say that it is a process of mining interesting patterns from different data sources. The main aim of this article is to give some idea about different data mining tools. These tools are actually to minimize the effort of data mining processes. These tools are used for different purposes such as data preprocessing step, visualization, prediction, classification, clustering, feature selection and regression. Different open source data mining tools include RapidMiner, WEKA, RProgramming, Orange, KNIME and NLTK. I am sure that after reading this article you will definitely have idea about the functionalities of these tools. RapidMiner (formerly known as YALE) is written in the Java Programming language, this tool offers advanced analytics through template-based frameworks. It is developed as a service, rather than a piece of local software, this tool holds top position on the list of data mining tools. In addition to data mining, RapidMiner also provides some functionality like data preprocessing and visualization, predictive analytics and CLEAR June 2016
statistical modeling, deployment.
evaluation,
and
WEKA was originally a non-Java version and was developed for analyzing data from the agricultural domain. The same was developed in Java-based version, then the tool became very much sophisticated and used in many different applications including visualization and algorithms for data analysis and predictive modeling. The main advantage of this compared to RapidMiner is, it is free under the GNU General Public License, which is a big plus compared to RapidMiner, because users can customize it however they please. WEKA supports several standard data mining tasks, including data preprocessing, clustering, classification, regression, visualization and feature selection. R-Programming is another most commonly used data mining tool which is primarily written in C and FORTRAN. And a lot of its modules are written in R itself. It’s a free software programming language and software environment for statistical computing and graphics. The R language is 30
widely used among data miners for developing statistical software and data analysis. Ease of use and extensibility has raised R’s popularity substantially in recent years. Besides data mining it provides statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. The tool Orange which is written in Python. As we know Python that has got so many properties and functionalities. They are commonly used in text processing. Python is getting popularity because it’s simple and easy to learn yet powerful. Hence, when it comes to looking for a tool for your work and you are a Python developer, look no further than Orange, a Python-based, powerful and open source tool for both novices and experts. A Data preprocessing has three main components: extraction, transformation and loading. KNIME does all three. It gives you a graphical user interface to allow for the assembly of nodes for data processing. It is an open source data analytics, reporting and integration platform. KNIME also integrates various components for machine learning and data mining through its modular data pipelining concept and has caught the eye of business intelligence and financial data analysis. This is written in Java and based on Eclipse; KNIME is easy to extend and to add plugins. Additional functionalities can be added on the go. Plenty of data integration modules are already included in the core version.
CLEAR June 2016
When it comes to language processing tasks, nothing can beat NLTK. NLTK provides a pool of language processing tools including data mining, machine learning, data scraping, sentiment analysis and other various language processing tasks. All you need to do is install NLTK, pull a package for your favourite task and you are ready to go. Because it’s written in Python, you can build applications on top of it, customizing it for small tasks.
Fig1.Information extraction architecture [2]
The figure 1 is Information extraction architecture. We are familiar of this same type of functionalities and components which is also present in NLTK. It begins by processing a document using several of the procedures. First, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity 31
detection. In this step, we search for mentions of potentially interesting entities in each sentence. We use relation detection to search for likely relations between different entities in the text. Next, in named entity detection, we segment and label the entities that might participate in interesting relations with one another. Finally, in relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to
build tuples recording the relationships between the entities. REFERENCE [1]http://thenewstack.io/six-of-the-bestopen-source-data-mining-tools/ [Accessed on : May 29, 2016] [2]http://www.nltk.org/book/ch07.html [Accessed on: May 29, 2016]
Will Google choose Neyyappam as next Android?
If Malayali netizens (and likeminded others) have their way, the upcoming version of Googleâ€&#x;s Android Operating System could be codenamed Neyyappam! People by the dozen have been logging on to #NameAndroidN microsite, through which Google is crowd-sourcing names for Android N, and casting their votes for the traditional Kerala snack made of rice and jaggery. Techies, individually and collectively, are having a field day promoting the campaign among friends, colleagues and family through SMS forwards and emails and on social media. Itâ€&#x;s become a matter of Malayali pride to vote for Neyyappam. Each major release version of Android is traditionally named after something sweet and that too in alphabetical order. So, let us hope for our delicious sweet to be the name for next Android OS. (Online: Hindu news) CLEAR June 2016
32
The Never ending WSD Gayathri E S M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram gayathryes@gmail.com
Word sense disambiguation is the process of determining which sense of a word is used in a given context. And it is a problem on which the natural language processing (NLP) started digging a lot before and still continuing. Other than NLP it is a long-standing problem in Computational Linguistics (CL) also, and has broad impact on many important Natural Language Processing applications, such as Machine Translation, Information Extraction, and Information Retrieval. Major researches regarding WSD are done on English since it is the most used and widely accepted language. New usage of existing words emerges, which creates new senses. New words are created, and some words may “die” over time. It is estimated that every year around 2,500 new words appear in English which makes the accuracy in to a harder objective. Knowledge used in a practical WSD system need to satisfy the criteria’s such as Disambiguation-enabling that is the knowledge should be capable of disambiguating senses, comprehensive and automatically acquirable that is the disambiguation knowledge need to cover a large number of words and their various usage, Dynamic and up to date that is constant and timely maintenance and updating of WSD knowledge base. The
CLEAR June 2016
systems developed for the same comes under any of the basic four approaches such as supervised learning, unsupervised learning, semi supervised learning and dictionary based approaches. Supervised techniques include a training phase and a testing phase. In the training phase, a sense-annotated training corpus is required, from which syntactic and semantic features are extracted to build a classifier using machine learning techniques and in the following testing phase, the classifier picks the best sense for a word based on its surrounding words. Unsupervised methods acquire knowledge from unannotated raw text, and disambiguate senses using similarity measures. And in semi supervised learning methods it make use of a small annotated corpus as seed data in a bootstrapping process to overcome the knowledge acquisition bottleneck faced by supervised methods. And finally in dictionary based use lexical knowledge bases (LKB) such as dictionaries and thesauri, and extract knowledge from word definitions and relations among words/senses. To mention some of the existing systems are, Naïve Bayes classifier, Decision lists, Label propagation based semi supervised learning, HyperLex, Random Walk Algorithm etc. Broad coverage and disambiguation quality are critical for WSD techniques to be adopted in real-world applications. To 33
develop a WSD method that overcomes the knowledge acquisition bottleneck faced by many current WSD systems is a challenge to both NLP and CL areas.
DRONE: UNMANNED AERIAL VEHICLE An unmanned aerial vehicle (UAV), commonly known as a drone, as an unmanned aircraft system (UAS), or by several other names, is an aircraft without a human pilot aboard. Essentially, a drone is a flying robot. The aircraft may be remotely controlled or can fly autonomously through software-controlled flight plans in their embedded systems working in conjunction with GPS. UAVs have most often been associated with the military but they are also used for search and rescue, surveillance, traffic monitoring, weather monitoring and fire fighting, among other things.
Personal drones are also becoming increasingly popular, often for drone-based photography. Other applications include drone surveillance and drone journalism, because the unmanned flying vehicles can often access locations that would be impossible for a human to get to. For more details visit: http://internetofthingsagenda.techtarget.com/definition/drone
CLEAR June 2016
34
SUBGROUP DETECTION AND SUMMARIZATION IN IDEOLOGICAL DISSCUSSION Anju R C M Tech Computational linguistics Government Engineering College, Sreekrishnapuram anju.malu50000@gmail.com ABSTRACT: Fast growth of social networking sites led to the emergence of communication groups. Many of the groups discussing about ideological and political topics. The participant of each discussion split into two or more subgroups. Members of each subgroups shares same opinion towards the target. Here they present an unsupervised approach for automatically detecting subgroups members in online communities. They analyze the Post Exchange to each participants of a discussion to identify the attitudes of each participant towards the topic. For this they use attitude prediction to construct an attitude vector for each discussant in the discussion thread. They use the clustering techniques to cluster these vectors and determine the sub-group membership. They compare the proposed system with baseline system such as text clustering and interaction graph clustering, and show that this method achieves promising results. Also A user interested in the topic of discussion or having a problem similar to being discussed in the thread may not want to read all the previous posts but only a few selected posts that provide her a concise summary of the ongoing discussion. So here find out the extractive summary of the discussion thread.
I.
INTRODUCTION
Social networking sites such as Facebook, Twitter, etc. will discussing about the ideological and political topics are quite common. When participant of each discussion section will discuss about disputed topic so that they will split into two or more subgroups. The members of each subgroup carry same attitude towards the topic. The positive attitude members will go to the same group and negative attitude members goes to opposite group.
CLEAR June 2016
For example take the debate about the enforcement of a new immigration law in Arizona State in US. Discussant 1: Arizona immigration law is good. Illegal immigration law is bad. Discussant 2: I totally disagree with you. Arizona immigration law is blatant racism and quite unconstitutional.
The discussant 1 is expressing positive attitude towards the Arizona immigration law negative attitude regarding illegal immigration law. Discussant 2 have negative attitude regarding the illegal immigration law. So that it is clear from this discussion 35
thread is that discussant 1 and discussant 2 are in different groups. Discussant 1 is supporting the law and discussant 2 is against it. Here they present an unsupervised approach for determining the subgroup membership of each participant in the discussion. For this they use linguistic techniques to identify attitude expression, polarities and targets. The target may be another discussant or an entity mentioned in the discussion. Sentiment analysis techniques used to identify the opinion expression. Named entity recognition and noun phrase chunking is to identify the entities mentioned in the discussion. Syntactic and semantic rules are used to identify the opinion-target pair. They construct an attitude vector for each participant, which contain the attitude features. It is called Discussant Attitude Profile. They use clustering techniques to identify the attitude vector and find out the subgroup structure of the discussion group and subgroup membership of each participant. In an online discussion threads each individual post in a discussion thread serves a different purpose in the discussion and we post that identifying the purpose of each such post is essential for creating effective summaries of the discussions. The role of an individual message in a discussion is typically specified in terms of dialog acts. There have been efforts to automatically assign dialog acts to messages in online forum discussions and also using dialog acts CLEAR June 2016
for linguistic analysis of forum data, such as in subjectivity analysis of forum threads. II
APPOARCH
FOR
SUBGOUP
DETECTION
A. Thread Parsing The proposed system starts by parsing the thread to identify the post, identify participant, and reply structure of the thread. In this section they tokenize the text of each post and split it into sentence using a tool CLAIRlib(It is proposed by Abujbara and Radev in 2011 [2]). B. Opinion Word Identification The next step wants to identify the opinion word from the posts. So that it should be determine by and the word polarity (Positive or negative) of individual words. The word that express the opinion that is known as opinion word. In 1974 Lehrer Dene that word polarity as the direction the word deviates to form the norm. For this they use a tool is Opinion Finder (It is proposed by Wilson in 2005). Opinion Finder analyzes the words in the sentence and find out their polarities. Word polarity mainly affected by the context in which it appears. For e.g.: Fine is positive if it used as an adjective and negative when it used as noun. So that Opinion Finder uses a large set of features to identify the contextual polarity of a given polarized word. Opinion Finder apply notation for each word. If it will be O means neutral, POS means positive and NEG means Negative. 36
Arizona/O Immigration/O law/O good/ POS. /O Illegal/O immigration/O bad/NEG./O
So that it replace it with unique placeholder and rewrite as follows.
C. Target Identification Fig1: An overview of the detection system The aim of this step is tosubgroups identify the
possible targets of the opinion. The target may be another discussant or an entity. If it may be another discussant then either discussant name or second person pronoun mentioned explicitly in the post. For example I totally disagree with you. Here you mentioned the discussant that post just above. The target may be an entity that mentioned in the discussion. For this proposed system use two methods. First one is shallow parsing techniques to identify the Noun Groups (NG). To work shallow x parsing they used a Edinburgh Language Technology Text Tokenization Toolkit (LT-TTT), it was proposed by Grover in 2000). Here only take the entity any noun group that(ENTITYID).From the example Arizona immigration law mentioned by two participants Discussant1 and Discussant 2.
CLEAR June 2016
Discussant 1: ENTITY1 is good. Illegal immigration is bad. Discussant 2: I totally disagree with you. ENTITY1 is blatant racism, and quite unconstitutional. Only consider the entities noun group that contains two or more words. Next method is Named Entity Recognition (NER) which is used to identify more entities. The famous named entity recognition is Stanford Named Entity Recognition (It is proposed by Finkel in 2005).It mainly recognizes three types of entities. That is person, location and organization. There is no restriction on entities identification using this method. Again repeat the above procedure that replaces each distinct entity with entity organization. There is no restriction on entities identification using this method. 37
Again repeat the above procedure that placeholder and identify the final set of entities. The main challenges that always happened in the text mining tasks at the level of granularity is that entities are usually expressed by anaphoric pronouns. For example, It doesn’t matter whether you vote for Obama. He is unbeatable. Here main entity is Obama in the first sentence, but uses its pronoun to refer the same entity in the second sentence. And the unbeatable is the opinion word in the second sentence. To get opinion-target pairing anaphoric pronoun wants to be resolved. For this they use a system that proposed by Jakob and Gurevych in 2010. It will improve the opinion target extraction. The result after apply the system is. It doesn’t matter whether you vote for Obama. Obama is unbeatable. Now, both sentence mentions same entity and NER system will identify the as one entity.
they use Stanford Parser (Proposed by Klein and Manning in 2003) to generate the dependency parse tree of each sentence in the thread. An opinion word and the target form a pair if they satisfy at least one of our dependency rules. Table contains the some of these rules. The rules are basically examining the type of the dependencies on the shortest path that connect the opinion word and target in the dependency tree. If a sentence S in a post written by participant Pi contains an opinion word OPj and a target TRk, and if the opinion-target pair satisfies one of our dependency rules, we say that Pi expresses an attitude towards TRk. The polarity of the attitude is determined by the polarity of OPj. We represent this as Pi +TRk if OPj ispositive and Pi - TRk if OPj is negative. It is likely that the same participant Pi express sentiment toward the same target TRk multiple times in different sentences indifferent posts. We keep track of the counts of all the instances of positive/ negative attitude Pi expresses to ward TRk. We represent this as Pin - m+TRk where m(n) is the number of times Pi expressed positive(negative) attitude toward TRk.
D. Opinion-Target Pairing E. Discussant Attitude Profile In the above sections opinion words and targets identified separately. Here they want to determine which opinion word is targeting which target. Here they use a rule based approach for opinion-target pairing. These rules are based on the dependency relations that connect the word in sentence so that CLEAR June 2016
In this section the representation of discussant attitude towards the identified targets in the discussion threads. As mentioned earlier the target may be a discussant or an entity mentioned in the discussion. In this representation, discussant 38
attitude profile is a vector, which contains numerical values. The values will be the count of positive or negative attitude express by the participant towards the target. So it is called as Discussant Attitude Profile (DAP). This DAP is constructed for every participant in this discussion. The dimension of DAP is n = (d + e) * 3. Here d is the discussant e is the entity. DAP has three corresponding values. The number of times the discussant expressed positive attitude towards the target. The number of times the discussant expressed negative attitude towards the target. The number of times the discussant interacted with or mentioned the target. It has to be noted that these values are not symmetric since the discussion explicitly denote the source and the target of each post. F. Clustering In this previous section they have an attitude profile (or vector) constructed for every discussant. In this section they analyze the attitude profile and determine the subgroup membership of each discussant. Here is to achieve this aim by notice that the attitude profiles of discussants who share same opinion are more likely to be similar to each other than to the attitude profiles of discussant with opposing opinion. From this method discussant can be split into subgroup according to their opinion.
CLEAR June 2016
III.APPROACH FOR SUMMARISATION
THREAD
Text summarization techniques can be classified into two categories, namely extractive Summarization, and Abstractive Summarization. Extractive summarization involves extracting salient units of text (e.g., sentences) from the document and then concatenating them to form a shorter version of the document. Abstractive summarization, on the other hand, involves generating new sentences by utilizing the information extracted from the document corpus, and often involves advanced natural language processing tools such as parsers, lexicons and grammars, and domain-specific knowledge bases. Owing to their simplicity and good performance, extractive summarization techniques are often the preferred tools of choice for various summarization tasks. A. Summarization Unit Before we can perform extractive summarization on discussion threads, we need to define an appropriate text unit that will be used to construct the desired summaries. For typical summarization tasks, a sentence is usually treated as a unit of text and summaries are constructed by extracting most relevant sentences from a document. However, a typical discussion thread is different from a generic document in that the text of a discussion thread is created by multiple authors (users participating in the thread). Further, the text of a discussion can be divided into individual user messages, each message serving a specific role in the 39
Post Position (Position): The normalized position of the post in the
Table 1: Examples of dependency rules used for opinion-target pairing [1].
whole discussion. In that sense, summarizing a discussion thread is similar to the task of multi-document summarization where content of multiple documents that are topically related is summarized simultaneously to construct an inclusive, coherent summary.
discussion thread. It is defined as follows: Position of the post in the thread Total # posts in the thread
Centroid Similarity (Centroid): This feature is obtained by computing the cosine similarity score between the post document vector and the vector obtained as the centroid of all the post vectors of the thread. Similarity with centroid measures the relatedness of each post with the underlying discussion topic. A post with a higher similarity score with the thread centroid vector indicates that the post better represents the basic ideas of the thread.
Inter Post Similarity: This feature is computed by taking the mean of the post’s cosine similarity scores with all the other posts in the thread.
Dialog Act Label (Class): This is a set of binary features indicating the
B. Framing Thread Summarization as Post Classification We consider the problem of extracting relevant posts from a discussion thread as a binary classification problem where the task is to classify a given post as either belonging to the summary or not. We perform classification in a supervised fashion by employing following features.
Similarity with Title (TitleSim): This feature is computed as the cosine similarity score between the post and the title of the thread.
Length of Post (Length): The number of unique words in the post.
CLEAR June 2016
40
dialog act class label of the post. We have one binary feature corresponding to each dialog act.
[2]
Amjad
Abu-Jbara
and
DragomirRadev.
2011. "Clairlib: A toolkit for natural language processing, information retrieval, and network analysis". In Proceedings of the ACL-HLT 2011
IV. CONCLUSION
System Demonstrations, pages : 121 - 126.
Here is an approach for subgroup detection in ideological discussions. Here they use linguistic techniques to identify the attitude the participant of online discussion carry towards each other and the aspects of the discussion topic. Attitude prediction and interaction frequency is used to construct the attitude vector. If we find out attitude vector of each participant, then cluster them to form subgroups. The proposed system have outperforms than text clustering and graph clustering. Also mentioned the contribution of each component in our system to the overall performance.
[3] Ahmed Hassan and DragomirRadev. 2010.
They framed discussion thread summarization as a binary classification problem and tested our hypothesis on two different datasets. They found that for both the datasets, incorporating dialog act information as features improves classification performance as measured in terms of precision and F-1 measure.
"Identifying text polarity using random walks". In ACL10, pages : 1 - 9. [4]
Sumit
PrasenjitMitra. Forum
Bhatia, 2012.
Discussions
–
PrakharBiyani, Summarizing Can
Dialog
and Online
Acts
of
Individual Messages Help?. In Proceedings of the 15th InternationalWorkshop on the Web and Databases 2012, WebDB 2012, Scottsdale, AZ, USA, May 20, 2012, pages 13– 18. [5]
Sumit
Bhatia,
PrakharBiyani,
and
PrasenjitMitra. 2012. Classifying user messages for managing web forum data. In Proceedings of the 15th International Workshop on the Web and Databases 2012, WebDB2012, Scottsdale, AZ, USA, May 20, 2012, pages 13– 18.
REFERENCES [1] Amjad Abu-Jbara, Pradeep Dasigi, Mona Diab, and DragomirRadev. 2012. "Subgroup Detection
in
Ideological
proceedings of the 50 Association
for
th
Discussions".
In
Annual Meeting of the
Computational
Linguistics,
pages : 399-409.
CLEAR June 2016
41
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- September-2016 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on September 2016. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th September, 2016 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in Editor,
Representative,
CLEAR Journal
SIMPLE Groups
CLEAR June 2016
42
Hello world, Nowadays, information is available in different forms and it’s huge. Works have been going on for their extraction and different techniques are being developed for analysing the human behaviour pattern we need, the so called Data mining technologies. One of the notable applications of Data mining is financial analysis which includes insurance analysis. Inorder to make the data mining activity feasible and efficient we have many data mining tools too. Speech processing is another area which has got wide interest among researchers and when it is applied using android applications it will be identified as current trend. Also word sense disambiguation and opinion mining, as we know both have their own importance in Natural Language Processing. This issue of CLEAR focuses on the researches done in the field of Data mining, Speech Processing and Natural Language Processing. The articles have been written up in the hopes of giving an insight into this fast emerging field. CLEAR always welcomes ideas that are refreshing yet purposive and is thankful to everyone that helps realise this target. Simple group welcomes more aspirants in this area. Wish you all the best!!! Swedha
CLEAR June 2016
43
CLEAR JUNE 2016
a SIMPLE initiative
Department of Computer Science & Engineering Government Engineering College, Sreekrishnapuram Palakkad, Kerala
CLEAR June 2016
44