COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE

Page 1

Journal for Research | Volume 03 | Issue 10 | December 2017 ISSN: 2395-7549

Comprehensive Analysis of Natural Language Processing Technique Utkarsha Singh Research Scholar Department of Computer Science and Engineering Thapar Institute of Engineering and Technology (Deemed to be University) Patiala, India

Harkiran Kaur Faculty Department of Computer Science and Engineering Thapar Institute of Engineering and Technology (Deemed to be University) Patiala, India

Abstract Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters. Keywords: Disclosure Integration, Natural Language Processing, Parts Of Speech, Turing Test, Semantic Analysis, Syntactic Analysis _______________________________________________________________________________________________________ I.

INTRODUCTION

NLP is an area of application which focuses on how computers can be utilized to understand natural language texts or speech to perform practical tasks. It takes some input either in the form of text, spoken language or keyboard input and performs task (translation of natural language to another language) in order to grasp easily and represent the context of text which is further used for various applications. NLP involves following basic steps for its working: 1) Lexical Analysis: Analysis of structure of text (input) and dividing into paragraphs, s entences and words. 2) Syntactic Analysis: Analysis of sentences and arranging words of sentences in such a manner so that it shows relation between words. 3) Semantic Analysis: Extraction of meaningful text from sentences (output of Syntactic Analysis). 4) Disclosure Integration: Extraction of meaningful sentence in such a way that each and every preceding sentence has meaning which is connected to next sentence. 5) Pragmatic Analysis: Derives those aspects of language which requires practical knowledge. All the above mentioned steps are the base of NLP and they are applied in different fields to achieve useful applications. In this paper, different works and applications based on NLP is represented. II. LITERATURE REVIEW This literature review is based on the different applications of NLP and its different techniques. The literature review helps in understanding the NLP and also helps to know to what extent work has been done in the field and what would be the future aspect. Agung Fatwanto in [1] worked on designing of SRS using the concept of NLP as designing of SRS using NLP is common, however, his research work was on the dynamic feature of software system. His proposed method offers a conversion of dynamic requirements specification (written in scenario like format) to formal specification (object oriented model). Methods involve in his approach are: translation using AI theory and translation based on grammatical analysis. Requirements specified in natural language will be parsed using Reed -Kellogg sentence diagramming system [1] which follows following schemata:

All rights reserved by www.journal4research.org

11


Comprehensive Analysis of Natural Language Processing Technique (J4R/ Volume 03 / Issue 10 / 003)

Sentence-Subject + Predicate Sentence-Subject | Verb | Direct Object; or Sentence-Subject | Copula | Predicate Requirement-Subject + Verb + Target + [Way] Fig. 1. Schemata for parsing requirements in natural language [1]

This schemata helps in incorporating dynamic aspect of software system. M. P. di Buono, P. Ronzino in [2] worked in the field of Archaeology to help in decision making and enhance the data retrieval process related to problem solving procedure s. Two steps were developed: creation of ontology through input – archaeological texts (booklets, appendices, excavation diaries etc.) which helps in determining archaeologists decisions and interpretations and to describe archaeological processes, which is the conclusion process of an archaeologist in the primary assessment, the decisions made during the excavation itself, the techniques and approaches involved, and second step is the ratification of the concluding decisions. Their framework was based on a defined language which is robust, which creates grammars and lexicons. Their architecture ensures better process for decision making by processing highly complicated sentences. Ravi Prakash Verma, Md. Rizwan Beg in [3] proposed a way on the concept of knowledge representation graph of parsed tree which is generated through parser which helps in creating test cases. The proposed method involves “Preprocessing” (removing unwanted words from the input text, “POS Tagging” ( assignment of parts of speech to each word to retrieve useful information), “Parsing” (determines syntactic tree structure, “Knowledge Representation” (representation of data written in natural language in the form of graph) and “Merging Graphs” ( all the graphs are combined which are generated through each parsed tree into one graph which represents knowledge of requirements in detail which are expressed in natural languages). Test cases are automatically generated through boundary value analysis. Takayuki Suzuki, Kiminori Gemba et al. in [4] analysed communication among customers regarding their experiences related to different hotels using NLP technique. An automated system is used to collect data from various user review sites and on the basis of that data co-occurrence words are identified using NLP technique. By co-occurrence words pattern is observed in terms of most frequently used words which helps hotels in analysing and offering better services to their customers. Veronika Vincze, Rich´ard Farkas in [5] research work was on “Named-Entity Recognition”, “De-identification” in NLP for medical as well as social media. They proposed the use of Named-Entity Recognition by using CRF (Conditional Random Field) approach which is closely related to de-identification and anonymization as most of the categories to be de-identified are some kind of named entities. Rodica Potolea, Mihaela Dinsoreanu in [6] proposed an architecture for spam detection which includes following components: 1) Feature Extraction Module: Features are extracted from the comment in form of links, whitespaces, and punctuation marks to differentiate the spam comment from the legitimate comment. 2) Topic Extraction Module: It extracts contents of post and comment to determine if there is similarity or common topics between post and comment. 3) Post-Comment Similarity Module: This module checks the similarity between the post and comment as spam comments mostly contain irrelevant data which has no connection with the post. Once all the above modules are executed, Decision Tree Classifier is implemented to provide final results. Abinash Tripathy, Santanu Ku. Rath in [7] developed a methodology for object oriented systems to identify class names and their relationships from SRS automatically. For analysis of SRS, different tables are constructed on the basis of adjectives, prepositions, adverbs, articles, pronouns and so on. The words of SRS are tokenized and are searched in tables. If word is founded in table then it is identified as output, otherwise, it is considered as noun and are stored in NON array including the words found in verb table. This NON array containing verbs and nouns helps in creating class diagrams (class-association-class). K. Kamalabalan, T. Uruththirakodeeswaran et al. in [8] presented a sample tool for traceability of software artefacts (SRS, test cases, designs) which are discovered in SDLC. Their tool is named as SATAnalyzer(Software Artefact Traceability Analyzer). File Server is used to store artefacts that are discovered during SDLC and are converted into XML files. Standard NLP tool is used to extract information from requirement specifications. Classes and their sub classes are derived from XML files which helps in generating traceability links. Dean J. Jones, Gunjan Mansingh in [9] presented work in progress on an algorithm that accounts for the similarity or oppositeness of given words. They have made the use of WordNet which is a publically available lexical database which consists of lakhs of nouns, verbs, adjectives and adverbs which are grouped into some synsets. Word Comparison is done through Comparison Metric. Yuki Sawa, Ram Bhakta et al. in [10] proposed an approach to detect suspicious comments on social sites using NLP. Social Engineering Attack is very common these days. Parse tree is generated by parsing sentence using context free grammar of English and it is analyzed to find whether the sentence is carrying question (request private questions) or command (to do unauthorized actions). If question or command is detected, then by Topic Extraction analysis is done to fine potential verb/noun topic pairs which describes topic of question/command. Then it is compared to topic backlist (prepared on the basis of violation of security protocols) to detect whether question or command is suspicious or not.

All rights reserved by www.journal4research.org

12


Comprehensive Analysis of Natural Language Processing Technique (J4R/ Volume 03 / Issue 10 / 003)

1) Staircase Pattern Learning: To find brand and product named varied in main content. 2) Staircase Pattern Matching: Comparison is done to find match. The proposed algorithm can be used for location based m commerce. Once the customer connects to wireless access point in shopping mall, algorithm recognizes customer’s interest by extracting product or band name from log files which helps in increasing sales of products. Sweta P. Lende, Dr.M.M. Raghuwanshi in [11] concluded that closed domain QA (Question Answering) system gives accurate answer in comparison to open domain QA system. Proposed method includes following steps: 1) Collection of education acts data sets from different websites. 2) Creation of Corpus(Datasets) 3) Preprocessing through Stop Words Removal and Stemming 4) Storing of extracted keywords into index term dictionary 5) Preprocessing of input query through POS tagging 6) Document is retrieved by doing comparison between indexed dictionary and extracted keywords from preprocessing of query 7) Ranking of keyword is done by finding out score between query keywords and document retrieval 8) Application of POS tagging on all keywords ranking for Answer Extraction 9) Answer is selected from Answer Extraction for which user is looking for. The proposed system gives more accurate results. Akkamahadevi R Hanni, Mayur M Patil et al. in [12] developed an application which helps in providing efficient summary of product reviews which the buyers have given for that product and also helps other buyers to decide whether to go for that product or not by reading those reviews. Feature Extraction and opinion mining extraction are the main modules which are used to give efficient reviews. Classifiers classifies reviews into positive and negative which helps customers in their decision making. Yilin Yang, Xinhai Huang et al. in [13] studied about the test case prioritization from different projects which results in effective software testing. Keywords are extracted from training test cases through NLP techniques and keyword dictionary is constructed. For test case prioritization three strategies are used: Risk Strategy, Diversity Strategy and Hybrid Strategy. On the basis of these strategies, test cases are prioritized so that defects can be detected in earlier stages. Shivaprasad T K, Jyothi Shetty in [14] presented the taxonomy in Fig. 1 [14] of various Sentiment Analysis methods using NLP to understand various reviews of product.

Fig. 2: Sentiment Analysis Process Model [14]

Reviews are taken as data set which are analyzed and classified into sentiments to provide result. Liu Shufengi, Shen Shaohong et al. in [15] developed a method to recognise Chinese characters in static background. NLP technique is used for feature extraction to get Chinese characters area and then, K means clustering algorithm is used to differentiate between character and background area. After identification of Chinese character area, two parameters are generated namely stroke across and energy density which identify Chinese characters in a very accurate manner. Georgi M. Kanchev, Pradeep K. Murukannaiah et al. in [16] proposed a tool based assistance named as CANARY to extract requirements from online forums to enhance interactions between stakeholders. The proposed tool extracts raw data from online forums, apply annotations to those raw data and runs query which results in social data that helps in interaction between stakeholders. Glen Coppersmith, Casey Hilland et al. in [17] introduced some recent advancements in the field of mental health by using NLP and machine learning. Whitespace information helps in analysing mental illness of a person. Linguistic signals are extracted from whitespaces which are processed through NLP technique and are compare with the already stored signals which helps in proper treatment and diagnosis of patient.

All rights reserved by www.journal4research.org

13


Comprehensive Analysis of Natural Language Processing Technique (J4R/ Volume 03 / Issue 10 / 003)

III. COMPARATIVE ANALYSIS OF NLP TECHNIQUES A comparative study is done after going through distinctive research papers of NLP’s applications and its techniques. A correlation table is drawn below on the basis of different parameters such as accuracy, performance, decision making, application and approaches/techniques. Table - 1 NLP Techniques with Their Influencing Factors

Parameters Grammatical Analysis

Approaches/Techniques Representation Feature Extraction Graph Spam Detection

Application

Translation of SRS written in natural language in formal language covering dynamic aspect of software system

Creation of Test Cases

Accuracy

More

More

More

Performance

High

Medium

High

Proposed Method/Algorithm/Schemata to carry out the techniques of NLP or Implementation

Reed Kellog Sentence Diagramming System

Decision Making

Yes

Parser/Parsed Tree POS Tagging Boundary Value Analysis No

Sentiment Analysis

Detection of characters from background Supports interaction between different stakeholders by extracting requirements from online forums

Product Reviews

Less in comparison to others Medium

K means Clustering Algorithm CANARY tool Yes

Yes

IV. CONCLUSION This paper is a comparative analysis of different NLP techniques which provides an overview of NLP’s proposed techniques and methodologies. On the basis of different parameters efficiency of different proposed methodologies are defined. This analysis helps in understanding which technique is better for what kind of application and what future work can be done or the scope in the field of NLP. It concludes that NLP has wide range of applications in the field of medical, security, social media, software development, robotics and so on and can be used with other techniques to provide a platform where interaction between human and computer becomes easy. REFERENCES [1]

A. Fatwanto, "Software requirements specification analysis using natural language processing technique," 2013 International Conference on QiR, pp. 105110, 2013. [2] M. P. Di Buono, M. Monteleone, P. Ronzino, V. Vassallo and S. Hermon, "Decision making support systems for the Archaeological domain: A Natural Language Processing proposal," Proceedings of the DigitalHeritage 2013 - Federating the 19th Int'l VSMM, 10th Eurographics GCH, and 2nd UNESCO Memory of the World Conferences, Plus Special Sessions fromCAA, Arqueologica 2.0 et al., vol. 2, pp. 397-400, 2013. [3] R. P. Verma and B. Md. Rizwan, "Generation of Test Cases from Software Requirements Using Natural Language Processing," 2013 6th International Conference on Emerging Trends in Engineering and Technology, pp. 140-147, 2013. [4] T. Suzuki, K. Gemba and A. Aoyama, "Hotel Classification Visualization Using Natural Language Processing of User Reviews," pp. 892-895, 2013. [5] V. Vincze and R. Farkas, "De-identification in Natural Language Processing," no. May, pp. 1300-1303, 2014. [6] C. Rădulescu, R. Potolea and M. Dinsoreanu, "Identification of Spam Comments using Natural Language Processing Techniques," pp. 29-53, 2014. [7] A. Tripathy and S. K. Rath, "Application of Natural Language Processing in Object Oriented Software Development," 2014 International Conference on Recent Trends in Information Technology, pp. 1-7, 2014. [8] K. Kamalabalan, T. Uruththirakodeeswaran, G. Thiyagalingam, D. B. Wijesinghe, I. Perera, D. Meedeniya and D. Balasubramaniam, "Tool Support for Traceability of Software Artefacts," MERCon 2015 - Moratuwa Engineering Research Conference, pp. 318-323, 2015. [9] D. J. Jones and G. Mansingh, "Not Just Dissimilar, but Opposite An Algorithm for Measuring Similarity and Oppositeness between Words," 2016. [10] Y. Sawa, R. Bhakta, I. G. Harris and C. Hadnagy, "Detection of Social Engineering Attacks Through Natural Language Processing of Conversations," 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), pp. 262-265, 2016.

All rights reserved by www.journal4research.org

14


Comprehensive Analysis of Natural Language Processing Technique (J4R/ Volume 03 / Issue 10 / 003) [11] S. P. Lende and M. M. Raghuwanshi, "Question answering system on education acts using NLP techniques," IEEE WCTFTR 2016 - Proceedings of 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare, 2016. [12] A. R. Hanni, M. M. Patil and P. M. Patil, "Summarization of customer reviews for a product on a website using natural language processing," 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, pp. 2281-2285, 2016. [13] Y. Yang, X. Huang, X. Hao, Z. Liu and Z. Chen, "An Industrial Study of Natural Language Processing Based Test Case Prioritization," 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 548-549, 2017. [14] S. T.K. and J. Shetty, "Sentiment Analysis of Product Reviews:A Review," 2017. [15] L. Shufeng, S. Shaohong and S. Zhiyuan, "Research on Chinese Characters Recognition in Complex Background Images," pp. 214-217, 2017. [16] G. M. Kanchev, P. K. Murukannaiah, A. K. Chopra and P. Sawyer, "Canary: An Interactive and Query-Based Approach to Extract Requirements from Online Forums," pp. 31-32, 2017. [17] G. Coppersmith, C. Hilland, O. Frieder and R. Leary, "Scalable mental health analysis in the clinical whitespace via natural language processing," 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 393-396, 2017.

All rights reserved by www.journal4research.org

15


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.