CLEAR March 2015
Page 1
CLEAR March 2015
Page 2
Editorial…………………..04 News & Updates………05 CLEAR Dec 2014 Invitation…………………48 Last word…………………49
A Survey on Syntax-Based Translation with Bilingually Lexicalized Grammars………………….07 CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 www.simplegroups.in simplequest.in@gmail.com
Anisree P G, Radhika K T
Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad-678633
Rugma R, Radhika K T
Editors Dr. Ajeesh Ramanujan Raseek C Nisha M Anagha M Cover page and Layout Sarath K S Manu.V.Nair
Interactive News Reading Android App for Malayalam…………………………………………………..15 Sooraj R, Jose Stephen, Bhadran V K, Anjali M, Arun Gopi A Survey on the use of Natural Language Processing Techniques in Language Learning..20
Study of Different Methods for Training and Enhancing Dysarthric Speech………………………..27 Divya Das, Jose Stephen, Bhadran V K Scalable Systems for Big Data Analytics in the Context of Sentimental Analysis on Social Networks…………………………………………...33 Thabshira K Shamsudheen, Viju P Poonthottam Computational Prediction of Transcription Factor Binding Site and Affinity based on DNA Features……………………………………………….41 Sheeba K, Achuthsankar S Nair
CLEAR March 2015
Page 3
Dear Readers! Greetings! This edition of CLEAR is a special issue consisting mainly of selected papers presented during the National Conference on Computational Linguistics and Information Retrieval (NC CLAIR-2014) held during 29-31 Dec 2014. The papers cover topics ranging from the use of Natural Language Processing Techniques in Language Learning to even Training and Enhancing Dysarthric Speech and Computational Prediction of binding site on DNA Features. The good response to our call for papers reflect the wider acceptability and growing awareness of Language processing among the technical community, which is definitely gratifying for the entire CLEAR team. On this positive note, I proudly present this Special Issue of CLEAR to the readers, for critical evaluation and feedback. Best Regards, P.C. Reghu Raj (Chief Editor)
CLEAR March 2015
Page 4
National Conference on Computational Linguistics and Information Retrieval (NC-CLAIR, 2014) The National Conference on Computational Linguistics and Information Retrieval (NCCLAIR-2014) has been successfully organized during 29-31 December 2014. It is the first national level conference organized by the Department of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, and incidentally this happens to be the first conference at any level conducted by the college. The conference brought together researchers working in the areas of Indian language computing, Machine Translation, Speech Processing, Information Retrieval, Big Data Analysis, Machine Learning and other related fields. The conference was sponsored by TEQIP-II. There was active participation both from Academics and the industry. The event began with tutorial sessions on 29 December, Monday. The first tutorial was on the topic “Data Science: The Industrial Perspective” by Mr. Samaj George, Innovation Consultant, EY - Global Talent Hub, Trivandrum. The tutorial was a gentel preview into potential industrial applications of Big Data analysis that was derived from his 14 years of experience in the field. The next was a tutorial-cum-workshop on “Big Data and Hadoop” by Mr. Robert Jesuraj, Data Analyst (NLP Engineer) at EY. He demonstrated internal working of Hadoop on Big Data. The final presentation was by Dr. Govind D, Assistant Professor (Center for Computational Engineering), Amrita Vishwa Vidyapeetham, Coimbatore. The topic was “Issues in Expressive Speech Analysis and Synthesis”, which reflected his work experience as the project lead in the UK-India education research initiative project (2007-2011) entitled “Study of source features for speech synthesis and speaker recognition”, between IIT Guwahati and University of Edinburgh. Dr. Govind is currently investigating the ongoing DST sponsored project titled “Analysis, Processing and Synthesis of Emotions in Speech”, at Amrita Viswa Vidyapeetham, Coimbatore. He has more than 20 research publications in reputed conferences and journals. On the second day, i.e., 30 Dec Tuesday, we had the inaugural session in which the keynote address was delivered by Dr. Rajasree MS, Director & Professor, Indian Institute of Information Technology and Management - Kerala (IIITM-K), Technopark, Trivandrum. She shared her experience in Computer Science Research. Dr. Rajasree has more than 23 years of rich service in government in several capacities including teaching, research, and administration. She is currently serving as a member of several committees constituted by the government as a technical expert. Two parallel sessions of paper presentation followed. There was also a tutorial on “Reinforcement Learning via Least Squares Policy Iteration (LSPI)” by Dr. Mohammed Shahid Abdulla, Assistant Professor, Information Technology and Systems, IIM Kozhikode. His academic positions include Assistant Professor (Avionics), Indian Institute of Space Science and Technology, Trivandrum, India and Researcher, General Motors R&D, Bangalore, India. With his presentation skills he conveyed the complex algorithm in a simple understandable way.
CLEAR March 2015
Page 5
On the final day, i.e., 31 Dec, Wednesday, the tutorial session was on the topic Data Mining Algorithms in Bioinformatics with special emphasis to Clustering Biological Data by Dr K A Abdul Nazeer, Head, Dept. of CSE, NIT Calicut. This was followed by two parallel sessions of paper presentations. Dr Rajasree, Dr. P. C. Reghu Raj and Dr. Ajeesh Ramanujan were the chairs. There were papers from the industry, academic, UG and PG students. In the valedictory session, Best paper award was given to the paper, “CRF-based Unknown Word Tag Prediction for Malayalam” by Amal Babu, Alen Jacob, and C. Naseer. The chair conveyed their remarks on the papers presented, and the decision to publish selected papers from the conference in the coming issue of CLEAR Journal published by the CSE Department.
Publications
TnT-CRF based Approach for POS Tagging, Amal Babu, Manu.V.Nair, Sarath K S, Naseer C published in NCAIET-2K15, SVNCE Mavelikara.(Got Best Paper award in Department of Computer Science and Engineering)
Semantic Role Labeling, Sreerekha T V, Vidya P V, P C Reghu Raj published in NCAIET2K15, SVNCE Mavelikara.
Classifying News Articles Headline in Malayalam using Machine Learning Approach, Rajitha K, Sruthi Sankar K P, Naseer C published in NCAIET-2K15, SVNCE Mavelikara.
Malayalam Morphological Analyser using MBLP Approach, Nisha M, Reji Rahmath K, Rekha Raj C T, P. C Reghu Raj published in IEEE ICSNS, SNS College of Technology, Coimbatore.
CLEAR March 2015
Page 6
A Survey on Syntax-Based Translation With Bilingually Lexicalized Grammars Anisree P G Dept of Computer Science and Engineering, MEA Engineering College, Vengoor, Perinthalmanna, Malappuram, India-679325 anisreepg7@gmail.com
Radhika K T Assistant professor Dept of Computer Science and Engineering MEA Engineering College, Vengoor, Perinthalmanna Malappuram, India-679325 radhikaanil111@gmail.com
ABSTRACT: Syntax-based translation model encourages progress in the improvement of translation quality. These models significantly improve the translation performance due to their grammatical modelling on one or both language side(s). The STSG (synchronous tree substitution grammars)based syntax translation model with bilingually lexicalized STSG inspired from PCFG (probabilistic Context Free Grammars) is discussed here. In string-to-tree translation models, an STSG rule consists of two right-hand sides known as the source-hand and target hand sides. However, in string-to-tree translation rules, the target side follows a TSG rule, while the source side follows a CFG rule. The translation rules like non-lexicalized rules and mixed rules don’t consider any lexicalized information on the source or target side. The lexicalized STSG can provide superior rule selection in decoding and substantially improve the translation quality.
Keywords:-Bilingually lexicalized synchronous tree
substitution
statistical
grammars,
machine
syntax
translation,
based Natural
Language Processing (NLP).
I.
INTRODUCTION
Traditionally machine translation performs simple substitution of words in one natural language to another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Hence a phrase based approach for machine translation is arrived, but that also can’t produce very high level of translation quality. Then a syntax based approach towards machine translation is
tested by people and it produces a very good translation result. Here it is dealing with a syntax based translation on sentences in a lexicalized tree substitution manner. The string in the source language is converted into a tree like structure depending upon the lexical substitutions. Then the tree is converted into the target language. Since it is considering bilingual i.e., use two languages, especially with equal or nearly equal fluency, they proposed to upgrade the synchronous tree substitution grammar (STSG) inspired from Probabilistic Context-Free Grammars (PCFG) used in monolingual parsing. Depending upon the input, which is a string or parse tree the models are categorized into tree-based models and string based models. Again the tree based model can be dividing into a treeCLEAR March 2015
Page 7
to-tree model and a tree-to-string model and string-based parsing model is classified into a string-to-tree and a string to-string model depending upon the output. In the STSG based string-to-tree translation model, both source and target side lexicalized information chooses appropriate translation rules during decoding. To evaluate bilingually lexicalized STSG more thoroughly, a generative and a discriminative models are used.
translation model parses the Chinese string using the source-side of the STSG rules and synchronously generates an English tree using the target-side of the STSG rules. Similar to the monolingual PCFG parsing models, the STSG model also have the problem of lacking lexicalization. Here , rule r11 specifies that if the substring x0 preceding the Chinese word jibi(kill) can be translated into an English prepositional phrase PP : x0 , then this can be used to translate the Chinese string x0 jibi(kill) into the English verb phrase were killed. It is correct to use only when the preposition indicates a passive voice. For instance, in the string jingfang(police) yu(in) lingchen(morning) jibi(kill) qiangshou(gunmen), the Chinese substring preceding can be translated into an English Prepositional phrase PP: x0, but the Chinese word yu(in) indicates the active voice. Without considering lexicalization, the following poor translation may result: the police were killed in the morning gunmen. II.
Fig.1. Comparison between monolingual parsing and string-to-tree translation
Fig. 1 shows an example in which an English sentences parsed into a tree structure and a Chinese sentence (both Chinese characters and Chinese Pinyin are provided) is converted into an English tree. It provides a comparison between monolingual parsing and string-totree translation. Both methods convert a string into a tree structure. The difference is that monolingual parsing applies PCFG rules to the conversion of an English string into an English parse tree, whereas the string-to-tree
NOISY MODEL
Charniak [2] first attempted to improve the grammaticality of the output of a string-to-tree system using a lexicalized monolingual PCFG parsing model. The resulting system was used to translate 347 sentences from Chinese to English. The translations were sorted into four groups: good/bad syntax crossed with good/bad meaning. The syntax based system had 45 percentages more translations that also had good syntax. Recently there has been considerable interest in MT systems based not upon words, but rather syntactic phrases, which performs the translation by assuming that during the training phase the target language (but not the source language) CLEAR March 2015
Page 8
specifies not just the words, but rather the complete parse of the sentence. Chinese to English translation it seems worthwhile to take it as a starting point. One sub-optimal part of this system is its incomplete use of the syntactic information available to it. Internally, the model performs three types of operations on each node of a parse tree. First, it reorders the child nodes, such as changing VP →VB NP PP into VP → NP PP VB. Second, it inserts an optional word at each node. Third, it translates the leaf English words into Chinese words. These operations are stochastic and their probabilities are assumed to depend only on the node and are independent of other operations on the node, or other nodes. An English phrase covered by a node can be directly translated into a Chinese phrase without regular reordering, insertions, and leaf-word translations. It accepts an English parse tree as input, and produces a Chinese
Fig. 2. Noisy-Channel Model
sentence as output. It is computed with respect to the simplified, non-lexical, PCFG not with respect to the “true” lexicalized PCFG. Thus it is possible that some edge which looks implausible with respect to the first could look quite good with respect to the second. And also the restricted consideration to sentences without punctuation.
III.
LOG LINEAR MODEL
Fig. 3. Log-Linear Model
Och [3] concentrated on phrase-based models. In his paper, he attempt to address the problems like grammatical errors include lack of a main verb, wrong word order, and wrong choice of function words by exploring a variety of new features for scoring candidate translations. The goal is the translation of a text given in some source language into a target language. The standard criterion for training such a log-linear model is to maximize the probability of the parallel training corpus consisting of S sentence pairs. Word-Level Feature Function features, directly based on the source and target strings of words, are intended to address problems like translation choice, missing content words, and incorrect punctuation. Shallow Syntactic Feature Function features can combine the strengths of tag- and chunk-based translation systems. The process of adding features based on Treebank-based syntactic analyses of the source and target sentences will address grammatical errors in the output of the baseline system. This is the first large scale integration of syntactic analysis operating on many different levels with a state-of-the art CLEAR March 2015
Page 9
phrase-based MT system. The methodology of using a log-linear feature combination approach, discriminative re-ranking of n-best lists computed with a state-of-the-art baseline system allowed members of a large team to simultaneously experiment with hundreds of syntactic feature functions on a common platform. Unfortunately, none of the implemented syntactic features achieved a statistically significant improvement in the BLEU score. IV. TREE SUBSTITUTION GRAMMAR MODEL Post and Gildea [4] demonstrated that tree substitution grammars (TSGs) may be a better choice than context-free grammars for language modelling. Training a more complicated bilexical parsing model across TSG derivations shows further improvement. It suggests that PCFGs in general may not be as poor of language models as often thought, and that they have mechanisms for producing context-free grammars that do a much better job of modelling language than the Treebank grammar. Compactness of a grammar is an important property for keeping perplexity low. A number of research groups have shown that PCFGs are not very helpful in improving BLEU scores for machine translation. Furthermore, they do not even appear to be very useful in distinguishing grammatical from ungrammatical text. Just as it generalizes less well in finding parse structures for unseen (grammatical) sentences, it is also unable to find satisfactory explanations for ungrammatical ones. The overall trend of the perplexity scores here correlate with those of the grammatical text. Finally, note that the plain CFG assigns a perplexity score that is in the neighbourhood of those assigned by the ngram models, with
the CFGs slight lead perhaps explained by the fact that the permutation of the test-data was performed at the constituent (and not the word) level. This is significant, because it suggests that PCFGs in general may not be as poor of language models as often thought, and have mechanisms for producing context-free grammars that do a much better job of modelling language than the Treebank grammar . The fact remains, however, that the context-free model is quite constrained, ignoring salient elements of the structural history that should surely be conditioned upon. This model is called a lexicalized model because (a) the expansion of a node into all of its children is conditioned on the parents head word and (b) heads of the siblings of the head child are also conditioned on the parents head word. First, we flattened TSG sub trees to equivalent height one CFG trees. This does not affect the end result, because internal nodes of TSG subtrees contribute nothing to language modelling under the inference models. Second, the Collins parsing model training procedures expect (tag, word) pairs in the training data, but flattening the lexicalized TSG rules remove many of the tags from the training corpus. To correct for this, a dummy pre terminals is introduced above such words that expand deterministically. Also modified the head-finding rules to select the same lexical head that would have been selected if the interior nodes were present. V. TREE SUBSTITUTION GRAMMAR BY XIAO Xiao et al. [5] studied syntax-based language modelling using the lexicalized monolingual parsing model with TSG. The CLEAR March 2015
Page 10
model parameters were directly learned from the automatically parsed target trees. They reported that their approach can improve the translation quality of the string-to-tree model. And their experimental results show that these methods are very beneficial to the proposed language model, and consequently further speed-up the system and improve the translation accuracy. In this work syntaxbased language modelling approaches for Chinese- English machine translation is investigated. And also presented a Tree Substitution Grammar based language model to improve a state-of-the-art Chinese- English syntax-based MT system. By learning TSGs from the target-side of bilingual data, this model could better model the wellformedness of MT output than traditional Context-Free Treebank Grammar-based language models that are trained on the limited treebank data. On the NIST ChineseEnglish evaluation corpora, it achieves promising BLEU improvements over the baseline system. Moreover, three methods for efficient language model integration are presented, as well as a simple and effective method for language model adaptation. The experimental results show that these methods are very beneficial to the proposed language model, and consequently further speed-up the system and improve the translation accuracy. They also expect that their promising results could encourage more studies on this interesting topic, such as exploring Tree Adjoining Grammars or other alternatives within language modelling. VI. COLLINS’ MODEL For lexicalized PCFG, a back off smoothing mechanism is designed for each sub-model of the proposed bilingually lexicalized STSG. Collins goals are two-fold.
First, he aims to advance the state of the art, by reporting improved parsing accuracy over previous results. Second, he aims to increase understanding of the parsing problem through a detailed analysis of parsing models. In common with several other approaches, he adopts a statistical method. The learning problem then becomes a task of estimating parameters vales from training data. This paper answers to two critical questions 1) How should trees be broken down into smaller fragments? 2) How can this choice be instantiated in sound probabilistic models? Their goals are two-fold. First, they aim to advance the state of the art, by reporting improved parsing accuracy over previous results. Second, to increase understanding of the parsing problem, through a detailed analysis of parsing models. VII. BILINGUALLY LEXICALIZED SYNCHRONOUS TREE SUBSTITUTION GRAMMAR MODEL The string-to-tree translation aims to construct a more grammatically correct parse tree on target side. The model accepts a source string and recursively apply STSG rules, and then search through all possible target trees to identify the tree with highest score. An STSG translation rule and decoding with bilingually lexicalized parsing are the two major steps. A. STSG Translation Rules The GHKM algorithm help for extracting (minimal) STSG rules from a triple (f, e, a), where f is the source-language sentence, e is a target-language parse tree whose e yield is the translation of f, and a is the set of word alignments between e and f. The minimal string-to-tree rules are extracted in three CLEAR March 2015
Page 11
steps: (1) frontier set computation, (2) fragmentation; and (3) extraction. The frontier set is the set of frontier nodes that meet the following alignment constraint. The target phrase dominated by the frontier node and its corresponding source phrase must be consistent with the word alignment. Then the graph composed of the triple is divided into several fragments. Each fragment forms an STSG translation rule. These rules are extracted through a depth-first traversal of e: for each frontier visited, a rule is extracted using the fragmentation rooted at this frontier. Table I shows the minimal and composed rule extraction of fig.1
where NP is deduced using axiom rule r2, and VP is deduced using inference rule r11. Here, w1 and e1 denote the score and the partial translation. The idea of rule extraction is to apply the headword information for the target side lexicalized training parse tree. Before rule extraction, first use the head finding rules to annotate every interior node of the target-side parse tree with the nodes headword and its part-of-speech. For the source side lexicalized information use a heuristic and each node are associated with target headword of the node. The target headword of the node with the highest probability is chosen.
B. Decoding as Parsing Fig. 2 illustrates the deductive steps. First, axiom rules r2, r3 and r5 are employed to deduce one-word translations. Then, inference rules r7, r11 and r10 are applied to deduce two-word, three-word and four-word translations. The inference rule r10 is used for the following analysis. The deductive step can be formalized as follows:
Fig. 4. Illustration of string-to-tree decoding
The interior non terminals contribute nothing to the generation of the target translation is removed as a part of parameterization. For example, the following lexicalized rule is obtained for r11
Nđ?‘ƒ0.1 : (đ?‘¤1 , đ?‘’1 )đ?‘‰đ?‘ƒ1.4 : (đ?‘¤2 , đ?‘’2 ) đ?‘†0.4 : (w, đ?‘’1 đ?‘’2 ) CLEAR March 2015
Page 12
r11→ VP(were,null)→ (x0 jibi,VBD(were,null) VBN(killed,jibi) PP(by,bei):x0)
Result→ VP(were,null)→ VBD(were,null)VBN(killed,jibi) PP(by,bei):x0
source-side lexicalized information is essential in the improvement of the translation quality. The bilingually lexicalized STSG parsing model is used to distinguish good rules from poor ones in the decoding stage. Also, the model not only have the merits of the phrase structure information in the stringto-tree model but also enrich it with source and target-side lexicalized knowledge. The string-to-tree translation model requires rich lexicalized information for reliable application of the STSG rules during decoding. REFERENCES [1] M. Collins, “Head-driven statistical models for
natural
language
parsing,”Comput.
Linguist,vol. 29, no. 4, pp. 589637, 2003. [2] E. Charniak, K. Knight, and K. Yamada, “Syntax-based language models for statistical machine translation,” Proc. MT Summit IX, 2003, pp. 4046. [3] F. Och, D. Gildea, S. Khudanpur, A. Sarkar, Fig. 5. Lexicalized training example
VIII. CONCLUSION
K. Yamada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, and D. Radev,“A smorgasbord of features for statistical machine translation,” Proc. NAACL04, 2004, pp. 161168.
Various translation models are evaluated according to the BLEU score and it is found that the bilingually lexicalized STSG parsing model achieves improved translation. Where BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine translated from one natural language to another. And the quality is considered to be the correspondence between a machine’s output and that of a human. The STSG approach can significantly improve the translation quality on both mall-scale and large-scale data sets. Both the target and
[4] M. Post and D. Gildea, “Language modelling with tree substitution grammars,” Proc. NISP Workshop Grammar Induct., Represent. Of Lang., Lang. Learn, 2009, pp. 18. [5] T. Xiao, J. Zhu, and M. Zhu,“Language modeling for syntax-based machine translation using tree substitution grammars: A case study on Chinese-English,” Proc. ACM Trans. Asian Lang. Inf. Process.,2011. [6] M. Galley, M. Hopkins, K. Knight, and D. Marcu,“Whats in a translation rule,” Proc. NAACL 04, 2004, pp. 273280.
CLEAR March 2015
Page 13
[7]
D.
Chiang,
phrase-based
[9] M. Galley, J. Graehl, K. Knight, D. Marcu, S.
translation,” Comput. Linguist, vol. 33, no. 2,
Deneefe, W. Wang, and I. Thayer, “Scalable
pp. 201228, 2007.
inference and training of context-rich syntactic
[8] L. Shen, J. Xu, and R. Weschedel, “A new
Translation models,” Proc. COLING-ACL 06,
string
p2006, pp. 961968.
to
“Hierarchical
dependency
machine
translation
algorithm with a target dependency language
[10] D. Marcu, W. Wang, A. Echihabi, and K.
model,” Proc.ACL-08: HLT, 2008, pp. 577585.
Knight, “SPMT: Statistical machine translation with syntactified
target language
phrases,”
Proc.EMNLP 06, 2006, pp. 4452.
Microsoft Officially Launches Azure Machine Learning Platform Microsoft officially announced at the Strata Conference today, the general availability of the Azure Machine Learning service for big data processing in the cloud. It also announced some enhancements to the platform since its Beta release in June. The product is built on the machine learning capabilities already available in several Microsoft products including Xbox and Bing and using predefined templates and workflows has been built to help companies launch predictive applications much more quickly than traditional development methods, even allowing customers to publish APIs and web services on top of the Azure ML platform. In addition, the platform now supports Hadoop and Spark, giving it a fairly comprehensive set of tools for processing big data, regardless of your platform of choice. The real strength of this platform is the ability to create APIs and begin processing data very quickly.
In addition to providing tools for big data processing in the cloud, Microsoft is also offering a marketplace where people can share applications and APIs they have created. In terms of visualizing the data, the platform has some built-in capabilities, but it is also compatible with Microsoft Power BI and IPython Notebook for further plotting and visualization of processed data.
http://techcrunch.com/2015/02/18/microsoft-officially-launches-azure-machinelearning-big-data-platform/
CLEAR March 2015
Page 14
Interactive News Reading Android App for Malayalam Sooraj R1, Jose Stephen2, Bhadran V.K3, Anjali M4, Arun Gopi5 Center for Development of Advanced Computing(C-DAC), Trivandrum, India. soorajr@cdac.in1, jose_stephen@cdac.in2, bhadran@cdac.in3, anjalim@cdac.in4, arungopi@cdac.in5
ABSTRACT: Newspapers are the source of news and information and more over it is a window to the
outside
world.
Today technology grown to that extent, Newspaper can be brought up to
handheld devices. Interactive News Reading Android App is an application which enables people to get benefit of newspaper reading without depending others at their will. The technologies like speech synthesis, web and mobile are integrated together in this application. The major highlight of this app will be in the form of an assistive technology for visually challenged, senior citizens. The application also helps the busy personalities to read the Newspaper where ever they are and engaged in other activities even on moving. The app collects, classifies the news from major newspaper sites and readout news in accordance to user inputs. .
Keywords: Newspaper Reading System, Text To Speech System, Android
I.
INTRODUCTION
Interactive News Reading System for Android phone is an application which collects Malayalam News from the News Server and extracts news contents based on user inputs. The extracted news will be displayed on the User Interface and read out for the user with the integrated TTS. The system minimizes the users task of browsing a web based news portal or a Newspaper site with the help of a web browser or newspaper specific app which are available today for accomplishing their daily routine of news reading activity. Moreover wide reach of smart mobile phones, tablets and other handy devices with internet connectivity to the people, made requirement
of applications that suited to mobile devices to acquire and share information in the form of news, especially in native languages from the major news publications. To meet this requirement we had ported the available desktop version of Interactive News Reading System to the most popular open source Android platform so as to update categorized news on the go through their Android based mobile device. The app will be a helping aid for people those who regularly read news, especially for the visually challenged as well as old age people with deprecated visual ability. An android mobile with a head set and internet connection will be enough for a person to go through the interested news topics and its contents as simple as the scenario of Select and rest without disturbing others around. The app incorporates interactive navigation facilities such as switching between headlines and its corresponding detail news, traversing of news contents, any time topic or newspaper change, CLEAR March 2015
Page 15
pausing and resuming reading etc. Speech feedback is integrated with navigation to give feedback to the user to judge what had done. The overall operation of the app can be done as simple as using screen touch, swiping and tapping. The overall app process is maintained to utilize minimal system resources and storage space by implementing client server concept.
II.
SYSTEM OVERVIEW
The system is designed with an objective of providing seamless integration of various technologies for efficient operation. The system is designed in such a way that the modules of the system and the integrated technologies can be reused for other systems to extend its functionalities. The system can be re-engineered to cope with the changes in the dependent news sites, extension in number of newspapers, news topics and the user level speech feedbacks. The ClientServer concept made it possible for this flexibility in future design changes on its dependencies over configuration changes in online news sites and the addition of new assistive technologies to the existing App. The changes in page structure, RSS feeders and hyperlinks can be easily handled. The main modules of the system are 1) 2) 3) 4) 5) 6) 7)
News downloading module User Interface Module. News Extractor and XML Parser. Thread Handler Navigation Controller. Speech Feedback. Interfacing with Text-to-SpeechTTS.
A. News downloader module News downloader module downloads news from the designated News Server where the scheduled downloading of news from the news sites as per the information in the configuration files and create news files in the prescribed format. The downloader request the news file according to the user demand using HTTP protocol and save the news files into the smart phone SD card memory with a defined file format. The News file name contains newspaper and topic information.
Fig. 1. Block diagram for News Reading system
B. News extraction module News Extraction module extracts news from the downloaded news file in xml format. This is done using standard XML Parsing methods. Extracted news data will be passed to the Thread Handler module. C. Thread handler Thread Handler handles the Synchronized display of news on the User Interface as well as the Read out process by the TTS. Java CLEAR March 2015
Page 16
multi-threading and Android Async task classes and its methods are used for this purpose. Thread states are controlled by the Navigation Controller module.
incoming call states, Screen is properly handled. G. Interfacing system
with
text-to-speech
D. Speech feedback Speech feedback support is integrated so as to assist visually challenged as well as normal users to get aware of what is going on at every user input/interrupt. This is implemented using xml configuration file and the integrated TTS technology and Android inbuilt Media Player. Java standard parsing technology is used for retrieving information from the XML configuration file E. Navigation controller Navigation controller module controls the execution of thread and its states according to the users interactive request and makes switching of news contents and suitable speech feedback messages. This module provides thread safe execution as the system uses multi-threading with synchronized sequence of execution F. User interface module It accepts the input from the user in the form of finger touch. News file will be downloaded using the downloader module and displayed in the display interface in accordance with user input and read out with the help of TTS. The interface supports facilities such as navigation from headlines to detail news, navigating back to headlines, pausing and resuming news reading activity, stop, restart etc. which all can be done using touch and designated button. Android phones basic properties such as orientation changes,
Fig. 2.
TTS interface
Concatenated based Malayalam TTS based on Epoch Synchronous Non Overlap and Add (ESNOLA) technique is integrated as TTS in the system. This TTS uses diphone like segments (partneme) as the basic units for concatenation. The database contains 1500 partnemes, which are used for generating speech for unlimited domain text. ESNOLA algorithm is used for concatenation, regeneration as well as for pitch and duration modification. The native library for TTS is created using Native compiler Android Native Development Kit (NDK). The TTS is accessed as a native library from this application. TTS supports android latest versions up to android 4.2. TTS module transforms text input into wav files and stores the media file into Android files system. This file is played by the Android standard Media player. III. RESULT The current version of the system supports both Mathrubhumi and Malayala Manorama newspaper. Newspaper dependent CLEAR March 2015
Page 17
information such as font, hyperlinks of newspaper and corresponding topics, news extraction information are stored in the configuration file at the server side. The client- server concept minimizes the processing overhead for the android app in terms of downloading and creation of news xml files. This makes the app resilient to changes in the design of newspaper sites to some extent and it also eases the process of addition of newspapers or new topics to the application.
IV.
selection and launching, which may result in confusion for visually challenged person. Currently the app supports only two major newspapers in Malayalam with limited news topics. This is to be enhanced with more papers and topics. The app is limited to Android smartphones with touch screen facility only which should be enhanced to all platforms such as iPhones, Firefox OS, windows etc.
FURTHER ENHANCEMENTS
Although the system is resilient to small design changes of newspaper sites, a major change will adversely affect the system at the server level. Present TTS system is intelligible and has low memory requirements but it lacks naturalness. The naturalness of TTS system has to be improved. Another limitation is, the presence of multilingual (English) words in the news, currently the system skips such words, which may adversely affect the reading clarity. The system currently depends up on a screen reading software for the operation such as app
Fig. 3. Screen Short 1
V. CONCLUSION The presented Interactive News Reading System for Malayalam Android CLEAR March 2015
Page 18
Version1.0 is first of its kind in Malayalam which integrates to Speech synthesis which are the basic means of communication of human being. The system in its all the present limitations. Naturalness of the TTS are the crucial factors in the usability of the system. A totally hands free system with ASR Technology which can read newspapers in multiple language will be a great achievement.
Reading
Kerala
Science
Congress
(2008). [8]
Dr.
V.
R.
Prabodhachandran
Nayar,
Syllabification rules:Swanavij na: nam. [9] on
M. Pucher, and P Frohlich, , A User Study the
influence
of
Mobile
Device
Class,
Synthesis Method, Data Rate and Lexicon on Speech Synthesis Quality, in the proceeding of inter speech 2005. [10]
REFERENCES
System,
R. Ravindra Kumar, K. G. Sulochana and
Jose
Stephen,
Automatic
generation
of
pronunciation lexicon for Malayalam A hybrid [1]
Anand Arokia Rai, Tanuja Sarkar, Satish
Chandra Pammi,Kishore Prahallad and Alan W Black,
Text processing for TextTo Speech
[2] Anjali M, Jose Stephen and BhadranV.K Interactive Newspaper Reading System, Kerala Science congress -2012 [3]Arun Gopi, Shobana Devi P, Sajini T, Bhadran V K. ”Implementation of Malayalam Text to Speech Using Concatenative Based TTS for Platform”,
Conference
on
2013
Control
[11]
Shyamal Kumar Das Mandal and Asoke
Kumar Datta, Epoch Synchronous Non Overlap Add (ESNOLA) method based concatenative
systems in Indian languages,
Android
approach.
International
Communication
and
speech synthesis system for Bangla. [12]
P. Tsaikoulis, A. Chalamandaris, S.
Karabetsos, and S. Raptis,
A Statistical Method
for
for
Database
Reduction
Embeded
Unit
Selection Speech Synthesis,, submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008),. [13]http://www.cs.nmsu.edu/Γepontell/courses /XML/material/xmlparsers.ht
Computing (ICCC). [4] A.W.Black and K A. Lenzo, Flite a small fast run-timesynthesis engine,
in the proceeding of
the
Speech
[14]http://docs.oracle.com/javase/tutorial/esse ntial/concurrency/ [Visited on July 2014].
[5] Datta A.K, Ganguli N.R and Mukherjee B.
[15]http://developer.android.com/training/Impl ementing-navigation/index.html [Visited on July 2014]. [16]https://developer.android.com/guide/topics
Intonation. In segment concatenated speech,
/ui/accessibility/index.html
Proc.
2014].
4th
ISCWorkshopon
Synthesis,
2001.
ESCA Workshop on speech synthesis,
[Visited
on
July
France, pp- 153-156, Sep 1990.
[17]https://developer.android.com/guide/topics
[6]
/text/index.html
Hoffmann, R. et al. A Mulitlingual TTS
System with less than 1MByte Footprint for Embedded Applications, in the proceeding of
[Visited on July 2014]
[18]http://www.openhandsetalliance.com/andr oid_overview.html [Visited on September 2014]
ICASSP, 2003. [7]
Jose
Stephen,
Sulochana
K.G
and
R.Ravindrakumar in the proceeding of News
CLEAR March 2015
Page 19
A Survey on the Use of Natural Language Processing Techniques in Language Learning Rugma R M. Tech Student, Dept. of Computer Science and Engineering, MEA Engineering College, Vengoor, Perinthalmanna, Malappuram, India-679325 rugmaramanathan@gmail.com
Radhika T Assistant Professor, Dept. of Computer Science and Engineering, MEA Engineering College, Vengoor, Perinthalmanna, Malappuram, India-679325 radhikaanil111@gmail.com
ABSTRACT: Understanding appropriate word combination is a very important part of learning a language. But most of the language learners are often reported to have problem with this. To address these problems, automated systems have been developed, which provide collocations, formulaic sequences and their usages that are required to satisfy learners’ demands. This paper attempts to study the use of computer and natural language processing techniques in language learning, focusing on the word associations in English.
Keywords: Natural Language Processing (NLP), Computer Assisted Language Learning (CALL), Collocations, Formulaic Expressions.
I.
INTRODUCTION
Computers, Corpora, and Computational Linguistics play an important role in language learning and teaching. Researchers in the area of Computer Assisted Language Learning (CALL) have been working on exploiting Natural Language Processing (NLP) technologies in traditional language learning. This paper explores the relevance and uses of NLP in the context of language acquisition, especially helpful with English language learners. One of the most problematic areas for language learning is the use of appropriate word combinations. For example, it is proper to use “commit suicide”, but “undertake suicide” sound completely strange. These play a crucial role in language processing,
learning and acquisition. Such recurrent multi-word lexical items which are learned and used as a whole unit are called formulaic sequences. They include idioms, collocations, phrasal vocabulary etc. Formulaic language [1] is one of the most important components of language overall. Approximately one-third to one-half of language is composed of formulaic elements. Formulaic sequences are stored and extracted as a whole, so they reduce the processing efforts. Such expressions are understood more quickly than non-formulaic sequences and learning them helps the speakers to be fluent. But many language learners have shown lack of knowledge of such expressions. So the focus on co-occurrence of words to make chunks, phrases, collocations and formulaic expressions has increased in recent years. With the development of computer technology, fast searching of such expressions becomes possible. As they are recognized as being beneficial to learner’s CLEAR March 2015
Page 20
language fluency, many tools using natural language processing techniques also have been developed. Several such corpus based tools are studied and compared in the following. II.
SKETCH ENGINE
The Sketch Engine [2] is a corpus query system which provides one page summaries of a words grammatical and collocational behaviour. It provides a list of collocates for each grammatical relation the word participates in, i.e. verbs, subjects, objects, adverbs, prepositions and so on. This one page summary is called word sketch. Fig 1 shows the word sketch for verb “pray”. Learner can go through this word sketch and click on the collocation of interest to see the corpus contexts in which the word and it’s collocate co-occur.
C. Input format The input format is described as: Each word is on a new line, and for each word, there can be a number of fields such as word form, POS-tag and lemma which specify further information about the word. D. Grammatical relations The sketch engine supports two possibilities for finding the grammatical relations between the words. First possibility is that the input corpus has been parsed and the information about the grammatical relations between the words is embedded in the corpus. In the second, the input corpus is loaded into the sketch engine unparsed. An expert user, ideally a linguist with some experience and familiarity with computational formalisms will then define each grammatical relation and will load it into the sketch engine.
A. Lemmatisation The first step in the development of Word Sketch is lemmatisation, the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For each word, the Sketch Engine must know what the corresponding lemma possible mappings is between surface forms and lexical forms in the dictionary B. POS Tagging It is the process of deciding the correct word class (verb, noun, adverb etc.) for each word in the text or a corpus, based on both its definition as well as its context i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. The Sketch Engine assumes tagged input.
Fig. 1. Word Sketch for pray (v)
III. TANGO Collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In language learning, it is extremely important to know what words pair CLEAR March 2015
Page 21
with each other. TANGO [3] is a concordance capable of answering user’s queries on collocation use. A user can type any English words as query and select the expected part of speech of the accompanying words. Then the result of possible collocates will be displayed on the return page. For example in Figure 2, after query for the noun collocates of “cooperation” is submitted, the results are displayed on the return page. The user can then browse through different collocation types and also click to get all the instances of a certain collocation type.
B. Collocation Type Extraction A large set of collocation candidates can be obtained from British National Corpus (BNC), via the process of integrating chunk and clause information. TANGO considers three prevalent Verb-Noun collocation structures in corpus: VP+NP, VP+PP+NP, and VP+NP+PP. The strength of association between two collocates is calculated using Logarithmic Likelihood Ratio (LLR). Sentence Chunking
Chunk Type
Confidence
NP
in
PP
The pound
NP
Is widely expected to take
VP
Another sharp dive
NP
if
SBAR
Trade figures
NP
for
PP
September
NP
Table I. Chunked Sentences
Fig. 2. Result of the query for noun collocates of “cooperation”
A. Chunking Chunking is the process of dividing a sentence into syntactically correlated parts of words. For example: Confidence/B-NP in/B-PP the/B-NP pound/I-NP is/B-VP widely/I-VP expected/I-VP to/I-VP take/IVP another/B-NP sharp/I-NP dive/I-NP if/B-SBAR trade/B-NP figures/I-NP for/B-PP September/B-NP
Here, the words correlated with the same chunk tag can be further grouped together [4] (see Table I).
C. Collocation Instance Identification With the help of chunk, clause information and the collocation types extracted from BNC, TANGO find the valid instances where the expected collocation types are located, so as to build a collocational concordance. IV. CONCGRAM ConcGram [5] is a phraseological search engine designed to find all the cooccurrences of words in a text or corpus. It can identify all the potential configurations of between 2 and5 words in any corpus, based on a window of any size, to include the CLEAR March 2015
Page 22
associated words even if they occur in different positions relative to one another (i.e. positional variation) and even when one or more words occur in between the associated words (i.e. constituency variation). Figure 3 shows sample concordance lines of the result of an automated3-word concgram search. The sonogram is “Asia/ world/city”, from a search with “Asia/world” as the double origin. The process of creating the initial 2-word sonogram list involves the following steps [6]: Step 1 All the unique words in a text or corpus are identified and listed. Step 2 With this list concordance searches are made, with each unique word acting as the single origin for the search. Step 3 All co-occurring words in the concordance lines are then listed for each single origin. From this initial 2-word concgram list, the user can go onto build a 3-word concgram list, and then a 4-word list, and finally a 5-word concgram list, all derived from fully automated searches.
from the BNC. The structure of StringNet can be described in two parts- a special type of ngrams that we refer to as hybrid n-grams, constituting the core content of StringNet and the inter-relations among these hybrid ngrams, represented by cross-indexing [8]. A. Hybrid n-grams Hybrid n-gram is an n-gram that admits items of different levels. It consists of a combination of lexemes, word forms, and parts of speech (POSs). For Example “saw the light” is indexed to 1. [verb] the light 2. see [det] light 3. [verb] [det] light 4. saw the [noun] and so on. B. Cross-indexing of Hybrid n-grams There exists inter-relation among these hybrid n-grams. For example, “Saw the light” is a child of “[verb] the light”. “See [det] light”, “[verb] [det] light”, “saw the [noun]” etc. can also be considered as parent of “saw the light”. StringNet completely cross-indexes all of these thus-related hybrid n-grams. Each hybrid n-gram listed in a search result contain links to examples and to its parent and children hybrid n-grams (See Fig. 4 and Fig. 5). C. Pruning
Fig.3. Result of a 3-word concgram search
V. STRINGNET StringNet [7] is a lexico-grammatical knowledge base, automatically extracted
It is the process of eliminating redundant hybrid n-grams from searches. Hybrid ngrams in StringNet consist of all possible combinations of word form, lexeme and POS. Thus for every single traditional n-gram consisting of a string of word forms, there are numerous hybrid n-grams that also describe CLEAR March 2015
Page 23
that same string. StringNet introduces pruning to decrease the search space while still keeping most of the useful information. Consider the pair: 1. See the light 2. See [det] light Here “See [det] light” can be prunedif all cases of [det] in this pattern in BNC are indeed cases of the determiner “the”.
Fig.4. StringNet search interface: “keep a[adj] eye on”
Fig.5. Children of “keep a[adj] eye on”
VI. GRASP Grammar and Syntax-based PatternFinder (GRASP) is a reference aid which provides a usage summary of the query phrase in the form of formulaic structures (i.e., syntactic pat- terns) and frequent formulaic sequences (i.e., lexical phrases). Such rich information is expected to help learners and lexicographers grasp the essence of word usages [9].
A. Lemmatisation GRASP lemmatises a total of approximately 5.6 million sentences in the BNC during corpus pre-processing stage. The goal of lemmatisation is to reduce the impact of inflectional morphology of words on statistical analysis. B. POS Tagging It is the process of generating the most probable POS tag sequence for each sentence. POS tagging provides a way to grammatically describe and generalize the contexts of a query phrase. C. Construction of Inverted Files Then it builds up inverted files of the lemmas in the corpus for quick run-time search. For each lemma, the sentences and positions in which it occurs are also recorded. Additionally, its corresponding surface word and POS tag are kept for run- time pattern grammar generation D. Extraction of Formulaic Structures and Sequences At run time GRASP automatically identifies all the sentences containing the query phrase and groups the contextual words into syntactic patterns based on their assigned POS tags [10]. Referring to Fig. 6, on receiving the request, GRASP initializes three zones on the left: syntactic patterns inbetween, preceding and following the query phrase. In the first row in the in-between zone, the number 2 indicates the distance of make any difference. GRASP generates at most five representative formulaic structures and they are displayed based on their frequency. User can click on any formulaic pattern to see at most five common instances
CLEAR March 2015
Page 24
together with their corresponding frequencies and example sentences.
VII. OBSERVATION AND ANALYSIS Several corpus based tools for language learning have been discussed. As shown in Table II, all of them have so many advantages and limitations. Word sketch provides information on grammatical relations between words. Sketch Engine also generates a thesaurus and sketch differences, which specify similarities and differences between near-synonyms. But the software does not yet properly support lexicographic research into multi-word items. TANGO extracts valid instances based on linguistic information of chunks and clauses. Moreover, using the technique of bilingual collocation alignment and sentence alignment, the system will display the target collocation and its
to handle constituency variation (i.e. AB, ACB) and positional variation (i.e. AB, BA). But it fails to provide syntactic patterns and usage information. StringNet also allows multi-word input and it represents hierarchical relations holding between and among different constructions. It also supports various NLP applications such as error detection and correction and pattern detection in webpages. But it lacks an organized display. GRASP is developed specifically to serve the purpose of suggesting formulaic expressions, which are much more than strings of words linked together with collocationalties. It supports multi-word queries. GRASP provides syntactic patterns and usage information and all are displayed in an organized manner. Sketch
TANGO
ConcGram
StringNet
GRASP
Engine Multiword Query
✖
✖
✔
✔
✔
Syntactic
✖
✖
✖
✔
✔
✖
✖
✖
✔
✔
✖
✖
✖
✖
✔
Patterns Usage Informatio n Organized Display
Table II. Comparative Study
VIII. CONCLUSION
Fig.6. GRASP information presentation
translation equivalents highlighted in different sentential contexts. Multi word inputs are not supported in this system. ConcGram allows multi-word query. Another main advantage of ConcGram is its capacity
It is clear that formulaic language plays a significant role in language development, processing, production, and learning. In view of the vital role of formulaic sequences in language development, it is beneficial for teachers as well as the language learners to lay emphasis on the use of formulaic expressions. CLEAR March 2015
Page 25
With the advent of computer and natural language processing technologies, learning and searching of formulaic sequences becomes easier. Various tools have been developed for assisting learners in acquiring appropriate word usage in English. Development process of such systems is based on several essential NLP techniques. It has been observed from the survey on corpus based tools for extracting formulaic sequences that they are much more beneficial than traditional learning resources. They provide collocations, formulaic sequences and their usages that are required to satisfy learners’ demands. GRASP, more than a concordance, addresses the limitations of the other tools and is expected to better meet learners’ requirements. Multi-word querying allows the users to directly target the usages of the desired phrases and GRASP displays query results in an organized manner. It promotes learners’ knowledge and awareness of word concatenation. These tools can also be applicable to be incorporated into classroom teaching and activities.
[4] Jia-Yan Jian, Yu-Chia Chang, Jason S. Chang, “Collocational Translation Memory Extraction Based on Statistical and Linguistic Information,” Proceedings
of
the
Computational
16th
Conference
Linguistics
and
on
Speech
Processing, pp. 257-264, 2004. [5]Chris
Greaves,
“ConcGram
1.0
-a
phraseological search engine,” The University of Siena and the Ministry of Education, 2005. [6] Winnie Cheng, Chris Greaves and Martin Warren,
“From
n-gram
to
skipgram
to
concgram,” International Journal of Corpus Linguistics, 411-433, 2006. [7] David Wible and Nai-Lung Tsao, “The StringNet Lexico-Grammatical Knowledgebase and
its
Workshop
Applications,” on
Proceedings
Multiword
of
Expressions:
the from
Parsing and Generation to the Real World (MWE 2011), pages 128130, June 2011. [8] D. Wile and N.L. Tao, “String Net as a Computational Resource for Discovering and Investigating Linguistic Constructions,” Proc. NAACL HLT Workshop Extracting and Using Constructions in Computational Linguistics, pp.
REFERENCES
25-31, 2010.
[1] LIU Wei and HUO Ying, “On the Role of
[9] Chung-Chi Huang, Mei-Hua Chen, Shih-Ting
Formulaic
Huang, Hsien-Chin Liou and Jason S. Chang,
Sequences
in
Second
Language
Acquisition,” US-China Foreign Language, ISSN
“GRASP: Grammar- and Syntax-based Pattern-
1539-8080 November 2011, Vol. 9, No. 11, 708-
Finder in CALL,” Proceedings of the Sixth
713.
Workshop on Innovative Use of NLP for Building
[2] Adam Kilgarriff, Pavel Rychly, Pavel Smrz
Educational Applications, pp. 26-31,June 2011.
and David Tugwell, “The Sketch Engine,” Proc.
[10]
11th EURALEX Intl Congress, pp. 105-116,2004.
Ting Huang, Jason S. Chang, and Hsien-Chin
[3] J.Y. Jian, Y.C. Chang, and J.S. Chang,
Liou “An Automatic Reference Aid for Improving
“TANGO: Bilingual Collocational Concordance,”
EFL
Proc. 42nd Ann. Meeting of the Assoc. for
Productive Language Use,” IEEE Transactions On
Computational Linguistics (ACL 04), pp. 19-23,
Learning Technologies, Vol. 7, No. 1, January-
2004.
March, 2014.
Mei-Hua Chen, Chung-Chi Huang, Shih-
Learners’
Formulaic
Expressions
CLEAR March 2015
Page 26
in
Study of Different Methods for Training and Enhancing Dysarthric Speech Divya Das1, Jose Stephen2, Bhadran V.K3 Centre for Development of Advanced Computing Trivandrum, India divyadas@cdac.in1, jose_stephen@cdac.in2, bhadran@cdac.in3
ABSTRACT: Dysarthria is a neuro motor speech disorder. Dysarthria mainly affects the speech intelligibility of general population. In this study we devised an error rating mechanism which gives feedback to the dysarthric people and improves the intelligibility of their speech. The error rating mechanism is done with the help of speech recognition. On evaluating these automatic ratings showed close resemblance with that of human rating. Rating mechanism can be used for evaluating quality of dysarthric speech. We tried out various approaches for improving the intelligibility of dysarthric speech. Enhancement of dysarthric speech using formant frequency modifications helped in making it more intelligible. Enhancement using modifying the kepstrum coefficients was tried out. From this study we inferred that the combination of different speech enhancement method is more useful for dysarthric speech enhancement than adopting a single method.
Keywords: Dysarthria, Intelligibility, Speech rating
I.
INTRODUCTION
Dysarthria is a speech disorder usually resulting in a substantive decrease in speech intelligibility. People with dysarthria cannot properly control the muscles needed for speech production. This causes speech to become slurred or distorted, making it hard to understand. The main problems faced by dysarthric people are the inability to control the pitch, timing, tone, strength, breathe while speaking. Dysarthria can be caused by a variety of reasons, including stroke, brain injury, or even certain medications. According to the severity of dysarthria it can be classified into mild and severe. Mild disarthria can be reduced to some extend by providing proper training to the patients. Ronanki Srikanth, Li Bo and James Salsman [2] developed the talk nicer pronunciation
training toolkit for spoken language learning. Their research work was concentrated only on the non-native language learners. The same method can also be adopted for the training dysarthric persons. Carnegie Mellon University has developed speech software named Fluency for automatic foreign language pronunciation training [8]. In this study, we developed the pronunciation rating mechanism for the dysarthric people. The study is conducted on mild dysarthric speech. The dependency on others during communication, results in losing of confidence of dysarthric patients. So it is necessary for them to have a speech therapy tool, where they can train and improve themselves. In case of severe dysarthria, speech training may not work as expected. In this case we must make their speech similar to that of normal speech. Various enhancement methods are available for improving CLEAR March 2015
Page 27
intelligibility of dysarthric speech. The intelligibility can be improved by modifying dysarthric speech at the signal level [1]. At signal level, characteristics of dysarthric speech are not like that of a normal speech. The characteristics of formant frequency variations is low in the case of dysarthria persons than the normal ones [2]. As a result, their formant trajectories will not show uniform variations. To make them closer to the speech of normal people these formant trajectory variations must be smoothened. Alexander Kain proposed the method for improving the intelligibility of dysarthric vowels [1]. Improvement was obtained by transforming the vowels of a speaker with dysarthria to more closely match the vowel space of a non-dysarthric (target) speaker. Hesham Tolba in their study, also proposed the method for smoothening the formant trajectories [3]. Lalitha, Prema and Lazar Mathew enhanced dysarthric speech using kepstrum coefficients and applying some signal processing on the coefficients using the Wiener filter [4]. In the work of Woo Kyeong Seong, Ji Hun Park, and Hong Kook Kim used, Kalman and the Wiener filters in combination for improving the intelligibility of dysarthric speech [6]. The usage of Wiener filter alone on the dysarthric speech will cause the removal of unvoiced speech. To prevent the loss of unvoiced speech during enhancement, the Wiener filter and Kalman filter separately applied on voiced and unvoiced speech [6]. After going through all the above mentioned methods for speech enhancement and dysarthric speech evaluation, we devised methods for speech training and enhancement. The method for dysarthric speech training purely based on the Ronanski’s mispronunciation correction
method. In case of dysarthric speech enhancement we tried combination of methods and that is explained in this paper. II. ERROR RATING MECHANISM FOR DYSARTHRIC SPEECH The pronunciation verification is done by comparing acoustic properties of dysarthric speech with normal speech. For this process we used two databases, one that contain healthy speech and the other was that of dysarthric speech. Then using the statistics of the standard speech corpus, error scores are calculated for the given dysarthric speech [Figure 1]. Based on these score the ratings are done [Figure 2]. A. Database Used Healthy speech can be collected from the standard speech corpus like TIMIT, RM1 or TIMIT WSJ. CMU developed an audio database named Resource Management (RM1) database. This corpus is a collection of recordings of discrete words, and spelled words pertaining to a naval resource management task. The corpus has been further divided into portions for training, development, testing, and final evaluative testing. From that 2235 speech files in the training set are used for collecting acoustic properties. For properties dysarthric speech, Nemours database is used. The Nemours database contains speech of 11 different dysarthric persons [7]. Each speaker recorded 74 sentences. The sentences are of the form “The X is Y ing the Z”, where the words X, Y, and Z are syntactically correct but semantically vacuous. One such sentence is “The dive is singing the phase”. The first 37 sentences CLEAR March 2015
Page 28
contain the same words as the last 37 sentences, but with the order of the two nouns X and Z reversed.
B. Calculation of z-score Z-scores are calculated using speech recognition engine CMU Sphinx3. For each of the speech files in the RM1 database, forced alignment is done using Sphinx3. During forced alignment, the search engine is given an exact transcription of what is being spoken in the speech data. The system then aligns the transcribed data with the speech data, identifying which time segments in the speech data correspond to particular words in the transcription data with an acoustic score. Then from the forced aligned output, the acoustics scores (1- log(acs)) of each phones with respect to their positions are separated and their mean and variances are calculated (Figure 2). For the given input dysarthric speech, its phone acoustic score based on its position are calculated, and then z-score is calculated using the equation:
�=
Fig. 1. Block diagram for Error Rating
(đ?‘Ľâˆ’đ?œ‡đ?‘–) đ?œŽđ?‘–
where x is the acoustic score of the phone in input dysarthric speech, đ?œ‡đ?‘– is the mean acoustic score of the phone and đ?œŽđ?‘– is the variance of acoustic scores of the phone. Then the phrase score is calculated by taking the largest z-score of the phone in that phrase. C. Rating
Fig. 2. Mean and variance of phone
Based on this variation of the z-score, ratings are done. Ratings are given as shown below:  1-2:Good  3-5:Intelligible  6-8: Worse These ratings are selected by analyzing the variance on each phone in RM1 database. We evaluated the rating method using ten dysarthric speech and five normal speech. The rating of the normal speech falls in the CLEAR March 2015
Page 29
range of 1 and 4. According to the severity of dysarthria ratings showed variations. Ratings of severe dysarthric speech were within six and eight. The automated z-score rating is compared with the manual rating. It showed great resemblances with the automated zscore ratings. We repeated the experiment with triphones, since the score based on triphones have more contextual dependency, it showed minimum variations among the words which helps to improve the rating of sentences. From the results we can conclude that we can use this rating method for evaluating the quality of dysarthric speech.
The energy of the speech signal is also found to be increased from 0.01367Pa2sec to 0.08545 Pa2sec.
III. DYSARTHRIC SPEECH ENHANCEMENT A. Modifying formant trajectories Speech Enhancement is mainly done by fading the background noises which improves the perception of speech. Dysarthric speech can be considered as a noisy speech and its enhancement means making the speech more intelligible. Dysarthric speech enhancement can be done in different ways. In normal speech we can see the continuous variations in the short spectrum characteristics. But in case of dysarthric speech due to the poor articulation rate, the variations of short term spectra will be slower and scatter over the speech. In our work we tried to improve dysarthric speech by modifying the formant trajectories. Figure 3 shows the block diagram of our method. In this experiment the smoothening of formant trajectories was done [Figure 4] for improving the intelligibility of dysarthric speech. The method is adopted from the work of Hesham Tolba [3]. Smoothened formant trajectories made the speech more intelligible.
Table I. Result of z-score method
B. Enhancement using Wiener filter Many dysarthric patients exhibit imprecise articulation in initial consonants. In dysarthric speech the consonants occurring at the articulation points is found to be similar to that of noise. This noise like features can be suppressed by Wiener filtering. When comparing the range of formant frequencies after enhancement, we saw that the format frequencies showed continuous variations with the speech signal. On hearing the speech, the perception level of the speech is also found to be increased, but unvoiced contents were eliminated from the speech. This happens because the Wiener filter works by comparing the signal with noise power ratio, provided by the user. If user does not specified the noise power ratio, then noise is estimated as the average of the local variance of the input. In dysarthric speech these variance occurs at the place of articulation of CLEAR March 2015
Page 30
Fig. 3. Speech enhancement by modifying formant trajectories
consonants. So that the consonant voices at that place are eliminated from the speech. The average energy of filtered speech (0.01309Pa2sec) is found to be decreased from that of normal (0.01367Pa2sec).
Fig. 4. The above figure shows the formant frequencies of dysarthric speech (first) and modified dysarthric speech (second)
C. Our Method for Enhancement Unvoiced sound will be absent in the output speech when we apply the above mentioned method. In order to solve this issue, the dysarthric speech is divided into voiced and un-voiced component and Wiener filter applied separately on them [Figure 5]. In that case the unvoiced consonants in place of articulation are not eliminated. While comparing the energy of the original and the enhanced signal energy of the signal is increased from 0:01367Pa2sec to 0:08860Pa2sec
Fig. 5. Block diagram showing enhancement of dysarthric speech
CLEAR March 2015
Page 31
IV. CONCLUSION We devised an error rating mechanism for dysarthric speech after experimenting on different speech training tools. This rating mechanism can be integrated to speech therapy tool for dysarthric people, where they can train and improve themselves. Going through the literature we understood that a number of studies were carried out on dysarthric speech enhancement. From our experiments we can conclude that combination of methods is more helpful in increasing the intelligibility of dysarthric speech than a particular one. In future we are planning to make a hardware device that could help in dysarthric speech classification, training and also for dysarthric speech enhancement. REFERENCES [1] Alexander B. Kain, John-Paul Hosom, Xiaochuan Niu, Jan P.H. van Santen, Melanie Fried-Oken and Janice Staehely. ”Improving the intelligibility of dysarthric speech”, Speech Communication 49 (2007) 743759.
[2] Divya Das, Dr. C. Santhosh Kumar and Dr. P. C. Reghu Raj ”Dysarthric Speech Enhancement using Formant Trajectory Refinement”, International Journal of Latest Trends in Engineering and Technology Vol.2 Issue 4 July 2014 pages.88-92. [3] Hesham Tolba and Ahmed S. EL Torgoman “Towards the Improvement of Automatic Recognition of Dysarthric Speech”, IEEE 978-14244-4520-2/09 2009. [4] Lalitha V, Prema P and Lazar Mathew.A “Kepstrum Based Approach for Enhancement of Dysarthric Speech”, 2010 3rd International Congress on Image and Signal Processing (CISP2010). [5] Ronanki Srikanth,Li Bo and James Salsman ”Automatic Pronunciation Evaluation And Mispronunciation Detection Using CMU Sphinx”, Proceedings of the Workshop on Speech and Language Processing Tools in Education, pages 6168, COLING 2012. [6] Woo Kyeong Seong, Ji Hun Park, and Hong Kook Kim ”A Noise Reduction Method Incorporating Consonant and Vowel Characteristics for Dysarthric Speech Recognition”, ICCA 2013, ASTL Vol. 24, pp. 60 - 63, 2013. [7] Xavier Menndez-Pidal, James B. Polikoff, Shirley M. Peters, Jennie E. Leonzio and H.T.Bunnell ”The Nemours Database of Dysarthric Speech”, Wilmington, DE 19899, USA. [8]https://www.lti.cs.cmu.edu/research/fluency [visited november 2014].
Scalable machine learning for smarter Big Data predictions Machine learning helps enterprises use all of their data for better real-time predictions, improved decision-making processes and analyzing patterns. Sri Satish Ambati, the CEO of H2O discusses how machine learning enables users to get more value from their existing data and easily create smarter business models. The machine learning product of H2O is an open source, parallel processing system that is developed for high performance and scalability. With this system H2O hopes to woo data scientists and businesses with a powerful yet easy to use data analysis platform. According to Ambati, machine learning is the new SQL. In the past, SQL defined data in the form of databases, but machine learning is driving better data-driven predictions. He mentioned scenarios such as fraud prevention, pattern recognition and faster predictive analytics as all part of the machine learning tool set. http://siliconangle.com/blog/2015/03/11/scalable-machine-learning-for-smarter-big-datapredictions-bigdatasv/ CLEAR March 2015
Page 32
Scalable Systems for Big Data Analytics in the Context of Sentimental Analysis on Social Networks Thabshira K Shamsudheen1, Viju P Poonthottam2 Department of Information Technology, MES College of Engineering Kuttipuram, thabshishams@gmail.com1, vijupoonthottam@yahoo.co.in2 ABSTRACT: An overflow of data occurred due to the recent technological advancements in distinctive domains over the past two decades. The term of big-data was coined to capture the profound meaning of this data-explosion trend. This paper focuses on scalable big-data systems, which include a set of tools and mechanisms to load, extract, and improve disparate data while influencing the massively parallel processing power to perform complex transformations and analysis. Owing to the uniqueness of big-data, designing a scalable big-data system faces a series of technical challenges about which we are going to discuss here. We are sure that it will help to provide an overall picture for non-expert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges.
Keywords: Big data analytics, Cloud Computing, Data acquisition, Data storage, Data analytics, Sentiment
analysis,
Sentimental
influence,
Social networks.
I. INTRODUCTION Big data is now a reality: The volume, variety and velocity of data coming into your organization continue to reach unprecedented levels. This phenomenal growth means that not only must you understand big data in order to decipher the information that truly counts, but you also must understand the possibilities of what you can do with big data analytics [1]. The emerging big-data paradigm, has profoundly transformed our society and will continue to attract diverse attentions from both technological experts
and the public in general. For instance, an IDC report predicts that, from 2005 to 2020, the global data volume will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, representing a double growth every two years. The term of big-data was coined to capture the profound meaning of this data-explosion trend. Nature and Science Magazines have published special issues to discuss the bigdata phenomenon and its challenges, expanding its impact beyond technological domains. As a result, this growing interest in big-data from diverse domains demands a clear and intuitive understanding of its definition, evolutionary history, building technologies and potential challenges [2]. Big data analytics is the process of examining big data to uncover hidden patterns, unknown CLEAR March 2015
Page 33
correlations and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions can’t touch. Consider this; it’s possible that your organization could accumulate (if it hasn’t already) billions of rows of data with hundreds of millions of data combinations in multiple data stores and abundant formats. High-performance analytics is necessary to process that much data in order to figure out what’s important and what isn’t [1]. With the rapid development and increasing popularity of social networks, more and more interests have been made in obtaining information from social networking websites for analyzing peoples behaviors. We can judge the sentimental influence of a post through the sentiment of the post and its receptors. Since it is using big data, we can explain the big data value chain in the context of sentiment analysis.
Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. Although big data doesn’t refer to any specific quantity, the term is often used when speaking about pet bytes and exabytes of data, much of which cannot be integrated easily [4]. B. Big Data History
Fig. 1. Big Data: Depiction
II. BIG DATA: DEFINITION & HISTORY This section is to get the basic ideas of big data through some simple definitions and have a glance to the history of big data. A. Big Data Definition Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics [3].
Fig. 2. Big Data: A brief history of Big Data with major milestones
History of big data is telling about how the big data is evolved into its current stages. The history of big data is presented in terms of the data size of interest. Based on that there are different stages of evolution [2]. CLEAR March 2015
Page 34
Megabyte to Gigabyte: In the 1970s and 1980s, historical business data introduced the earliest big data challenge in moving from megabyte to gigabyte sizes. The urgent need at that time was to house that data and run relational queries for business analyses and reporting. Gigabyte to Terabyte: In the late 1980s, the popularization of digital technology caused data volumes to expand to several gigabytes or even a terabyte, which is beyond the storage and/or processing capabilities of a single large computer system. Data parallelization was proposed to extend storage capabilities and to improve performance by distributing data and related tasks, such as building indexes and evaluating queries, into disparate hardware. Terabyte to Petabyte: During the late 1990s, when the database community was admiring its finished work on the parallel database, the rapid development of Web 1.0led the whole world into the Internet era, along with massive semi-structured or unstructured web- pages holding terabytes or pet bytes (PBs) of data. The resulting need for search companies was to index and query the mushrooming content of the web Petabyte to Exabyte: Under current development trends, data stored and analyzed by big companies will undoubtedly reach the PB to Exabyte magnitude soon. However, current technology still handles terabyte to PB data; there has been no
revolutionary technology developed to cope with larger datasets.
Fig. 3. Big data technology map. It pivots on two axes, i.e., data value chain and timeline. The data value chain divides the data lifecycle into four stages, including data generation, data acquisition, data storage, and data analytics. In each stage, we highlight exemplary technologies over the past 10 years.
III. BIG DATA ARCHITECTURE
SYSTEM
In this section, we focus on the value chain for big data analytics. Specifically, we describe a big data value chain that consists of four stages (generation, acquisition, storage, and processing) A. Big Data System: A Value Chain View A big-data system is complex, providing functions to deal with different phases in the digital data life cycle, ranging from its birth to its destruction. Here we are describing the four modules of big data value chain [2]. Data generation concerns how data CLEAR March 2015
Page 35
are generated. In this case, the term big data is designated to mean large, diverse, and complex datasets that are generated from various longitudinal and/or distributed data sources, including sensors, video, click streams, and other available digital sources. Data acquisition refers to the process of obtaining information and is subdivided into data collection, data transmission, and data pre-processing. First, because data may come from adverse set of sources, websites that host formatted text, images and/or videos - data collection refers to dedicated data collection technology that acquires raw data from a specific data production environment. Second, after collecting raw data, wended a high-speed transmission mechanism to transmit the data into the proper storage sustaining system for various types of analytical applications. Finally, collected datasets might contain many meaningless data, which unnecessarily increases the amount of storage space and affects the consequent data analysis. Data storage concerns persistently storing and managing large-scale datasets. A data storage system can be divided into two parts: hardware infrastructure and data management. Hardware infrastructure consists of a pool of shared ICT resources organized in an elastic way for various tasks in response to their instantaneous demand. The hardware infrastructure should be able to scale up and out and be able to be dynamically reconfigured to address different types of application environments. Data management software is deployed on top of the hardware infrastructure to maintain large-scale datasets. Data analysis leverages analytical methods or tools to inspect, transform, and model data to extract value. Emerging analytics research can be classified into six critical technical areas: structured data
analytics, text analytics, multimedia analytics, web analytics, network analytics, and mobile analytics. This classification is intended to highlight the key data characteristics of each area. B. Big Data Technology Map In this section, we present a big data technology map, asillustrated in Fig. 3. In this technology map, we associate a list of enabling technologies, both open- source and proprietary, with different stages in the big data value chain. This map reflects the development trends of big data. Inthe data generation stage, the structure of big data becomesincreasingly complex, from structured or unstructured to a mixture of different types, whereas data sources becomeincreasingly diverse. In the data acquisition stage, data collection,data pre-processing, and data transmission researchemerge at different times. Most research in the data storagestage began in approximately 2005. The fundamental methodsof data analytics were built before 2000, and subsequentresearch attempts to leverage these methods to solve domain specific problems. Moreover, qualified technology or methods associated with different stages can be chosen from this mapto customize a big data system. C. Big Data System: A Layered view The big data system can be decomposed into a layered structure, as illustrated in Fig.4. The three layers are as follows. 
The infrastructure layer consists of a pool of ICT resources, which can be organized by cloud computing infrastructure and enabled by virtualization technology. Within this model, resources must be allocated to meet the big data demand while
CLEAR March 2015
Page 36
achieving resource efficiency by maximizing system utilization, energy awareness, operational simplification, etc.
Fig. 4. Change PIC Layered architecture of big data system. It can be decomposed into three layers, including infrastructure layer, computing layer, and application layer, from bottom to up.
The computing layer encapsulates various data tools into middle ware layer that runs over raw ICT resources. In the context of big data, typical tools include data integration, data management, and the programming model. Data integration means acquiring data from disparate sources and integrating the dataset into a united form with the necessary data preprocessing operations. Data management refers to mechanisms and tools that provide persistent data storage and highly efficient management, such as distributed file systems and SQL or Nasal data stores. The programming model implements abstraction application logic and facilitates the data analysis applications. The application layer exploits the interface provided byte programming
models to implement various data analysis functions, including querying, statistical analyses, clustering, and classification; then, it combines basic analytical methods to develop various related applications. D. Big Data System Challenges A big data analytics system is not a trivial or straightforward task. As one of its definitions suggests, big data is beyond the capability of current hard- ware and software platforms. The new hardware and software platforms in turn demand new infrastructure and models to address the wide range of challenges of big data. Data collection and management addresses massive amounts of heterogeneous and complex data. The following challenges of big data must be met: Data Representation: Many datasets are heterogeneous in type, structure, semantics, organization, granularity, and accessibility. A competent data presentation should be designed to reflect the structure, hierarchy, and diversity of the data, and an integration technique should be designed to enable efficient operations across different datasets. Redundancy Reduction and Data Compression: Typically, there is a large number of redundant data in raw datasets. Redundancy reduction and data compression without scarifying potential value are efficient ways to lessen overall system overhead. Data Life-Cycle Management: Pervasive sensing and computing is generating data at an unprecedented CLEAR March 2015
Page 37
rate and scale that exceed much smaller advances in storage system technologies. One of the urgent challenges is that the current storage system cannot host the massive data. In general, the value concealed in the big data depends on data freshness; therefore, we should set up the data importance principle associated with the analysis value to decide what parts of the data should be archived and what parts should be discarded. Data Privacy and Security: With the proliferation of online services and mobile phones, privacy and security concerns regarding accessing and analyzing personal information are growing. It is critical to understand what support for privacy must be provided at the platform level to eliminate privacy leakage and to facilitate various analyses. There are some other challenges due to the massive data which are just listing here Approximate Analytics, Connecting Social Media, Deep Analytics, Energy Management, Scalability, and Collaboration. IV. BIG DATA IN SENTIMENTAL ANALYSIS Large amounts of posts are generated on social networks every day. People are curious in finding the influence among them. However, we are not sure if the influence is made positively or negatively on other posts if their sentimental information is not considered. Since social networking sites are dealing with big data, I am here to explain about the various modules of big data value chain for the successful analysis of the
sentimental influence of various posts on social networks. A. Data Generation Large amounts of information from social networking sites have raised the interests of researchers and scholars to make use of the information for analysis and prediction, this in turn leads to the big data generation. Profile of mood states (POMS) is a rating system used to assess current mood states. Data required for the processing includes the emotion words and emoticons. SentiWordNet is a popular lexicon resource for sentimental analysis [5]. It gives sentimental scores to English words so that the overall sentiment of a word can be estimated by the one with highest score. Emoticons and the emotion words are categorized under the different sentiments and stored in a dictionary. 1) Data Attributes: Volume is the sheer volume of datasets. Velocity the data generation rate and real-time requirement. Variety refers to the data form, i.e., structured, semi-structured and unstructured. Horizontal Scalability is the ability to join multiple datasets. Relational Limitation includes two categories, special forms of data and particular queries. Special forms of data include temporal data and spatial data. Particular queries may be recursive or another type. The attribute of the sentiment of a post is defined as Si. Its value can be positive (POS), negative (NEG) or neutral (NEU), which will be judged from the post content. Count the CLEAR March 2015
Page 38
times of positive and negative emoticons or emotional words appearing in the post content, and calculate their difference. If the results positive Si is considered positive, negative if the result is negative and neutral otherwise. B. Data Acquisition The task of the data acquisition phase is to aggregate information in a digital form for further storage and analysis. Intuitively, the acquisition process consists of three substeps, data collection, data transmission, and data pre-processing, 1) Data collection: Data collection refers to the process of retrieving raw data from realworld objects. Here the posts on the social networks are collected from various sites. Common methods for this purpose include sensors, log files, web crawlers.
Sensors are used commonly to measure a physical quantity and convert it into a readable digital signal for processing (and possibly storing). Sensor types include acoustic, sound, vibration, automotive, chemical, electric current, weather, pressure, thermal, and proximity. Log files, one of the most widely deployed data collection methods, are generated by data source systems to record activities in a specified file format for subsequent analysis. A crawler is a program that downloads and stores webpage for a search engine. As far as sentimental analysis is concerned, the better approach is log files. It is because, we need to compare and analyze the data
acquired with the data stored for the decision making procedure. 2) Data Center Transmission: When big data is transmitted into the data center, it will be transited within the data center for placement adjustment, processing, and so on. This processes referred to as data center transmission. To find the placement of data in the category of emotions in the dictionary. 3) Data Pre-processing: Data preprocessing techniques are designed to improve data quality should be in place in big data systems. 4) Integration: Data integration techniques aim to combine data residing in different sources and provide users with unified view of the data. Data from various social sites are combined for making it the input to the analysis section. 5) Cleansing: The data cleansing technique refers to the process to determine inaccurate, incomplete, or unreasonable data and then to amend or remove these data to improve data quality. C. Data Storage The data storage subsystem in a big data platform organizes the collected information in a convenient format for analysis and value extraction. For this purpose, the data storage subsystem should provide two sets of features:
The storage infrastructure must accommodate information persistently and reliably. The data storage subsystem must provide a scalable access interface to query and analyze a vast quantity of data.
CLEAR March 2015
Page 39
In the sentimental analysis, the data including the input posts from social websites, emotion words and emoticons collections are to be stored. Here they are stored using dictionary which is a scalable system that can be extended in need. D. Data Analysis Sentimental analysis addresses the following research questions:
What is the difference of sentiment variance on public topics and personal topics? How much does the sentiment of influential posts affect the receptors sentiment? Do the posts on the same topic have similar sentiment trend on different social media platforms?
contents from different social networks can be very different from each other, although they belong to the same topic category. IV. CONCLUSION The era of big data is upon us, bringing with it an urgent need for advanced data acquisition, management, and analysis mechanisms. In this paper, we have presented the concept of big data and highlighted the big data value chain, which covers the entire big data lifecycle. The big data value chain consists of four phases: data generation, data acquisition, data storage, and data analysis. We explained these modules in the context of sentimental analysis which helps us most to understand the requirement for scalable system for big data analytics.
The sentiment on different topics should be different. It is found that in most cases the sentiment variance on public topics is higher than on personal topics.
REFERENCES
The sentimental influence has been defined in two types: compliance and opposition. In order to know whether the types of the influencer’s sentiment have different effect on their sentimental influence, the average complying rate and average opposing rate are calculated for the influencers with positive and negative sentiment separately. The complying rates very high for the influencers with negative sentiment on public topics, and for those with positive sentiment on personal topics. As for the opposing rate, it is relatively high for the influencers with negative sentiment on personal topics. There are much more differences on the three platforms. This result is reasonable because the details of post
[2] Tat-Seng Chua Xuelong Li Han Hu, Yonggang
[1]http://searchbusinessanalytics.techtarget.co m/, Visited in December 2014. Wen. Toward scalable systems for big data analytics. In Access, IEEE, volume 2, pages 652– 687, 24 June 2014. [3] Sinanc D Sagiroglu S,”big data: A review. In” International
Conference
Collaboration
Technologies and Systems (CTS), pages 42 –47, May2013. [4]http://blogs.sas.com/, [5]Vincent sentimental
TY
Ng
December
Beiming
influence
of
Sun. posts
2014.
Analyzing on
social
networks. In Computer Supportive Cooperative Work in Design (CSCWD), Proceedings of the 2014 IEEE 18th International Conference, pages 546–551. IEEE, May 2014.
CLEAR March 2015
Page 40
Computational Prediction of Transcription Factor Binding Site and Affinity based on DNA Features Sheeba K1, Achuthsankar S. Nair2 Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram sheebaktvm@gmail.com1, sankar.achuth@gmail.com2 ABSTRACT: Transcription factors are critical proteins for sequence specific control of transcriptional regulation when protein binds to the DNA it regulates the biological function of the DNA mainly expression of genes. These regulatory proteins that bind DNA in promoter regions of the genome, either enhance or repress gene expression are called Transcription Factors (TFs). The essential aspect for most of the cellular responses to environmental condition for cell and tissue specificity is the initiation of transcription. In this article, we investigate the information that can be retrieved from local DNA structural properties rather than nucleotide sequence which is relevant to transcription factor binding site. Here, we present a feature based model which is a probabilistic method for modelling TFDNA interactions, based on sequence, conformational and physicochemical features to represent TF binding specificities. We developed the mathematical formulation of our model, and implemented an algorithm for extracting the structural features from binding site data. This model works on predicting the binding site and affinity of a protein towards DNA sequences based on PSSM (Position Specific Scoring Matrix), binding energy and affinity. The above algorithm was used to formulate a web tool based on the binding energy and affinity of the sequence. Along with this study, in an associated work, regarding the regulation of Sir2p gene of Saccharomyces cerevisiae, we analysed the 429 base pairs upstream and experimentally validated the result.
Keywords: Gene expression, Gene regulation, Sequence motif, Transcription factor binding site.
I.
INTRODUCTION
Gene regulation is essential for organisms to increase the versatility and adaptability, by allowing the cell to express protein when needed. Regulation may occur at any point of time during the expression of a gene, from the start of transcription to the post translational modifications. In eukaryotic cells, the start of transcription is one of the most complicated
parts of gene regulation. Transcription factors have an important role in regulating gene expression. It controls the flow of genetic information from DNA to RNA. Transcription factor binding site is the DNA sequence where a transcription factor binds. DNA binding sites for a given transcription factor is usually all different, with variation in the degrees of affinity of the transcription factor for the different binding sites. The affinity based approach may allow a more quantitative analysis for the prediction of transcription factor binding sites. Transcription factors bind to DNA in CLEAR March 2015
Page 41
a sequence specific manner to stimulate transcription from the core promoter. The accurate prediction of transcription factor binding sites is useful for getting information about the function and regulation of genes discovered in genome sequencing projects. There are numerous databases that store information on known transcription factors and their binding sites. Different computational methods are currently available for identifying transcription factor binding site and affinity in the genomes of different species. The most common model used to represent TF binding specificities is a PSSM, which assumes the dependence between binding positions. The construction of PSSM has been applied extensively in studies related to regulatory elements in the genome. The major challenge faced by the biological research community is the handling of increasing volume of experimental data and to unravel the hidden information embedded in these biological data. Data mining techniques enables biologists to examine biological sequences in a meaningful manner. One among the mining method include pattern discovery from large data sets involving application of statistical methods, machine learning and database management. The work presented here is the mining of the biological features of the transcription factor binding sites and using the significant ones for prediction.
transcription factor binding sites. These data were downloaded and used for the prediction. B. Features Studied The DNA features studied after literature review was classified into three category mainly Sequence dependent, Conformational and Physicochemical as listed in Table I. Features type Sequence based
Palindrome, Position Specific Scoring Matrix, GC Content
Conformational
Twist, Rise, Bend, Tip, Inclination, Major Groove Width, Major Groove Depth, Major Groove Size, Major Groove Distance, Minor Groove Width, Minor Groove Depth, Minor Groove Size, Minor Groove Distance, Persistance Length, Propeller Twist, Clash Strength
Stacking energy, Energy, Melting Temperature, Probability contacting Physicochemical nucleosome core, Mobility to bend towards major groove, Mobility to bend towards minor groove, Enthalpy, Entropy Table I: DNA Features studied
C. Features Selected The features studied were optimized and four selected are listed in Table II. Features type
II. MATERIALS AND METHODS
Sequence based
A. Materials
Conformational
The TFs of Saccharomyces cerevisiae are collected from YEASTRACT database. The database contains 170 transcription factors and seven hundred and two (702)
Features
Feature name Palindrome Position specific Scoring Matrix Energy
Physicochemical Entropy Table II: Features selected for finding binding site
CLEAR March 2015
Page 42
D. Palindrome Palindromes are defined as identical inverted repeats and are generally thought to be biologically significant. It is a string with a 5’ to 3’ sequence that is identical to the reverse complimentary strand of 5’ to 3’ sequence. For example, this 15-mer palindromic site, AATGCCCATCGATAA, the palindromes are underlined. A program was designed to detect the palindrome structure contained within a binding site sequence.
The second matrix dependent parameter R0 sets the binding energy of the factor to the consensus site as well as the TF concentration. The mismatch energy of the nucleotide for each site in the promoter sequence are computed by: 1
đ?‘Łđ?‘–,đ?‘šđ?‘Žđ?‘Ľ
đ?œ†
��,�
đ?›Ľđ??¸đ?‘– (đ?œ†) = − ∑đ?›ź=đ??´,đ??ś,đ??ş,đ?‘‡ ln (
) (2)
Where vi,max is the frequency of the consensus base at position i in the PSSM and vi,Îą is the frequency of the observed base at position i in the matrix.
E. Position Specific Scoring matrix G. Entropy Position specific scoring matrix represents patterns in biological sequences. They are derived by aligning a set of sequences. The values of the distribution matrix are 0 and 1. When a base pair is present then the corresponding base value will be 1, otherwise 0.
đ??ť(đ?‘‹) = ∑đ?‘ đ?‘–=1 đ?‘?đ?‘– log đ?‘?đ?‘–
F. Energy The Position Specific Scoring matrix used to predict the binding strength of a given transcription factor to a promoter sequences is employed on the basis of the biophysical model [2], calculates the local affinity by computing the occupancy of a TF to each site in the sequence using equation:
đ?‘?đ?‘– =
The measure for order or disorder in sequences is the Shannon’s entropy [5]. For the random variable (random molecule) “X� states, then the entropy H of X is given by Shannon’s formula with probability pi is:
đ?‘…0 đ?‘’ đ?›Ľđ??¸đ?‘– (đ?œ†) 1+đ?‘…0 đ?‘’ đ?›Ľđ??¸đ?‘– (đ?œ†)
(1)
where ∆Ei(Îť) is the energy difference or mismatch energy scaled by the parameter Îť between the binding energy of the factor to site i and the lowest binding energy possible with the factor bound to its consensus site [1].
(3)
From all the above features two of them namely PSSM and energy were used for the prediction. III. METHODOLOGY The Transcription Factors of Saccharomyces cerevisiae downloaded from the YEASTRACT are used to construct a database using SQL database server 2008. Then PSSM for each of the transcription factor was created from alignments of known binding sites retrieved from the database. A. Existing Methods for TFBS prediction  Consensus Sequences: The consensus sequence is the calculated CLEAR March 2015
Page 43


order of most frequent content, either nucleotide or amino acid, found at each position in the alignment of the sequence. It represents the results of a multiple sequence alignment in which related sequences are compared to each other and similar sequence motifs are calculated. Position Weight Matrix: Position weight matrix (PWM) is not only one of the most widely used Bioinformatics methods, but also a key component in more advanced computational algorithms for characterizing and discovering motifs in nucleotide or amino acid sequences. Position Specific Scoring Matrix (PSSM): PSSM is a most widely used method to represent the specificity of transcription factor/DNA interactions. PSSM can be constructed on the basis of a set of known binding sites for the factor of interest. The information for PSSM can be obtained from various transcription factor databases.
B. Feature Extraction The feature value for a given nucleotide sequence, is extracted by converting the dinucleotide frequencies into numeric value for each of the specific DNA structural characteristic at all positions of the TFBS. C. Method Implemented for TFBS prediction In our method, PSSM was constructed computationally from the binding sites of 25 TFs of length varying from 5 to 11. The total dataset was divided in the ratio 80:20 comprising of training and test set.
IV. IMPLEMENTATION Initially a database of transcription factors of Saccharomyces cerevisiae (Yeast) is created and then the PSSM which is a numerical representation of the DNA sequences (Fig 1.) for the transcription factors are generated computationally and stored.
Fig1. Position Specific Scoring Matrix of TF Aft2p
Features are selected from the extracted properties of DNA like PSSM, energy, entropy and palindrome. Prediction is done for a given sequence based on the energy and affinity from the PSSM stored in the database. The features PSSM and energy were used for prediction using the above described method and developed a new tool, for predicting Transcription Factor Binding Site and affinity of Saccharomyces cerevisiae is explained in the concluding part of this work. V. RESULTS AND DISCUSSIONS A. TFBS prediction The features extracted were studied in detail and the analysis was done for the transcription Factor Binding site prediction. The palindrome values and the information content (entropy) generated for various TFs namely Aft2p, Arr1p, Cup9, Dot6, Ert1p, Gcn4, Sum1of Saccharomyces cerevisiae are tabulated in Fig 2 and Fig 3:
CLEAR March 2015
Page 44
Fig2. Palindrome for TFBS
Fig3. Entropy of TFBS
The Sequence logos area graphical representation of sequence alignments. Each logo consists of pile of symbols, one pile for each position in the sequence. The overall height of the nucleotide sequence is proportional to the information content at that position, while the height of symbols within the pile indicates the relative frequency of each nucleic acid at that position. Fig 4 shows the sequence logo for Gcn4. This method for predicting the binding site and affinity of a protein towards DNA sequences is based on PSSM (Position Specific Scoring Matrix), binding energy and affinity. We implemented the above algorithm to devise a web tool based on the binding energy and affinity of the sequence. The details listed in Table III.
Fig4. Sequence logo (Information content) of transcription factors Gcn4p TF Name
TF Length
Energy value
Aft2p
7
0.011
Predict ed Result (Bindin g Site) Y
Dal80p
7
>1.57
N
Ecm22p
7
0.0443
Y
Gat3p
7
>1.57
N
Gcn4p
7
0.0376
Y
Rds1p
7
0.0978
Y
Ace2p
8
0.0403
Y
Adr1p
8
0.0849
Y
Tbf1p
8
0.0483
Y
Dal81p
11
0.0003812
Best site
Sip4p
13
0.0001339
Best site
Table III. Predicted result of transcription factor name, length, energy value of Saccharomyces Cerevisiae
The Energy and Affinity graph is generated for all TFs. An example for TF GCN4p is in Fig 5 and Fig 6. VI. ASSOCIATED COLLABORATIVE WORK As a part of a collaborative work, Bioinformatics prediction of transcription factor binding to Sir2p gene. We analyzed the 429 base pairs upstream of Sir2 regulatory region for identifying transcription factor binding sites of Saccharomyces cerevisiae. CLEAR March 2015
Page 45
Fig 5. Affinity graph of Gcn4p
and the diversity of genomic data provide an excellent opportunity for identifying TFBSs. We provide a systematic procedure of computationally predicted tissue-specific binding targets for 110 yeast TFs and 500 transcription factor binding sites. We predict binding site with the help of a tool where we used statistical method like PSSM and biochemical properties like binding energies and affinity. As per suggestions, we can use machine learning methods to create various predictive models for predicting binding site. Developing methods to integrate various types of features of DNA giving emphasis to conformational and physicochemical data has become a major trend in this pursuit. We improve this approach by trying to include more DNA properties as parameters for binding site prediction and add more species and predict their binding sites. Such comprehensive resource shall become useful for researchers studying gene regulation. REFERENCES
Fig 6. Binding energy graph of GCn4p
[1]
Roider H. G., Kanhere A., Manke T., and
Vingron
M.
“Predicting
transcription
factor
affinities to DNA from a biophysical model”.
The analysis for finding transcription factors was performed by the statistical method [2] employed in the most widely used Transcription Factor Affinity Prediction TRAP tool. This result was verified in the wet lab and experimentally validated [10] which was a collaborative work.
Bioinformatics (Oxford, England), 23(2), 134– 41. doi:10.1093/bioinformatics/btl565, 2007. [2]
Manke T., Roider H. G., Vingron M.
“Statistical Modeling of Transcription Factor Binding
Affinities
Predicts
Regulatory
Interactions”. PLoSComputBiol 4(3): e1000039. doi:10.1371/journal.pcbi.1000039, 2008.
VII. CONCLUSION
[3] Teixeira M. C., MonteiroP., Jain P., Tenreiro S., Fernandes A. R., Mira, N. P., Alenquer M., et
Identifying transcription factor binding sites (TFBSs) is helpful for understanding the mechanism of transcriptional regulation. The vast amount
al. “The YEASTRACT database : a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae” 34, 446–451. doi:10.1093/nar/gkj013, 2006.
CLEAR March 2015
Page 46
[4]
Sandelin A., Alkema W., Engström P.,
[8] Che
D., Jensen
L. And Liu
Wasserman W. W., & Lenhard B. ” JASPAR: an
S.
open-access
Tools”. Bioinformatics: 2909-2911,
database
for
eukaryotic
“BEST:
S., CAI
Binding-site Estimation Suite of
transcription factor binding profiles”. Nucleic
93/bioinformatics/bti425, 2005.
acids research, 32(Database issue), D91–4.
[9] T.
D.
doi:10.1093/nar/gkh01, 2004.
Gold,
and
[5] C. E. Shannon “A Mathematical Theory of
content
Communication”,
sequences.
Reprinted
with
corrections
Schneider,
of
A.
G.
D.
DOI:10.10 Stormo,
binding
sites
on
nucleotide
Journal of Molecular Biology”,
188:415-431, 1986
pp. 379-423, 623-656, July, October, 1948.
[10] Shyamasree
[6] Hannenhalli, Bioinformatics.
Bhattacharyya,
Achuthsankar Nair,
transcription factor binding sites-modeling and
Dhar,
Sunanda
integrative search methods”.
"Inheritance of heat stress
(24):1325-1331.
Bioinformatics,
DOI:10.1093/bioinformatics/
L.
Ehrenfeucht. “Information
from The Bell System Technical Journal, Vol. 27, “Eukaryotic
Jun
and
Laskar, Sheeba K, Mrinal Pawan
Bhattacharyya. induced
Cup9
dependent transcriptional regulation ofSir2".
btn198, 2008.
Accepted by Molecular and Cellular Biology,
[7] Chollier Morgane T., Hufton A., Heinig M.,
December 2014.
O’Keeffee S., El Masri N., Roider H. G., Manke
[11] Qian. Ziliang, et al., Bioinformation. (2007)
T., Vingron M “Transcription factor binding
“An efficient method for statistical significance
predictions using TRAP for the analysis of ChIP-
calculation of transcription factor binding sites”.
Seq
Bioinformation by Biomedical
data
and
regulatory
SNPs”.
Protocols: doi:10.1038/nprot.2011.
Nature
Informatics, 2(3):
Natural Language Processing System Accurately Measures Colonoscopy Quality A natural language processing program provided accurate tracking of colonoscopy quality measures and assignment of surveillance intervals, according to recent study data. Imler and colleagues created and validated the performance of an NLP system with 19 measures for quantifying adenoma detection rate and providing surveillance recommendations using data from 13 Veterans Affairs endoscopy units. From 42,569 colonoscopy reports with pathology records, they randomly selected 250 for the training set to refine the NLP and 500 for the test set. Masked, paired, annotated expert manual review was used to create the reference standard, and the remaining 41,819 non annotated records were processed through the NLP system to evaluate consistency. NLP and annotation resulted in similar rates of pathologic findings. Accuracy of the test set was 99.6% for colorectal cancer, 95% for advanced adenoma, 94.6% for non-advanced adenoma, 99.8% for advanced sessile serrated polyps, 99.2% for non-advanced sessile serrated polyps, 96.8% for large hyperplastic polyps and 96% for small hyperplastic polyps. http://www.healio.com/gastroenterology/interventionalendoscopy/news/online/%7B8aab57ff592e-4d7c-a84d-20eb5189b1d0%7D/natural-language-processing-system-accuratelymeasures-colonoscopy-quality CLEAR March 2015
Page 47
Article Invitation for CLEAR June 2015 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on June 2015. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th June, 2015 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Journal
SIMPLE Groups
CLEAR March 2015
Page 48
Hello World, CLEAR Journal is a forum for research on computational linguistics and natural language processing. Computational linguistics is the study of computer processing, understanding, and generation of human languages. Techniques from computational linguistics are used in applications such as machine translation, speech recognition, information retrieval, intelligent Web searching, and intelligent spelling checking.
Last year GEC, Sreekrishnapuram hosted NC-CLAIR, National level conference on Computational Linguistics and Information Retrieval. Simple groups proudly brings you the March edition of CLEAR Journal with the papers published in NC-CLAIR 2014. NC-CLAIR aims to bring together, researchers working in the areas of Indian Language Computing, Machine Translation, Speech Processing, Information Retrieval, Big Data Analytics, Machine Learning and other related fields.
CLEAR welcomes thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects in all areas covered, which includes audio, speech, and language processing and the sciences and technologies that support them.
Simple group welcomes more aspirants in this area. Wish you all the best!!! Nisha M
nisha.m407@gmail.com
CLEAR March 2015
Page 49
CLEAR March 2015
Page 50