CLEAR December 2015
Page 2
Editorial…………………..04 News & Updates………05 CLEAR Dec 2014 Invitation…………………48 Last word…………………49
CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College Sreekrishnapuram, Palakkad-678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. Ajeesh Ramanujan Asst. Professor Dept. of Computer Science and Engineering Govt. Engineering College Sreekrishnapuram, Palakkad-678633 Editors Raseek C Pelja Paul N Sini G M Revathy P Cover page and Layout Anoop R
Compositional Morphology based Language Generation for Agglutinative Languages……………………………………….06 Krishnaprasad P, Ajeesh Ramanujan
Real-Time Captioning and Lecture Transcription…….…..15 Pelja Paul N
Pattern Based Bootstrapping for Morphological Analysis of Malayalam……………………………………………………………...24 Kala M T
A Study on Machine Aided Translation Approaches in India…………………………………………………………………………...32 Deepa C A, Sincy V Thambi, Varsha K V
Securing Cloud Data from Data Mining Attacks………..….39 Jyothsna G K
CLEAR December 2015
Page 3
Dear Readers, Greetings from Government Engineering College, Sreekrishnapuram. We are extremely happy to bring out the December edition of the on-line magazine CLEAR Journal, an initiative of SIMPLE, the association of the M.Tech students of the Department of Computer Science and Engineering, Government Engineering College, Sreekrishnapuram. In this issue, we are pleased to present five papers including applications related to Machine learning ,Data mining and Speech Recognition. The Editorial Board appreciates the time and effort that have been devoted by the different authors and would like to thank them all. Finally, we hope you enjoy this edition too. We hope that our magazine will have a long and successful life with the help of our readers and contributors. As always, suggestions and criticisms towards improving the magazine content are welcome. With best wishes... Ajeesh Ramanujan (Chief Editor)
CLEAR December 2015
Page 4
Publications 1. Rekha Raj C. T and Reghu Raj P. C "Text Chunker for Malayalam using MemoryBased Learning " in IEEE International conference on Control, Communication and Computing India 2015 (IEEE ICCC India 2015), December 11 - 2015, pp. 1-8. 2. Archana S M, Naima Vahab, Raseek C and Rekha Thanakappan "A Rule Based Question Answering System in Malayalam corpus using Vibakthi and POS Tag Analysis" in Proceedings of International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015), December 11-2015, pp.1-8. 3. Seena I T, Sini G M, and Binu R "Malayalam question answering system" in Proceedings of International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015), December 10 - 2015, pp. 1-4 (accepted for poster presentation). 4. Pelja Paul.N, Revathy P, Sini G M, and Binu R "Automatic AMR Generation for Simple Sentences Using Dependency Parser" in Proceedings of International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST-2015), December 11-2015, pp.1-5. 5. Vidya P V, Reghu Raj P C, and Jayan V "Web Page Ranking Using Multilingual Information Search Algorithm - A Novel Approach " in Proceedings of International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST 2015) , December 9-2015, pp.1-8. 6. Nisha M, Reji Rahmath K, Rekha Raj C. T, and P. C. Reghu Raj "Malayalam Morphological Analysis Using MBLP Approach" published in IEEE International Conference on Soft-computing and Network Security(ICSNS 2015), February-2015, pp.1-5. 7. Kavitha Raju, Robert Jesuraj K, Samaj Babu George, and P. C. Reghu Raj "Information Extraction from Text through Sequence Labelling", published in International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE 2013) , 3(4), October-2015, pp.1-4.
CLEAR December 2015
Page 5
Compositional Morphology based Language Generation for Agglutinative Languages Krishnaprasad P Department of Computer science GEC Palakkad University of Calicut krishnaprasadpgupta@gmail.com
Ajeesh Ramanujan Department of Computer science GEC Palakkad University of Calicut ajeeshramanujan@gmail.com
ABSTRACT: This paper presents a novel method for modeling languages that are highly agglutinative, especially Indian languages that have wide variety of morphological variations, through vector based probabilistic representations in the context of natural language generation. The method is evaluated in the context of Indian language Malayalam which create challenges for statistical language modeling because of the proliferation of word forms. The morphological variations of the language is tackled by suitably modeling the morphological forms using additive word representations keeping morphologically related words should share statistical strength in spite of differences in surface form. We introduce factorized feature vector where we factorize a word into surface morphemes and this factorized feature vector can be further used for capturing the possible morphological variations and thereby language generation. Experiments shows that results are promising.
I.
INTRODUCTION
Natural Language Generation is one of the most promising areas in Computational Linguistics. In early days, generation was considered the easy part of natural language processing. After all, it is straightforward to write a generator that produces impressive text by associating a sentence template (or some equivalent general grammatical form) with each representational type and then using a grammar to realize the template into surface form. Since the characteristics of the languages varies widely, the problem of natural language generation become difficult and limits us from obtaining tangible results.
The proliferation of word forms in morphologically rich languages presents challenges to the statistical language models (LMs) which limits the quality of the machine generated natural language text. Conventional back-off n-gram language models and the increasingly popular vectorbased language models use parameterization that do not explicitly encode morphological regularities among related forms, such as compute, computer, computerization etc.. Such models suffer from data sparsity arising from morphological processes and lack a coherent method of assigning probabilities or representations to unseen word forms. Most of the Indian languages are highly agglutinative in nature and hence it makes CLEAR December 2015
Page 6
natural language generation as one of the unsolvable problem for such languages. This paper proposes a general method for modeling morphological variations of particular language so that language generation more easy and the problem of unseen words can be solved up to an extent. The proposed method is based on the theory of log bilinear models which attempt to find a balance between the probabilistic language modeling and morphology based representation learning. We introduce factorized feature vector for learning morphological representation. The feature vectors are constructed in such a way that vectors are composed as a linear function of arbitrary sub-elements of the word, e.g. surface form, stem, affixes, or other latent information. The effect is to tie together the representations of morphologically related words, directly combating data sparsity. The results are evaluated in the context one of the Indian language called Malayalam. It seems that the results are promising for modeling in agglutinative languages. II.
RELATED WORKS
The term language models originates from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980s. Speech recognition systems use a language model to complement the results of the acoustic model which models the relation between words (or parts of words called phonemes) and the acoustic signal. The history of language models, however, begins at end of 19th century when Andrei Markov used language models (Markov models) to model letter sequences in works of Russian literature (Mnih and Hinton, 2007). Another
famous application of language models are Claude Shannons models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory(Holger Schwenk, 2007). In the 1990s language models were applied as a general tool for several natural language processing applications, such as part-ofspeech tagging, machine translation, and optical character recognition. Language models were applied to information retrieval by a number of research groups in the late 1990s. They became rapidly popular in information retrieval research. Statistical Language Models are most basic way of modeling languages, while various improvements are suggested based on the learning technique and morphological segmentation (Mnih and Hinton, 2007). The improvements on LM are mostly depend on vector based represent of feature vectors of training data and neural networks for learning function(Strong et al., 2007). The combined method of vector theory and probability theory introduced a new area in language model called continuous space language modeling. Vector based approaches for language modeling shows good results. Skip-gram models and recurrent neural networks are based on the foundations of vector theory. Various improvements for achieving good results on language models are obtained by introducing improvements in learning methods. A. Language Modeling as a Problem A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model
CLEAR December 2015
Page 7
based on a big English newspaper archive is expected to assign a higher probability to a bit of text than to aw pit tov tags, because the words in the former phrase (or word pairs or word triples if so-called n-gram models are used) occur more frequently in the data than the words in the latter phrase. For information retrieval, typical usage is to build a language model for each document. At search time, the top ranked document is the one which language model assigns the highest probability to the query. Language models are generative models, i.e., models that define a probability mechanism for generating language. Such generative models might be explained by the following probability mechanism: Imagine picking a term T at random from this page by pointing at the page with closed eyes. This mechanism defines a probability P(T|D), which could be defined as the relative frequency of the occurrence of the event, i.e., by the number of occurrences of a word on the page divided by the total number of terms on the page. Suppose the process is repeated n times, picking one at a time the terms T1, T2, … Tn. Then, assuming independence between the successive events, the probability of the terms given the document D is defined as follows: P(T1,T2,……,T|D) = ςୀଵ ܲ (Ti | D) A simple language modeling approach would compute probability for each document in the collection, and rank the documents accordingly. The potential problem might be the equation will assign zero probability to a sequence of terms unless all terms occur in the document.
Statistical language modeling is concerned with building probabilistic models of word sequences. Such models can be used to discriminate probable sequences from improbable ones, a task important for performing speech recognition, information retrieval, and machine translation. The vast majority of statistical language models are based on the Markov assumption, which states that the distribution of a word depends only on some fixed number of words that immediately precede it. While this assumption is clearly false, it is very convenient because it reduces the problem of modeling the probability distribution of word sequences of arbitrary length to the problem of modeling the distribution on the next word given some fixed number of preceding words, called the context. This distribution can be denoted by ܲ(ݓ |ݓଵ:ିଵ ), where wn is the next word and w1:n-1 is the context (w1,…,wn-1). n-gram language models are the most popular statistical language models due to their simplicity and surprisingly good performance. These models are simply conditional probability tables for ܲ(ݓ |ݓଵ:ିଵ ), estimated by counting the ntuples in the training data and normalizing the counts appropriately. Since the number of n-tuples is exponential in n, smoothing the raw counts is essential for achieving good performance. There is a large number of smoothing methods available for n-gram models. In spite of the sophisticated smoothing methods developed for them, ngram models are unable to take advantage of large contexts since the data sparsity problem becomes extreme. The main reason for this behavior is the fact that classical ngram models are essentially conditional CLEAR December 2015
Page 8
probability tables where different entries are estimated independently of each other. These models do not take advantage of the fact that similar words occur in similar contexts, because they have no concept of similarity. B.
The log-bilinear model
Like virtually all neural language models, the log bilinear(LBL) model represents each word with a real-valued feature vector(Botha and Blunsom., 2014)(Mikolov et al., 2013). We will denote the feature vector for word w by rw and refer to the matrix containing all these feature vectors as R. To predict the next word wn given the context w1:n-1, the model computes the predicted feature vector r for the next word by linearly combining the context word feature vectors:
layer is simply the feature vector matrix R. The vector of activities of the hidden units corresponds to the predicted feature vector for the next word. The LBL model needs to compute the hidden activities only once per prediction and has no nonlinearities in its hidden layer. In spite of its simplicity the LBL model performs very well, outperforming the n-gram models on a fairly large dataset. C. Recurrent neural network based language model The word representations we study are learned by a recurrent neural network language model, as illustrated in Figure( Kun Lu, 2013)(Bengio et al., 2003).
r = σିଵ ୀଵ ܥi rwi where Ci is the weight matrix associated with the context position i. Then the similarity between the predicted feature vector and the feature vector for each word in the vocabulary is computed using the inner product. The similarities are then exponentiated and normalized to obtain the distribution over the next word: P(wn = w | w1:n-1 ) =
ୣ୶୮( ೢ ାೢ ) σೕ ୣ୶୮( ೕ ାೕ )
Here bw is the bias for word w, which is used to capture the context-independent word frequency. Note that the LBL model can be interpreted as a special kind of a feedforward neural network with one linear hidden layer and a softmax output layer. The inputs to the network are the feature vectors for the context words, while the matrix of weights from the hidden layer to the output
Fig1.The Recurrent Neural Language Model Architecture
Network
This architecture consists of an input layer, a hidden layer with recurrent connections, plus the corresponding weight matrices(Mikolov et al., 2013). The input vector w(t) represents input word at time t encoded using 1-of-N coding, and the output layer y(t) produces a probability distribution over words. The hidden layer s(t) maintains a representation of the sentence history. The CLEAR December 2015
Page 9
input vector w(t) and the output vector y(t) have dimensionality of the vocabulary. The values in the hidden and output layers are computed as follows: ܵ(݂ = )ݐ൫ܷ )ݐ(ݓ+ ܹ ݐ(ݏെ 1)൯ ))ݐ(ݏ ܸ(݃ = )ݐ(ݕ, where, ݂(= )ݖ
log ܲ൫ܹ௧ା หܹ௧ )
௧ୀଵ ିஸழ,ஷ
where c is the size of the training context (which can be a function of the center word wt ). Larger c results in more training examples and thus can lead to a higher accuracy, at the expense of the training time.
1 1 + ݁ ି௭
݃(݁ ௭ ) = D.
்
1 ܶ
݁ ௭ σ ݁ ௭ ೖ
The Skip-gram Model
Skip-grams are a technique largely used in the field of speech processing, whereby ngrams are formed (bi-grams, tri-grams, etc.) but in addition to allowing adjacent sequences of words, we allow tokens to be skipped. While initially applied to phonemes in human speech, the same technique can be applied to words(Holger Schwenk, 2007)(Mikolov et al., 2013). For example, the sentence I hit the tennis ball has three word level trigrams: I hit the, hit the tennis and the tennis ball. However, one might argue that an equally important trigram implied by the sentence but not normally captured in that way is hit the ball. Using skip-grams allows the word tennins be skipped, enabling this trigram to be formed. The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document (Mnih and Hinton, 2007). More formally, given a sequence of training words w1, w2, w3, …, wT , the objective of the Skip-gram model is to maximize the average log probability
Fig2.The Skip-gram Model Architecture [10] Here is an actual sentence example showing 2-skip-bi-grams and tri-grams compared to standard bi-grams and trigrams consisting of adjacent words for the sentence: Insurgents killed in ongoing fighting. Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}. 2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting} Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}. 2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed
CLEAR December 2015
Page 10
fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}. In conclusion, the task here is “predicting the context given a word”. Also, the context is not limited to its immediate context, training instances can be created by skipping a constant number of words in its context. Note that the window size determines how far forward and backward to look for context words to predict. III. COMPOSITIONAL MORPHOLOGY BASED LEARNING In-order to construct a system that will be capable of understanding and produce natural language the primary need is the determination of basic language units and their associations. The practical approach for generating language, hence intelligently use lexical resources in their core of implementation method. For example, in information retrieval, the analysis entails collecting a list of words and detecting their association with topics of discussion. Moreover, a vocabulary is essential for obtaining good results in speech recognition. Words are often thought of as basic units of representation (Creutz and Lagus., 2007). However, especially in inflecting and compounding languages this view is hardly optimal. For instance, if one treats the English words hand, hands, and left-handed as separate entities, one neglects the close relationships between these words as well as the relationship of the plural +s to other plural word forms e.g., heads, arms, fingers. In the case of Malayalam, this situation become worst. Overlooking these regularities
accentuates data sparsity which is a serious problem in statistical language modeling. According to linguistic theory, morphemes are the smallest meaning-bearing units of language as well as the smallest units of syntax. Every word consists of one or several morphemes; consider, for instance, the English words hand, hand+s, left+hand+ed, finger+s, un+avail+able. For Malayalam, it seems like: ßƁ+ãė÷, ćđą+ĀđŔ etc.. There exist linguistic methods and automatic tools for retrieving morphological analyses for words, for example, based on the two-level morphology formalism ( Kun Lu, 2013). However, these systems must be tailored separately for each language, which demands a large amount of manual work by experts. Moreover, specific tasks often require specialized vocabularies which must keep pace with the rapidly evolving terminologies. A. Additive Word Representations Our approach seek a compromise that retains the unsupervised nature of CSLM feature vectors, but also incorporates a priori linguistic knowledge in a flexible and efficient manner( Le and Mikolov, 2014). In particular, morphologically related words should share statistical strength in spite of differences in surface form. In general continuous space language model we represent each word types w in the vocabulary V a d−dimensional feature vector ࢘௩ ܴ אௗ . Similarities among the word types are capture by abstract way by suitably associating weight for each of the feature vector, in contrast to hand-engineered linguistic features that target very specific
CLEAR December 2015
Page 11
phenomena, as often used in supervisedlearning settings. We define a mapping ܸ ՜ ܨା of a surface word into a variable-length sequence where of factors, i.e. ݂( = )ݓ(ଵ , … , ݂ ), w אV and ݂ ܨ א. Each factor f has an associated factor feature vector ࢘ ܴ אௗ . So that word is factorized into its surface morphemes, although the approach could also incorporate other information, e.g. lemma, part of speech. The vector representation ݎ௩ of a word v is computed as a function )ݒ( ݓof its factor vectors. Here addition is a composition function: ݎ௩ = = )ݒ( ݓσ(א௩) ݎ . The vectors of morphologically related words become linked through shared factor vectors ሬሬሬሬሬሬሬሬሬሬሬሬሬሬԦ ) (notation: ሬሬሬሬሬሬሬሬሬሬԦ ݀ݎݓ, ݂ܽܿݎݐ ሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬԦ = ሬሬሬሬԦ ሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬԦ + ଓ݈ݕݐ ሬሬሬሬሬሬሬԦ ܽݒܽ݊ݑଓ݈ܾܽଓ݈ଓݕݐ ݊ݑ+ ܽܽݒଓ݈ܾ݈ܽ݁ ሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬԦ = ݐ݂ܿ݁ݎ݁ ሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬሬԦ + ݈ݕ ሬሬሬԦ ݕ݈ݐ݂ܿ݁ݎ݁ By introducing the additive word representation, we can model unseen words. So the system performs well in limited vocabulary. The out of vocabulary are constructed from the available vocabulary of morphemes by using simple vector addition. Since we allow number of factors to vary freely so that non compositional nature of particular word can be tackled through including surface form as one of the factor. Including surface form also overcomes the problems created by commutative property of vector addition, by helping us to keep order invariance. For example in English, ሬሬሬሬሬሬሬሬሬԦ ሬሬሬሬሬሬሬሬሬሬሬԦ ് ܿ݁݉ ሬሬሬሬሬሬሬሬሬሬሬԦ + ݎ݁ݒ ሬሬሬሬሬሬሬሬሬԦ ݎ݁ݒ+ ܿ݁݉
B. Factorized Log-Bilinear Representations The heart of the natural language generation algorithm uses a variant of logbilinear language model. The method consist of intelligent use of compositional word representations by associating the composed word vectors ݎԦ and ݍ ሬሬሬԦఫ with the target and context words, respectively. The |ி| ×ௗ representation matrices ܳ(݂), ܴ(݂) ܴ א thus contain a vector for each factor type. This model is designated LBL++ and has parameters ݁ܽݐାା = (ܥ , ܳ(݂), ܴ(݂), ܾ). Words sharing factors are tied together, which is expected to improve performance on rare word forms. The factorization permits an approach to unknown context words that is less harsh than the standard method of replacing them with a global unknown symbol instead, a vector can be constructed from the known factors of the word (e.g. the observed stem of an unobserved inflected form)(Botha and Blunsom., 2014)(Mikolov et al., 2013). A similar scheme can be used for scoring unknown target words, but requires changing the event space of the probabilistic model. This vocabulary is used for stretching capability in word similarity experiments. C.
Estimation of probability
In-order to generate the natural language text, We introduce a probabilistic function based on the factorized log bilinear model. Log bilinear make the same Markov assumption as n-gram language models. The probability of a sentence w is decomposed over its words, each conditioned on the n1 preceding words:
CLEAR December 2015
Page 12
ିଵ ܲ(ݓ(ܲ = )ݓ |ݓିାଵ )
Our model calculate s, context specific parameter as: = ݏ ݍ ݇ Where ݍ ܴ אௗ is the context vector based on n preceding words and ݇ is the position specific parameter of next word. Once we calculated the value of s, we can calculate the vector for the next word as : ݎ × ݏ = )ݓ(ݎ௪ + ܾ௪ The calculated vector can be further used for generating probability using softmax as: ିଵ ܲ൫ݓ หݓିାଵ ൯=
D.
exp(ݓ(ݎ )) σ௩ א exp())ݓ(ݎ
Training and Installation
Implementation of system includes a training phase followed by testing phase. A corpus that containing large number of sentences is used for training process. The calculation of probability for next word is applied on subset of sentences that shares similar properties. Once the sentences are selected the vector representation of sentences are computed. In this vector each term is again a vector which corresponds to factor feature vector of each word in the sentence. The vector is set into a appropriate length, which is found by the property of the language. This set of vectors are used for constructing the context vector q and each term in the k is the co-effect of linear combination of position specific properties. We further calculate the vector which best explains the next word. The word is determined from the sentence that have best matches the calculated sentence vector. The
probability of word is calculated using softmax method. REFERENCES [1] Andriy Mnih and MaGeoffrey Hinton. 2007. A Scalable Hierarchical Distributed Language Model, volume 3. Journal of Machine Learning Research, Department of Computer Science University of Toronto. [2] Andriy Mnih and MaGeoffrey Hinton. 2007. Three New Graphical Models for Statistical Language Modelling, volume 3. Department of Computer Science University of Toronto, Proceedings of the 24 th International Conference on Machine Learning, Corvalli [3] Christina R. Strong, Manish Mehta, Kinshuk Mishra, Alistair Jones and Ashwin Ram 2007. motionally Driven Natural Language Generation for Personality Rich Characters in Interactive Games, volume 3. Cognitive Computing Lab (CCL),Georgia Institute of Technology Atlanta, Georgia,Third Conference on Artificial Intelligence for Interactive Digital Entertainment. [4] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2007. A Closer Look at Skipgram Modelling. Proceedings of the NIPS, NLP Research Group, Department of Computer Science, University of Sheffield. [5] Holger Schwenk. 2007. Continuous space language models, volume 21. Spoken Language Processing Group,Computer Speech and Language. [6] Jan A. Botha and Phil Blunsom. 2014. Compositional Morphology for Word Representations and Language Modelling. CLEAR December 2015
Page 13
Department of Computer Science, University of Oxford, Proceedings of the 31 st International Conference on Machine Learning, Beijing, China [7] Kun Lu 2013. Insight Into Vector Space Modeling and Language Modeling. School of Information Management Wuhan University, iConference 2013 Proceedings. [8] Mathias Creutz and Krista Lagus. 2007. Unsupervised Models for Morpheme Segmentation and Morphology Learning, volume 4. ACM Transactions on Speech and Language Processing. [9] Quoc Le, Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China. [10] Tomas Mikolov, Ilya Sutskever and Kai Chen. 2013. Distributed Representations of Words and Phrases and their Compositionality. Google.
India is all set to switch to its own Navigation System, IRNSS Its time we move away from the American Global Positioning System (GPS) and make way for our own navigation system – the Indian Regional Navigation Satellite System or IRNSS on our mobile phones. IRNSS will provide two types of services, namely Standard Positioning Service, which is provided to all the users and Restricted Service, which is an encrypted service provided only to the authorised users. Some applications of IRNSS are: Terrestrial, Aerial and Marine Navigation, Disaster Management, Vehicle tracking ,and fleet management, Integration with mobile phones, Precise Timing, Terrestrial navigation aid for hikers and travellers. All the seven satellites of IRNSS are expected to be in orbit by March 2016, Indian Space Research Organisation.
[11] Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig 2013. Linguistic Regularities in Continuous Space Word Representations. Google. [12] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, Christian Jauvin. 2003. A Neural Probabilistic Language Model, volume 3. Journal of Machine Learning Research 3 (2003):1137-1155.
CLEAR December 2015
Page 14
Real-Time Captioning and Lecture Transcription Using Speech Recognition Pelja Paul.N Department of CSE Govt. Engineering College Sreekrishnapuram Kerala, India 678633 peljapaul@gmail.com ABSTRACT—Speech recognition (SR) technologies were evaluated in different classroom environments to assist students to automatically convert oral lectures into text. Two distinct methods of SR-mediated lecture acquisition (SR-mLA) is real-time captioning (RTC) and post-lecture transcription (PLT). It has been developed to increase the word recognition accuracy. Both methods has been compared according to technical feasibility and reliability of classroom implementation, instructor's experiences, word recognition accuracy, and student class performance. RTC provided near instantaneous display of the instructor's speech for students during class. PLT employed a user-independent SR algorithm to optimally generate multimedia class notes with synchronized lecture transcripts and instructor audio for students to access on-line after class. It has been learnt that PLT provides more word recognition accuracy than RTC. The potential benefits of SR-mLA for students who have difficulty taking notes accurately and independently, particularly for non-native English speakers and students with disabilities. I.
INTRODUCTION
Speech recognition (SR) technology has a burgeoning range of applications in education from captioning video and television for the hearing-impaired, voice controlled computer operation, and dictation. Some of the most popular commercial applications of SR are for dictation and other hands-free writing tasks with software applications. Most commercial SR software applications were developed for dictation with punctuation, not for transcribing extemporaneous speech,
which is structurally and grammatically different from written prose. Transcripts produced from a continuous unbroken stream of text are additionally difficult to read and interpret without punctuation or formatting. Alternative SR software applications for real-time captioning or speech transcription have been developed to parse speech into individual transcribed statements using verbal cues rather than having to specify punctuation. Line breaks were introduced using pauses and interjections, such as "um", "ah", and "uh"
CLEAR December 2015
Page 15
comparison has been done on the classroom implementation, reliability, and academic performance impact of two different methods of SR-mediated lecture acquisition (SR-mLA). Both SR-mLA techniques were employed using conventional educational technology .The first method of SR-mLA provided real-time captioning (RTC) of an instructor's lecture speech using a clientserver application for instant viewing during class on a projection screen or directly to the student's laptop personal computers (PCs). The second SR-mLA method, post-lecture transcription (PLT), employed a digital audio recording of the instructor's lecture to provide transcripts, which were synchronized with the audio recording for students to view online or download after class. These studies were conducted in courses other than science, technology, engineering, and mathematics (STEM) and did not attempt to quantitatively measure the effects of providing SR-based lecture notes on student class performance. II.
RELATED WORKS
Note taking is a fundamental and ubiquitous learning activity that students are expected to perform and master during their educational development .The benefits of lecture note taking include for students to organize, summarize, and better comprehend lecture information, recording content for later studying, self regulated learning through the active process of note taking, and simply staying attentive during class.
SR-mediated lecture acquisition are realtime captioning and postlecture transcription. Both methods were compared according to technical feasibility and reliability of classroom implementation, instructor's experiences, word recognition accuracy, and student class performance [1]. The development of a system that can provide an automatic text transcription of multiple speakers using speech recognition (SR), with the names of speakers identified in the transcription and corrections of SR errors made in real-time by a human editor [3]. A survey approach to potential benefits of SR-mLA for students who have difficulty taking notes accurately and independently, particularly for non-native English speakers and students with disabilities. [5] III. SR METHODS AND TOOLS Two different SR approaches for SRmLA were used; 1) the first approach was RTC using IBM ViaScribe and 2) the second was PLT through IBM Hosted Transcription Service (HTS). The general steps for performing both methods of SRmLA are outlined (Figure 1).
Speech recognition (SR) technologies in different classroom environments to assist students to automatically convert oral lectures into text. Two distinct methods of
CLEAR December 2015
Page 16
Fig1: General SR-mLA methodology Figure 2 summarized and compared the major technical functionalities between realtime captioning using ViaScribe and postlecture transcription using HTS. Functionalities for comparison were divided according to: x process of recording instructor's speech, x speaker-dependent or -independent SR engines used, x error correction methods, and x display options
Fig2: Comparison of Major Functionalities between ViaScribe and HTS Systems A.
IBM ViaScribe
IBM ViaScribe utilized a SR engine capable of transcribing live or pre-recorded speech developed collaboratively by IBM and the LL Consortium. During class lectures ViaScribe displayed or captioned what the instructor uttered into text as it is
being spoken. ViaScribe was chosen for real-time captioning, because it had a proven track record by LL members for reliable captioning and had a client-server platform for streaming live transcription to student's laptop PCs during lectures. As natural spoken language does not explicitly state grammar and punctuation; ViaScribe transcription provided readability by introducing a paragraph break or other markers whenever the speaker paused for a breath. These pauses could be customized according to the speaker's individual speech characteristics. To improve word recognition accuracy, users performed voice profile training. The commercial ViaVoice application was used to create the initial voice profile and subsequently updated to the ViaScribe application. Profile training involved recording a minimum of 500 words of dialogue and vocabulary for proper speech recognition. Once the initial voice profile was performed, it could be updated by adding lecture transcripts that had been recorded and corrected by us. As more words were inputted, word recognition accuracy improved. However, inputting more than 2,500 words for profile training does not significantly improve word accuracy. Students could voluntarily install client software on their own personal laptops during class receiving text as it is being streamed by the ViaScribe server. There is an inherent delay between when a word is spoken and when it is transcribed (regardless of whether SR is used or human captionists are employed). ViaScribe used a single pass decoding technique, which generated very little display lag compared to other SR
CLEAR December 2015
Page 17
systems that use different decoding techniques. A client-server monitoring application on the instructor's machine showed the current client connections, which could be deactivated, and the rate of streaming words from the server. B.
IBM Hosted Transcription Service
IBM HTS was selected for post-lecture transcription primarily because of its higher word recognition accuracy rates compared to other systems. HTS is a speaker independent SR system developed by IBM Research that automatically transcribes a variety of standard audio or video file formats through a cloud service. HTS uses statistically derived acoustic and language models to convert speech to text. As opposed to statistical models designed for creating written language, which would not be ideally suited for recognizing extemporaneous speech, HTS used United States English Broadcast News models built from acoustic data from spoken language. HTS SR engine employs a double-pass decoding technique, which dynamically adjusts to the speaker's voice, without requiring voice profile training or enrollment. For HTS transcription, authenticated users had to visit the HTS service portal, log into their secure accounts, and then upload a media file for automatic transcription. Once HTS has processed the recorded lecture, the transcribed text could be viewed and edited for error corrections online employing a Flash-based interface. A posthoc correction method similar to ViaScribe was performed. The audio recording in synchrony with the transcript could be downloaded. This multimedia content could be viewed in
different predefined layouts and adjusted temporally by authors using post-production tools provided by HTS. The presentation package was downloaded from HTS, which consisted of an XML file with timing data, audio WAV file, and lecture transcript to prepare the multimedia transcript. IV. CLASSROOM PROCEDURES
EVALUATION
Both SR-mLA systems were evaluated during four phases of in-class testing during both social science and life science lecture courses. Prior to evaluation, instructor testers were trained to operate the necessary software and provided initial best practices for improved SR accuracy before starting the real-time captioning or postlecture transcription. A.
Initial Evaluation with RTC
Phase 1 evaluation occurred during regular courses offered in two graduate-level special education classes taught by a female instructor in the College of Education. The goal of Phase 1 testing during these semester-long courses was to successfully install and run the ViaScribe system, assess this technology with feedback from the instructor and students, and collect data on the accuracy and feasibility of real-time captioning. Prior to use, the instructor underwent initial voice training to develop a voice profile for the ViaScribe system. In Figure 3, a schema of the RTC system used during classes is given. The instructor used a portable, laptop PC running the ViaScribe server program during class. The laptop was then connected to the wireless microphone receiver and digital
CLEAR December 2015
Page 18
classroom projector. During class, the ViaScribe server program streamed textual captions, as it was being spoken and processed by the SR engine, to the classroom projection system and screen or to student's laptops running client software during class to serve as a closed captioning window. As clients, students acquired raw, unedited transcripts. After-ward students were provided with corrected transcripts. The instructor could provide additional information and create multimedia presentations to post on the web with keyword-searchable transcripts synchronized with the digital lecture audio.
mail was sent to the HTS website account holder. This process could take hours to a full day depending on the index of submission in the queue of jobs submitted by other users. The transcribed text was then corrected for errors by a trained graduate student using the HTS Flash interface. The editor would play the lecture audio while editing the transcribed text automatically advancing in synchrony. The audio recording and generated transcripts were automatically synchronized through HTS; however, this content could be disaggregated to text only. The multimedia transcripts (synchronized text and audio) were uploaded to the university Blackboard system for students to download, search, print, or playback the audio using third-party applications if desired.
Fig3: High-level overview of the real-time captioning method using ViaScribe. B. Initial Evaluation with PLT In phase 2, PLT was assessed during the same education courses from Phase 1 with the same instructor. As shown in Figure 4, PLT was deployed by digitally recording the lecture audio with software installed on the instructor's laptop. The lecture audio was recorded in a SR-compatible format using a wireless microphone system. After class, the lecture audio file was uploaded to the online HTS system to be automatically transcribed through the IBM SR engine. Once the HTS system finished processing, a notification e-
Fig4: High-level overview of the postlecture transcription method using HTS.
CLEAR December 2015
Page 19
C. PLT Evaluation during a Science Course PLT using HTS was alternately evaluated during the lecture portion of a graduate neuro anatomy course in the College of Veterinary Medicine rather than a social science course. This course was taught by a male instructor rather than a female instructor, which was performed previously (Figure 4). The objectives of the Phase 3 evaluation were to 1) compare SR-mLA implementation in a science versus social science course, 2) compare instructors of different genders, and 3) assess a new method of providing students SR class transcripts by synchronizing them with the lecture audio and class PowerPoint slides to generate comprehensive multimedia class notes.
modify, annotate, or search the multimedia class notes according to their individual preferences. D. Evaluation of Student Class Performance in a STEM Course
Microsoft PowerPoint was used to record the lecture audio and class slides simultaneously during class. The speech was automatically saved with each slide according to the time instructor spent on a particular slide. The PowerPoint file was separated to generate a set of slide images, an audio file, and XML file, which contained data on slide timings for synchronization. The audio file was subsequently uploaded to the HTS system for transcription. Transcript error correction was performed through the HTS website.
Outcomes for quiz scores, exam grades, and student satisfaction of PLT were assessed in a preliminary study of the small cohort of nine students in a team-taught graduate level course in systemic mammalian physiology at the College of Veterinary Medicine. A non-mandatory online five-question quiz was posted each week for 12 weeks to course's Blackboard website covering the previous week's lecture material. Students were invited to voluntarily perform these quizzes and were informed that these quiz scores did not affect their course grades. Multimedia class notes with lecture transcripts were available to students to view on Synote for the first six weeks of the course or 16 lectures (experimental period). The multimedia class notes were unavailable for the next six weeks (control period). The most obvious incentive for students to voluntarily take online quizzes was to ensure their understanding of the lecture information. Student questionnaires regarding use of multimedia class notes and note taking were completed by students at the end of the course.
A new multimedia package synchronizing transcribed text, audio and individual slide images was created and uploaded to Synote for students to view at their convenience. Students had to logon to Synote to view these multimedia class notes. Through Synote, students could select,
The systemic physiology course was divided into separate topics covering blood, muscles, and the nervous and digestive systems. Throughout the course students were provided hard copies of the lecture PowerPoint slides and lecture notes that the instructors would normally hand out when
CLEAR December 2015
Page 20
teaching this course. During the time period, students had access to multimedia class notes, they had two class exams totalling 160 points. When multimedia class notes were unavailable, students had three exams worth 170 points in total. The total exam points, quiz scores, and combined scores were compared between time periods to assess changes in class performance. V.
EVALUATION RESULTS
Current applications for SR technology have focused primarily on document dictation; which requires punctuation, capitalization, and syntax to be specified, discrete word entry, and voice user interface control (e.g., call systems, home automation, driver control of vehicle features). SR-mLA provides an ideal model for studying continuous SR whereby extemporaneous speech by a single speaker (lecturer) is transcribed for student use in a controlled, noise-limiting environment. By investigating the technical challenges and merits of implementing SR-mLA in typical university classes, evaluate its current feasibility and develop practical strategies for incorporating this technology in other educational environments. In this pilot study, two di_erent SR-mLA methods, real-time captioning and postlecture transcription, using SR engines to be reliable and accurate. Other SR engines could be substituted within these RTC or PLT frameworks. A.
Implications for Student Learning
During pilot study, there was a positive correlation among individual students in voluntarily taking the non-compulsory
online quizzes more frequently when multimedia class notes were available than when they were not provided. Additionally, students received higher quiz scores when SR multimedia transcripts were accessible. Voluntary quiz taking and quiz performance rose most noticeably among those students who stated they were interested in studying material offered outside of class. We believe that having access to multimedia class notes was an added incentive for students to take the optional online quizzes. Once the multimedia class notes were unavailable, students took the online quizzes much less. Past studies have demonstrated that acquiring and studying lecture notes result in a greater learning experience and higher overall academic performance for students . In this study, greater class grade performance was observed when synchronized multimedia class notes were available during the course. Students scored 10.2 percent higher on exams, 15.0 percent higher on non-compulsory online quizzes, and 10.5 percent on total scores. However, there are many factors involved in students getting high grades. Therefore, it is difficult to determine a direct correlative effect between having access to multimedia class notes and class performance for individual students. More study is needed to discern the full impact of SR multimedia class notes on student learning. SR-mLA would be especially advantageous for students with special needs, such as non-native English speaking students and students with disabilities, to obtain class notes without having to rely upon classmates or paid note takers or captionists. Students incapable of or not
CLEAR December 2015
Page 21
confident in their own note taking are able to acquire through PLT accurate and comprehensive multimedia class notes, which they could review at their own convenience and pace. Various formats of these multimedia class notes can enable greater access for all the visually impaired, deaf or hard of hearing, mobility impaired, learning disabled, non-native speakers, distance learners, and any student who has a need of synchronized searching of the lecture material. With RTC student subjects could view extemporaneous speech from the instructor to actively participate in class discussions during lectures. For instance, students with hearing loss can be engaged in lectures and respond during class with timely questions . In one study, students felt that RTC improved teaching and learning in class as long as word recognition was greater than 85 percent and the transcription and display lag was negligible. SR-mLA is not synonymous with note taking. Note taking practices can vary depending on the student, who may choose to include or omit any lecture content or record this information in a way that helps their understanding. Verbatim note taking was reported as the goal of half of college students, but it was estimated that less than 40 percent of lecture information was actually recorded .SR-mLA enables the students to capture all lecture information. During this study, we achieved an average word recognition accuracy of 87 percent for PLT, which is more than twice as effective as verbatim note taking by manual note takers.
VI. CONCLUSION WORK
AND
FUTURE
Speech recognition (SR) technologies were evaluated in different classroom environments to assist students to automatically convert oral lectures into text. Two distinct methods of SR-mediated lecture acquisition (SR-mLA), real-time captioning (RTC) and post-lecture transcription (PLT), has been developed to increase the word recognition accuracy. SRmLA would help students to improve their academic performance. The next step in evaluating SR-mLA will be to test how students with disabilities can best utilize this technology to achieve particular learning outcomes. Although the initial findings regarding student performance are very encouraging, further research with larger class numbers and multiple courses is required to fully understand the impact of SRmLA on academic performance, particularly for students with special needs. Evaluating the impact of SR transcripts on class performance in greater detail such as the effect and perception of raw unedited transcripts in comparison to the edited transcripts. SR systems can be implemented locally or virtually as a service via cloud computing environment. Future studies will take advantage of cloud technology. Access to SR transcription services would be more efficient through local hosting of a SR instance as a cloud service. The cloud computing model "Software as a Service" would allow users to remotely access SRmLA using internet web browsers. The main advantages of this model are: on demand
CLEAR December 2015
Page 22
availability of SR-mLA without any software installation on user systems, access to greater processing power than on local PCs, and automating the whole process of postlecture transcription from recording the lecture to delivery of multimedia class notes. REFERENCES [1] Rohit Ranchal, Teresa Taber-Doughty, Yiren Guo, Keith Bain, Heather Martin, J. Paul Robinson, Bradley S. Duerstock, Using Speech Recognition for Real-Time Captioning and Lecture Transcription in the Classroom, IEEE Transactions on Learning Technologies ,2013. [2] K. Bain, S. Basson, M. Wald, Speech Recognition in University Classrooms:Liberated Learning Project, Proc. Fifth Int'l ACM SIGCAPH Conf. Assistive Technologies, 2002. [3] M. Wald , K. Bain, Enhancing the Usability of Real-Time Speech Recogni-tion Captioning through Personalised Displays and Real-Time Multiple Speaker Editing and Annotation. Proc. HCI International Conf., , 2007.
A New Wearable Keyboards could be Sewn into Clothing The Apple Watch and Google Glass are some of the most widely known wearable devices, but the ways users can interact with these “smart” gadgets are limited. For instance, it would be pretty difficukt to type a message out on the face of a watch. But now , researchers have developed wearable keyboards made of electronics knitted together like fabric that could lead to a new kind of human-machine interface. The prototype keyboard can be worn on a sleeve and has 11 keys, representing the numbers 0 to 9 as well as an asterisk. “A wearable keyboard would provide a more inuitive interface for tactile input than the touch-sensitive face of a smart watch or the hand gestures that control devices such as the Google Glass”.
Visit: http://www.livescience.com/53098wearable-keyboards-sewn-intoclothing.html
[4] S. Repp, A. Grob, C. Meinel, Browsing within Lecture Videos Based on the Chain Index of Speech Transcription, IEEE Trans. Learning Technologies, 2008. [5] Ashwini B V, Laxmi B Rananavare, Enhancement of Learning using Speech Recognition and Lecture Transcription: A Survey, International Journal of Computer Applications International Conference on Information and Communication Technologies , 2014.
CLEAR December 2015
Page 23
Pattern Based Bootstrapping for Morphological Analysis of Malayalam Kala M T Asst. Professor in CSE GEC Sreekrishnapuram kalsmol14@gmail.com ABSTRACT: The crucial part in natural language processing is morphological Analysis. In this paper, pattern based bootstrapping for morphological analysis of Malayalam language is proposed. This is a semi-supervised approach for handling Malayalam words. In the proposed system, new patterns are learned by computing tuple confidence score and pattern score. This procedure is preceded by morpheme segmentation which computes mutual information score to get the morphemes correctly segmented. The morphemes might be segmented incorrectly due to the presence of sandhi and morphological suffixes. The training set contains patterns for different kinds of Malayalam words. If the analyzer is not able to handle words, then new patterns are learned for handling these words. Synthetic training set and test set are used for the purpose.
I.
INTRODUCTION
Malayalam is a Dravidian language and official language of Kerala. It is very close to Tamil and most features are borrowed from Sanskrit. The main difference between Malayalam and other Dravidian languages is that the verbs do not depend on name, gender and person. The study of problems in the automatic generation and understanding of natural languages is the main aim of NLP. The primary goal of natural language processing is to build computational models of natural language for its analysis and generation. The first step in natural language processing is morphological analysis. Morphology is structure of words. is the identification constituents of the
the study of internal Morphological analysis of parts of the words or words. If we keep all
inflections of all the root words in a table, and if we list features of all word forms along with it, then we do not need a morphological analyzer. When we are given with a word, then we need to search in the table and we can retrieve its features. But Malayalam is a highly inflected language which makes several problems with this method. It leads to wastage of memory space. If we list every inflection of the root word, clearly it will result in the large number of entries in such a table. The present system stores the same information redundantly even when two root words follow the same rule. It does not show relationship among different root words that have similar inflections. Thus it is not possible for the representation of a linguistic generalization. If the system is to have the capability of understanding an unknown
CLEAR December 2015
Page 24
word, linguistic generalization is necessary. In the generation process, the linguistic knowledge can be used if the system needs to coin a new word. In this paper, a pattern based bootstrapping for morphological analysis of Malayalam which learns new patterns as it receives inputs, is proposed. New patterns are learned according to tuple confidence score and pattern confidence score. There are many challenges involved in the implementation of the system. Malayalam is a less resourced language, so it does not have any standard corpus. Annotating corpus manually is a time consuming task. Malayalam is a morphologically rich language and morphological suffixes convey most of the roles played in a sentence. Malayalam words can be compound words also meaning that many words can be combined to form a single word. Even though Malayalam has an unmarked subject-object-verb order, it is a relatively free world order language. Malayalam has a rich case marking system, with different types of case suffixes. It does not have capitalization information. The remainder of the paper is organized as follows: Section II gives the concepts and previous work, Section III explains the concepts and implementation details of pattern based bootstrapping for morphological analysis IV discusses the results obtained and their performance on different kinds of inputs. II. BASIC CONCEPTS PREVIOUS WORK
AND
Natural Language Processing is a subfield of artificial intelligence and linguistics. It has been developed in 1960. The aim of NLP is studying problems in the automatic
generation and understanding of natural languages. The primary goal of natural language processing is to build computational models of natural language for its analysis and generation. The first step in Natural Language Processing is morphological analysis. In linguistics morphology refers to the mental system involved in word formation or to the branch of linguistics that deals with words, their internal structure, and how they are formed. A major way in which morphologists investigate words, their internal structure, and how they are formed is through the identification and study of morphemes, often defined as the smallest linguistic pieces with a grammatical function. The term ‘morph’ is sometimes used to refer specifically to the phonological realization of a morpheme [1]. Morphological analyzer analyzes individual words and non-word tokens such as punctuation are separated from the words. Pattern based Bootstrapping is a semisupervised learning approach. In this, the training set contains patterns for some different kinds of words and their inflections. The test data will be matched against the training data for possible matches and labels POS tags to the words. If words from the test data cannot be matched with any of the inputs in the training data, it performs a scoring and finding out a pattern for them. Several approaches for morphological analysis of Malayalam and other Dravidian languages have been proposed in the last years. In [2], for Indian languages morphological analysis, a paradigm based approach is developed. For a given stem, a paradigm defines all word forms and its associated feature structure. It creates CLEAR December 2015
Page 25
different tables of word forms covering the words in the language. Each word-forms table contains a set of roots which means that for generating the word forms, the patterns (paradigm) implicit in the table should be followed by the roots. For inflectionally rich languages, paradigm based approach is very effective. The paradigms can be extracted from the word forms of any root words by identifying the number of characters to be deleted from the root and the characters to be added to obtain the word forms. In [3], a suffix stripping method for morphological analysis of Malayalam has been proposed. A stem dictionary, a suffix dictionary that contains all possible suffixes of nouns/verbs in the language, morphophonemic rules or sandhi rules are used in suffix stripping method. The analyzer splits the inflected forms as suffixes and stem. Once the suffix of the word has been identified, that suffix will be removed and proper sandhi rules are applied to obtain the root word. In suffix stripping algorithms, rules are stored to identify root/stem form. A hybrid method [4] combines both paradigm approach and suffix stripping approach. As in paradigm approach, nouns and verbs are classified into different paradigms. As suffix stripping methods do, suffixes will be identified first, and then by applying sandhi rules, the stem will be identified. III. PROPOSED MODEL This paper proposes a morphological analyzer for Malayalam language. It uses pattern based bootstrapping approach for building morphological analyzer. The morphological analysis is performed in two steps.
x Morpheme Segmentation x Morphological Analysis A suffix list containing all suffixes of Malayalam words is used for morpheme segmentation. A bootstrapping approach is used here for handling incorrectly segmented morphemes. The training set contains different patterns for different types of Malayalam words and new patterns are learned by using scoring.
Fig1. Overall block diagram A. Morpheme Segmentation Morphological analysis is preceded by morpheme segmentation. As Malayalam is a morphologically rich language, proper segmentation of words into morphemes are necessary to identify the constituents of words. Morpheme segmentation step breaks Malayalam words in to its constituent parts. There will be a root word followed by any number of suffixes. So a predefined list of suffixes is used to identify the suffixes in a word. These are separated by a suffix marker and proceeds to find the next suffix. After all suffixes have been separated, it will be in a format where root word and its
CLEAR December 2015
Page 26
morphological suffixes. There are mainly two problems that occur in morpheme segmentation. One of the problems is that the substring that forms a suffix can be part of a root word itself. It need not be always a suffix. Example, (1)
ąćŮĒŔ (Maraththil) on the tree
(2)
ĀĒŔŚĔŹĔ (Nilkkunnu) standing
Here in (1) the substring is a case marker. But in (2) it is not a case marker, rather it is part of the root word itself. But in this method, it incorrectly splits (2) as ĀÜĒŔŚ-ÜĔŹĔ .This occurs due to the incorrect segmentation of the suffix ÜĒŔ(il), which is a case marker. But in this word, this ÜĒŔ(il) is not a case marker, but it is part of the word. The problem of incorrect segments will be solved by bootstrapping approach using mutual information score. This method computes mutual information score for each pair of adjacent morphemes of that particular word. Mutual information score is calculated by using a formula,
up with the suffixes or the root word can be broken up. In example (1), the word contains sandhi Ů(ththa) and a case marker ÜĒŔ(il). After splitting up into constituents, the root word will be ąć (mara). But the exact Malayalam word is ąćÝ (maram), this occurs because of the presence of morphological suffixes Ů(ththa)and ÜĒŔ(il). To form the exact root word, a finite state transducer is used. This will have a set of rules for forming root word by checking the inflections. The finite state transducer checks the first and second segments of the segmented morphemes. The rules are written according to the last letter in the first segment and first letter in the second segment. For example in the example (1), the first and second segments are ąćŮ (mara-ththa). Because of the presence of Ů in the second segment, and ć in the as the last letter in the first segment, one ÜÝ will be added after ąć to make it Malayalam word ąćÝ (maram-tree).
(୶,୷)
ݔ(ܫܯ, ()୷()୶( = )ݕ3.1) Where x and y are segmented morphemes. All such (x,y) will be considered and computes the MI. The pair having the highest MI will be combined together to form exact segmented morpheme. The second problem with morpheme segmentation is the breaking up of root word. As Malayalam is a highly inflected language, sometimes the root word is mixed
Fig2. Flow segmentation
diagram
of
CLEAR December 2015
morpheme
Page 27
Above figure is the flow diagram of morpheme segmentation. Suffix patterns correspond to the list of suffixes for Malayalam words. B. Morphological Analysis The segmented morphemes will be acting as the input to the morphological analysis module. A Pattern based bootstrapping approach is used for performing morphological analysis. The training data consists of Malayalam words, their segmented morphemes and the corresponding POS tags. The test data will have words with their segments. The seed patterns will be extracted from the training data and will be compared with the test data. When perfect matching occurs, the corresponding POS tag will be attached to the test data. If partial matching only occurs, then new patterns will be learned by using bootstrapping. To perform this, it makes use of scoring. Two scorings used here are Pattern confidence score and Tuple value score. x Pattern confidence Score: The pattern confidence score is determined by the number of samples matching the pattern (Pi) among all the available samples and gives the probability of occurrence of the pattern in the corpus. Score(Pi) =
ே ୀଵ
୧
P ቀቁ
(3.2)
Where N is the total number of patterns and Pi is the pattern for modification. This step selects a pattern Pi with maximum score. x Tuple confidence Score: The tuple value confidence score is determined by the probability of the number of samples with tuple value (TV) of all the patterns of a
particular relation (R) among all samples with the same tuple value (TV). ௧
#( ) ܵܿ = )ݐ(݁ݎmin ܲ ൮ ൘# ݐ൲ (3.3) ௧
Where t is a tuple and () represents tuple t of the pattern Pi. All words in the input will not be handled by the patterns given in the training set. When there are words left out without labelling the POS tags, then they will be handled by learning new patterns. It computes pattern confidence score and tuple confidence score by using equations 3.2 and 3.3 respectively. If a pattern has highest pattern confidence score, then it will be the most frequently occurring patterns in the training set. If a tuple in a pattern has least tuple confidence score, then it will be least frequently occurring tuple in a pattern. The pattern with maximum pattern confidence score will be selected and one of its tuples with lowest tuple confidence score will be masked and checked against the seed patterns from the test data. If it matches with the seed patterns, then the masked tuple will be replaced by the corresponding tuples in the seed patterns. These new set of learned patterns are used for matching with the remaining inputs. If it fails to find out match with the learned pattern, again pattern confidence score and tuple confidence score will be calculated and masking and matching will be done until all of the inputs are handled or there are no new patterns to learn.
CLEAR December 2015
Page 28
ėíđĉīƧŹĔ ėíđĉīƧíÜĔŹĔ(KOLLUNNUKOLLUKAUNNUKILLING KILL-ING) MORPHOLOGICAL ANALYSIS Example, íĔŨĒíŕ
íĔŨĒ-íŕ
Noun + Plsufx
(KUTTIKALKUTTI- KAL) āđƀĒėĀ Fig3. Flow diagram of Morphological Analysis IV. RESULTS & EVALUATION
PERFOMANCE
āđƀĝ-ÜĒėĀ
Noun
+ CM (PAMBINEPAMBU-INE) ėíđĉīƧŹĔ ėíđĉīƧí-ÜĔŹĔ Verb + TM
The Morphological Analyzer is implemented in JAVA language using Netbeans 8.0.2 IDE. The training data set used for developing this system is composed of Malayalam word, its segmented morphemes and POS tags. The test data considered here are newspaper articles. A sample input and output for each module is given below.
(KOLLUNNUKOLLUKA-UNNU) The evaluation metrics used here are precision, recall and F-Measure. F-measure (or F1-score) is the measure of a pattern's accuracy. It considers both the precision and the recall of the pattern to compute the score.
MORPHEME SEGMENTATION Precision
Example, íĔŨĒíŕāđƀĒėĀėíđĉīƧŹĔ(KUT TIKALPAMBINEKOLLUNNU) are killing the snake. íĔŨĒíŕíĔŨĒíŕ(KUTTIKALKUTTIKAL)(CHILDREN āđƀĒėĀ
CHILD- S)
āđƀĝ-
ÜĒėĀ(PAMBINEPAMBU-INE)
Children
Precision= ே௨ ௪ௗ௦ ௧௬ ௧ௗ ்௧ ௨ ௪ௗ௦ ௗௗ
(4.1)
Recall Recall= ே௨ ௪ௗ௦ ௧௬ ௧ௗ ்௧ ௨ ௪ௗ௦ ௧ ௗ௨௧
(4.2) F-Measure F-Measure=
ଶ כ௦כோ ௦ାோ
(4.3) CLEAR December 2015
Page 29
The outputs of Morphological analyzer for different nouns, verbs, adverbs and adjectives have been taken. The performance of the morphological analyzer is computed for nouns, verbs, adverbs and adjectives and showed in a chart for precision, recall and F-measure. Fig6. F-Measures of Nouns, Verbs, Adverbs, and Adjectives
Fig4. Precision of Nouns, Verbs, Adverbs, and Adjectives The above figure 4 shows the precision of the output obtained for set of nouns, verbs, adverbs and adjectives. In all cases it results in precision above 60%. The results will be better if different kinds of inputs are used.
The above figure 6 shows the FMeasure of the output obtained for set of nouns, verbs, adverbs and adjectives. In all cases it results in precision above 60%. The results will be better if different kinds of inputs are used. VI. CONCLUSION AND FUTURE WORK The current morphological analyzer performs well against nouns and verbs. The system can be further extended to recognize compound words and more adverbs and adjectives. The presence of sandhi letters anywhere inside a word makes it more difficult to handle. The training and test data should be expanded to get more accuracy. REFERENCES
Fig5. Recall of Nouns, Verbs, Adverbs, Adjectives The above figure 5 shows the recall of the output obtained for set of nouns, verbs, adverbs and adjectives. In all cases it results in precision above 50%. The results will be better if different kinds of inputs are used.
[1] Mark Aronoff and Kirsten Fudeman, Thinking about Morphology and Morphological Analysis. [2] Akshar Bharati, Vineeth Chaitanya, Rajeev Sangal, Natural Language Processing a Paninian Approach, pages 3642. [3] Rajeev R.R, Rajendran N, Elizabeth Sherly, A Suffix Stripping Based Morphological Analyzer For Malayalam Language , Science Congress 2008. CLEAR December 2015
Page 30
[4] Vinod P M, Jayan V, Bhadran V K, Implementation of Malayalam Morphological Analyzer based on Hybrid approach, Proceedings of the twenty-fourth conference on Computational Linguistics and Speech Processing (ROCLING 2012). [5] Jisha P Jayan, Rajeev R R, S Rajendran, Morphological Analyser for Malayalam- A comparison of Different Approaches, IJCSIT, Vol.2, No. 2, Dec 2009, pp. 155160. [6] Yasunari MAEDA, Naoya Ikeda, Hideki Yoshida, Yoshitaka Fujiwara, and Toshiyasu Matsushima, A Note on Morphological Analysis Methods based on Statistical Decision Theory, SICE Annual Conference 2007. [7] Cheng Juan, Research and Implementation English Morphological Analysis and Part-of-Speech Tagging, International Conference on E-Health Networking, Digital Ecosystems and Technologies 2010. [8] Yuguang Wang, Hongcui Wang, JiaqiGao, Jianguo Wei and Jianwu Dang, Detailed Morphological Analysis of
Mandarin Sustained Steady Vowels, 978-14673-2507-3/12/$31.00 © 2012 IEEE. [9] Meera Subhash, Wilscy. M, S.A Shanavas, A Rule Based Approach For Root Word Identification In Malayalam Language, International Journal of Computer Science & Information Technology (IJCSIT) Vol 4, No 3, June 2012. [10] Nimal J Valath, Narsheedha Beegum, Malayalam Noun and Verb Morphological Analyzer: A Simple Approach, International Journal of Science and Research (IJSR), Volume 3 Issue 7, July 2014. [11] Jisha P. Jayan, Rajeev R R, Dr. S Rajendran, Morphological Analyser and Morphological Generator for Malayalam Tamil Machine Translation, International Journal of Computer Applications (0975 – 8887) Volume 13– No.8, January 2011. [12] Aswani Shaji, Sindhu L, Morphological Analyzer for Malayalam: A Literature Survey, International Journal of Computer Applications (0975 – 8887) Volume 107 – No 14, December 2014.
An Intelligent –Stand Alone –Smart Tracker This is designed for those who need eye tracking embedded in their product. The new AEye technology from EyeTech Digital Systems creates new possibilities where a compact intelligent tracking module is required. It has got Multi OS support, compact size, low cost, Robust tracking with fast acquisition, attachable to computing devices. Visit: http://www.eyetechds.com/board-level-oem.html
CLEAR December 2015
Page 31
A Study on Machine Aided Translation Approaches in India Deepa C A Govt Engineering College Sreekrishnapuram Palakkad
Sincy V Thambi Govt Engineering College Sreekrishnapuram Palakkad
Varsha K V Govt Engineering College Sreekrishnapuram Palakkad
ABSTRACT: A survey of the machine translation systems that have been developed in India for translation from English to Indian languages and among Indian languages reveals that the MT softwares are used in field testing or are available as web translation service. India is a multilingual and multicultural country with over 1.25 billion population and 22 constitutionally recognized languages which are written in 12 different scripts. This necessitates the automated machine translation system for English to Indian languages and among Indian languages so as to exchange the information amongst people in their local language. There are certain machine translation systems that have been developed in India for translation from English to Indian languages by using different approaches.
I. INTRODUCTION India has a diverse list of spoken languages. At least 30 different languages and around 2000 dialects have been identified. In India English and Hindi are used as the languages for official communication. Additionally, the constitution of India classifies a set of 22 scheduled languages which are languages that can be officially adopted by different states for administrative purposes, and also as a medium of communication between the national and the state governments, as also for examinations conducted for national government service. In a large multi-lingual society like India, there is a great demand for translation of documents from one language to another. Machine Translation activities in India are relatively young. The term Machine Translation (MT) is a standard name for the use of computers to automate some or all the
process of translating from one natural language to another. In India several Institutes work on Machine Translation. The prominent Institutes are as follows: x The research and development projects at Indian Institute of Technology (IIT), Kanpur x National Centre for Software Technology (NCST) Mumbai (now, Centre for Development of Advanced Computing (CDAC), Mumbai x Computer and Information Sciences Department, University of Hyderabad x Centre for Development of Advanced Computing (CDAC), Pune x Ministry of Communications Information Technology
and
x Government of India, through its Technology Development in Indian Languages (TDIL) Project
CLEAR December 2015
Page 32
Machine translation systems for Indian languages: Natural language processing (NLP) is an area concerned with the interactions between computers and human (natural) languages. Translation, in its full generality, is a difficult, fascinating, and intensely human endeavour, as rich as any other area of human creativity. There are different approaches for machine translation. It can be classified as:
2. Dictionary look up to get the target language words/morphemes 3. Word order is changed to match with the target language. For English to Malayalam, this may be reordering of prepositions to postpositions and changing subject-verb-object to subject –object- verb structure. 1.1 Anusaaraka systems among Indian Languages (1995) In the Direct approach MT system in India first attempt was done by the Rajeev Sahgal in IIT Kanpur in the year 1995, further this is continued by IIIT Hyderabad. ANUSAARAKA is a machine aided translation system among Indian languages has been built with funding from the TDIL project. It is a language processor rather than a machine translation system. Anusaarakas have been built from Telugu, Kannada, Bengali, Marathi, and Punjabi to Hindi.
Fig.1: Machine Classifications
Translation
II. Different approaches Translation 1. Direct translation
for
Systems Machine
As the name suggests, these systems provide direct translation, without using any intermediate representation. This is done on a word by word translation using a bilingual dictionary usually followed by some syntactic arrangement. The steps involved are: 1. Identification of root words by removing suffixes from source language words.
It is domain free but the system has mainly been applied for translating children’s stories. The focus in Anusaaraka is not mainly on machine translation, but on Language Access between Indian languages. Using principles of Paninian Grammar (PG), and exploiting the close similarity of Indian languages, an Anusaaraka essentially maps local word groups between the source and target languages. 1.2 Punjabi to Hindi MT System (2007, 2008) This system comprised of modules such as pre-processing, word-to-word translation using Punjabi-Hindi lexicon, morphological analysis, word sense disambiguation,
CLEAR December 2015
Page 33
transliteration and post processing. Their system has additional modules for training the system for generating the lexicon using already existing corpus, input text font conversion into Unicode format to make the system free from specific font dependency, Hindi text normalization to handle spelling variations for the same word due to variation in dialects, replacement of collocations by keeping a lexicon for collocations, named entity recognition and replacement , word by word translation using bilingual dictionary and transliteration of unknown words. 1.3 Hindi-to-Punjabi MT System Goyal V and Lehal G S developed a system that uses direct word to word translation approach at Punjabi University, Patiala. Again, same group developed a system that uses direct word to word translation approach for Hindi to Punjabi at Punjabi University. 2.
Rule Based Translation
Rule based MT systems parse the source text and produce an intermediate representation, which may be a parse tree or some abstract representation. The target language text is generated from the intermediate representation. These systems rely on specification of rules for morphology, syntax, lexical selection and transfer, semantic analysis and generation and hence are called rule based systems. Depending on the intermediate representation used, these systems are further categorized as Transfer based machine translation and Interlingua based machine translation.
2.1 Transfer based machine translation In this method the structure of the input text is obtained by parsing that input sentence. It consists of three modules: analysis module, transfer module and generation module. The analysis module identifies the structure of sentence in source language. Language grammar rules can be used for this purpose. The transfer module transfers the source language structure representation to a target language representation. This modules needs the subtree rearrangement rules by which the source language sentence syntax tree can be transformed into target language sentence syntax tree. The generation module generates target language text using target language structure. Most of the systems use this approach for Machine-aided translation. Mantra MT (1997) Mantra is English to Hindi MT system developed by Bharati for information preservation. The text available in one Indian language is made accessible in another Indian language with the help of this system. It uses XTAG based super tagger and light dependency analyzer developed at University of Pennsylvania for performing the analysis of the input English text. It distributes the load on man and machine in novel ways. The system produces several outputs corresponding to a given input. MANTRA MT(1999) It translates English text into Hindi in a specific domain of personal administration that includes gazette notifications, office orders, office memorandums and circulars. The system was tested for the translation of
CLEAR December 2015
Page 34
administrative documents such as appointment letters, notification and circular issued in central government from English to Hindi in initial stage. It uses the Tree Adjoining Grammar (TAG) formalism developed by University of Pennsylvania and Lexicalized Tree Adjoining Grammar (LTAG) (Bandyopadhyay, 2004) to represent the English and Hindi grammar. The system is developed for the Rajya Sabha Secretariat, the Upper House of Parliament of India and used to translate the proceedings of parliament such as study to be laid on the Table, Bulletin Part-I and Part-II. It can translate from English to Bengali, Telegu, Gujarati and Hindi-English and also among other Indian Languages. MAT (2002) It is a machine aided translation system for translating English texts into Kannada,. This system makes use of morphological analyser and generator for Kannada. The input text is parsed by Universal Clause Structure Grammar parser and outputs the number, type and inter-relationships amongst various clauses in the sentence and the word groups. Shakti (2003) This system translates English to any Indian languages with simple system architecture. It is a hybrid approach which combines rulebased approach with statistical-approach. It contains 69 modules and each for one of the three steps discussed in translation based approach. OMTrans(2004)
the source and target language. Word Sense Disambiguation is also handled by this system. The MaTra System (2004, 2006) This system uses a frame like structured representation. It’s a domain specific system which considers the news articles as input and it uses text categorization in order to classify them as sports, politics, economic etc. and keeps separate dictionary for each of these categories. It is an English – Hindi translation which requires considerable human assistance in analysing the input. Sampark System: Automated Translation among Indian Languages (2009) A consortium of 11 institutions in India have developed a multipart machine translation system for Indian Language to India Language Machine Translation (ILMT) funded by TDIL program of Department of Electronics and Information Technology (DeitY), Govt. of India. It uses Paninian Grammar (CPG) for analysing language and combines it with machine learning. 1.2 Interlingua translation
based
machine
In this approach translation is a two-step process: analysis and synthesis. During analysis, the source language text is converted into a language independent meaning representation called Interlingua. In synthesis phase the interlingual representation is translated into any target language. Thus it can be used for multilingual translation.
This system that translates text from English to Oriya based on grammar and semantics of
CLEAR December 2015
Page 35
Angla-Bharti
UNL-based English-Hindi MT System
Angla-bharti is an MT system for translation from English to Indian languages which uses pseudo Interlingua approach. ANGLABHARTI represents a machineaided translation methodology specifically designed for translating English to Indian languages. Angla-Bharti uses pattern directed approach using context free grammar like structures. It analyses English only once and creates an intermediate structure called PLIL (Pseudo Lingua for Indian Languages). The PLIL structure is then converted to each Indian language through a process of text-generation. There is a provision for automatic pre-editing & paraphrasing, recognition of named-entities and incorporated an error-analysis module and statistical language-model for automated post-editing. The purpose of automatic preediting module is to transform/paraphrase the input sentence to a form, which is more easily translatable. The project had being implemented in consortium mode with four institutions are participating to build the system. The languages pairs being targeted are English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu. Experimental Machine Translation System has been made available for following languages pair as technology demonstrator:
This system uses Universal Networking Language (UNL) as the Interlingua structure. The UNL is an international project aimed to create an Interlingua for all major human languages. English-Hindi, Hindi-UNL, UNL-Hindi, English-Marathi and English-Bengali were also developed using UNL formalism. The other systems developed are extensions of the ANGLABHARTI system.
i) English to Bangla ii) English to Punjabi iii) English to Malayalam iv) English Urdu
3. Corpus Based Machine Translation Corpus based MT systems have gained much interest in recent years. The advantage of these systems are that they are fully automatic and require less human labour than rule based approaches. The disadvantage is that they need sentence aligned parallel text for each language pair and this method cannot be employed where these corpora are not available. Corpus based systems are classified into statistical machine translation (SMT) and Example based Machine translation (EBMT). 3.1
Statistical Machine Translation
This method uses statistical methods for Machine Translation. The task involves three steps: i) Estimating the language probability P(t) ii) Estimating the probability p(s/t)
translational
model
iii) Devising an efficient search for the target text that maximizes their product. In the above model s is the source language sentence and t is the target language sentence. The probabilities are to be
CLEAR December 2015
Page 36
calculated from the parallel corpus. Smoothing techniques are required for handling data sparsity problem that occurs in any noisy channel model. Bengali to Hindi MT System (2009) This hybrid Machine Translation system uses an integration of SMT with a lexical transfer based system (RBMT) i.e. multiengine Machine Translation approach. The experimentation shows that BLEU scores of SMT and lexical transfer based system when evaluated separately are 0.1745 and 0.0424 respectively. 3.2 Example based Machine translation system (EBMT) An Example based Machine translation system (EBMT) system maintains a corpus consisting of translation examples between source and target languages. An EBMT system has two modules: Retrieval module and an adaptation module. The retrieval module retrieves a similar sentence and its translation from the corpus for the given source sentence. The adaptation module then adapts the retrieved translation to get the final corrected translation. ANUBAAD (2000, 2004) Translates news headlines from English to Bengali using example based Machine Translation approach. An English news headline given to the system as an input is initially searched in the direct example-base for an exact match. If a match is found, the Bengali headline from the example-base is produced as output. If match is not found, the headline is tagged and the tagged headline is searched in the Generalized
Tagged example-base. If a match is found in Generalized Tagged Example-Base, the Bengali headline is to be generated after appropriate synthesis. If a match is not found, the Phrasal example-base will be used to generate the target translation. If the headline still cannot be translated, the heuristic translation strategy is applied where translation of the individual words or terms in their order of appearance in the input headline will be generated. VAASAANUBAADA (2002) An Automatic Machine Translation system for Bengali-Assamese News Texts using Example Based Machine Translation (EBMT) approach. It involves BengaliAssamese sentence level Machine Translation for Bengali text. It includes preprocessing and post-processing tasks. The bilingual corpus has been constructed and aligned manually by feeding the real examples using pseudo code. Longer sentences are fragmented at punctuations to obtain better quality translation. When the exact match is not found at sentence/fragment level in Example-Base, the backtracking is used and further fragmentation of the sentence is done. III. CONCLUSION India has a variety of languages and thus machine translation is having a great relevance. The area is been considered as an emerging area of research. There are several approaches to perform Machine Translation. Most of the successful approaches in earlier stages is found to be rule based systems. Since Indian languages are morphologically rich the linguistic knowledge is essential
CLEAR December 2015
Page 37
for the translation. Recent advances may involve the use of EBMT systems to do the translation. The challenge for this particular system is the need for parallel bilingual corpora. The study shows that now there have been systems developed which consider not only English but also other Indian languages as source or destination languages, which can be considered to be a good achievement regarding MT. As per the current scenario the existing systems require small amount of modifications in order to get good results. REFERENCES [1] Sudeep Naskar, Sivaji Badyopadhyay, Machine Translation Systems in India, In the Proceedings of MT SUMMIT X; September 13-15, 2005, Phuket, Thailand. [2] G V Gharje, G K Kharate, Survey of Machine Translation Systems in India, International Journal of Computer Applications, Number 23, 2015. [3] D S Rawat, Survey on Machine Translation Approaches used in India, International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, June 2015. [4] Sanjay Kumar Dwivedi, and Pramod Premdas Sukhadeve, Machine Translation System in Indian Perspectives, Journal of Computer Science 6 (10): 1111-1116, 2010, ISSN 1549-3636. [5] Latha R. Nair, David Peter S, Machine Translation Systems for Indian Languages
International Journal of Applications, February 2012.
Computer
[5] Latha R. Nair, David Peter S, Machine Translation Systems for Indian Languages International Journal of Computer Applications, February 2012. [6] Vaishali.M.Barkade, Prakash R Devale, Suhas H Patil, Engilsh to Sanskrit Machine Translator Lexical Parser and Semantic Mapper, NCICT-10 [7] Ruchika A Sinhal, Kapil O Gupta, A Pure EBMT approach for Englis to Hindi Sentance Translation System, I.J. Modern Education and Computer Science 2014. [8] Nithya B, Shibily Joseph, A Hybrid English to Malayalam Machine Translator for Malayalam Content Creation in Wikis, IOSR-JCE, 2013 [9] Neeha Ashraf, Manzoor Ahmed, Machine Translation Techniques and their Comparative Study, International Journal of Computer Applications (0975 – 8887), September 2015 [10] Amrita Godase, Sharvari Govilkar, Machine Translation Development for Indian Languages and its approaches, International Journal on Natural Language Computing (IJNLC), April 2015. [11] Pankaj Upadhyay, Umesh Chandra Jaiswal, Kumar Ashish, TranSish: Translator from Sanskrit to English-A Rule based Machine Translation, International Journal of Current Engineering and Technology, October 2014.
CLEAR December 2015
Page 38
Securing Cloud Data from Data Mining Attacks Jyothsna G K M. Tech Computer Science Jyothi Engineering College, Vettikkatiri, Cheruthuruthy jyothsu.praveen@gmail.com ABSTRACT: Cloud security is the current discussion in the IT world. Cloud computing provided us several powerful and efficient ways to perform the task related to computation without owning costly resources and it also provides the simple procedure for storing data in remote storage. Security and confidentiality of large amount of data in the use of cloud computing is becoming highly necessary and important. Several types of security or confidentiality attacks are possible on the data stored in cloud environment. One of the security concerns of cloud is data mining based privacy attacks that involve analyzing data over a long period to extract valuable information, which is responsible for the privacy violation of clients. This paper presents distributed storage cloud architecture, which describes a trusted computing work group (TCG) and trusted platform module (TPM) and adds a secure cloud storage using the proposed cryptographic solution and with a searchable encryption technique for the files to be accessed. This is a better approach to the user to ensure security of data.
I. INTRODUCTION Cloud Computing is a latest and greatest thing that companies are busy while using in their marketing. It is a store house of various components such as Virtual computing, Web Applications, Clusters, Application servers, etc. Virtualization is a component of cloud computing. Cloud computing is a powerful means of achieving high storage and computing services at a low cost, it has not lived up to its reputation. Many potential users and companies yet lack interest in cloud based services [1]. One of the main reasons behind this lack of interest involves security issues. Cloud has several securities issues involving confidentiality of data [6]. A user temporarily or permanently due to an unlikely event such as a malware attack or network outage. Another big concern is
confidentiality of user data in the cloud. Cloud gives providers an opportunity to analyses user data for a long time. In addition, outside attackers who wish to get access to the cloud can also analyses data and violate user privacy. Cloud is a source of massive static data, and also a provider of high processing capacity at low cost. Thus cloud is more vulnerable as attackers can use the raw processing power of cloud to analyses data [1]. So many data analysis techniques are available nowadays that successfully extract valuable information from a large volume of data. Cloud service providers use these analysis techniques. For example, to analyses user behaviours and recommend search results [1]. Google uses data analysis techniques. Attackers can use these techniques to extract valuable information
CLEAR December 2015
Page 39
from the cloud. The data analysis involves mining which is closely associated with statistical analysis of data. Data mining can be a potential threat to cloud security considering the fact that all the data belonging to a particular user is stored in a single cloud provider. This gives the provider opportunity to use powerful mining algorithms that can extract private information of the user. As mining algorithms require a reasonable amount of data, the single provider architecture suits the purpose of the attackers. This approach (single cloud storage provider) also eases the job of attackers who have unauthorized access to the cloud and use data mining to extract information. Thus the privacy of data in the cloud has become a major concern in recent years. This paper presents an approach to prevent data mining based attacks on the cloud. The system involves distributing user data among multiple cloud providers to make data mining a difficult job to the attackers. The key idea of this approach is to categorize user data, divide the data into chunks and provide these chunks to the proper cloud providers. So, even if someone has an illegal access to the storage of any service provider, he cannot retrieve all the data of any particular client. Hence, it would be impossible to extract any important strategic information from the limited available data. This paper is organized as follows, second section describes data mining risks, third section is about trustworthy computing, fourth describes what is CIA, fifth is distributed architecture, sixth is security framework, and conclusion is given in the seventh section.
II.
DATA MINING: A Potential Threat to Privacy
Data Mining is the extraction of information’s from large amount of data. Cloud providers use data mining to provide clients a better service [3]. If clients are unaware of the information being collected, issues like privacy and individuality are violated [2]. This can be a serious data privacy issue if the cloud providers misuse the information. Not only that attackers outside cloud providers have unauthorized access to the cloud, also have the opportunity to mine data from cloud. In these cases, attackers can use cheap and raw computing power provided by cloud computing to mine data and thus acquire useful information from data. The information’s can be successfully extracted via data mining. It depends on two main factors: one is proper amount of data and the other is suitable mining algorithms. There are different mining algorithms used for numerous purposes. Some mining algorithms are good enough to extract information up to the limit that violates client privacy. For example, clustering algorithms can be used to categorize people or entities and are suitable for finding behavioural patterns [4], association rule mining can be used to discover association relationships among large number of business transaction records [4] etc. Thus analysis of data can reveal private information about a user and leaking these types of information may do significant harm. As more researches are going on mining, improved algorithms and tools are being developed [5]. Thus, data mining is
CLEAR December 2015
Page 40
becoming more powerful and possessing more threat to cloud users. III.
TRUSTWORTHY COMPUTING
The main cause of choosing online backup is encryption. It is the favourite choice for computer storage. Encryption avoids malicious gatherings from trying to access, modification or damage files by storing them, which is inaccessible without the key. Trustworthy computing shaped hardware root of trust that has been developed by the industry to defend computer infrastructure and millions of end points. Encryption does not allow unwanted persons to access, damage or maliciously change files that are harmful to client or client's business. Encryption is a process through which the data is converted from plain text to cipher text using a key. It protects the data in the data server. TCG [5] created the Trusted Platform Module (TPM) cryptographic competency, which implement the specific behaviours and protects the system besides unauthorized access and attacks such as malware and root kits. As computing is extended to heterogeneous devices and infrastructures have grown, so the concept of TCG is also well extended for trusted computing systems beyond the computer with a TPM to other devices, ranging from mobile phones and hard disk drives. This can secure cloud computing and virtualized systems. IV. CIA: Confidentiality, integrity and Authentication The security concerns of cloud have high priority. Security for the systems, networks and data can be provided by the
Trusted Computing Group technologies including, the Trusted Platform Module (TPM), network security, and self-encrypted drives. TCG also provides security policies for private and public networks and offers how to interface different technical standards to enterprise solution which is tailored to fulfill mission and the business requirements. TPM (Trusted Platform Module) is nothing but a computer chip that can securely store artifacts that used to authentic the platform (client PC or laptop). The artifacts include password, certificate, or encryption keys. A TPM can also be used for storing the measurements of platform, which helps to ensure that the platform is remains to be trustworthy. Authentication (ensures that the platform can be proven what it claims to be) and attestation (a process helping to prove that a platform is trustworthy and has not been breached) is mandatory steps to ensure secure computing in all the environments [6]. TPM maintains three properties: x Confidentiality: The hardware based cryptography ensures that it is better protected from the software attacks because some information stored in the hardware. The number of applications developed by TCG stores the secret information in TPM. To access information on computing devices, require proper authorization. These applications build it more difficult to access information. x Authentication: The TPM provides the authorized devices and users in the network. These are already registered in the network.
CLEAR December 2015
Page 41
x Platform integrity: Integrity defines that the alterations need to be done by the authorized users by some authorized mechanism. Platform integrity includes the BIOS, application software, operating system, boot sector and disk MBR to ensure that no changes have occurred through unauthorized access. A secure cryptographic integrated circuit (IC) is used in the hardware to manage the user authentication, network access and data protection against the unauthorized user. V. DISTRIBUTED ARCHITECTURE
SYSTEM
The distributed system architecture presented in the paper prevents data mining based privacy attacks on the cloud. This system consists of two major components: Cloud Data Distributor and Cloud Providers. Data is received by the Cloud Data Distributor in the form of files from clients. Then divides each file into chunks or clusters and distributes these chunks among cloud providers. Cloud Providers store these chunks or clusters and responds to chunk requests by providing chunks. Figure 1 shows the system architecture.
Fig1. System Architecture
For the security purpose there is a Trusted Gateway between the client and cloud provider. The Trusted Gateway encrypts the client’s data and sends it to the cloud data distributor. Figure2 represents the extended architecture.
Fig2. Extended System Architecture The cloud data distributor fragments the encrypted data and distribute in the cloud providers. The Trusted Gateway again decrypts the data when it delivered to the client. Trusted Gateway inspects all outgoing request, identify sensitive data and encrypt that data using TPM. Then it forwards the modified request to the cloud data distributor. Similarly the encrypted data returning from the cloud data distributor is converted into the plain text and displayed to the client. A. Cloud Data Distributor The cloud data distributor is present inside every node. Cloud data distributor receives the encrypted data file from Trusted
CLEAR December 2015
Page 42
Gateway; it splits each data file into fragments (chunks) and distributes these fragments among cloud storages or providers. Cloud storages store the fragments and respond to fragment request by providing the chunks. In this system, clients do not interact with the cloud storages directly. It interacts with Cloud providers via Trusted Gateway and Cloud Data Distributor [7]. If the client needs to upload the data, deliver data to the trusted gateway, which encrypts the data and send it to the cloud data distributor. Each file has a privacy level that is chosen by the client specifying its data mining sensitivity. It also maintains 4 sensitivity levels of privacy: PL 0, 1, 2, 3. Privacy level 0 indicated the low sensitive data and in the same way privacy level 3 has shown the high sensitive data or private data. The higher privacy level means more sensitive data inside the file. The encrypted files are received from the trusted gateway; the cloud data distributor divides each file into the fragments with same privacy level of the parent file. Client notices the total number of chunks for each file so that any chunk can be requested by the client by stating the filename and serial number. Serial number corresponds to the position of the chunk within the file. B. Cloud Storage or Cloud Providers The cloud storage or providers store the fragmented data; client can retrieve the desired data and delete the file from cloud providers. Cloud provider receive chunk from cloud data distributor and store them. The cloud provider responds to the cloud
data distributor and cloud data distributor responds to the trusted gateway and provides the data. Cloud providers receives request from cloud data distributor and perform the operation as desired [7]. VI.
SECURITY FRAMEWORK
Various security risks are involved in cloud computing, also some good solutions are designed. The design architecture is as shown in Figure 3, where a new security layer is designed for private cloud. The new security framework is present in between session layer and transport layer such that it is transparent to application layer and the other lower layers. So when a data is transferred by the client it is first secured by certain authentication protocols and saved at the server end. Then, the data will be stored in a secured way at server end. Those who want to download the data or view it should be connected or have access through same framework to view the data. This is done in application user level so that the data will be secured and transferred where there is need to disturb any lower layers of the network.
Fig3. Design Architecture
CLEAR December 2015
Page 43
A. Security Framework Model The detailed architecture is shown in the below Figure 4. The nodes which are connected to server will be connected to the security layer. When a user wants to send data to cloud, he has to select a Security Algorithm based on the privacy level of the document. To get more security there must be a strong security algorithm to be selected.
the remote end then send encrypted data. The data is carried by the protocol to process the other commands which happens in a network. Thus the data will be secured at the sender end because of the security framework which helps in secure data transfer.
Fig5. Process at the sender Fig4. Security Framework Model The security server will secure the documents and store it in database. All the systems which belong to that network are connected to the same architecture. When any user wishes to select any document from the data centre, he must be connected to the same security server to get the original document. This is very useful in security and privacy of the documents.
C. Process at Retrieval End When the data is received at the receiver end, the data will be decrypted and written on the disk as shown in Figure 6. The data will be decrypted by the security approach used at encryption end. This is then worked above transport layer just where the packets arrive at end application.
B. Process at the Sender The data at the initiator end (client) will send his data. He encrypts the data by selecting the appropriate approach and sends it to the server end. At the client end the data is read, ready to send data as shown in Figure 5. At socket layer, the data will be encrypted for each byte before sending it to Fig6. Process at the receiver
CLEAR December 2015
Page 44
VII. CONCLUSION The cloud service providers and attackers use various techniques of data mining to capture the valuable information from the user's data. In this paper, a model is proposed, that uses trusted platform module which is used for security purpose in the data storage on cloud. To ensure the security for users data stored in cloud, an effective and flexible distributed scheme is proposed. The security is applied over to the data at the background using the encryption algorithms like AES, Triple DES and DES. Trusted gateway provides the encrypted data to the cloud data distributor, cloud data distributor fragment the data into chunks and stores in the distributed cloud providers which are in the same cluster, so the communication cost is reduced. Secure cloud storage can be provided using the cryptographic solution and with a searchable encryption technique for the files to be accessed. Hence it will work as a better approach to the user to ensure security of data. This model provides an efficient and effective way to secure privacy from data mining attacks.
REFERENCES [1] R. Chow, P. Golle, M. Jakobsson, E. Shi, J. Staddon, R. Masuoka, and J. Molina, Controlling data in the cloud: Outsourcing computation without outsourcing control , pp. 85–90, 2009. [2] L. Van Wel and L. Royakkers, Ethical issues in web data mining, Ethics and Inf. Technol., 6: pp. 129–140, 2004. [3] J. Wang, J. Wan, Z. Liu, and P. Wang, Data mining of mass storage based on cloud computing, In IEEE Computer Society, pp. 426–431, 2010. [4] M. Kantardzic, Data Mining: Concepts, Models, Methods and Algorithms John Wiley & Sons, Inc., 2002. [5] Trusted Computing Group. [Online]. Available:https://www.trustedcomputinggro up.org/. [6] Trusted Platform Module (TPM) Summary https://www.trustedcomputinggroup.org/res ources/trusted_platform_module_tpm_sum mary. [7] Dev, H.; Sen, T.; Basak, M.; Ali, M.E., An Approach to Protect the Privacy of Cloud Data from Data Mining Based Attacks , High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp.1106-1115, Nov.2012.
CLEAR December 2015
Page 45
Article Invitation for CLEAR March 2016 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on March 2016. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th March, 2016 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Journal
SIMPLE Groups
CLEAR December 2015
Page 46
Hello World,
As every day passes we are more and more close to computers that think and work like humans. Language with all of its beauty is manipulated in these systems. But tackling this problem of natural language processing is not a trivial one. But with the big shots of the field, like Google, working on systems that mimic human brain, the day AI becomes a common feature of our lives isn’t far away.
At CLEAR, we are always on the lookout for papers that address these obstacles and come up with interesting and handy solutions. This issue of CLEAR brings you papers that break new grounds in the field of Linguistics.
One of the main tasks that we have focused on is the processing of Malayalam. We seem to be biased when it comes to this particular dialect. The agglutinative nature of this exquisite language makes processing it quite challenging, we are very keen on papers that thrives overcomes these challenges. The aim of CLEAR is to support those who are dedicated and excited about questions like these. We thank everyone that helps CLEAR and drives it in this direction.
Simple group welcomes more aspirants in this area. Wish you all the best!!! Deepthi
CLEAR December 2015
Page 47