Clear Journal September 2014

Page 1

CLEAR September 2014

1


CLEAR September 2014

2


C

Editorial…………………… 4 News & Updates……….5 Events……………………….6 CLEAR Dec 2014 Invitation…………………38

CLEAR September 2014 Volume-3 Issue-3 CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Dr. Ajeesh Ramanujan Raseek C Nisha M Anagha M Cover page and Layout Sarath K S Manu.V.Nair

CLEAR September 2014

Last word…………………39

Content Translation Wikipedia contents Santhosh Thottingal

Tool

for

Translating

Speech Recognition Divya Das

Feature extraction for Speaker Recognition System Neethu Johnson FreeSpeech: Real-Time Speech Recognition Vidya P V CMU Sphinx 4: Speech Recognition System Raveena R Kumar Deep Learning in Speech Processing Rajitha K, Alen Jacob

Parallelising NLP Tasks using MapReduce Paradigm Freny Clara Davis, Shalini M, Nidhin M Mohan, Shibin Mohan

3


Dear Readers! Greetings!

In this season of ONAM, the September edition of CLEAR comes to you as a Journal, decorated by flowers in the form of articles mainly on Speech Processing, and centered around the special article on Translation tool from Wikipedia by Santhosh Thottingal, Senior Software Engineer, Language Engineering team, Wikipedia Foundation. It is heartening to note the present batch of M.Tech students also are trying to extend the frontiers of knowledge by venturing into unexplored areas. On top of all this, we have glimpses of the visit by the Acharya Prof. R. Kalyana Krishnan to our institution, which we consider as a blessing. We will try to include articles from other departments also in the future editions, keeping in mind the broad objectives of the journal. Hope you all will enjoy the colours of this edition. Do send in your feedback.

Best Regards, P.C. Reghu Raj (Chief Editor)

CLEAR September 2014

4


Adieu to Second Batch of SIMPLE Groups Yet another milestone was conquered by simple groups when the students of second batch (2012-14) of Computational Linguistics successfully completed their M.Tech course. It was an academically eventful period they spent at the institution. Research projects were taken up and several publications resulted from them. Workshops, expert talks etc., were organized by them. They also attended similar programs at various other institutions that brought about the sharing of knowledge. Their examination results were spectacular. Besides academics they also have a story of radiant life at the institution-a story of friendship sprinkled with love, helping and taking care of each other.

A farewell party was organized by the junior batch (2013-15) for their outgoing seniors. The forenoon session of the function was a get-together of staff, faculty and students. First, Dr. P. C. Reghu Raj shared his memories and experiences with the batch, gave a final word of advice for them and wished all of them all the successes in their future endeavours. Then the other staffs also recalled the times with them and gifted them with well wishes and enlightenments. The students also shared their memories in the college, and gave their feedback on the course and the institution. They had suggestions on how to further improve the system here and they also gave valuable directions to their junior batch regarding various aspects of the course and project works. The forenoon function was concluded and a lush lunch was served. Then the students of both batches gathered for some games and other fun activities. The party was hilarious and filled with excitement. Towards the end, when everyone recollected their past days and took a trip down the memory lane it turned nostalgic and heart-warming. Thus the second batch of Computational Linguistics bid good bye to the college with a bunch of treasured memories and colourful achievements, with a promise to stay connected. (Content prepared by Kavitha Raju)

CLEAR September 2014

5


Talk by Prof. Kalyana Krishnan R Professor R Kalyana Krishnan one amoung the most reputed Professors of IIT Madras visited our college on 16th of September 2014. Prof. Kalyana Krishnan is retired from IIT Madras since 2012 after a long service lasting nearly four decades. He visited our college and delivered expert talks on various engineering fields for both post and undergraduate students.

Interaction with PG students began by afternoon as per the schedule. Both the first and second year M.tech Computational Linguistics batch students attended the class. The talk was centered around text processing and various issues associated. After providing a shot introduction to the mathematical basis provided to text processing, the discussion advanced to the much anticipated Multilingual text processing. Prof. Kalyana Krishnan compared English with Malayalam in terms of size of alphabet and homophones. Later the talk proceeded with issues regarding the character encodings in Malayalam, mostly the ones associated with Koottaksharams. In the following discussions the Professor made us realize the beauty of Malayalam and asked us to envy the expertise with which the Indian literatures were written. The beauty of Indian literature further drove the discussions towards the Rama Krishna Viloma Kavyam, a sloka build of palindromes, and excellence of the epic Mahabharatha. Opportunities like this where we could share our thoughts and ideas with experts like Prof. Kalyana Krishnan comes only rarely in a lifetime. We, the students of M.Tech CL is hence sincerely thankful to our Head of the department for providing us such an opportunity. (Content prepared by Amal Babu)

CLEAR September 2014

6


Talk by Dr. T Ashokan Dr T Asokan is a Professor in the Department of Engineering Design at IIT Madras. He completed his B.Tech and M.Tech in mechanical engineering from Calicut University. Dr Asokan received his Ph.D in Mechanical Engineering from the Indian Institute of Technology Madras, in the year 2000. His area of specialization was electro hydraulic controls for robotic applications. He visited our college and delivered expert talk on “Under Water Robotics� for undergraduate students of Mechanical department and some of the students from Electrical department. Later he interacted with faculty members.

CLEAR September 2014

7


Mangalyaan The Mars Orbiter Mission, Mangalyaan, launched into Earth orbit on 5th November 2013 by Indian Space Research Organisation, was successfully inserted into Mars orbit on 24th September 2014, making India the first nation to send a satellite into Mars orbit on its first attempt, and the first Asian nation to do so. The Mangalyaan robotic probe is one of the cheapest interplanetary missions ever. Only the US, Russia and the European Space Agency have previously sent missions to Mars, and India has succeeded on its first attempt - an achievement that eluded even the Americans and the Soviets. It is India's first interplanetary mission and ISRO has become the fourth space agency to reach Mars, after the Soviet space program, NASA, and the European Space Agency. The specific objectives of the Mars Orbiter Mission are primarily associated with spacecraft construction and mission operations as Mangalyaan serves as a pathfinder, being India’s first mission beyond the Moon which brings its own unique challenges such as the 20-minute average signal delay to Mars. The Indian Space Science Data Center has provided the following Mission Objectives: 1. Develop the technologies required for design, planning, management and operations of an interplanetary mission. 2. Orbit maneuvers to transfer the spacecraft from an elliptical Earth orbit to a heliocentric trajectory and finally insert it into Mars orbit. 3. Development of force models and algorithms for orbit and attitude computations and analyses. 4. Navigation in all mission phases. 5. Maintain the spacecraft in all phases of the Mission meeting Power, Communications, Thermal and Payload requirements. 6. Incorporate autonomous features to handle contingency situations. The following scientific Objectives have been set for the Mars Orbiter Mission: 1. Study climate, geology, origin and evolution of Mars. 2. To study sustainability of life on the planet. MOM will be set on a highly elliptical orbit around Mars, with a period of 3.2 days and a planned periapsis of 423 km (263 mi) and apoapsis of 80,000 km (50,000 mi). Commissioning and checkout operations are planned over the coming weeks to prepare MOM's instruments for scientific operations. (Content prepared by Vidya P V)

CLEAR September 2014

8


Content Translation Tool for Translating Wikipedia contents Santhosh Thottingal

Senior Software Engineer Language Engineering Team Wikimedia Foundation santhosh.thottingal@gmail.com ഇന്‍റര്‍നെറ്റിലെനആദ ്യ അദ ്ചു വദ ന്സൈറ്റ്റിലളിലെനആ്നായ്വദ ്െകിപെഡിയെവദ 287ദ ഭ്ഷിലെല്‍ദ പ്ഡ്ര്‍തെകിപവനായവട്.ദ ഇതെല്‍ദ ഏറ്റില്വുംദ ്ആവതവുംദ പ്ഡശസ്ത്വുംദ ഇുംഗ്ലിഷ്ദ ഡതെപ്പ്ണ്.ദ 20ദ ഇന്ത്അന്‍ദ ഭ്ഷിലെആവുംദ ്െകിപെഡിയെവദ െ​െആ്െആവട്.ദ ഇന്ത്അന്‍ദ ഭ്ഷ്ദ ഡതെപ്പളിനലആ്​്ുംദ ഇുംഗ്ലിഷ്ദ ഭ്ഷ്ദ ഡതെപ്പെനെദ ്പഡക്ഷെച്ച്ദ ്ലനെദ നെറവത്ണ്.ദ വൂപറ്ഡഅന്‍ദഭ്ഷില്ണ്ദഇുംഗ്ലിഷെ​െവദനത്ട്ടളത്നെദ ്ആവപ്പതെആവുംദ പ്ഡശസ്തെവെആവുംദ ഉള്ളത്.ദ പആഖെദ ങലളനണദ ണത്തതെആവുംദ ്​്വവനണദ ഉള്ളണകിപതെന്‍റദ ്െതെനന്‍റദ ി്െഅതെആവുംദ ഈദ ത്െതമ്അുംദ ശെ​െവ്ണ്.ദ ഈദ ത്െതമ്അുംദ ഡപക്ഷദ ്ദ ഭ്ഷദ ്ും്​്െ​െകിപവനായ്െവനണദ

ണത്തതെനന്‍റദ

ി്െഅതെല്‍ദ

്െവഡ്തെിമ്ആ്ദ

ത്െവും.ദ

്ും്​്െ​െകിപവനായ്െവനണദ ണത്തതെല്‍ദ ഹെന്ദെദ പആ്ിതെല്‍ദ െ്ആ്ുംദ ദ സ്ഥ്െത്ണ്.ദ ഒെവദ ആക്ഷതെല്‍ദ ഡെുംദ പആഖെദ ങല്ണ്ദ ദ ഹെന്ദെദ ദ ്െകിപെഡിയെവവെആവള്ളനതിെല്‍ദ ഇുംഗ്ലിഷെല്‍ദ 40ദ ആക്ഷപത്ലുംദ പആഖെദ ങലളട്.ദ മ്ആവ്ലുംദ ്െകിപെഡിയെവവെല്‍ദ 35000ദ പആഖെങല്ണവള്ളത്.

ഇതെുംദ നെറെവദ ്െകിപെിള്‍ക്കിപവദ ്ആെവദ ്െകിപെിലെല്‍ദ െ​െനായവുംദ പആഖെദ ങള്‍ക്ദ ഡെ​െഭ്ഷനപ്പണവതെനവണവതവിൂനണദ ്െകിപെഡിയെവദ

ണയെറ്റിലര്‍മ്​്ര്‍ദ

ണനായദ

്ങനെദ

പെ്യ അുംദ ദ

പ്ഡ്ക്തമ്​്ണ്.

ഭ്ഗെിമ്​്വെദ

െ​െ​െ്ധെദ പആഖെദ ങള്‍ക്ദ

ഡെ​െഭ്ഷനപ്പണവതവനായവമ്വട്. ഡെ​െഭ്ഷനപ്പണവതല്‍ദ ണലളപ്പമ്​്കിപ്ന്‍ദ ഡആദ ഭ്ഷിള്‍ക്കിപവുംദ നമ്ഷിന്‍ദ പ്ണ്ന്‍പേഷന്‍ദ ന്ൌിെഅങള്‍ക്ദ ആഭഅമ്​്ണ്. ്ുംബന്ധെച്ചെണപത്ലുംദ

നമ്ഷിന്‍ദ

ഡപക്ഷദ ഇന്ത്അന്‍ദ ഭ്ഷിനല,

പ്ണ്ന്‍പേഷന്‍ദ

്​്പിതെി്െയ അദ

്​്ശഅങള്‍ക്കിപവതി്ന്‍ദ മ്​്പ്തുംദ ഇതവ്നെവവുംദ ്ലര്‍നായെട്ടെആ്.

ഇതെുംദ

നമ്ഷിന്‍ദ പ്ണ്ന്‍പേഷന്‍ദ

്പിതെി്െയ അദഉള്ളദഭ്ഷിലെആവുംദപആഖെങള്‍ക്ദഡെ​െഭ്ഷനപ്പണവത്ന്‍ദ്െകിപെഡിയെവദ ന്ൌിെഅങനല്നായവുംദ ദ ണദ യെറ്റിലര്‍മ്​്ര്‍കിപവദ നി്ണവകിപവനായെആ്. ഗൂഗെള്‍ക്ദ പ്ണ്ന്‍പേറ്റില്, ബെങ്ദദ തവണങെവ്ദ ഉഡപവ്ഗെച്ച്ദ ദ ്െകിപെകിപ്ദ ഡവറത്ദ ഡെ​െഭ്ഷനപ്പണവതെവദ ഉള്ളണകിപുംദ ്​്​്​്െുംദ ്െകിപെവെപആകിപ്ദ ഡിര്‍തവിവ്ണ്ദ ര്‍ദ നെയ്യളനായത്.

ദ ഉള്ളണകിപതെനആദ

ആെിവിള്‍ക്,ദ നറഫറന്‍്വിള്‍ക്,ദ നണപല്റ്റിലളിള്‍ക്ദ തവണങെവ്നവ്നകിപദ ണദ യെറ്റിലര്‍മ്​്ര്‍ദ മ്​്റ്റിലെദ ണെവതണും. ഈനവ്െവദ

ന്ൌിെഅകിപവറ്വദ

തവണങെകിപെ​െഞ്ഞവ.

CLEAR September 2014

ഡെ​െഹെ​െകിപ്ന്‍ദ

്െകിപെയ്ക്കിപവള്ളെല്‍ദ

9

തനനായദ

്െകിപെഡിയെവദ ്െ്െധതെുംദ

പ്ശമ്ുംദ

ഡെ​െഭ്ഷ്ദ


്ും്െധ്െങലളനണദ

്ഹ്വപത്നണദ

ഒെവദ

ഭ്ഷവെല്‍ദ

െ​െനായവുംദ

പ്നറ്െവദ

ഭ്ഷവെപആയ്ക്കിപ്ദ ദ പആഖെങള്‍ക്ദ ഡെ​െഭ്ഷനപ്പണവത്ന്‍ദ ഇതവ്െ​െദ ്​്ധെയ്ക്കിപവും. െ​െഘടവിള്‍ക്,ദ നമ്ഷിന്‍ദ പ്ണ്ന്‍പേഷന്‍ദ ദ ണനായെ്യ്ക്കിപവദ ഡവറപമ്, ്െകിപെവെനആദ ആെിവിള്‍ക്ദ പ്നറദ ഭ്ഷവെപആകിപ്ദ ഓപട്ട്മ്​്റ്റിലെക്ദ ്വെദ മ്​്റ്റില്െവള്ളദ ദ ന്ൌിെഅുംദ ഉട്​്വും. ഒെവദ ്െഷവനതപ്പറ്റിലെദ ്െ്െധദ ഭ്ഷിലെനആദപആഖെങലളനണദ്െ്െങള്‍ക്ദ െ​െആ്െല്‍ദ തനനായദ ്െകിപെയ്റ്റിലദണനായദനപ്ഡ്ജക്ണെനന്‍റദ്ഹ്വത്ല്‍ദ്ുംഭെ​െയ്ക്കിപവനായവട്. നറഫറന്‍്വിള്‍ക്,ദ നണുംപേറ്റിലളിള്‍ക്ദ ണനായെ്വവുംദ ഒെവദ ഭ്ഷവെല്‍ദ െ​െനായവുംദ പ്നറ്നായെപആകിപ്ദ ഡൂര്‍ത്തമ്​്പവ്ദ ഭ്ഗെിമ്​്പവ്ദ ണൂലളിലളനണദ

്ഹ്വപത്നണദ ദ ഓപട്ട്മ്​്റ്റിലെക്ദ ്വെദ മ്​്റ്റില്ന്‍ദ

്​്ധെയ്ക്കിപവും. ്െകിപെഡിയെവദ പആഖെങള്‍ക്ദ തെ​െവതവനായതെന്ദ ഡആെവുംദ മ്ണെകിപവനായതെ​െവള്ളദ ഒെവദ ി്െണുംദ ്തെനന്‍റദ ിവറച്ച്ദ ്െഷമ്ുംദ ഡെണെച്ചദ ്െകിപെദ മ്​്ര്‍കിപപ്പ്ദ ഭ്ഷദ ഡഠെനച്ചണവകിപ്ന്‍ദ ദ ദ മ്ണെകിപവനായതെ​െ്ആ്ണ്. പആഖെങള്‍ക്ദ ഡെ​െഭ്ഷനപ്പണവത്ന്‍ദ ്െ്ന്‍ദ പഡ്ിവനായദ

ഈദ

്ും്െധ്െതെല്‍ദ

ഗൂഗെള്‍ക്ദ

പയ്പകിപ്ദ

പ്ഡ്ദ

പഡ്ആവള്ളദദ

പയ്ിഅനമ്പന്‍റ്ദദണയെറ്റിലളദനെയ്യളനായപഡ്നആദപആഖെങള്‍ക്ദദതയ്യ്റ്കിപ്ന്‍ദിെ​െവവും. തവണകിപതെല്‍ദ സ്ഡ്െ​െഷ്ദ ്പഡര്‍ഷഅവും

ി്റ്റിലആന്‍ദ ഭ്ഷിള്‍ക്ദ ്വെ​െ​െകിപവുംദ ഡെന്ത്വണകിപവനായത്.

(http://www.apertium.org/)ദ ണനായദ ്വതപ്ന്ത്ദ നമ്ഷിന്‍ദ ദ പ്ണ്ന്‍നേഷന്‍ദ

്ും്െധ്െമ്​്ണ്ദ ഉഡപവ്ഗെച്ചെ​െ​െകിപവനായത്. ്ണവതഡണെവ്വെദ മ്റ്റിലളഭ്ഷിലെആവുംദ ഈദ ്ും്െധ്െുംദ്െി്െപ്പെയ്ക്കിപവും.

ിടന്‍റ്ദ പ്ണ്ന്‍സ്പആഷന്‍ദ ണൂള്‍ക്ദ ്െ​െദ ഏപറ്പേനന്‍റദ പആഖെനതദ സ്ഡ്െ​െഷ്ദ ഭ്ഷവെല്‍ദ െ​െനായ്ദി്റ്റിലആന്‍ദഭ്ഷവെപആകിപ്ദപ്ണ്ന്‍സ്പആറ്റില്ദനെയ്ക്തെ​െ​െകിപവനായവ.

Watson Analytics merges big data with natural language tools IBM has announced the launch of Watson Analytics, a cloud-based natural language service that aims to simplify and streamline predictive data analytics for businesses, creating handy visualizations in the process. It can help companies source and cleanup data, so that the results seen are always relevant. Visit: http://www.wired.co.uk/news/archive/2014-09/16/ibm-watson-analytics

CLEAR September 2014

10


Speech Recognition Divya Das

Project Engineer-II CDAC Thiruvananthapuram divya.das1196@gmail.com

I.

II.

Introduction

Speech recognition is a complex decoding process which translates speech into its corresponding textual representation. Because of the stochastic nature of speech stochastic models are used for its decoding by modeling relevant acoustic speech features. Speech recognition engines usually require two basic components in order to recognize speech. One component is an acoustic model, created by taking audio recordings of speech and their transcriptions. The other component is called a language model, which gives the probabilities of sequences of words. The following figure shows the important speech recognition modules.

Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. It is also closely tied to natural language processing (NLP), as its input can come from or output can go to NLP applications. The two main research areas of speech processing are: 

Speech recognition (also called voice recognition), which deals with analysis of the linguistic content of a speech signal and its conversion into a computer readable format. Speech synthesis the artificial synthesis of speech, which usually means computer generated speech.

This article briefly explains the fundamental steps followed in speech recognition systems.

CLEAR September 2014

Speech Recognition

Figure 1: Speech Recognition

11


III.

Acoustic Modelling

The first step toward building an automated speech recognition system is to create a module for acoustic representation of speech. The main goal of this module is the computation of the acoustic model probability as it describes the probability of a sequence of acoustic observations conditioned on the word sequence. Two main branches of possible model types have gained popularity, namely neural networks (NNs) and hidden Markov models (HMMs). HMMs are commonly used for stochastic modelling, especially in the field of automated speech recognition. This is because they have been found to be eminently suited to the task of acoustic modelling. The hidden Markov model is a (first order) Markov model whose topology is optimized for the task of speech recognition. It is strictly a left-to-right model consisting of states and transitional edges. It is called hidden because the state sequence is effectively hidden from the resulting sequence of observation vectors. The number of states depends on the speech unit modelled by the HMM. Possible speech units are phones or phone groups (e.g. biphones or tri-phones), syllables, words or even sentences. The link between the speech signal and the corresponding speech units is made by acoustic modelling. A. Feature Extraction The main tasks of the acoustic feature extraction procedure are the conversion of the analog speech signal to its discrete representation and the extraction of the CLEAR September 2014

relevant acoustic features in terms of best speech recognition capability. Mel Frequency Cepstral Coe cents (MFCCs) are a feature widely used in automatic speech recognition. The mel frequency is used as a perceptual weighting that more closely resembles how we perceive sounds such as music and speech. For example, if we are listening to a recording of music, most of what you “hear� is below 2000 Hz you are not particularly aware of higher frequencies, though they also play an important part in audio perception. The cepstrum is the spectrum of a spectrum. A spectrum gives you information about the frequency components of a signal. A cepstrum gives you information about how those frequencies change. The combination of the two, the mel weighting and the cepstral analysis, make MFCC particularly useful in audio recognition, such as determining timbre (i.e. the difference between a flute and a trumpet playing the same frequency), which forms the basis of instrument or speech recognition. B. Training HMMs The next step is the training of the acoustic model parameters. In this context, training means the computation of model parameters based on appropriate training material in order to emulate the stochastic nature of the speech signal. Therefore, the training material needs to be representative for the speech domain for whose recognition the acoustic models will be used later. Over iterations through the training data, efficient estimation approaches used by standard training methods converge to a local optimum. 12


There are several well established training methods such as the maximum likelihood (ML) or maximum a posteriori (MAP) approaches. Baum-Welch training and Viterbi training are commonly used implementations of the ML training approach. One main characteristic of Viterbi training is the direct assignment of speech frames to HMM states. The Baum-Welch training algorithm is more flexible and allows overlaps in the frame to state assignment during the training procedure. In Viterbi training, the HMM parameters are estimated based on an initial segmentation of the training data. Each iteration successively improves the estimation of the acoustic model probability. The training procedure is finished when no further significant improvement can be achieved. IV.

Language Modelling

The language model (LM), also known as grammar which describes the probability of the estimated sequence of words. The LM can be defined as a context-free grammar (CFG), stochastic model (n-gram) or a combination of the two. Context-free grammars are used by simple speech recognition systems where the input sentences are often modelled by grammars. CFGs allow only utterances which are explicitly covered/defined by the grammar. Since CFGs of reasonable complexity can never foresee all the spontaneous variations of the users input, n-gram language models are preferred for the task of large vocabulary spontaneous speech recognition. N-gram language models represent an nth order stochastic Markov model which describes the probability of word occurrences conditioned on the prior CLEAR September 2014

occurrence of n-1 other words. The probabilities are obtained from a large speech corpus and the resulting models are called unigram, bigram or n-gram language models depending on their complexity. The assumption to build such an LM is that the probability of a specific n-gram can be estimated from the frequency of its occurrence in a training set. The simplest ngram is the unigram language model, which means a prior probabilities attached to each word. Prior probabilities describes the frequency of the specific word normalized by the total number of words. V.

Decoding Process

The search space of the speech decoding process is given by a network of HMM states. The connection roles within this network are defined at different hierarchy levels such as the word, the HMM and the state level. Words are connected based on language model roles, whereas each word is constructed of HMMs defined by the pronunciation dictionary. The primary objective of the search process is to find the optimal state sequence in this network associated with a given speech utterance. The Viterbi algorithm is an application of the dynamic programming principle and it performs the maximum likelihood decoding. The Viterbi algorithm provides a solution of finding the optimal word sequence associated with a given sequence of feature vectors by using the acoustic model and the language model. VI.

Speech Recognition Tool

The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and 13


manipulating hidden Markov models. HTK is primarily used for speech recognition. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis.

VII.

References

[1] “Speech Recognition using Hidden Markov Model�, Mikael Nilsson and Marcus Ejnarsson, In

Department

of

Telecommunications

and

Signal Processing [2]

http://www.voxforge.org/

Visited

on

September 2014

Brain-to-brain verbal communication in humans achieved for the first time A team of researchers has successfully achieved brain-to-brain human communication using non-invasive technologies across a distance of 5,000 miles. The team, comprising researchers from Harvard Medical School teaching affiliate Beth Israel Deaconess Medical Center, Starlab Barcelona in Spain, and Axilum Robotics in Strasbourg, France, used a number of technologies that enabled them to send messages from India to France, a distance of 5,000 miles (8046.72km), without performing invasive surgery on the test subjects.

This experiment, the researchers said, represents an important first step in exploring the feasibility of complementing or bypassing traditional means of communication, despite its current limitations, the bit rates were, for example, quite low at two bits per minute. Potential applications, however, include communicating with stroke patients, for example. Visit:http://www.cnet.com/news/brain-to-brain-verbal-communication-in-humans-achieved-for-

the-first-time/ CLEAR September 2014

14


Feature extraction for speaker recognition system Neethu Johnson

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad ajneethu@gmail.com ABSTRACT: Speech processing has emerged as an important application area of digital signal processing. Various fields for research in speech processing are speech recognition, speaker recognition, speech synthesis, speech coding etc. The objective of automatic speaker recognition is to extract, characterize and recognize the information about speaker identity. Feature extraction is the first step for speaker recognition. Many algorithms are developed by the researchers for feature extraction out of which the Mel Frequency Cepstrum Coefficient (MFCC) feature has been widely used for designing a text dependent speaker identification system.

I.

Speaker recognition

Anatomical structure of the vocal tract is unique for every person and hence the voice information available in the speech signal can be used to identify the speaker. Recognizing a person by her/his voice is known as speaker recognition. Speaker recognition systems involve two phases namely, training and testing. Training is the process of familiarizing the system with the voice characteristics of the speakers registering. Testing is the actual recognition task. Feature vectors representing the voice characteristics of the speaker are extracted from the training utterances and are used for building the reference models. During testing, CLEAR September 2014

similar feature vectors are extracted from the test utterance, and the degree of their match with the reference is obtained using some matching technique. The level of match is used to arrive at the decision. For speaker recognition it is important to extract features from each frame which can capture the speaker-specific characteristics. II.

Feature extraction

Feature extraction is the process of extracting a limited amount of useful information from speech signal while discarding redundant information. The extraction and selection of the best parametric representation of acoustic signals 15


is an important task in the design of any speaker recognition system; it significantly affects the recognition performance. The features can be extracted either directly from the time domain signal or from a transformation domain depending upon the choice of the signal analysis approach. Some of the signal features that have been successfully used for speech processing tasks include Mel-frequency cepstral coefficients (MFCC), Linear predictive coding (LPC) and Local discriminant bases (LDB). Few techniques generate a pattern from the features and use it while few other techniques use the numerical values of the features. A. LPC In LPC system, each sample of the signal is expressed as a linear combination of the previous samples. This equation is called a linear predictor and hence it is called as linear predictive coding .The coefficients of the difference equation (the prediction coefficients) characterize the formants. LPC (Linear Predictive coding) analyses the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue. B. MFCC MFCC is based on the human peripheral auditory system. The human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus CLEAR September 2014

for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on a scale called the ‘Mel Scale’. The mel frequency scale is a linear frequency spacing below 1000 Hz and logarithmic spacing above 1kHz.As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels. C. LDB LDB is a speech signal feature extraction and a multi group classification scheme that focuses on identifying discriminatory timefrequency subspaces. Two dissimilarity measures are used in the process of selecting the LDB nodes and extracting features from them. The extracted features are then fed to a linear discriminant analysis based classifier for a multi-level hierarchical classification of speech signals. III.

Mel Frequency Coefficients

Cepstral

The most widely used acoustic features for speech and speaker recognition are MFCCs. They are the results of a cosine transform of the real logarithm of the shortterm energy spectrum expressed on a melfrequency scale. The MFCCs are proved more efficient. These features take into account the perception characteristics of human ear. While deriving the MFCC features, it considers that the human perception is nearly linear upto 1000 Hz, and after that it is non-linear, with the importance of a frequency signal decreasing with increase in frequency value. As a result, we need to have a better resolution (constant) up 16


to 1000 Hz and a decreasing resolution as the frequency increases. This means that up to 1000 Hz, mel filter banks will have constant bandwidth that will be smaller than the bandwidth of the filter banks above 1000 Hz. Beyond 1000 Hz, the filter banks will increase in frequency values. The calculation of the MFCC includes the following steps. A. Frames The most fundamental process common to all forms of speaker and speech recognition systems is that of extracting vectors of features uniformly spaced across time from the time-domain sampled acoustic waveform. a. Pre-emphasis The pre-emphasis refers to filtering that emphasizes the higher frequencies in the speech signal. Its purpose is to balance the spectrum of voiced sounds that have a steep roll-off in the higher frequency region. b. Framing The speech signal is a slowly timevarying or quasi-stationary signal. For stable acoustic characteristics, speech signal needs to be examined over a sufficiently large duration of time over which it could be considered to be stationary. Further, samples between adjacent frames are overlapped to ensure continuity in the features extracted, and thus avoid any abrupt changes. The timedomain waveform of the utterance under consideration is divided into overlapping fixed duration segments called frames. In speaker recognition, a frame size of 20 ms is seen to be the optimum and 10 ms for the CLEAR September 2014

overlap between the adjacent frames. Advancing the time window every 10 ms enables the temporal characteristics of the individual speech sounds to be tracked and the 20 ms analysis window is usually sufficient to provide good spectral resolution, and at the same time short enough to resolve significant temporal characteristics. c. Windowing Each frame is multiplied by a window function. The window function is needed to smooth the effect of using a finitesized segment for the subsequent feature extraction by tapering each frame at the beginning and end edges. Any of the window functions can be deployed, with the Hamming window function being the most popular. B. MFCC features A Fast Fourier Transform (FFT) operation is applied to each frame to yield complex spectral values. Subsequently, the FFT coefficients are binned into 24 mel filter banks and the spectral energies in these 24 filter banks are calculated. Then, Discrete Cosine Transform (DCT) is applied on the log of the mel filter bank energies to obtain the MFCC coefficients, and the first 13 MFCC coefficients are selected as the features for the speaker recognition system. The DCT also serves the purpose of decorrelating the mel frequency band energies. It may also be interpreted that the last four coefficients discarded corresponds to the fast variations in the signal spectrum and is found that they do not add value to the speaker/speech recognition experiments. 17


Subsequently, the temporal delta and acceleration coefficients are calculated and appended to the 13 baseline features, making the total number of features to 39. The mel frequency scale is represented as:

information about the acceleration of the speech signal. IV.

Tool for feature expression A. HTK

Mel frequency is proportional to the logarithm of the linear frequency, reflecting similar effects in the human's subjective and aural perception. As the vocal tract is smooth, the filter bank energies measured in adjacent bands tend to be co-related. DCT is applied to the transformed mel frequency coefficients to produce a set of cepstral coefficients. Prior to the computing DCT the mel spectrum is usually represented on a log scale. Since most of the signal information is represented by the first few MFCC coefficients, the system can be made robust by extracting only those coefficients ignoring or truncating higher order DCT components. Traditional MFCC systems use only 8-13 cepstral coefficients. The 0th coefficient is often excluded since it represents the average logenergy of the input signal, which only carries little speaker-specific information. The cepstral coefficients are static features that contain information from a give n-frame, while the information about the temporal dynamics of the signal is represented by the first and second derivatives of the cepstral coefficients. The first order derivative called delta coefficients represent information about the speech rate (velocity) and the second order derivative called delta-delta coefficients represents CLEAR September 2014

Hidden Markov Model Tool Kit (HTK) is a toolkit for building Hidden Markov Models (HMMs) can be used to model any time series. It is primarily designed for building HMM-based speech processing tools, in particular recognisers. Although all HTK tools can parameterise waveforms (MFCC features) on-the-fly, in practice it is usually better to parameterise the data just once. The tool HCopy is used for this. As the name suggests, Hcopy is used to copy one or more source files to an output file. Normally, HCopy copies the whole file, but a variety of mechanisms are provided for extracting segments of files and concatenating files. By setting the appropriate configuration variables, all input files can be converted to parametric form as they are read-in. A sample setting of configuration file for Hcopy is shown below: #Feature configuration TARGETKIND = MFCC_E_D_A TARGETRATE = 100000.0 SAVECOMPRESSED = F SAVEWITHCRC = T WINDOWSIZE = 200000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 24 CEPLIFTER = 22 NUMCEPS = 12 18


ENORMALISE = T #input file format (header less 8 kHz 16 bit linear PCM) SOURCEKIND = WAVEFORM SOURCEFORMAT = NOHEAD SOURCERATE = 1250 Thus, it is simple to parameterise or extract MFCC features from the speech signal using HTK. These features can be then used for speaker recognition, spoken language identification, speech recognition or any other speech processing tasks. V.

CONCLUSION

Speaker recognition is a commonly used biometric for control of access to information services or user accounts as it can be used to replace or augment personal

identification numbers or passwords. The speech signal can be represented as a sequence of feature vectors in order to apply mathematical tools. These spectral based features including are used for speaker recognition in most of the systems. Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. It describes the signal characteristics, relative to the speaker discriminative vocal tract properties. High accuracy and low complexity are the major advantages of MFCCs. A freely available portable toolkit for building and manipulating Hidden Markov Models, Hidden Markov Model Tool Kit (HTK), is primarily used for speech recognition research. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis.

Microsoft Unveils Real-Time Speech Translation for Skype At Re/code's inaugural Code Conference, Microsoft unveiled it’s real-time speech translator for Skype-a technology that conjures up references to "Star Trek" and "A Hitchhiker's Guide To The Galaxy" that's been in the works for years. While Speaker A is talking, Speaker B will actually hear their voice, at a lower volume, even as Skype Translator begins to do its work and starts delivering translated, spoken words. Moreover, the system is looking for natural pauses or, "silence detection," in speech to start translating. The length of time it takes to translate is totally dependent on the length of the sentence or phrase. The alternative would have been to have the speaker hold a button while speaking and let it go when they wanted to deliver a sentence or phrase. This approach should be more natural. The "Star Trek"-like translator will become available before the end of 2014.

Visit: http://research.microsoft.com/en-us/news/features/translator-052714.aspx

VisiVIvihttp://research.microsoft.com/en-us/news/features/translatorCLEAR September 2014 052714.aspx

19


FreeSpeech: Real-Time Speech Recognition Vidya P V

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad vidyapv15@gmail.com

ABSTRACT: Speech Recognition is the process of translation of spoken words into text. Real-time continuous speech recognition has opened up a wide range of research opportunities in humancomputer interactive applications. PocketSphinx is an open-source embedded speech recognition system that is capable of real-time, medium-vocabulary continuous speech recognition, developed by the Carnegie Mellon University. FreeSpeech is a free and open source real time speech recognition application that uses PocketSphnix.

I.

Introduction provides off-line speaker-independent voice recognition with dynamic language learning capability using the PocketSphinx speech recognition engine.

Speech recognition is the process of converting the spoken words to written format. It provides exciting opportunities for creating great user experiences and efficiencies in real-time interactive applications. The advent of hand-held devices has paved way for a wide variety of speech recognition applications including voice user interfaces such as voice dialing, call routing, search, simple data entry, speech to text processing etc. CMU SPHINX is a popular open source large vocabulary continuous speech recognition system developed by the Carnegie Mellon University. PocketSphinx is a version of Sphinx that can be used in embedded systems, which is capable of realtime, medium-vocabulary continuous speech recognition. FreeSpeech is a free and open-source real-time speech recognition application that CLEAR September 2014

II.

FreeSpeech

FreeSpeech is a free and open-source, dictation, voice transcription, real-time speech recognition application which provides offline speaker-independent voice recognition with dynamic language learning capability using the PocketSphinx speech recognition engine and the gstreamer open source multimedia framework. FreeSpeech is truly cross-platform, written in Python. CMU Sphinx or simply Sphinx describes a group of speech recognition systems developed at Carnegie Mellon University. PocketSphinx is a version of Sphinx that can be used in embedded systems. It is a research system and is a lightweight, multi-platform, 20


speaker independent, large vocabulary continuous speech recognition engine.

be done using Python interpreter. Then, the application starts working and it recognizes what is being spoken.

A. Installation

The following figure indicates the window that shows the spoken text in written format.

In order to make FreeSpeech work reliably on Linux, the following packages must be installed. These can be installed through package manager.  Python 2.7 

pygtk2

python-xlib

python-simplejson

gstreamer,

including

gstreamerFigure 1. FreeSpeech Window

python 

pocketsphinx and sphinxbase

CMU-Cambridge

The dictionary available along with the FreeSpeech application indicates the vocabulary and can be referred for further information regarding different special characters, pronunciations of words etc. Voice commands are also included in this application. A menu listing various voice commands pops up upon running the FreeSpeech program using Python interpreter. A list of voice commands that are supported by FreeSpeech application has been listed below:  file quit - quits the program  file open - open a text file in the editor  file save (as) - save the file  show commands - pops up a customize-able list of spoken commands  editor clear - clears all text in the editor and starts over  delete - delete [text] or erase selected text

Statistical

Language Modelling Toolkit v2 CMU-Cambridge Statistical Language Modelling Toolkit can be downloaded and after unpacking it, installation can be performed by reading the instructions in the README file and editing the Makefile. Manually copy the tools from the bin directory somewhere in $PATH like: /usr/local/bin. Similarly, PocketSphinx and Sphinxbase can also be downloaded and unpacked, and installed as per the instructions given in README file. FreeSpeech can be downloaded from Google Code and its installation requires only setting an environment variable to a user-writeable location if it isn't already set. B. Using FreeSpeech As already said, FreeSpeech is written in Python and hence launching the program can CLEAR September 2014

21


language corpus, thereby making the application do better in understanding next time. Similarly, the FreeSpeech dictionary can also be edited if there exists any word that the application refuses to recognize even after teaching it several sentences. Adding new words to the dictionary may be done manually, along with their phonetic representation.

insert - move cursor after word or punctuation example: "Insert after period"

III.

The FreeSpeech real-time speech recognition application provides a platform to perform real-time speech to text conversion and voice control. The speech recognition engine used, PocketSphinx, is a research system, which is also an early research. Several tools need to be made available to make this complete. The small size of the language corpus provided by FreeSpeech is also one of the limitations. Manual efforts may be required to do the learning part initially to increase the language corpus size so as to suite our needs. The difficulty in handling the pronunciation variations is also one of the major challenges which can be solved to an extent by editing the FreeSpeech dictionary.

Figure 2. Command Preferences

    

select - select [text] example: "select the states" go to the end - put cursor at end of document scratch that - erase last spoken text back space - erase one character new paragraph - equivalent pressing Enter twice

IV. to

REFERENCES

[1] David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex

C. Corpus and Dictionary

I.

Real-Time

The FreeSpeech application contains a very limited language corpus, freespeech.ref.txt. The appication can be trained by entering texts in the textbox provided and clicking the learn button, which adds the contents in the text box to the

CLEAR September 2014

Conclusion and Future Works

Rudnicky,

PocketSphinx:

Continuous

Speech

A

Free,

Recognition

System For Hand-Held Devices, 2006 IEEE. [2]

http://thenerdshow.com/freespeech.html

Visited on September 2014.

22


CMU Sphinx 4: Speech Recognition System Raveena R Kumar

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad veenakalathil@gmail.com

ABSTRACT: Speech is a continuous audio stream where rather stable states mix with dynamically changed states. The common way to recognize speech is to take the waveform, split it on utterances by silences then try to recognize what's being said in each utterance. CMU Sphinx toolkit is a leading speech recognition toolkit with various tools used to build speech applications. The Sphinx-4 speech recognition system is the latest addition to Sphinx speech recognition systems.

I.

Introduction

CMU Sphinx toolkit has a number of packages for different tasks and applications. CMU Sphinx, also called Sphinx in short. These include a series of speech recognizers (Sphinx 2-4) and an acoustic model trainer (SphinxTrain). Sphinx is a continuousspeech, speaker-independent recognition system making use of hidden Markov acoustic models (HMMs) and an n-gram statistical language model. Sphinx featured feasibility of continuousspeech, speaker-independent largevocabulary recognition. Sphinx 2 focuses on real-time recognition suitable for spoken language applications. As such it incorporates functionality such as endpointing, partial hypothesis generation, dynamic language model switching and so on. It is used in dialog systems and language learning systems. Sphinx 2 used a semi continuous representation for acoustic modelling. Sphinx 3 adopted the prevalent continuous HMM representation

CLEAR September 2014

and has been used primarily for highaccuracy, non-real-time recognition. Sphinx 3 is under active development and in conjunction with SphinxTrain provides access to a number of modern modeling techniques, such as LDA/MLLT, MLLR and VTLN, that improve recognition accuracy. PocketSphinx a version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation. Sphinx 4 is designed from the earlier Sphinx systems in terms of modularity, flexibility and algorithmic aspects. It uses newer search strategies, is universal in its acceptance of various kinds of grammars and language models, types of acoustic models and feature streams. It has been built entirely in the Java programming language. The Sphinx-4 system is an open source project.

23


II.

Now accept the BCL license agreement which will unpack jsapi.jar.

Sphinx 4

The Sphinx-4 architecture has been designed for modularity. Any module in the system can be smoothly exchanged for another without requiring any modification of the other modules. One can, for instance, change the language model from a statistical N-gram language model to a context free grammar (CFG) or a stochastic CFG by modifying only one component of the system, namely the linguist. Similarly, it is possible to run the system using continuous, semi-continuous or discrete state output distributions by appropriate modification of the acoustic scorer. The system permits the use of any level of context in the definition of the basic sound units. Information from multiple information streams can be incorporated and combined at any level, i.e., state, phoneme, word or grammar. The search module can also switch between depth-first and breadth-first search strategies. One by-product of the system’s modular design is that it becomes easy to implement it in hardware.

Now test Sphinx-4: cd .. java -jar bin/Dialog.jar

Press ctrl-C to exit the Sphinx4 dialog demo. B. Basic Usage There are several high-level recognition interfaces in Sphinx-4:   

Live Speech Recognizer Stream Speech Recognizer Speech Aligner

Live Speech Recognizer uses microphone as the speech source. Stream Speech Recognizer uses audio file as the speech source. Speech Aligner time-aligns text with audio speech. For the most of the speech recognition jobs high-levels interfaces should be enough. And only to setup four attributes:  Acoustic model.  Dictionary.  Grammar/Language model.  Source of speech.

A. Installation Sphinx-4 is written in Java and therefore requires the JVM to run. To install Java on Ubuntu Linux, this requires the following: sudo apt-get install sun-java6-jre

III. Download the Sphinx-4 1.0beta4 package from SourceForge.

The Figure 1 shows the overall architecture of sphinx 4 decoder. The speech signal is parameterized at the front-end module, which communicates the derived features to the decoding block. The decoding block has three components: the search manager, the linguist, and the acoustic

Next: unzip sphinx4-1.0beta4-bin.zip cd sphinx4/lib sh jsapi.sh

CLEAR September 2014

Architecture of the Sphinx-4 decoder

24


scorer. These work in tandem to perform the decoding.

independent information sources, such as visual features, can be directly fed into the decoder, either in parallel with the features from the speech signal, or bypassing the latter altogether. B. Decoder The decoder block consists of three modules: search manager, linguist, and acoustic scorer. a. Search Manager The primary function of the search manager is to construct and search a tree of possibilities for the best hypothesis. The construction of the search tree is done based on information obtained from the linguist. The search manager makes use of a token tree. Each token contains the overall acoustic and language scores of the path at a given point, a Sentence HMM reference, an input feature frame identification, and a reference to the previous token, thus allowing backtracking. The Sentence HMM reference allows the search manager to fully categorize a token to its senone, contextdependent phonetic unit, pronunciation, word, and grammar state. Search through the token tree and the sentence HMM is performed in two ways: depth-first or breadth-first. Depth-first search is similar to conventional stack decoding. In Sphinx-4, breadth-first search is performed using the standard Viterbi algorithm as well as a new algorithm called Bush-derby.

Figure 1: The overall architecture of Sphinx 4

A. Front end The module consists of several communicating blocks, each with an input and an output. Each block has its input linked to the output of its predecessor. When a block is ready for more data, it reads data from the predecessor, and interprets it to find out if the incoming information is speech data or a control signal. The control signal might indicate beginning or end of speech. One of the features of this design is that the output of any of the blocks can be tapped. Similarly, the actual input to the system need not be at the first block, but can be at any of the intermediate blocks. The current implementation permits us to run the system using not only speech signals, but also spectra, etc. In addition, any of the blocks can be replaced. Additional blocks can also be introduced between any two blocks, to permit noise cancellation or compensation on the signal, on its spectrum or on the outputs of any of the intermediate blocks. Features computed using CLEAR September 2014

b. Linguist The linguist translates linguistic constraints provided to the system into an internal data structure called the grammar which is usable by the search manager. Linguistic constraints are typically provided in the form of context free grammars, N25


gram language models, finite state machines etc. The grammar is a directed graph, where each node represents a set of words that may be spoken at a particular time. The grammar is a directed graph, where each node represents a set of words that may be spoken at a particular time. The nodes are connected by arcs which have associated language and acoustic probabilities that are used to predict the likelihood of transiting from one node to another. Sphinx-4 provides several grammar loaders that load various external grammar formats and generate the internal grammar structure. The pluggable nature of Sphinx-4 allows new grammar loaders to be easily added to the system. Grammar nodes are decomposed into a series of word states, one for each word represented by the node. Words states are further decomposed into pronunciations states, based on pronunciations extracted from a dictionary maintained by the linguist. Each pronunciation state is then decomposed into a series of unit states, where units may represent phonemes, diphones, etc. and can be specific to contexts of arbitrary length. Each unit is then further decomposed to its sequence of HMM states. The Sentence HMM thus comprises all of these states. States are connected by arcs that have language, acoustic and insertion probabilities associated with them.

C. Knowledge Base This module consists of Language model and acoustic model. An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context). A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. There are two types of models that describe language - grammars and statistical language models. Grammars describe very simple types of languages for command and control, and they are usually written by hand or generated automatically with plain code. IV.

If one wants to create an acoustic model for new language/dialect, or need specialized model for small vocabulary application a training should be done. The trainer learns the parameters of the models of the sound units using a set of sample speech signals. This is called a training database. This information is provided to the trainer through a file called the transcript file, in which the sequence of words and non-speech sounds are written exactly as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal. You have to design database prompts and post process the results to ensure that audio actually corresponds to prompts.

c. Acoustic Scorer The task of the acoustic scorer is to compute state output probability or density values for the various states, for any given input vector. The acoustic scorer provides these scores on demand to the search module. In order to compute these scores, the scorer must communicate with the front-end module to obtain the features for which the scores must be computed. CLEAR September 2014

Training

26


The file structure for the database is: 

sphinxtrain –t an4 setup

Etc

your_db.dic-

Replace an4 with your task name. After that go to the database directory: Phonetic

cd an4

dictionary

- Phoneset file your_db.lm.DMP- Language model your_db.filler - List of fillers

your_db_train.fileids

 

your_db.phone

To train, just run the following commands: sphinxtrain run

V.

- List of

Sphinx 4 is developed entirely in the Java programming language and is thus highly portable. Sphinx 4 also enables and uses multithreading and permits highly flexible user interfacing. Algorithmic innovations included in the system design enable it to incorporate multiple information sources in a more elegant manner as compared to the other systems in the Sphinx family. It's very flexible in its configuration, and in order to carry out speech recognition jobs. It provides a context class that takes out the need to setup each parameter of the object graph separately.

files for training    

wav  

your_db_train.transcription-

Transcription for training your_db_test.fileids - List of files for testing your_db_test.transcriptionTranscription for testing speaker_1  file_1.wav speaker_2  file_2.wav

VI.

The following packages are required for training: 

sphinxbase-0.8

SphinxTrain-0.8

pocketsphinx-0.8

Conclusion

References

[1] http://www.cs.cmu.edu/~rsingh/homepage /papers/icassp03-sphinx4_2.pdf, Sphinx-4

Speech Recognition System,

Cmu Paul

Lamere, Philip Kwok et al. [2] http://cmusphinx.sourceforge.net/ Visited on September 2014

To start the training change to the database folder and run the following commands:

CLEAR September 2014

The

27


Deep learning in Speech Processing Alen Jacob

Rajitha K

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad alenjacob@outlook.com

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad reji.krishkripa@gmail.com

ABSTRACT: Malayalam Deep learning is becoming a mainstream technology in speech processing. In this article we provides some works in deep learning related to speech processing. The main applications of speech processing are speech recognition and speech synthesis. In this document Deep Neural Network based speech synthesis and Deep Neural Tensor Network based speech recognition are explained.

I.

Introduction provide different amounts of abstraction. The automatic conversion of written to spoken language is commonly called Textto-speech or simply TTS. The input is text and the output is a speech waveform. A TTS system is almost always divided into two main parts. The first of these converts text into what we will call a linguistic specification and the second part uses that specification to generate a waveform. This division of the TTS system into these two parts makes a lot of sense both theoretically and for practical implementation: the front end is typically language-specific, whilst the waveform generation component can be largely independent of the language. Nowadays there are many speech synthesis systems exist, the quality of system are measured based on the naturalness and intelligibility of speech generated. Statistical parametric speech synthesis based on hidden Markov models (HMMs) has grown in popularity in the last decade. This system

Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classification and for feature or representation learning. It is in the intersections among the research areas of neural network, graphical modeling, optimization, pattern recognition, and signal processing. Deep learning algorithms are based on distributed representations, a concept used in machine learning. The underlying assumption behind distributed representations is that observed data is generated by the interactions of many different factors on different levels. Deep learning adds the assumption that these factors are organized into multiple levels, corresponding to different levels of abstraction or composition. Varying numbers of layers and layer sizes can be used to CLEAR September 2014

28


simultaneously models spectrum, excitation, and duration of speech using contextdependent HMMs and generates speech waveforms from the HMMs themselves. This system offers the ability to model different styles without requiring the recording of very large databases. The major limitation of this method is the quality of synthesized speech. To address the limitations of context dependent HMM based speech synthesis method, introduced an alternative scheme that is based on a deep architecture. The decision trees in HMM-based statistical parametric speech synthesis perform mapping from linguistic contexts extracted from text to probability densities of speech parameters. Here decision trees are replaced by a deep neural network (DNN). Until recently, neural networks with one hidden layer were popular as they can represent arbitrary functions if they have enough units in the hidden layer. Although it is known that neural networks with multiple hidden layers can represent some functions more efficiently than those with one hidden layer, learning such networks was impractical due to its computational costs. However, the recent progress both in hardware (e.g. GPU) and software enables us to train a DNN from a large amount of training data. Deep neural networks have achieved large improvements over conventional approaches in various machine learning areas including speech recognition and acoustic-articulatory inversion mapping. Note that NNs have been used in speech synthesis since the 90s. Automatic speech recognition, translating of spoken words into text, is still a challenging task due to the high viability in CLEAR September 2014

speech signals. Deep learning, sometimes referred as representation learning or unsupervised feature learning, is a new area of machine learning. Deep learning is becoming a mainstream technology for speech recognition and has successfully replaced Gaussian mixtures for speech recognition and feature coding at an increasingly larger scale. II.

Deep Neural Network Based Speech Synthesis

A DNN, which is a neural network with multiple hidden layers, is a typical implementation of a deep architecture. We can have a deep architecture by adding multiple hidden layers to a neural network. The properties of the DNN are contrasted with those of the decision tree as follows: 



29

Decision trees are inefficient to express complicated functions of input features, such as XOR, d-bit parity function, or multiplex problems. To represent such cases, decision trees will be prohibitively large. On the other hand, they can be compactly represented by DNNs. Decision trees rely on a partition of the input space and using a separate set of parameters for each region associated with a terminal node. This results in reduction of the amount of the data per region and poor generalization. Yu et al. showed that weak input features such as wordlevel emphasis in reading speech were thrown away while building decision trees. DNNs provide better




Fig. 3(c) is an alternative view of the same DTNN shown in Fig. 3(b). By defining, the input to the layer, as

A novel deep model called DTNN, in which one or more layers are DP and tensor layers is described for speech recognition tasks. An approach to map the tensor layers to the conventional sigmoid layers is also shown so that the former can be treated and trained in a similar way to the latter. With this mapping we can consider a DTNN as the DNN augmented with DP layers and so the BP learning algorithm of DTNNs can be cleanly derived.

đ?‘Ł đ?‘™ = đ?‘Łđ?‘’đ?‘?(â„Ž1đ?‘™âˆ’1 ⊗ â„Ž2đ?‘™âˆ’1 ) = đ?‘Łđ?‘’đ?‘?(â„Ž1đ?‘™âˆ’1 (â„Ž2đ?‘™âˆ’1 )đ?‘‡ ) This rewriting allows us to reduce and convert tensor layers into conventional matrix layers and to define the same interface in describing these two different types of layers. For example, in Fig. 3(c) hidden layer â„Žđ?‘™ can now be considered as a conventional layer as in Fig. 3(a) and can be learned using the conventional backpropagation (BP) algorithm. This rewriting also indicates that the tensor layer can be considered as a conventional layer whose input comprises the cross product of the values passed from the previous layer.

V.

References

[1] Heiga Zen, Andrew Senior, Mike Schuster “Statistical Parametric Speech Synthesis using Deep Neural Networks� , IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), May 2013 . [2] Li Deng , ‘Three Classes of Deep Learning Architectures and Their Applications: A Tutorial

IV.

Survey’ , Microsoft Research, Redmond, WA

Conclusion and Future Works

98052, USA.

The DNN-based approach has a potential to address the limitations in the conventional decision tree-clustered context-dependent HMM-based approach. Future work includes the reduction of computations in the DNN-based systems, adding more input features including weak features such as emphasis, and exploring a better log F0 modeling scheme.

CLEAR September 2014

[3] Dong Yu, Li Deng, and Frank Seide , ‘Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks ’ , Interspeech ISCA, September 2012. [4] G.E. Dahl, D. Yu, L. Deng, and A. Acero , ‘Context-dependent networks

for

pretrained

large

deep

vocabulary

neural speech

recognition’ , IEEE Trans. Audio, Speech, and Lang. Proc. Jan. 2012, vol. 20, no. 1, pp. 33-42.

32


Parallelising NLP Tasks Using MapReduce Paradigm Freny Clara Davis, Shalini M, Nidhin M Mohan, Shibin Mohan S7, Computer Science & Engineering GEC, Sreekrishnapuram, Palakkad

ABSTRACT:

Natural Language Processing (NLP) refers to the applications that deal with natural

language in a way or other. The proposed system tries to parallelise NLP tasks using map-reduce paradigm. The main NLP tasks that the proposed system is intended to perform are POS Tagging & Stemming. This paper presents an approach to parallelize the NLP tasks using the map-reduce paradigm.

I.

Introduction Speech synthesis - pronunciation Speech recognition - class-based Ngrams  Information retrieval - stemming, selection of high-content words  Word-sense disambiguation  Corpus analysis of language lexicography  Information Extraction  Question Answering (Q A)  Machine Translation Stemming can be used in the field of information retrieval where it is required to find documents relevant to an information need from a large document set. POS Tagging and Stemming are performed using the Map Reduce Paradigm which is a programming model. Map Reduce is an associated implementation for pro-cessing large data sets with a parallel, distributed algorithm on a cluster and is expected to improve the performance of the current NLP system in which the NLP tasks are done sequentially.  

POS Tagging: Part of speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus. Taggers play an important role in speech recognition, natural language parsing and information retrieval. The input to a tagging algorithm is a string of words and the specified tagset. The output is the best tag for each word. Stemming: Stemming is the process for reducing inflected (or sometimes de-rived) words to their stem, base or root form. POS Tagging and Stemming are great skills of humans and we could not expect such great skills from everyone. POS Tagging and Stemming done by humans is extremely limited in quality and quantity. Therefore they would benefit from systems that perform NLP tasks to assist them in meeting their needs. POS Tagging may be useful to know what function the word plays, instead of depending on the word itself. POS tagging and Stemming can be useful in the following areas: CLEAR September 2014

33


II.

maps can be performed in parallel.

Methodology

Usually, the NLP tasks like Tagging and Stemming are done sequentially. If tasks like POS tagging and Stemming are done sequentially, it may take a lot of time. This problem can be solved by performing these tasks using the map-reduce paradigm which is implemented in the hadoop framework. Hadoop scales up linearly to handle larger data sets by adding more nodes to the cluster. It also allows users to quickly write efficient parallel code. MapReduce is an associated implementation for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a Map procedure that performs filter-ing and sorting and a Reduce procedure that performs a summary operation. The two steps performed in the map-reduce program are: Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all CLEAR September 2014

III.

Conclusion

In this paper we presented an approach to parallelize the NLP tasks using the mapreduce paradigm. How the huge amount of time required to perform the NLP tasks sequentially, can be reduced is shown here. IV.

References

[1]Chuck Lam, “Hadoop in Action”, Manning publications, 2012. [2]Daniel “Speech

Jurafsky and

and

Language

James

H.Martin,

Processing,

An

Introduction to Natural Language Processing, Computational

Linguistics,

and

Recognition”, Pearson Education, 2012.

34

Speech


TEAM INDIA DOES INDIA PROUD AT THE INTERNATIONAL LINGUISTICS OLYMPIAD At the recently concluded 12th International Linguistics Olympiad held in Beijing, China, the Indian Team bagged the following awards: * Bronze Medal - Anindya Sharma, Bangalore * Bronze Medal - Rajan Dalal, Ranchi * Best Solution Award for Problem No.3 - Anindya Sharma, Bangalore The International Linguistics Olympiad (IOL) is one of the newest in a group of twelve International Science Olympiads, and is steadily growing in popularity over the last few years. The goal of the Olympiad is to introduce students to linguistics because Linguistics is a subject, which per se, is not taught in high-schools across the world. This year, IOL participants had to decipher the grammar rules, kinship terms and word meanings of Benabena, Kiowa, Engenni, Gbaya and Tangut (now extinct) languages spoken in Papua New Guinea, North America, Nigeria, Congo, and central China respectively, each of which have less than a few thousand speakers and are at the verge of extinction. India first competed in the IOL in 2009, and has participated in 6 Olympiads till date. Over the years, Team India has brought home 7 medals (3 Silver and 4 Bronze), 4 BestSolution prizes, and 3 Honorable Mentions. Team India is chosen through the Panini Linguistic Olympiad conducted by the University of Mumbai, and actively supported by Microsoft Research India, as well as several other premier institutes from across the country including JNU, IIT Guwahati, IIT Patna, IIT Kharagpur, SNLTR, EFLU and Chennai Mathematical Institute. In a country like India with many languages, we need a lot more linguists and computational linguists to drive Indian language technology and research. Linguistics Olympiad aims to realize this goal by exposing young minds to the concepts of linguistics and computational linguistics presented in the form of interesting yet challenging puzzles. The highlight of this year's IOL is that India won the bid to host the Olympiad in 2016. Linguistics Olympiad is much less known in India than the other science Olympiads primarily because linguistics is not taught in the schools. On the other hand, exposure to many languages make Indian students naturally adept in this Olympiad. We need more support from the NLPAI community in spreading the awareness about this Olympiad and helping us scale up our activities. For more information on Panini Linguistics Olympiad, check website: https://sites.google.com/site/pa ninilingui or send email plio.mumbai@gmail.com CLEARsticsolympiad/ September 2014 35


M.Tech Computational Linguistics 2012-2014 Batch

Abitha Anto

Ancy Antony

Athira Sivaprasad

Deepa C A

Divya M

Gopalakrishnan G

Indhuja K

Indu Meledathu

Lekshmi T S

CLEAR September 2014

36


M.Tech Computational Linguistics 2012-2014 Batch

Neethu Johnson

Nibeesh K

Prajitha U

Reshma O K

Sincy V T

Sreeja M

Sreejith C

Sruthimol M P

Varsha K V

CLEAR September 2014

37


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- Dec-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on Dec 2014. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th Dec, 2014 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in Editor,

Representative,

CLEAR Journal

SIMPLE Groups

CLEAR September 2014

38


Hello World, This is a very proud moment for all of us as India has become the first Asian nation to reach Mars. This edition marks an important milestone in the history of CLEAR as well, as this is the first printed edition of CLEAR Journal. I am very glad to be a part of the editorial team to witness this most precious moment. With this edition of CLEAR Journal, we bring an edition on speech processing. This will provide a forum for students to enhance their background and get exposed to intricate research areas in the field of speech and audio signal processing. The exponential growth of audio and speech data, coupled with increase in computing power, has lead to increasing popularity of deep learning for speech processing. This edition also includes some NLP related works carried out by UG students and events conducted by IT, EC and CS departments in our college. I would like to sincerely thank the contributing authors, for their effort to bring their insights and perspectives on the latest developments in Speech and language engineering. Technical advances in speech processing and synthesis are posing new challenges and opportunities to researchers. The pace of Innovation continues. I wish our college could also fly high beyond the horizon of Speech & Language Processing. Simple groups welcomes more aspirants in this area. Wish you all the best!!! Nisha M

nisha.m407@gmail.com

CLEAR September 2014

39


CLEAR September 2014

40


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.