Ncmmi2014006 by ASDF Administartor

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

Real Time Characterization of LipǦMovements Using Webcam Suresh D S, Sandesh C

as df .re s. in

Director & Head of ECE Dept, CIT, Gubbi, Tumkur, Karnataka. PG Student, ECE Dept, CIT, Gubbi, Tumkur, Karnataka.

w w w .e dl ib .

Abstract: Communication plays an important role in the dayǦtoǦday activities of human being. The main objective of this paper is to help the people who are unable to speak. Visual speech plays a major role in the lipǦreading for listeners with deaf person. Here we are using local spatiotemporal descriptors in order to identify or recognize the words or phrases from the disable (dumb) people by their lipǦmovements. Local binary patterns extracted from the lip movements are used to recognize the isolated phrases. Experiments were made on ten speakers with 5 phrases and obtained accuracy of about 75% for speaker dependent and 65% for speaker independent. While being made comparison with other concepts like AV letter database, our method outperforms the other by accuracy of 65%. The advantages of our method are robustness and recognition is possible in real time. Key words: LBP Top, Webcam, Visual speech, KNN classifier, Silent speech interface.

I. Introduction

D ow nl oa de d

fro m

In recent days, a silent speech interface has becoming an alternative in the field of speech processing. A voice based speech communication has suffered from many challenges which arises due to fact that speech should be clearly audible, it cannot be masked, includes lack of robustness privacy issues and exclusion of speech disabled person. So these challenges may be overcome by the use of silent speech interface. Silent speech interface is a device, where system enabling speech communication takes place, without the necessity of a voice signal or when audible acoustic signal is unavailable. In our approach, lip movements are captured by the use of a webcam, which is placed in front of the lips. Many research work focuses only on visual information to enhance speech recognition[1]. Audio information still plays a major role than the visual feature or information. But, in most cases it is very difficult to extract information. In our method we are concentrating on the lip movement representations for speech recognition solely with visual information. Extraction of set of visual observation vectors is the key element on AVSR(audioǦvisual speech recognition) system. Geometric feature, combined feature and appearance features are mainly used for representing visual information. Geometric feature method represents the facial animation parameters such as lip movement, shape of jaw and width of the mouth. These methods require more accurate and reliable facial feature detection, which are difficult in practice and impossible at low image resolution. In this paper, we propose an approach for lip reading or lip movements, where humanǦcomputer interaction improves significantly and understanding in noisy environment also improves.

II. Local Spatiotemporal Descriptors for Visual Information

In this paper, we are concentrating on LBPǦTOP in order to extract the features from the extracted video frames. The Local Binary Pattern (LBP) operator is a grayǦscale which is not varied, always remains the same texture, simple statistic, which has shown the best performance in the classification of different kinds of texture. For an individual pixel in an image its respective binary will be generated and the thresholds will be compared with the its neighborhood value of center pixel as shown in Figure 1(a) [1].

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

as df .re s. in

Figure. 1. (a) Basic LBP op perator. (b) C Circular (8,2) neighborhoo od.

w w w .e dl ib .

Where g gc denotes th he grey value of center pixxel{xc, yc} of f the local neighborhood a and gp is related to the grey valu ues of P equaally spaced piixels on a cirrcle of radiuss R. A histogrram is created in order to o collect up the occu urrences of vvarious binaary patterns. The definittion of neigh hbors can be extended to circular neighborrhoods with a any number of pixels as s shown in Fig gure 1(b). By this way, we can collect l largerǦscale texture p primitives or microǦpatterrns, like spotss, lines and co orners.

D ow nl oa de d

fro m

Local texxture descrip ptors have ob btained trem mendous atten ntion in the analysis of facial f image because of their rob bustness to challenge such as pose an nd illuminatio on changes. In our appro oach a tempo oral texture recognitiion using loccal binary paatterns were eextracted fro om the three orthogonal p planes(LBPǦT TOP). LBPǦ TOP method is moree efficient th han the ordin nary LBP. In ordinary LB BP we are exttracting info ormation or features in two dimen nsion, where as in LBPǦTO OP we are exxtracting info ormation in th hree dimensiion i.e, X, Y and T. F For LBPǦTOP, the radii in n spatial and temporal axxes X, Y, and d T, and the number of n neighboring points in n the XY, XT,, and YT plan nes can also b be different a and can be m marked as RX X, RY and RT, PXY, PXT, PYT. Thee LBPǦTOP fe feature is then denoted ass LBPǦTOP P PXY, PXT, PY YT, RX, RY, R RT. If the coo ordinates of the center pixel gtc,ee are (xc, yc,, tc), than th he coordinatees of local neeighborhood in XY planee gXY,p are given by (xcǦRXsin(2ɕ ɕp/PXY),yc+R RY cos(2ɕp/P PXY), tc), thee coordinatess of local neiighborhood in n XT plane gXT,p aare given byy (xcǦRXsin(2ɕp/PXT),ycc, tc –RT cos(2ɕp/PXT T)) and thee coordinates of local neighborrhood in YT plane gYT,p (xc, ycǦRYco os(2ɕp/PYT),,, tc –RT sin((2ɕp/PYT)). S Sometimes, tthe radii in three axees are the sam me and so do the number of neighboring points in X XY, XT, and Y YT planes. In n that case , we use LBPǦTOPP,R f for abbreviatiion where P= =PXY=PXT=PY YT and R=RX X=RY=RT[1].

Figure. 2. (a) Volum me of utterancce sequence. (b) Image in XY plane (c) Image in XT T plane(d) Im mage in TY plane.

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

w w w .e dl ib .

as df .re s. in

Figure 2(a) demonstrrates the vollume of utteerance sequen nce. Figure 22(b) shows iimage in thee XY plane. T plane providing a visuall impression of one row c changing in t time, while Figure 2((c) is an imaage in the XT Figure 2((d) describess the motion of one colum mn in temporral space[1]. A An LBP descrription when n computed over thee whole utterrance sequen nce it encodes only the occurrences of the micro oǦpatterns without w any indicatio on about theeir locations.. In order to o overcome this effect, aa representaation which consists of dividing the mouth im mage into sevveral overlap pping blocks i is introduced d. Figure. 3 allso gives som me examples of the LB BP images. T The second, tthird, and fou urth rows sh how the LBP images whicch are drawn n using LBP code of every pixel from XY (seecond row), XT (third ro ow), and YT T (fourth row w) planes, reespectively, correspo onding to mou uth images in n the first row w.

D ow nl oa de d

fro m

Figure 3.. Mouth regio on images (ffirst row), LB BPǦXY imagess (second row w), LBPǦXT i images (third d row), and LBPǦYT i images (last r row) from on ne utterance[11].

Figure. 4 4. Features in each block v volume. (a) B Block volumess. (b) LBP feaatures from t three orthogo onal planes. (c) Concaatenated feattures for one block volum me with the ap ppearance and d motion[1].

Fig gure. 5. Mouth h movement representation[1]. When a person utterrs a comman nd phrase, th he words are pronounced d in order, fo or instance “yyouǦsee” or “seeǦyou””. If we do no ot consider th he time orderr, these two p phrases would d generate almost the sam me features.

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

To overccome this efffect, the who ole sequence is not only d divided into block volum mes according g to spatial regions but b also in time t order, aas shown in Figure. 4(a) shows. The LBPÇŚTOP hisstograms in each block volume are computeed and conccatenated intto a single h histogram, as shown in Figure. 4. A All features extracted d from each block volum me are connected to repreesent the app pearance and d motion of the mouth region seequence, as sh hown in Figu ure. 5.

as df .re s. in

A histogrram of the m mouth movem ments can be d defined as folllows

III. Multiiresolution n Features a and Feature e Selection n

w w w .e dl ib .

Multi ressolution featu ures will provvide more acccurate inform mation and alsso the analyssis of dynamicc event will be increaased. By using these multiiÇŚresolution f features, it heelps in increaasing the num mber of featu ures greatly. When th he features frrom the different resolution were mad de to be conn nected in serries directly, the feature vector would be very long and as a a result the c computationaal complexityy will be to hiigh. It is obvious that all ute equally, so it s is necesssary to find out features like which the multti resolution features will not contribu location, with what rresolution an nd more importantly the types such aas appearancee, horizontall motion or vertical m motion that are very im mportant. Wee need featu ure selection for this purrpose. In changing the parameteers, three diff fferent types o of spatiotemp poral resolutiion are presented: 1) Use o of a differentt number of neighborring points w when compu uting the features in XY (appearancee), XT (horizzontal motion n), and YT (vertical motion) slicces; 2) Use of different raadii that can capture the occurrencess in differentt space and time scalles; 3) Use of blocks of diff fferent sizes t o create glob bal and local s statistical features[4][5].

IV V. Our Systtem

D ow nl oa de d

fro m

Our system consists o of three stagees, as shown in Figure 6. T The first stag ge is a detectiion lip movem ments. The second s tage extractss the visual feeatures from the mouth m movement seq quence. The role of the final stage is nize the inpu ut utterance u using an KNN N classifier. to recogn

Figurre 6. System d diagram

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

as df .re s. in

In our ap pproach web bcam is used to capture the lipǦmovem ments. Furth her we will exxtract the vissual feature from thee mouth reg gion. These ffeatures are then fed to the classifieer. For speecch recognitio on, a KNN classifierr is selected s since it is welll founded in n statistical leearning theorry and has beeen successfu ully applied to variou us object deteection tasks i in computer vision. Sincee the KNN is only used fo or separating two sets of points, th he Ǧphrase cllassification p problem is decomposed i nto twoǦclasss problems, t then a voting g scheme is used to a accomplish r recognition. S Sometimes m more than on ne class gets t the highest n number of vo otes; in this case, 1ǦN NN template matching iss applied to these classes to reach th he final resu ult. This meaans that in training, the spatiotem mporal LBP h histograms o of utterance s equences bellonging to a g given class arre averaged to generaate a histograam template for that classs[1].

V V. Experime ents Protocol and Resu ults

For more accurate eevaluation off our proposed method, we made a design with different exxperiments, including g speakerǦind dependent, sp peakerǦdepen ndent, multi r resolution.

fro m

w w w .e dl ib .

1. SpeakeerǦIndependeent Experimeents: For the speakerǦindependent experiments, leeaveǦoneǦspeeakerǦout is utilized. We made a training of p particular phrrase using on ne speaker an nd we made to say the saame phrase with 10 d different speaakers and when testing waas done we w were successfu ul in getting the same phrrase from 6 speakers. The overall results werre obtained u using M/N ((M is the tottal number of correctly recognized sequencees and N is th he total num mber of testin ng sequences)). When we a are extracting the local patterns, we take into o account no ot only locatiions of micro oǦpatterns bu ut also the tiime order of f lipǦmovemeents, so the whole seequence is diivided into b block volumees according to not only spatial regions but also time t order. Figure. 77 demonstratte the perforrmance for every e speakerr. The resultts from the second s speak ker are the worst, m mainly becausee the big mou ustache of th hat speaker reeally influencees the appearrance and mo otion in the mouth reegion.

Figuree. 7. Recognitiion performaance for everyy speaker

D ow nl oa de d

2) Speak kerǦDependen nt Experimen nts: For speaakerǦdependeent experimeents, the leavveǦone utteraanceǦout is utilized f for cross validation. In ou ur approach w when a particcular speakerr is trained w with some listt of phrases and afterr testing is peerformed with h the same sp peaker 70% o of same phrasses were identified correcttly.

3) OneǦO One versus O OneǦRest Reco ognition: In t the previous experimentss on our own dataset, the tenǦphrase classificaation problem m is decompo osed into 45 two class pro oblems (“Helllo”Ǧ “See you u”, “I am sorrry”Ǧ “Thank you”, “Yo ou are welcom me”Ǧ “Have a a good time”, etc.). But ussing this mulltiple twoǦclaass strategy, t the number of classiffiers grows q quadratically with the num mber of classses to be recognized like in AVLetterrs database. When th he class numb ber is N, the n number of th he KNN classifiers would b be N (NǦ1)/2.

Figure. 8. 8 Selected 155 slices for phrases p “See yyou” and “Th hank you”. “ʜ” in the blocks means th he YT slice (vertical motion) is seelected, and “ “_” the XT slice (horizontaal motion), “//” means the appearance X XY slice[1].

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

FigureǦ8 shows the selected s slices for similar phrases “seee you” and “tthank you”. T These phrasees were the most diff fficult to recognize becausse they are qu uite similar in n the latter p part containin ng the same w word “you”. The seleccted slices arre mainly in the first and second part of the phrasse; just one v vertical slice is from the last part.. The selected d features aree consistent w with the hum man intuition.

Table I

Conclusio on

as df .re s. in

Results f from OneǦtoǦǦOne and On neǦtoǦRest Claassifiers on SeemiǦSpeakerǦǦDependent E Experiments (Results in The Parentheses are From On neǦtoǦRest Strrategy)

w w w .e dl ib .

A realǦtim me capture o of lipǦmovem ments using w webcam for rrecognition o of speech waas proposed, in order to help the disable persson. Our app proach uses llocal spatioteemporal desccriptors for th he recognitio on of input utterancee. LBPǦTOP is used to exxtract the feaatures from th he captured images. Expeeriments werre made on ten speaakers and th he lip movem ments were converted c in nto speech vvery efficiently. Compareed to other approach h, our method d outperform ms the other b by accuracy o of 70%.

Reference es

2. 3.

D ow nl oa de d

Lipreading W L With Local Spatiotemporall Descriptors IEEE Transaactions On Mu ultimedia, Vo ol. 11, No. 7, N November 20 009. T T. Ahonen, A A. Hadid, and d M. Pietikäin nen, “Face deescription witth local binarry patterns: A Application t to face recogn nition,” IEEE Trans. Patteern Anal. Macch. Intell., voll. 28, no. 12, p pp. 2037–20411, 2006. G Zhao and G. d M. Pietikäinen, “Dynam mic texture rrecognition u using local b binary pattern ns with an a application to facial exprressions,” IEE EE Trans. Patttern Anal. Mach. M Intell., vol. 29, no. 6, pp. 915– 9 928, 2007. G G. Zhao, M. Pietikäinen, and A. Hadid d, “Local spaatiotemporal descriptors f for visual recognition of s spoken phras ses,” in Proc. 2nd Int. Wo orkshop Hum manǦCentered d Multimediaa (HCM2007)), 2007, pp. 5 57–65. G Zhao and G. d M. Pietikääinen, “Boossted multiǦreesolution spaatiotemporall descriptorss for facial e expression recognition,” Pattern Reccognit. Lett.,, Special Isssue on Imag ge/VideoǦBased Pattern A Analysis and HCI, 2009, a accepted for p publication.

fro m

NCMMI '14

ISBN : 978-81-925233-4-8