Special Centre for Sanskrit Studies, J.N.U., New Delhi
Current Progress in Developing Resources for Sanskrit and other Indian languages Girish Nath Jha Associate Professor, Computational Linguistics Special Center for Sanskrit Studies, J.N.U., New Delhi – 110067 & Mukesh and Priti Chatter Distinguished Professor of History of Science, University of Massachusetts Dartmouth Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
What is a “resource” ? Language data, corpora in standard formats for computer processing for direct/indirect use by humans
India is considered “resource-poor” country as we do not have enough standard resources. Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
What does it mean for Sanskrit ?
-electronic texts, dictionaries -digital libraries -parallel corpora -search engines -language processing tools (MT, Speech, OCR, OLHWR etc) -second Indology revolution in the making? Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Why Sanskrit?
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Language
Scripts
Family
Hindi Sanskrit Marathi Konkani Maithili Nepali Sindhi Bodo Dogri Santhali Bengali Assamese Manipuri Gujarati Kannada Malayalam Oriya Punjabi Tamil Telugu Urdu Kashmiri
Devanagari Devanagari Devanagari Devanagari Devanagari Devanagari Devanagari Devanagari Devanagari Devanagari, Ol Chiki Bengali Bengali Bengali, Meithei Gujarati Kannada Malayalam Oriya Gurumukhi Tamil Telugu Perso-Arabic Perso-Arabic
Indo Aryan Indo Aryan Indo Aryan Indo Aryan Indo Aryan Indo Aryan Indo Aryan Tibeto Burman Indo Aryan Austro Asiatic Indo Aryan Indo Aryan Indo Aryan Indo Aryan Dravidian Dravidian Indo Aryan Indo Aryan Dravidian Dravidian Indo Aryan Indo Aryan
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Indian constitution on languages
448 articles, 12 schedules, 107 amendments (so far)
Article III – Fundamental rights Article IV A – Fundamental duties Article XVII – Official Language Article XVII – Regional Languages Article XVII – Language of Supreme Court and High Court Article XVII – Special Directives
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Sanskrit Commission, 1956
Keynote delivered at WAVES2012, UMASSD, July14,2012
Sanskrit in digital age Computer for Sanskrit ď Ż Sanskrit for Computer ď Ż
major e-contents Sanskrit wikipedia
Sanskrit wikipedia (Sanskrit medium wikipedia) http://sa.wikipedia.org
Sanskrit wikisource (Sanskrit e-texts)
Sanskrit wiktionary (Sanskrit encyclopedia )
Sanskrit wikiBooks (Sanskrit e-library)
major e-contents
Digital libraries
DLI project (http://dli.iiit.ac.in/) 1022 Sanskrit books (IISc, CMU,NSF,ERNET,MCIT) NSF funded, Brown Univ (http://www.sanskritlibrary.org/) Clay’s project (http://www.claysanskritlibrary.org) JJC foundation, NYU Press INRIA, Paris (technical texts, tools) IGNCA (http://ignca.nic.in/sanskrit.htm _ J-TESS (JNU Text Encoding and Search for Sanskrit)
major e-contents
Sanskrit e-documents Maharshi Mahesh Yogi (http://sanskrit.safire.com/Sanskrit.html) Avinash Sathaye - Sanskrit documents list(http://sanskritdocuments.org/ ) Srinivas Varkhedi – Sanskrit corpus (http://rsvidyapeetha.ac.in/) Oliver Hellwig (Univ of Berlin) Anand Mishra (http://sanskrit.sai.uni-heidelberg.de/) http://sanskrit.jnu.ac.in
major e-contents
Sanskrit documents
Sanskrit blogs
Tirupati Vidyapeeth ASR Melkote CDAC- heritage computing group
JNU students Others (http://sanskritlinks.blogspot.com )
Sanskrit corpora and tagset
JNU , LDC, Univ. of Pennsylvania, U.Hyd
major e-contents: static
Himanshu Pota (http://learnsanskrit.wordpress.com/) http://www.ee.adfa.edu.au/staff/hrp/personal/sanskrit/ American Sanskrit Institute (http://www.americansanskrit.com/) Acharya, IITM (http://acharya.iitm.ac.in/sanskrit/tutor.php) Vasudev Bhatt (http://www.ourkarnataka.com/learnsanskrit/sanskrit_main .htm) Sanskrit Bharati (http://www.samskritabharati.org/newsite/index.php) http://sanskritbhasha.blogspot.com/
major e-contents: dynamic
Tutorials
Sudhir Kaicker (http://www.sanskrit-lamp.org/_ Prof. G.V.Singh (CASTLE project of DoE) Peter Scharf Avinash Sathaye Sanskrit CD (Mahesh Kulkarni, CDAC Pune)
Language processing tools
Gerard Huet Amba Kulkarni Peter Scharf Girish N Jha Anand Mishra
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Editor: Girish Nath Jha
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Work done at Jawaharlal Nehru University (JNU), New Delhi Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Special Center for Sanskrit Studies, JNU
Linking Traditional scholarship with modern methods Exploring Science & Technology in Sanskrit Developing language technology resources and tools for Sanskrit and other Indian languages Collaboration with universities Collaboration with industry Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
SATIAIT Science And Technology In Ancient Indian Texts Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Center for Indic Studies, UMASSD initiative
Due to the initiative and efforts of Prof Bal Ram Singh, we are doing the following activities
Identifying key S&T texts Digitizing them, providing computer help Translating Lab experiments Documenting… Keynote delivered at WAVES2012, UMASSD, July14,2012
Editors: Bal Ram Singh Girish Nath Jha Umesh Kumar Singh Diwakar Mishra
Keynote delivered at WAVES2012, UMASSD, July14,2012
Editors: Girish Nath Jha Bal Ram Singh R P Singh Diwakar Mishra
7/14/2012
Special Center for Sanskrit Studies, J.N.U., New Delhi
Editors: Angela Marcantonio Girish Nath Jha
7/14/2012
Special Center for Sanskrit Studies, J.N.U., New Delhi
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Technology Development for Indian Languages Keynote delivered at WAVES2012, UMASSD, July14,2012
Building Blocks of Language Technology Development Standards Software/Tools
Localization
Awareness
Language Technology
Training
Technologies Linguistic Resources Keynote delivered at WAVES2012, UMASSD, July14,2012
Certification
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Near Future initiatives
Localization R & D Center (JNU, CDAC, IIT Delhi)
NME-ICT center at JNU
(MHRD, JNU)
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Machine Translation
SHMT (Dept of IT, Govt. of India) SaHiT (unfunded) Microsoft Translator Hub
English-Hindi (Microsoft) English-Urdu (Microsoft) English-Gujarati (Microsoft) Sanskrit-English (unfunded) English-Maithili (unfunded) Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
SHMT (DIT)
A consortium of 7 universities/institutes
University of Hyderabad JNU IIIT Hyderabad Tirupati Vidyapeeth Sanskrit Academy Hyderabad Poornaprajna Vidyapeth Bangalore Rajasthan Sanskrit University, Jaipur
Duration 3 yrs (2008 – 2012) MT system tobe hosted on http://tdil-dc.in very soon
Phase2 (2012-2015) Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Indian Languages Corpora Initiative (ILCI) Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
A consortium of Indian universities has been formed under my leadership – 17 languages, remaining 6 to join later Parallel tagged corpora if 100,000 sentences in all Indian languages in tourism, health, agriculture, entertainment domains Funded by TDIL program of Ministry of C & IT Phase1 :2009-12 (a consortium of 12 languages including English) - corpora to be hosted on http://tdil-dc.in very soon Phase2 : 2012-2015 Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Languages & Consortia partners Consortium of universities Server baser corpora development and management >> the server is called “sanskrit� Limited Crowd sourcing 7/14/2012
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Shallow parsing tools for Indian languages Under
a consortium project led by Univ. of Hyderabad Morph analyzers for 11 Indian languages Duration = 2012-15
7/14/2012
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Consultancies
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Online Handwriting Recognition for Devanagari based languages -Microsoft
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Indic languages tagset and annotation -Microsoft Research India
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Multimodal data in 8 security sensitive languages (Indian English, Hindi, Urdu, Tamil, Bangla, Punjabi, Pushto, Dari) -LDC, University of Pennsylvania Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
English- (major) Indian languages Machine Translation (English-Hindi, English-Urdu, English-Gujarati, Sanskrit-English, English-Maithili) Started this summer
-Microsoft Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Some of the recent R&D with the help of research students
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
ď Ż
Sanskrit Speech Synthesizer
(in collaboration with Microsoft Research India) (prototype by next year)
ď Ż
Named Entity Recognizer for Sanskrit (prototype finished) Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
J-TESS : JNU Text Encoding & Search for Sanskrit
Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Tools Server
based corpora creation, annotation application called ILCIANN Sanskrit and other Indian languages processing tools Multimedia animation, e-learning tools Lexical resources and search Indian language Transliterator Keynote delivered at WAVES2012, UMASSD, July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Demo http://sanskrit.jnu.ac.in
Keynote delivered at WAVES2012, UMASSD, July14,2012
धन्यवाद ! questions?? ક ক
क
ಕ
കൂ क କ
ક గ
क
ক
ਕ
ਕ
క
କ
గ ಕ ક ಕ
ಕ
girishjha@gmail.com 91-11-26741308
Keynote delivered at WAVES2012, UMASSD, July14,2012