Computational methods to describe languages Over 7,000 languages are spoken across the world today. They are listed and partially described in linguistic catalogues such as Glottolog and WALS. We spoke to Dr Tanja Samardzic and Dr Ximena Gutierrez about their work in developing a quantifiable and reproducible way of describing languages with the aid of text data, which will help map linguistic diversity. Over 7,000 different languages are spoken across the world today, some by millions in large countries, while many more are limited to relatively small communities in remote areas. Language processing technology was originally developed for ‘big’ languages with many speakers, especially English, but over recent years interest in other languages has grown. “Many companies want to develop multi-lingual language technology,” says Dr Tanja Samardzic. The question then is; how can we transfer solutions from one language to another? What kinds of solutions work for what kinds of languages? What kinds of languages are there? As the Head of the Text Group in the Language and Space laboratory at the University of Zurich, Dr Samardzic is working to develop new computational methods, based on analysis of text data across a wide variety of languages. Language structures The World Atlas of Language Structures (WALS) is an invaluable source of information about linguistic diversity, providing data on 2,662 languages. Its authors designed a sample of 100 of these languages, which are broadly representative of global diversity. “We’ve taken this list of 100 languages, and are now collecting text samples for those languages,” outlines Dr Samardzic. This sample is structurally, geographically and genealogically varied, providing a solid basis for researchers to capture and map the most important types of linguistic structures. “We can estimate how many languages have short and ‘simple’ words
like English. There are languages with really long words. Short words tend to come with a fixed word order, while long words allow a variable word order,” explains Dr Samardzic. “We are trying to cover various phenomena like that. Almost 100 families are represented within these 100 languages, and we also try to cover a wide geographic area.” In the project, Dr Samardzic is using mathematical notions from information theory and geometry to describe languages in a quantifiable and reproducible way. “We want to observe a text, collect some data from this text, and then process this in order to categorise this language,” she says. The focus of attention in the project is on the formal side of language rather than the meaning and interpretation. “What is the form of different languages? What units do they consist of? What is the form of these units?” outlines Dr Samardzic. “These are fundamental questions in the scientific study of language.” A measure called Shannon entropy is one of the mathematical notions used in the project as a means of answering such questions. This is a measure that provides an insight into the complexity of a language. “We try to measure the entropy of a text as an indicator of language complexity and then see, where we find complex languages, what kind of complexity they show,” explains Dr Samardzic. The aim here is to get a kind of screenshot of linguistic diversity, including insights into languages at risk of extinction. “Languages can die. Several of the languages in our sample are at different degrees
Approximative language complexity (the size of the circle) vs. the enlargement status;
of risk of extinction,” says Dr Ximena Gutierrez, a post-doc student in the Language and Space lab. This means not only the loss of the cultural heritage associated with a language, but also the loss of our knowledge about the way in which ideas and values are incorporated within it. This is a prime motivation behind Dr Samardzic’s work. “We see language as a code, and the structure of this code can be very different,” she outlines. Different languages have different ways of encoding information, and losing this diversity would have wider consequences. “We would lose awareness of the possibilities that the human mind has come up with to encode content,” says Dr Samardzic. “The kinds of things that can be encoded in language are really fascinating.” The wider aim in the project is to develop a scalable, computational way of describing languages to facilitate the development of language technology. A further scientific goal is to study the relationship between various extra-linguistic factors and language types. “Why is a certain language more complex than another?” outlines Dr Samardzic. Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora Dr. Tanja Samardžić, PD URPP Language and Space, University of Zurich Freiestrasse 16, 8032 Zürich T: +41 44 63 43945 E: tanja.samardzic@uzh.ch W: http://www.spur.uzh.ch/samardzic Olga Sozinova is a PhD student at the Language and Space lab at the University of Zurich. Her research is mainly focused on computational linguistics. Ximena Gutierrez-Vasques is a postdoctoral researcher at the Language and Space lab at the University of Zurich. Her research interests include natural language processing, quantitative linguistics and lowresource languages. Dr Christian Bentz is an Assistant Professor at the University of Tübingen. He was previously a PostDoc at the Language and Space Lab, University of Zurich. Dr Steven Moran is an Assistant Professor at the University of Neuchâtel. He was previously a PostDoc at the Language and Space Lab, University of Zurich.
Map by C. Bentz and X. Gutierrez-Vasques.
www.euresearcher.com
57