TAUS RE VIEW of language business and technology
The “Data” Issue
Reviews of Language Business & Technology in Asia, Africa, the Americas and Europe. Plus... columns by Nicholas Ostler, Lane Greene, Jost Zetzsche and Luigi Muzii and an update on the Japanese Translation Market by Una Softic April 2015 - No. III
1
MACHINE TRANSLATION FOR GLOBAL BUSINESS
ACCURATE. AGILE. ADAPTABLE. AUTOMATED. These are the elements that will ensure success for global business. Ensure YOUR success. Machine Translation is NOT a “one-size-fits-all” proposition. Choose a provider with proven expertise, industry leadership, proprietary technology, and happy clients.
CHOOSE SAFABA. Contact us to get started. +(412) 478-2408 • safaba.com
Safaba_TAUS_Sept2014.indd 1
12/17/14 10:53 AM
ACCURACY IS IMPORTANT. WE BELIEVE IT’S VITAL.
We combine the expertise of professionals with cutting edge technology to deliver translations that take quality to new heights. 2Because there’s accuracy. And there’s Accuracy.
www.lingo24.com
Magazine with a Mission How do we communicate in an ever more globalizing world? Will we all learn to speak the same language? A lingua franca, English, Chinese, Spanish? Or will we rely on translators to help us bridge the language divides? Language business and technology are core to the world economy and to the prevailing trend of globalization of business and governance. And yet, the language sector, its actors and innovations do not get much visibility in the media. Since 2005 TAUS has published numerous articles on translation automation and language business innovation on its web site. Now we are bundling them in TAUS Review, an online quarterly magazine. TAUS Review is a magazine with a mission. We believe that a vibrant language and translation industry helps the world communicate better, become more prosperous and more peaceful. Communicating across hundreds – if not thousands – of languages requires adoption of technology. In the age of the Internet of Things and the internet of you, translation – in every language – becomes embedded in every app, on every screen, on every web site, in every thing. In TAUS Review reporters and columnists worldwide monitor how machines and humans work together to help the world communicate better. We tell the stories about the successes and the excitements, but also about the frustrations, the failures and shortcomings of technologies and innovative models. We are conscious of the pressure on the profession, but convinced that language and translation technologies lead to greater opportunities. TAUS Review follows a simple and straightforward structure. In every issue we publish reports from four different continents – Africa, Americas, Asia and Europe – on new technologies, use cases and developments in language business and technology from these regions. In every issue we also publish perspectives from four different ‘personas’ – researcher, journalist, translator and language – by well-known writers from the language sector. This is complemented by features and conversations that are different in each issue. The knowledge we share in TAUS Review is part of the ‘shared commons’ that TAUS develops as a foundation for the global language and translation market to lift itself to a high-tech sector. TAUS is a think tank and resource center for the global translation industry, offering access to best practices, shared translation data, metrics and tools for quality evaluation, training, research.
Colophon TAUS Review is a free online magazine, published
Editorial contributions and feedback can be
four times per year. TAUS members and non-
sent to:
members may distribute the magazine through
General: editor@taus.net
their web sites and online media. Please write to
Continental reviews of language business and
editor@taus.net for the embed code. TAUS Review
technology:
currently has about 5,000 readers globally.
•
Africa review: africa@taus.net
•
Americas review: americas@taus.net
Publisher & managing editor: Jaap van der Meer
•
Asia review: asia@taus.net
Editor & publication manager: Anne-Maj van der
•
Europe review: europe@taus.net
Meer
Persona’s perspectives of language business and
Distribution and advertisements: Anne-Maj van
technology:
der Meer
•
Translator: translator@taus.net
•
Research: research@taus.net
Enquiries about distribution and advertisements:
•
Language: language@taus.net
review@taus.net
•
Journalist: journalist@taus.net
3
Content
Leader
Features
5. Leader by Jaap van der Meer
34. Japan’s translation industry is feeling very Olympic today by Una Softic
Reviews of language business & technologies
38. Contributors
8. In Asia by Mike Tian-Jian Jiang
40. Directory of Distributors
11. In Africa by Amlaku Eshetie
43. Industry Agenda
14. In the Americas by Brian McConnell
18. In Europe by Andrew Joscelyne
Columns 22. The Translator’s Perspective by Jost Zetzsche 25. The Research Perspective by Luigi Muzii 29. The Language Perspective by Nicholas Ostler 32. The Journalist’s Perspective by Lane Greene
4
Leader
by Jaap van der Meer
Treasure and treason in the life of a translator Was Robin Thicke
simply inspired by
Marvin Gaye
or did he commit a crime by copying his famous tune
“Got to Give it up” from 1977? The March 10 was: “Guilty!” This was a clear example of infringement on copyright law. Robin Thicke and his composer have to pay the Gaye family $7.4 million in recuperative damages. from the single
jury’s verdict on
This unprecedented court case made a lot of professionals working in the music industry and other creative sectors worried and uncertain. What is original, what isn’t? Can I trust my inspiration, or am I – perhaps unconsciously – a copycat? Am I the artist that I’ve always wanted to be, creating true treasures, or am I – perhaps unknowingly – committing treason against fellow artists before me? Copyright law is going through one of its most challenging times. Legislation on intellectual property is firmly rooted in the last century when publishing was still …. well, like publishing used to be. Roles and functions were clearly defined and publishing was a capital-intensive industry, requiring recording studios, printing presses, agents, middlemen. Copyright law has not caught up with the time. And the times are changing… so ferociously fast. Legislators at the European Commission are now digging into the matter and they’d better come to conclusions fast or we will see more legal cases unsettling and disrupting the course of innovating industries.
Like singers and songwriters translators are bounced between inspiration and plagiarism everyday on every job.
Like singers and songwriters translators are bounced between inspiration and plagiarism everyday on every job. Just in the last year the process and tool environment for industrial
translators is changing radically. Machine translation technology is now often embedded somehow leaving little room for a translator to create a sentence from scratch. There is always a suggestion, whether it comes from a translation memory database or from a machine translation engine that is trained with translation data coming from god knows which sources. In this month issue of TAUS Review – we call it the ‘Data issue’ – we have collected interesting perspectives on the pervasive trend of the datafication of translation. Does this remind us of the Gartner Hype Cycle, asks Luigi Muzii in his researcher’s perspective. Nicholas Ostler – with his language philosopher’s hat on – rejects the assumption that all that is needed is the past record of a language. It is we, humans, and the translators in particular, who create and understand new messages by reconfiguring memories and patterns that we recognize. Lane Greene, the journalist, picked up the phone to call Google and question them about the ‘unreasonable effectiveness of data’. Really? No, there are limits to data. And Jost Zetzsche, the translator, weighs the balance in his column “Data on Data”. Andrew Joscelyne talked with Roberto Navigli, the master brain behind the BabelNet project. BabelNet uses Wikipedia and WordNet to help disambiguate data.
We have collected interesting perspectives on the pervasive trend of the datafication of translation.
Brian McConnell approached the ambiguity from a different angle and makes a concrete proposal to TAUS to step up and promote the adoption of TML (Translation Mark-up Language), a micro-
5
Leader by Jaap van der Meer
And translators? Is there still treasure in their work or are they quickly becoming the most skilled copyists?
format not visible to the readers but helping translators (human or machine) to choose the right interpretation. And more articles of course. We hope you enjoy this quarter’s issue.
recommendations to “Clarifying copyright on translation data”. We recommend European legislators – and anyone who is interested – to read this article and join the discussion. And translators? Is there still treasure in their work or are they quickly becoming the most skilled copyists? “Wait, whether or when the machines will render the work of humans obsolete”, says Luigi Muzii.
And what to do in the meantime? With these copyright issues for instance? TAUS did a study back in 2013 and published some
Get your insights, tools, metrics, data, benchmarking, contacts and knowledge from a neutral and independent industry organization. Join TAUS!
TAUS is a think tank and resource center for the global translation industry. Open to all translation buyers and providers, from individual translators, language service providers & buyers to governments and NGOs.
6
taus.net
TAUS Community Discuss topics around translation quality, post-editing, data and translation automation with other professionals in the language industry. Log-in or sign-up on www.taus.net/community
Sign up for the TAUS Post-Editing Course today! Get an ofďŹ cial TAUS CertiďŹ cate + a listing in the Post-editors Directory
7
Review of language business & technologies in Asia by Mike Tian-Jian Jiang
The Circle of Data, Information, Knowledge, Understanding, and Wisdom Instead
of
being
a
comprehensive
catalogue
of
translation-related data, this article will generalize data gathering and curating approaches, especially
Asian translation demands. Before going into details, let’s review the famous hierarchy of data, information, knowledge, understanding, and wisdom (DIKUW). We will then morph this into a circle, or a positive feedback loop, in hope of shedding some light on the path of pursuing bigger and smarter data. for
In 1989, American organizational theorist Russell Ackoff published the paper “From Data to Wisdom” in the Journal of Applied Systems Analysis, proposing the relationship of DIKUW: • Data: symbols; • Information: “who”, “what”, “where”, and “when” questions; • Knowledge: “how” questions; • Understanding: “why” questions; • Wisdom: evaluated understanding. The hierarchy is usually described as a pyramid, and sometimes with implication of time or value: the earlier/rawer/easier, the less valuable. It is quite intriguing to link the current trend of big data with this particular perception: not only is the data valuated by size, but also by swiftness. Thanks to the Internet, it is seemingly possible now, if not contradicting. The question is, however, is the urge of big data really not a contradiction to DIKUW? Perhaps it is more likely a context issue whether two concepts are compatible or not. After dropping the desire to have a universal theory, it would become clearer that, at the very least, in the context of translation, disambiguation is no doubt a crucial part. Hence the compatibility in doubt could be transformed into a definition problem:
When we talk about data, are you thinking what I am thinking?
8
when we talk about data, are you thinking what I am thinking? In the shared field of translationrelated research and industry, data in need is relatively more sophisticated than just symbols. Unlike common stories of big data these days, data required by translation is not just some search result or server log, hence the terms “corpus”, “bi-text”, “translation memory”, etc., along with the actions “curation”, “alignment”, “annotation”, and so on.
Who is involved with the data? What is the data about? What is the data source? When is the data originated?
In other words, when I said translation-related data, the “data” solely without the modifier is actually just a bunch of raw ingredients, such as texts, images, audio, or even video files, while for the whole phrase “translation-related data”, it is more information like: who is involved with the data? What is the data about? What is the data source? When is the data originated? Here comes a new quest: how do you acquire the above information?
Review of language business & technologies in Asia by Mike Tian-Jian Jiang
Again, thanks to the Internet and the search engines, the ingredients are almost free. The catch is, there is still no free lunch. At first glance, a simple keyword search may lead us to some nice resources. For example, combining what just popped up in the previous paragraph, one may formulate queries with specific language pairs, like “Japanese English corpus”, which happens to be a nice list. The quest of information acquisition for translation-related data requires inter-discipline collaborations. Particularly for Asian translation business, it is not hard to imagine that besides the typical prerequisite of domain knowledge and genre/ style of outcome, the deep understanding of the differences between Asian languages, or the heterogeneousness from Asian to nonAsian languages. If it sounds exaggerated and intimidating, allow me to provide an almost stupid example beginning with a naïve question (and please bear with me if you know
the answer already): where to find the data to assist Japanese-to-English place name translation for online shopping/shipping? As a computational linguist, the answer was trivial and tedious to me: just go trolling Wikipedia, or if you want to be competitive for job security, query DBpedia by SPARQL as a practice of “Semantic Web.” Of course it was disappointing that both the coverage and the quality did not suffice. But here it comes: One of my colleagues in the sales department suggested that with the worry of being amateuristic and stepping over: how about the address data of Japan Post? Ta-da!
The quest of information acquisition for translation-related data requires inter-discipline collaborations.
9
Review of language business & technologies in Asia by Mike Tian-Jian Jiang
Well, unlike fairy tales, there is always much more after the happy endings. Japan Post’s address data turned out to be Romanization in upper case. Normalizing cases is not a big deal, but some Romanized terms proved problematic: basement, floor, ward, and several other typical units are still Roman scripts of Japanese. Luckily, it is still not too hard to search-andreplace them. The real important thing here is to be aware of the situation in the first place, and then talk to customers for a mutual understanding: do you want “ward” to be “ku” or “area”, or…? Why? So here we are. For shipment, if the place name will be presented to Japan Post eventually, why not keep it as it is? For other potential customers, say an online photo-sharing site wants to have Japanese-English bilingual geo-locations, then it better be plain English than Romanized Japanese up to certain level, and floor and basement will probably be useless anyway. Furthermore, now DBpedia is welcome.
Isn’t it still a long, tiresome and uncertain journey towards the idea of bigger, quicker and smarter data?
Every decision a customer approves will subsequently become evaluated knowledge, hopefully qualified as wisdom, even when it is so small and silly after looking back to the above story. Wait, isn’t it still a long, tiresome and uncertain journey towards the idea of bigger, quicker and smarter data? I certainly hope not. Imagine that, the wisdom of why and how to prepare the data of place names, will soon embody the next round of data acquisition, and inspire more keywords for search engine queries. Even better, if one is willing to invest time and money to semi-automate this positive feedback loop, the pyramid of DIKUW will become the circle of DIKUW, for the translation industry.
10
Once the engine of the circle is started, the collision between big data and DIKUW will ease, and the next post-happy-ending quest shall be revealed.
Review of language business & technologies in Africa by Amlaku Eshetie
Translation in Ethiopia – a long way to catch up with the industry as is practiced in the rest of the world In
my previous two articles,
I
provided readers with
a general glimpse of translation and localization contexts and practices in
Africa. Those
articles,
I
believe, have helped the readers understand where
Africa compared the world.
the industry is in other parts of
In
this third article,
industry
(I
I
to where it is in
try to picture the translation
deliberately did not include localization
here, because it isn’t yet to that stage) in
I
Ethiopia.
write this article from my own experience and from
a chat with two veteran translators. cover
the
qualification
and
It
will broadly
professionalism
of
translators, quality of translation and the nature/ type of translation and the market as well as future prospects.
The Translators In Ethiopia, translation businesses are mainly and traditionally located near judiciary courts and mostly owned by non-professionals. The owners often set up the business either from a point of view of business opportunities or they’re being driven by some kind of push/pull factors. The translation business owners have neither the background nor the know-how for translation and hire anyone who says he/she can translate into a certain language. Questions like: ‘How many translators are there?’, ‘What percentage of the translators have what qualifications or training?’, etc. require a survey or study to be answered. On top of the lack of formal training and experience in translation, which of course contributes to the problem, the translators lack professionalism as well. Some translators show up at the translation companies while they are looking for other jobs or when they just need some immediate money and disappear again afterwards. Others take documents to be translated and never return. Again others lack
the skills to type their translations themselves and need typists whom they dictate what they translate. The nature of employment for these translators is freelancing or temporary, and there is no recruitment procedure. The compensation system for these translators is liberal, i.e., it is done purely through negotiation. The unit for negotiation isn’t defined at all: it can be per number of pages a translator translates, or it can be an agreed amount of lump sum a day, or whatever else the owner and the translator agree upon.
Translation business owners have neither the background nor the know-how for translation and hire anyone who says he/she can translate.
The Translation Translation, more particularly written/document translation, is as old as the history of written language. Evident to this are the translation of religious manuscripts into Ethiopian languages from Hebrew, Greek, Aramaic, or Arabic. From religion it moved to the legal, medical, technical, and business sectors. Since then
11
Review of language business & technologies in Review Africaof language business & technologies in Africa by Amlaku Eshetie by Amlaku Eshetie
provinces are nearly non-existent. Even in the capital city, Addis Ababa, which is also a seat for the African Union and headquarters of other international organizations, the services are concentrated in one area, called Stadium, and very few near or in judiciary courts, as I mentioned earlier.
it remained to be document translation: translation of documents such as medical prescriptions and certificates; court orders, decisions, and reports; marketing documents, letters, and other technical documents. So far so good it appears. However, the issues specific to the Ethiopian translation services are the rate of development/evolution, quality, and accessibility. By rate of development or evolution I mean the rate of growth or dynamism the companies show. The translation companies that exist for 30 years or more are still there doing the same level of translation nearly with the same kind of people and capacity.
Universities will sooner or later realize that translation and localization is an important industry and in turn will launch programmes that train qualified translators and localizers.
Therefore, when someone needs a certain paper or document to be translated from one language into another, they use their social network and ask a colleague, a relative, or any other person to do it for them for free, regardless of the quality.
In terms of quality, unless the translation companies recruit trained and in-house translators as well as adopt modern management systems, tools and widen their scope and increase diversity, it is hard to think they’d be able to maintain the quality and punctuality of their translation services.
Attributable to this practice and attitude, many individuals as well as professional organizations do not seek professional assistance for their translation needs. Whenever they do, as their expectation is so low, they are shocked when they hear a price quotation of something equivalent to 5 dollars for a small page or 10 dollars for a standard page.
Finally, when we look at the a c c e s s i b i l i t y, translation companies in the
In consequence, the pricing varies from person to person and document to document depending on the competition level among the translation companies. Some quote very low, others medium or others relatively higher for
Most people do not understand translation as a profession and as a business.
12
Market and Pricing When discussing the translation market in Ethiopia, it is crucial to mention the perception of translation of the public and the general business population. As my experience and the experiences of the veteran translators I talked to tell us, most people do not understand translation as a profession and as a business.
Review of language business & technologies in Africa by Amlaku Eshetie
just the same volume of translation request. The Future Prospect For the translation business in Ethiopia to evolve and become dynamic, there are several drawbacks: those discussed in this article as well as others such as ICT infrastructure/ facility and skills. Nevertheless, the prospect for the Ethiopian translation business, as I see it, is bright for several reasons and change is inevitable.
Translation has become a significant player in the economy.
industry and in turn will launch programmes that train qualified translators and localizers. Ethiopia is also attracting foreign investors through its Foreign Direct Investment policy more than ever. This will create more demand for translation of documents, software, websites, etc. from/ into several local languages as well as foreign languages. Many countries in the world have a number of translation companies, agencies and translators, and the sector has become a significant player in their economy. Ethiopia is not unique to this and there is no way that we learn from it or are influenced by experiences of other countries.
Universities will sooner or later realize that translation and localization is an important
13
Review of language business & technologies in the Americas by Brian McConnell
Preemptive Disambiguation As
any professional translator knows, high quality
translation depends on understanding the context of the source material.
This
article introduces the
concept of preemptive disambiguation, along with an example microformat that can be embedded in online documents to make them easier to translate accurately.
The basic idea behind preemptive disambiguation or “Pre D” is to embed information in a document, in a format that is hidden from normal users and does not damage the visible document layout, but visible to any machine or professional translators who are working on it. Microformats are an especially attractive way to do this, as they are already part of web standards, and are widely supported by web browsers and other tools. For example, let’s say that we want to embed geographical information in an article in a way that enables a program to easily extract this information, and know its context. To do this we might say: <p>The birds <span class=”species” style=”visible:none”>Passer domesticus</span> roosted at <span class=”geo”><span class=”latitude”>45.5</span><span class=”longitude”>-122.68</span></ span>.</p> The additional markup is invisible for users, but any program parsing this page will be able to see and extract this geographical information (for example to display this page in search results overlaid on a map). This general approach is already used to add structure and machine-readable content to
Microformats are simply ignored by applications that don’t understand them.
14
regular web pages, and can be easily applied to assist in translation. Translation Markup Language (TML) TML, or something like it, is a microformat that can be embedded within documents wherever additional information is required to disambiguate the meaning of a phrase, to provide additional context, style guide hints or glossary entries. Microformats provide a lightweight way to embed semantic information in a web document. They are already in widespread use, and do not break backward compatibility with existing browsers. Microformats are appealing because they are simply ignored by applications that don’t
Review of language business & technologies in the Americas by Brian McConnell
understand them. In the example below, TML is used to embed a comment to explain the usage of the word “pipeline” so that it is not interpreted literally. A user viewing this in a standard web browser would not see the remarks, while a user viewing this with a TML aware application would see the instructions. Embedding Comments About Context Using TML A brief comment about the context or meaning of a phrase or sentence is often all that’s needed to assist the translator in reaching a correct interpretation. TML makes it easy to embed otherwise hidden annotations that are only visible to people using translation-aware browsers, editing tools, etc.
We can use TML to explicitly reference the glossary entry.
Example of TML being used to embed comments about context for translation: <p>The company’s sales <span class=”TML”>pipeline<span class=”comment” style=”display:none”>In this context, pipeline refers to potential customers, not a literal pipeline</span></span> is nearly full.</p> Synonyms The class “synonyms” is used to attach a list of synonyms to a word or phrase. This can be used both by human and machine translators (this approach will enable machine translation engines to automatically determine the correct meaning of a word whose meaning might otherwise not be obvious). Example of TML synonyms: <p>In <span class=”tml”>rough<span class=”synonyms” style=”visibilit y:none”>approximate</span></span>
numbers, there were 100 people at the conference.</p> Glossary References If a phrase belongs to a translation glossary, we can use TML to explicitly reference the glossary entry, as shown in the example below: <p>Insightly is the leading Google Apps <span class=”tml”>CRM <span class=”glossary style=”visibility:none”>CRM : Customer Relationship Management <a href=”https://companyxyz. translationglossary.com/term/ crm>more information</a></span></ span> service for Google Apps.</p> Here we embed both basic information from the glossary entry, as well as a hyperlink for more information. As with other TML tags, this is invisible to ordinary users and is only visible to people using TML aware tools. Implications For Web Authoring Tools Adding support for TML, both to authoring tools and translation tools, will require minimal effort thanks to the simplicity of the microformat pattern. For authoring tools, the basic goal is to encourage authors to provide additional i n f o r m a t i o n whenever there is uncertainty about the usage or meaning of a phrase.
The basic goal is to encourage authors to provide additional information whenever there is uncertainty about the usage or meaning of a phrase.
Most authoring tools now have fairly sophisticated grammar, spellchecking and builtin dictionaries. For these tools, it will be pretty easy to add a pop-up dialog that is triggered
15
Review of language business & technologies in the Americas by Brian McConnell
whenever the user types a word or phrase whose meaning is ambiguous, has multiple meanings, etc. When this occurs, the author would see a pop-up that asks for a list of synonyms, glossary entry, and optional freeform comment. If the author enters any of these, it would insert TML as shown in the examples above. This is a trivial modification to make, so this functionality could easily be added to a wide variety of authoring tools if the microformat is adopted.
Because the author will be encouraged to embed information in the document as he/she is writing it, the translator will typically have much better information to work from when composing or post-editing translations.
More importantly, because the author will be encouraged to embed information in the document as he/she is writing it, the translator will typically have much better information to work from when composing or post-editing translations. Currently this information has to be obtained offline, typically via a back-andforth email conversations. Machine translation engines will also be able to use the synonyms element to better guess at the intended meaning of a word or phrase, which may be especially useful for rules based translation engines when the encounter ambiguous words or phrases. Implications for TAUS TML, or a microformat like it, is an example of where TAUS could play a leading role. As a microformat it doesn’t require large changes to the existing web toolchain, just small changes to tools that need to be aware of this microformat. With its relationships throughout the IT industry, it is well positioned to bring a microformat like this to fruition.
Implications For Translation Tools It will likewise be easy to add TML support to translation editing tools, which simply need to look for <span class=”TML”>...</ span> segments, and then extract the hidden information within these regions. This is straightforward HTML/XML parsing, and very easy to add.
TAUS Annual Conference San Jose, CA 12-13 October 2015
16
17
Review of language business & technologies in Europe by Andrew Joscelyne
BabelNet – How the World Can Help Disambiguate Words In a landmark article well over a decade ago, the US psycholinguist George Miller (and initiator of WordNet), the well-known computational thesaurus of English) showed how a fairly banal couplet from the American poet Robert Frost’s poem Stopping by Woods on a Snowy Evening that goes: But I have promises to keep And miles to go before I sleep can cumulatively generate a total of 3,616,013,016,000 possible compound meanings if each individual word’s different dictionary meanings are broken out and aligned. This way semantic madness lies – at least for a computer. Luckily the outcome of a European project called BabelNet is now making it much easier to think through the classic problem of word ambiguity for the translation industry and others. Let’s look at how we got there and what BabelNet can offer by way of a solution.
speech, to distinguish for example ‘but’ as a conjunction from ‘but’ as a verb as in “don’t but in”, you can radically reduce the rate of ambiguity to a geometric mean of 2.026 senses per word.
The Frost poem, in this context, can be conceived as a vast word cloud in which each of the thirteen words interlink to all possible combinations, generating an average of 9.247 meanings per word. This discovery is of course partly an artefact of the very process of using computers to handle human language. Most humans would probably never have thought of this as a problem in a pre-digital world.
Luckily humans have special access to knowledge about contexts that computers don’t have. They also know that we (poets especially) can make words work harder for us by packing in two or three meanings at a time. Remember Lewis Carroll’s Humpty Dumpty in Alice through the Looking Glass, who cunningly said, “When I make a word do a lot of work like that, I pay it extra.”
Luckily humans have special access to knowledge about contexts that computers don’t have.
If you add a little augmented reality to this cloud picture by overlaying the words with grammatical information about their parts of
18
Imagine now that you want to translate the phrase into another language. You would according to the computing scenario have to choose fairly systematically between the various meanings of each (full) word to find an equivalent in the target language.
The fact is the human brain can engender some 10100 concepts (more than the total number of particles in the universe) but we only know about 106 words. So concepts are forced to share word containers in order to be practically communicable by a
Review of language business & technologies in Europe by Andrew Joscelyne
carbon biological system with power, memory and other constraints. This word disambiguation conundrum inevitably frustrated the pioneers of Natural Language Processing 50 years ago. Yehoshua Bar-Hillel, one of the founders of the discipline, famously claimed that he could not see how a computer could be brought to automatically ‘understand’ the difference between the two rather dreary English expressions “the box is in the pen” and “the pen is in the box”. In other words, how could you program a machine to distinguish pen meaning ‘enclosure’ (first phrase) from pen meaning ‘writing instrument’? This resounding academic doubt about the linguistic capacity of computers to access knowledge to disambiguate word meanings has been cited as one of main reasons for the US government turning off its funding tap for machine translation back in the mid-1960s. So if WordNet can now give us information about the sets of synonyms, antonyms and hyponyms etc. associated with a lexical item in our linguistic repertoire, how far have we come in being able to identify automatically the world knowledge contexts in which a given synonym/word meaning is appropriate?
The best place to find out today is Rome, where Roberto Navigli from the city’s Sapienza University has been working on the BabelNet project for several years now. BabelNet is an online multilingual s e m a n t i c dictionary and lexical database that provides a powerful resource via an API for anyone doing natural language processing from translation to text analytics and more. It is poised to make a major impact as a multilingual resource for the digital agenda because it has the virtues of combining two crucial components: world knowledge and lexical information. In a nutshell, a seamless merge of WordNet, Wikipedia, Wiktionary and other electronic dictionaries.
It is poised to make a major impact as a multilingual resource for the digital agenda.
More generally, BabelNet puts into practice the strategy that Tim Berners-Lee propounded in 2009 of linked open semantic data as the underlying architecture of the next Web. After first showing us how to link documents, Berners-Lee has long been pleading for a level of linked meanings, not texts, by exploiting various types of openly-accessible linguistic data. But, in addition to the merging of the knowledge about ‘word’ meanings and the knowledge about ‘world’ meanings, BabelNet is – as its name suggests – also dedicated to ensuring that this linkage is multilingual. This therefore constitutes a remarkable step forward towards the Human Language Project so dear to TAUS.
19
Review of language business & technologies in Europe by Andrew Joscelyne
BabelNet began life as a project with a multimillion European Commission grant in 2011 called MultiJEDI. Navigli’s principle concern at the time was to find effective, widecoverage and powerful ways of representing lexical-semantic knowledge in appropriate formats and then use it in NLP applications.
If the lexical knowledge issue could be solved, then lots of applications could be enabled in as many languages as possible.
He soon realized that one “over-ambitious” consequence was that the project needed to move from a monolingual to multilingual approach, by building on the WordNet model of linguistic principles but transposing it to a multilingual setting, whereby concepts are understood as sets of synonyms in different languages.
The next challenge was to represent WordNet multilingually and then create the huge network of multilingual lexicalizations connected to each other and to the meanings and distinctions so that lexicographic sources (words) could link to encyclopedic resources (named entity mentions). If the lexical knowledge issue could be solved, then lots of applications could be enabled in as many languages as possible. As Wikipedia was constructed as sets of equivalent pages linked in different languages, the idea was to interconnect all these language versions. The trouble is, Navigli realized, you can’t just make translations from Wikipedia in the form of named entities. What you also needed was translations over abstract concepts that are not
20
actually in the encyclopedia already. So he had the idea of applying statistical machine translation to what they call “sense annotated” corpora – i.e. corpora that had their words associated with explicit meanings derived from WordNet or Wikipedia. This trick helped increase the translations of abstract concepts. So now BabelNet had a way of linking dictionary and encyclopedic knowledge such that when you process a sentence automatically you don’t need to worry about words or world knowledge. You just get the appropriate meaning: that word piano in a music context is not the same as Piano in an architectural context (Renzo Piano). The original purpose of BabelNet, after the fairly complex automatic process of linking word meanings with entity identities was completed, was to support a typical NLP task – e.g. Word Sense Disambiguation – over multiple languages, it can handle any amount of data and create only multilingual or even language agnostic systems that can be applied to text analytics, or searching large text databases semantically. In the case of search, BabelNet ought to return the “best” results in context, not the most obvious or typically crowdsourced results.
Review of language business & technologies in Europe by Andrew Joscelyne
Improving machine translation is another key application. Although we are aware of the “unreasonable effectiveness of data”, there are frequent cases of data scarcity, especially in languages other than English, and recourse to the semantic resources of BabelNet should be able to help out. In the emerging area of multilingual text analytics, especially in Europe with its geographical language silos, connecting analytics across languages, even with translations, will almost inevitably confront the problem of similar ideas being expressed in different ways or different words. In such cases, some form of synonym linking across languages will be needed to capture the facts about named entities and smooth the quality of the translation.
In the case of search, BabelNet ought to return the “best” results in context, not the most obvious or typically crowdsourced results.
So who is using BabelNet these days? Navigli says that computer-assisted translation is probably the biggest use case. People can expand their translation memories by using BabelNet to deliver confidence scores for each translation, for example.
provides a quick semantic fingerprint of the document and its domain, plus a checklist of multilingual versions. Roberto Navigli is most of all impressed by the system’s ability to process and connect text across almost any language. And this apparently means dozens and dozens of languages today from Abkhazian to Zulu. Once you link your text string, he says, to a node in a semantic network such as WordNet and Wikipedia, you move up to a new level of insight and open up a whole new world for your process.
The language technology industry is finally able to leverage the power of linked linguistic data and linked encyclopedic knowledge for its own purposes.
For the future, the key for the resource will be to keep it open for research purposes, with user companies giving funds to continue with the development. This means that the language technology industry is finally able to leverage the power of linked linguistic data and linked encyclopedic knowledge for its own purposes. Let’s hope it will prompt some interesting disruptive innovation in Europe and elsewhere.
Even more importantly, for each translation whereby you know which concepts are involved, you can provide a term in any language and the system gives you back an option with translations, definitions, relations, and even pictures for millions of concepts. It has also been suggested (by Andrzej Zydron of XTM Intl. among others) that BabelNet can be used to create a rapid document categorizer by automatically generating a wordlist of the contents (words plus translations) which then
21
The Translator’s Perspective by Jost Zetzsche
Data on Data I’m
a
historian
by
training,
so
I’m
naturally
interested in historical developments in the world of translation—especially in relation to translation technology and the employment of that technology.
It’s
a bit of a trend these days, but historians have
always been eager to find data on which they can base their findings.
That’s
why
I
found it disappointing to
realize that it’s not easy to come by data on the topic of this column: data. data on how
Particularly translators use data.
difficult to find:
Why is it so difficult to get to that data? The easiest answer would be that “translation” is so diverse that the requirements and their solutions are by no means homogenous. (You might have noticed that I referred to the “world of translation” in the first paragraph, an unwieldy term that still seems more suitable than “translation industry” to represent the multiplicity of diverse groups that work with translation.) A n o t h e r answer might be that most translators who reluctantly a d o p t e d technology at first have finally embraced it now, but their gradual and ultimately static adoption has led to a broad spectrum of the kinds of technology in use, with correspondingly different ways of using data.
It didn’t really hurt to have useless data in the TM because the chance that any given translation unit would be used as a perfect or fuzzy match was really not likely.
Yet there are some good indications that overall the use of data has changed in the last five to eight years. Where do I find those indicators? In the development of technology itself.
22
Aside from glossaries and dictionaries, the relevance of external electronic data entered the translator’s consciousness shortly after translation environment tools (CAT-tools) became widely available in the late 1990s. Translation memory and termbase data were two of the main pillars of that technology. Many translators accepted a “bigger-is-better” approach, especially to the one translation memory that held all the data they could get their hands on (aka the “Big Mama TM”). And for good reason. The nature of the translation environment of that time meant that it didn’t really hurt to have useless data in the TM because the chance that any given translation unit would be used as a perfect or fuzzy match was really not particularly likely. Everybody agreed that there was a potential G I G O (garbage-ingarbage-out) problem—that recurring poor translations had a detrimental effect on the translation product—but as long as the translator controlled the translation quality by steering the flow of data into the TM one
Recurring poor translations had a detrimental effect on the translation product.
The Translator’s Perspective by Jost Zetzsche
translation unit at a time, it wasn’t necessarily a problem.
the subsegmenting feature, suddenly all the translation memory content becomes relevant.
Yes, there was a different terminology and register for different clients and subject matters, but that could be controlled with an adequate setup (TM metadata, strong clientspecific termbases, etc.). There also were some external resources such as the famed Microsoft “glossaries”—which really were translation memories—and many translators imported these into the Big Mama TM. But even those were typically judged of good enough quality not to spoil the overall outcome too much.
The .x% change that any given translation unit was ever going to be reused as a full match s u d d e n l y b e c a m e exponentially higher by extracting specific parts of it. As a result, the TM’s requirements in regard to quality and specificity in the areas of subject matter and client preference also rose exponentially.
For years as I trained other translators in the use of translation technology, I told the (true) story of that one several-thousand-word software resource file that I had translated early in my career. It sat dormant in my TM until years later when I translated a completely unrelated project that happened to contain the exact same file. That was a good and profitable day, and I encouraged my students to build up similar resources.
There was a strong call for “advanced leveraging”.
I haven’t told that story for a good while now. Why? Because new technology has made such a process obsolete.
And just like that, data went from “ bigger-is-better” to “ targetedis-better”—at least for some translators.
And just like that, data went from “ biggeris-better” to “ targeted-is-better”—at least for some translators. Others may realize only later how to harness the new technology in a more efficient manner. As I said above, the world of translation is multifaceted, so all of us use technology differently. In addition, there are other fundamental changes in data concepts.
Eight years ago at an early TAUS meeting in Taos, New Mexico, there was a strong call for “advanced leveraging,” i.e., the idea of drilling down more deeply into the content of translation memories by analyzing and extracting partial as well as complete segments. At that point, only some Canadian bi-text tools had anything close to that kind of functionality. Today, however, most translation environment tools offer this feature, typically coined “subsegment leveraging.” I mention this because it had a tremendous impact on the way translation memories work. If translators make extensive use of
23
The Translator’s Perspective by Jost Zetzsche
On the one hand, in corporate server-based workflows, translation memory data only reaches the translator as it applies to the current segment being translated. Very often these workflows prevent the translator from adding her own translation memory data to the mix. In those cases, the translator neither “gains” any data nor directly uses any of “her own data.” On the other side of the spectrum, translators who work in situations where there is very little if any pressure for translation memory-driven discounts may prefer to use mono- or bilingual reference corpora rather than traditional translation memories, data of a very different kind.
The vast majority of translators have not actively embraced the possibility of building their own translation engines to use in their translation processes.
24
taus.net
And then there is of course data-driven statistical m a c h i n e translation.
While there are always exceptions, it’s probably safe to say that the vast majority of translators have not actively embraced the possibility of building their own translation engines to use in their translation processes. The reasons for that? There are many: a perceived lack of available data, too little technical expertise to tackle something like Moses themselves, very few viable commercial offerings, and insecurity about the legal considerations when using sources like Microsoft’s Translator Hub. Most importantly, however: no real vision for how that additional technology would justify the effort and cost in building such an engine. We may have to wait until the data catches up with our data to take the final data-driven plunge.
The Research Perspective by Luigi Muzii
Breakthroughs from Research In the last few years, Internet and data have been the engine for change, affecting global communications in every area, including the translation industry. Big data and the IoT A few weeks ago, at the International Consumer Electronics Show in Las Vegas, 2015 has been designated as the year of connected devices. From toothbrushes that can schedule check-ups with dentists to yoga mats that can analyze Ä sana in real-time, over collarpowered trackers helping owners locate their runaway pets. It is the Internet of Things (IoT), everyday objects with integrated network connectivity, which Gartner predicts in over 25 billion by 2020.
These devices will be producing exabytes of data every day. Real-time processing, analysis, and leveraging are becoming a capability requirement. Right now, big data is central to many areas because of the unparalleled amount of data produced every day. Most research projects require a massive data-crunching and machinelearning approach. Recently, the American Association for the Advancement of Science identified a poor fit in traditional university career paths for experts to build the tools to analyze vast amounts of data now abundant in every field. Big data experts are already sought-after by industry and needed in academia, i.e. to process gene sequences or cosmological data. Achievements in statistical machine translation are also due to a change in paradigm made possible by the availability of an unmatched amount of language data.
The IBM Model 350 Disk File (1956) Over a ton, fifty 24â&#x20AC;? disks, $ 3,200 monthly lease 3.75 MB storage capacity
Kenneth Cukier, co-author of Big Data: A Revolution That Will Transform How We Live, Work, and Think, explained this brilliantly in a TED talk last June.
25
The Research Perspective by Luigi Muzii
Computer scientists changed the nature of the problem from trying to explain to the machine how to translate to instructing it to figure out what a translation is from a huge amount of data around it. They call it machine learning. With the emergence of IoT, data is changing status, from static to dynamic, and is been leveraged for uses never imagined when collected, with translation becoming ubiquitous and more and more a big data issue. The case against machine translation As Thomas H. Davenport suggests in the Wall Street Journal, the value added to data by analytics comes from the human brainsâ&#x20AC;&#x2122; systematic fallacy in its appraisal. Computers can see more in data and do things that humans cannot do, especially with small data. Fundamentally, the hostility against machine translation comes from the anthropomorphization of computers, which makes people assume that computers process data just like humans do, albeit more quickly and more accurately. In reality, a big data system looks at a historical set of behavioral data and statistically infers a probable behavior under similar circumstances. The same happens with statistical machine translation.
The hostility against machine translation comes from the anthropomorphization of computers.
Many people worry that increasingly s m a r t e r machines will disrupt the labor market and threaten humans. Even some technology optimists, such as Vinod Khosla, say that as computers and robots become more proficient at everything, skilled jobs will quickly vanish. Others, like Marc Andreessen, think those worries are nonsense, since technological
26
advances have always improved productivity and created new jobs. Andreessen also considers machine learning as a trend to watch out for in 2015. The evolution in artificial intelligence and machine learning consists of deep neural networks (DNNs,) biologically inspired computing paradigms designed like the human brain, enabling computers to learn through observation. At the beginning of the last d e c a d e , building DNNbased systems proved hard, and many researchers turned to other solutions with more near-term promises. Now, thanks to big data, new DNN-based models can learn as they go and build larger and more complex bodies of knowledge from and about the dataset they are trained on. Machine translation is a promising research field for the application of DNNs.
Thanks to big data, new DNNbased models can learn as they go and build larger and more complex bodies of knowledge from and about the dataset they are trained on.
The uberization of work According to Farhad Manjoo, new technologies have the potential to chop up a broad array of traditional jobs into discrete tasks to be assigned to people when needed. Wages could be set by a dynamic measurement of supply and demand. A workerâ&#x20AC;&#x2122;s performance could be tracked and subject to the light of customer satisfaction. Manjoo calls this uberization of work, to resemble what Uber is doing for taxis, with the key perks of an Uber job being flexibility, working 1-15 hours a week and an easy additional income. Also, Uber drivers do not
The Research Perspective by Luigi Muzii
require any particular ability other than driving. This is exactly what has been happening for decades in the localization industry, where freelancers have been experiencing this kind of ‘novelty,’ called moonlighting. On the other hand, in The Internet is Not the Answer, Andrew Keen uses Uber as an example of the exploitation of the openness of the Internet to take control of existing industries. The prevalence of on-demand jobs in the immediate future is the real novelty, while for the localization industry uberization could represent disintermediation. The invisibility of translation Ray Kurzweil predicted many of the most important innovations of the last twenty years. His predictions for the next 25 years could seem mind-boggling, but also obvious. Four years ago, Kurzweil predicted that spoken language translation would be common by the year 2019, and that machines would reach human levels of translation quality by the year 2029.
Kurzweil’s predictions seem realistic also because we have been in the second half of the chessboard for a few years now and the cost of storing and analyzing ever-vaster amount of information keeps steadily decreasing. Big data and IoT will make translation even more central than in the past, but definitely invisible.
The pervasiveness and centrality of translation in the age of IoT and big data shall lead to applied knowledge.
The problem with exponential growth is that, contrarily to previous technological revolutions, shift has been happening too fast to provide new opportunities to successive generations of workers. Only a quarter of a century ago, young people could comfortably plan their future on a fiveyear span, and retraining for displaced workers was a viable solution. Today retraining is a viable solution if it is quick enough, and no competence can be disjoint from data and its
27
The Research Perspective by Luigi Muzii
manipulation. This is true even for the language industry, if language is just a technology as Mark Changizi, Director of Human Cognition of 2AI Lab, suggested. The pervasiveness and centrality of translation in the age of IoT and big data shall lead to applied knowledge, with the ability of producing, using and manipulating data being essential. Quantity may be quality Today, as General Electric CEO Jeff Immelt noticed and the success of Douglas Hubbard’s book How to Measure Anything testifies, almost everything can be measured, making the physical and the analytical worlds no longer separated. Often we do not notice the big data aspect of our daily encounters with technology. And yet, despite the often comical renditions, not only is the autocorrect feature in many a device helpful, it is also daunting: It is the result of the infinite number of combinations, a matter of big data.
Big data may have just crested the wave of inflated expectations and be barreling towards the trough of disillusionment.
big data and machine learning will improve context-based functions for an improved experience across (connected) devices. When new technologies make bold promises, discerning the hype from what’s commercially viable is a problem. As Alon Halevy, Peter Norvig, and Fernando Pereira from Google noticed six years ago, the promise of gaining more insights from the more data collected could be labeled as “the unreasonable effectiveness of data.” In reality, the problem is one of validity: With so much data and so many different tools to analyze it, how can one be sure results are correct? Good statistical modeling requires stable input, at least a few cycles of historical data, and a predicted range of outcomes. Following Gartner’s Hype Cycle, big data may have just crested the wave of inflated expectations and be barreling towards the trough of disillusionment, but this means it could be approaching the maturity stage, when technologies recover to reach a plateau of productivity. In the end, the question is still the same, whether (translation) machines will render the work of (trained) humans obsolete. Wait: whether or when?
The task is challenging and tricky, as frequent errors in data can become viral, but
Quality Evaluation Summit Dublin, ireland 28 May 2015
28
The Language Perspective
by Nicholas Ostler
Language Data Faces the Philosophers Modern
language technology is mostly based
way or another
–
on
“big
data”.
Each
–
one
language has
been around for a long time, and some of them over a very wide range of user communities.
Modern
digital
techniques of storage mean that it is easy to bring together the record of a language’s past use, and look for patterns in it.
These patterns may recollect the effects of grammatical rules, the traditional way to understand language structure; but they may reveal other regularities too, and in a downright, “in your face” way. At last, it feels, we have some objective evidence of language structure. We have no choice but to accept the repeating collocations, substitution equivalences, and gaps, that emerge. And, reassuringly for those disappointed by twentieth-century linguistics, there is no need for appeal to the personal intuitions of native speakers – or worse still, of theoretical linguists themselves. The revulsion from cognitive methods or elicited text has been so great that some have come to believe in the “unreasonable effectiveness of data”, seeing c o m p u t e raccessible databases of language records as analogous to the role of mathematics in natural sciences. Electronic pattern recognition, it seems, gives us the means to find structure in the vast hinterland of a language’s back catalogue.
Electronic pattern recognition gives us the means to find structure in the vast hinterland of a language’s back catalogue.
The impression is beginning to creep in that computational techniques can beat humanity at its own game, namely the correct and meaningful deployment of human language. Oh, and if performance is still a little substandard,
as native speakers judge: that is only because the exceptions are too rare, or conditioned at too long a distance, for their causes to show up as yet, in the amount of data that has been collected, and the processing to which they are submitted. If billions of words do not suffice, just wait until the trillions… or the decillions come into play. Have patience, therefore, until the triumph of Moore’s Law and patternmatching is fulfilled. In a way, this apparent effectiveness of past data in revealing patterns is unsurprising. How, after all, do people learn their languages but through being exposed to others’ usage? Mostly they become comprehending listeners, and active users, without any attempt by parents or teachers actually to instruct them.
Computational techniques can beat humanity at its own game
Nevertheless, there is something unsatisfying in the radical assumption that all the information needed is in the past record. Language, in use, is not just a matter of recalling what
29
The Language Perspective by Nicholas Ostler
one has heard. It is a productive skill. We understand new messages by re-configuring memories, actively putting together fragments of past language experience. We produce new utterances likewise. Somehow we are able, not just to mimic past utterances, but to innovate. We apply patterns actively as rules. We speculate about the limits of what is possible, and then we go ahead and explore it. Clearly, memory – fed by teachers who enlarge our experience of past practice – plays a large part in what we call culture, and education. Much of our formal learning is made up essentially of Repetitions: nursery rhymes, songs and poems are learnt as complete structures to be repeated, and so are quotations, even large-scale recitations. If we think about the learning of languages, this is the kind of contribution made by dictionaries
Memory – fed by teachers who enlarge our experience of past practice – plays a large part in what we call culture, and education.
and phrase-books. But there are also items which are learnt not for use ready-made, but rather as abstract recipes, patterns which can be applied either to organize other items, or to indicate their role in larger structures. These range from phonotactic principles for the structure of words (e.g. in English, the sound written ng can only occur to end a syllable) and grammar rules (e.g., inverting subject and main verb can indicate a question) to systems of morphology, such as the principles of conjugation and declension in Latin: and even rhymes predicting the gender of nouns: “To nouns that cannot be declined | The neuter gender is assigned…” It is principles like these which may be applied dynamically to produce more of a language. Although there will be such principles in a dead language (such as old English), they will only exist historically. The corpus of a dead language is now closed – unless it should be revived. But by definition, a living language is open-ended. Its principles apply productively, even innovatively, and are known (usually only implicitly) to all those who are competent in the language. The fault in using language data as a system to implicitly define a language is that it cannot tell which principles are dynamic: and so it misses the distinction between a dead and a living language. At any one time, a corpus contains just the sentences which it does: and so it might as well be representative of a language that will never have any more data.
By definition, a living language is open-ended.
Of course, it is possible to derive rules, statistically or stochastically, which will be compatible with the set of sentences in a corpus. Such rules might be taken as a simple substitute
30
The Language Perspective by Nicholas Ostler
for a grammar of the language. These may, or may not, correspond to dynamic principles used by speakers in actually using the language. But as pointed out by Wittgenstein, a series of items, however long, does not determine the choice among the possible rules that may have generated the series; and as pointed out by Quine (in his Thesis of the Indeterminacy of Translation), an equivalence b e t w e e n sentences in two languages, h o w e v e r extended, will never fully d e t e r m i n e the principles needed to interpret one language in terms of another. Hence, however useful and practical language data, and rules derived from them, may be as approximations of a living language, they can never ultimately pin it down.
An equivalence between sentences in two languages, however extended, will never fully determine the principles needed to interpret one language in terms of another.
They will not generate rules to interpret sentences (as an actual user of a language must); nor will they produce a theory of rhetoric â&#x20AC;&#x201C; of how to get effects with the language. They cannot progress from rules for incidence of words to properties of a semantic model: the picture of the world that the language user has in mind. Techniques for expression of anything outside language, or communication between language users, remain a mystery. References: Kripke, Saul A., Wittgenstein on Rules and Private Language, Harvard University Press, 1982. Quine, Willard V.O., Word and Object, MIT Press, 1960.
Every other month we host a party at our office on Keizersgracht 74, Amsterdam Join us on May 21 for a potluck dinner and a pool competition
31
The Journalist’s Perspective by Lane Greene
Everything you ever wanted to know about Google Translate, and finally got the chance to ask Google Translate
is the world’s best-known free
tool for machine translation. by
Google’s
huge trove of
It is data, and
made possible the statistical
techniques that match n-grams from one language with plausible n-grams on another. to
the
Translate
translation
industry
For
like
an outsider
me,
seemed to represent a great leap forward
in translation quality when it was first introduced.
However,
since then, its quality improvements seem
more incremental, when they are visible to all. did
Google Translate
get so good?
And
How
how can it
avoid plateauing in quality, and get better still?
One of the bright sides of being a journalist is that when you have questions like this, you can just call the people who know the most and ask them. Google’s press team responded to my email with an offer to talk to Macduff Hughes, the engineering director for Google Translate. First, where did Google get all of its data? It crawls and saves text from about a trillion web pages. But how does it know what is humantranslated text to run its statistical learning algorithms on? I had thought that perhaps humans cull and code the texts to be fed into the engine. But Hughes explained that the search engine simply looks for pages that look like they might be translations of one another. Perhaps they have identical domains, only one ends in /en and another ends in /fr. Perhaps they have proper names or identical numbers in the same position. The software does not weight a pairing as more or less likely to be a translation—it is an either-or binary decision, in or out. How did it get so good? The initial leap in quality came from sheer mass. A 2009 paper by three Google researchers responded to the “physics
32
envy” that students of human phenomena feel. A classic 1960 paper had been titled “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”, extolling the power of formulae like f=ma. Linguistics has no such formula. But the Google researchers retorted by calling their 2009 paper “The Unreasonable Effectiveness of Data.”
But how does it know what is humantranslated text to run its statistical learning algorithms on?
The Google approach is that a simple approach over a huge trove of data is better than a clever approach over limited data. With so much data, errors will, it is hoped, cancel each other out in the enormous aggregate. In addition to all that unmarked, untagged messy data, Google does get some specialty data from professional translators: the European Patent Office shares data
The Journalist’s Perspective by Lane Greene
with Google, for example, though Hughes says that this EPO data (despite its high quality) does not currently have any special weight in the public-facing Google Translate. He notes, sensibly enough, that many people use Google Translate for slangy or spoken-language purposes, for which giving too much weight to the kind of language in a patent application would be less than ideal. But even Google has limits on what enormous amounts of data can do. There are thousands of potential language pairings across the several dozen languages Google Translate offers. But for the vast majority of those pairings (FinnishZulu, say), there is little or no training text available, even on a trillion web pages. So the user hoping to translate Finnish to Zulu on Google Translate will be going through a “bridging” language, almost certainly English. This of course magnifies the possibilities for error. Asya Pereltsvaig, who teaches linguistics at Stanford, caught Google Translate translating a Russian nursery rhyme with “two happy geese” into French and getting deux oies gay—two homosexual geese. The culprit was, of course, the double-meaning of “gay” in English, the bridging language between Russian and French. This leads to another problem. Pereltsvaig has translated this phrase with Google Translate, however badly. The dud translation now lives on the web, where it will be crawled by Google—and could be fed back into Google Translate. What if the service is, to put it crudely, consuming its own waste? Hughes acknowledges the problem frankly. Google has tried electronically “watermarking” its translations so the crawler will recognize them and try to avoid feeding mistakes back into the system as input. And then there are web pages that simply have the same text in—suspiciously—all of the languages Google Translate offers. The system can guess that these were translated by Google and avoid feeding them back into the system.
Would more data help an organization that already has so much? Would ten trillion pages be noticeably better than one trillion? Hughes is again frank: for the most common language pairings, “we have reached about the limit where more data is helpful.” His efforts have to turned to making Google Translate smarter, playing with rule-based improvements to see if they improve quality. In other words, if Google Translate’s first great leap forward came from huge data and computing power, for big languages, at least, its next leap forward will rely more on clever software engineering. For example, automatic parsing can improve word order in translations.
What if the service is, to put it crudely, consuming its own waste?
And he mentions neural networks as a particularly exciting avenue for research— this, after all, has been particularly helpful in Google’s speech-recognition. But there is another avenue: the great software company is asking good old fashioned human users to chip in their expertise. If you are a frequent user of Google Translate, you will probably have noticed the “Help Improve Google Translate” at the bottom of the page. These user-driven efforts pack a particularly heavy punch for those languages for which data is sparse, and users are keen volunteers. A titan of data like Google is smart enough to know the limits of data. Hughes hopes that some (undiscussed) radical breakthroughs might yet lead to a sudden leap forward in Google Translate’s quality. But even absent that, cycles of data gathering and incremental innovation are hoped to gradually inch the needle of quality forward. And the wisdom of crowds—Google’s users—could inch it further.
33
Japan’s translation industry is feeling very Olympic today by Una Softic
Since the city of Tokyo was elected as the Host City of the Olympic Games in September 2013, Japan started undergoing a transformation in preparation of its leading role in the biggest international sporting event.
If
we could call the
64’ Olympic Games
an
Tokyo to reshape its infrastructure Kenzo Tange’s visionary metropolis, then
opportunity for and become
this decade marks the transformation to a global capital, the axis of international communication and multilingual dialogue.
Japan is often cited for its “insular culture and mindset”, as not many people show interest in foreign cultures and languages. That is why Tokyo remains one of the rare metropolises in the world, where “lost in translation” is more than a saying, a movie title, or a funny metaphor. For people who have visited Tokyo it soon becomes the way of life; the beauty and the pain of bare survival. Roughly 2% of Japan’s population is foreign (and this figure includes large numbers of permanent residents who have been there for generations). Therefore even in urban areas people are rarely exposed to the influence of other cultures and languages. Foreign visitors are often surprised by the lack of information and access to genuine content in their language, which prevents them from having a flawless experience visiting the land of the rising sun. The announcement of the Olympic city brought in a new breeze of awareness that it’s time (and the right opportunity) to change that. While the drills are tearing up buildings in the Mita area to make way for the new “Olympic” highway, the government, telcos, tech companies, translators and ladies preparing bento boxes (we’ll get to that later) drill their skills and
It is an absolute necessity for strong Japanese translation companies to start utilizing machine translation and other technological solutions. 34
look for opportunities to connect with foreign visitors who don’t speak a word of Japanese. Innovation comes with eccentricity In a recent conversation with Mr. Hiroki Kawano and his team from Honyaku Center Inc., an interesting observation about the Japanese translation environment was made. Mr. Kawano, who is the Deputy General Manager of Honyaku Center, as well as Editor in Chief of JFT’s (Japan Translation Federation) Japan Translation Journal, explained how Japanese people involved in translation business possess a certain amount of eccentricity. Combining Japanese commitment to quality and process with unconventional interest in foreign cultures and overseas organizations’ workflows gives local translation industry room for individuality and an obligation to respond to recent global trends. Mr. Kawano pointed out that it is an absolute necessity for strong Japanese translation companies to start utilizing machine translation and other technological solutions. Leaving no room for conservative methods that have been proven to work well so far, the Japanese translation industry continues to attract eccentric individuals ready to disrupt rigid mainly sales-focused language service providers. Shifting focus to engineering and tech, introducing automation trend while the demand for instant and quality translation increases just before the Olympics, are in his opinion priorities.
Japan’s translation industry is feeling very Olympic today by Una Softic
VoiceTra4U is further developing their solutions that support travel conversations and target primarily foreign visitors, domestic medical institutions, railway operators and retailers – the ones who will need flawless communication during the Tokyo Olympic games most. NICT as Japan’s leading natural language processing resource center plays a great role in development for companies with clientfocused products that bridge the Japanese communication gap.
Endorsement of translation technology The Japanese government is endorsing these endeavors by investing in the deployment of multi-language machine translation systems in preparation for the 2020 Olympics. They are supporting the National Institute of Information and Communications Technology (NICT), expecting the release of a flawless multilingual communication infrastructure/system before the big sporting event. NICT released a free network-based smartphone translation application VoiceTra in 2010 already. This application displayed their research results of automatic multilingual speech translation technologies. Voice input processed by statistical model speech recognition system was transmitted to the multilingual speech translation server, conducting language translation and providing voice output using speech synthesis. This project served as an incentive for further development of the application and the establishment of The Universal Speech Translation Advanced Research Consortium (U-STAR, part of The Asian Speech Translation Advanced Research), currently comprised of 30 institutes from 25 countries/regions. Their publicly released client application, ‘VoiceTra4U’ is a result of joint efforts of academia (University of Kyoto being the leading institution) and private companies, as well as an independent administrative agency under the Internal Affairs and Communications Ministry.
According to Dr. Ikuo Kitagishi, Manager of Machine Translation Research Project of Yahoo! JAPAN, NICT’s available corpus and tools present the most important basic step in any related process, especially for statistical machine translation that requires a huge amount of parallel translation corpus material. He expressed the difficulties in Japanese Machine Translation, since the lack of data and technical challenges of natural language processing such as tokenizer, named entity extractor and parser are amplified by talent and fund shortage, which often results in dropped projects and re-strategizing. In addition to these obstacles it might be difficult for companies to calculate return on investment of Japanese MT research and development. Dr. Kitagishi also affirmed that Japanese MT is on the right track with increasing governmental support and an active leadership of Dr. Isahara (Toyohashi University of Technology), Dr. Kurohashi (Kyoto University) and Dr. Tsujii (Tokyo University).
Japanese MT is on the right track with increasing governmental support and an active leadership of the Japanese universities.
The latest Japanese initiative providing high performance voice and task recognition is Mirai Translate, which was formed in October 2014 as a joint venture of NTT DOCOMO, Systran
35
Japan’s translation industry is feeling very Olympic today by Una Softic
and FueTrek, utilizing technology from NICT and their projects made in conjunction with NTT research labs. Dr. Minoru Etoh, the CEO & President of this high-performance engineering company, shared that if we asked him about his view of MT about a decade ago, he would have been skeptical about it. But nowadays with the availability of excellent hybrid systems between crowd-sourced human translation and MT, we are at the starting point of complete automation of our communication. Mirai Translate’s strong engineering team develops B2B engines for companies who can utilize their technology for B2C clients. He expressed the inevitability of a trade-off between quality and speed, which is improving daily and resetting industry standards. Realtime MT is on Mirai Translate’s roadmap, and they certainly expect results before the 2020 Olympics. About their Olympic incentive, Dr. Etoh elaborated: “Everybody talks about the Olympic Games. But I am skeptical about any “hype word”. The Olympics are a good motivation for everyone to contribute their best work, but we need to truly utilize this hype. Positive sentiments towards this event will bring additional motivation, but we do our best to provide great language s o l u t i o n s regardless the hype.” Dr. Etoh also outlined the high demand for good quality corpus on the Japanese translation market, while they aim to achieve the world’s highest level of accuracy for machine translation.
The Olympics are a good motivation for everyone to contribute their best work, but we need to truly utilize this hype.
NTT DOCOMO currently provides its Jspeak – Japanese translator app both for iOS and Android devices, based on Hanashite Hon’yaku, which was primarily helping Japanese speakers communicate abroad, while Jspeak application gave in to the trend,
36
easing communication between foreign visitors and Japanese locals, suited for the upcoming needs of the Olympics. Start-ups and restart-ups Japanese crowd-sourced and/or machine translation start-ups are well aware of increasing demand of diverse translations. anydooR’s Conyac, Yaraku’s WorldJumper, Gengo and Wovn are established providers of fast and affordable solutions who are keen on taking an extra step and providing their customers with versatile services, suited to their needs. Due to an increasing demand of languagerelated services, Conyac recently launched “Conyac Market”, a platform that enables users to easily find multilingual help on various projects within Conyac’s worldwide database of over 45K multilingual people. “Conyac Market” was launched last month and it is already conducting market research, subtitling, copywriting, post-editing and many other services. With an increasing interest in the market and the Olympic “hype”, the Conyac Team is ready to accept innovative orders and is ready to answer any language or culture related requirements. In the meantime Pijin, a collaborative entity of a number of tech companies that started its
Japan’s translation industry is feeling very Olympic today by Una Softic
path as a student venture, provided foreign visitors with a simple and clever solution for basic understanding of signs and printouts, the QR Translator. Scanning a QR code placed next to Japanese text, visitors can access translations of the content at various a i r p o r t s , department s t o r e s , convenience s t o r e s , museums, expos and other touristy areas. The CEO of Pijin, Kenji Takaoka, is aware that there are many more areas and venues across Japan that can benefit from such a wholesome solution.
In the past few years many new smaller translation agencies opened their doors, providing specialized services for small businesses with big international dreams.
What’s with the bento box ladies from the beginning of this article? We finally got to that. After reviewing the companies, it is time to talk about the people who try to make an impact on their own. In the past few years many new smaller translation agencies opened their doors, providing specialized services for small businesses with big international dreams. An interesting example of such is cosmopolite, an idea of Sabrina Olivieri-Tozawa and Miki Sakae, Tokyoites with years of international experience in education, linguistics, cultural studies and interaction with students. They have been welcoming international students, organizing parties, cultural events and even preparing their bento boxes for their host “children”. Knowing the cultural and linguistic barriers they decided to dedicate their expertise in one of the most important segments – culinary experiences in Japan. In addition to cooking classes, tours and gatherings, they offer menu translation for restaurants in the form of package translation deals that also take care
of restaurants’ webpages, SEO, location on Google maps and full social media appearance. Full Olympic package – no less than that… “Hello Japan. How do we communicate now?” More important than plans and future predictions are facts and current circumstances. There is probably no better place to review current state of Japanese translation industry than at the fifth TAUS Executive Forum in Tokyo, which will take place on April 9-10 at Oracle Japan. Technology, crowdsourcing translation models, social platforms, advanced workflow systems, data sharing, evaluation metrics, cloud-based TM systems and role of MT will be discussed by industry professionals from Lionbridge, Yaraku, Microsoft, Nikon Precision, Moravia, Mirai Translate, Honyaku Center, Spoken Translation, ATR-TREK, NICT, Human Science, Crestec, TOIN, Microsoft, ISE, Gengo and Conyac. Register today to “Discover Tomorrow”.
Una Softic Una is the Chief Global Manager at anydooR Inc., the company behind the Social Translation Service Conyac. She holds a double MA degree in Comparative Linguistics from University of Ljubljana, Slovenia, and has 10 years professional experience in Business Development and Marketing across Europe, United States and Japan.
37
Contributors
Reviews Mike Tian-Jian Jiang Mike was the core developer of GOING (Natural Input Method, http://iasl.iis.
Andrew Joscelyne
sinica.edu.tw/goingime.htm), one of the most famous intelligent Chinese
Andrew Joscelyne has been reporting
phonetic
on language technology in Europe for
He was also one of the core committers of OpenVanilla,
well over 20 years now. He also been a market watcher
one of the most active text input method and processing
for European Commission support programs devoted
platform. He has over 12, 10, and 8 years experiences
to mapping language technology progress and needs.
on C++, Java, and C#, respectively. Also familiar with
Andrew has been especially interested in the changing
Lucene and Lemur/Indri. His most important skill set is
translation industry, and began working with TAUS from
natural language processing, especially for Chinese word
its beginnings as a part of the communication team.
segmentation based on pattern generation/matching,
Today he sees language technologies (and languages
n-gram statistical language modeling with SRILM, and
themselves) as a collection of silos â&#x20AC;&#x201C; translation, spoken
conditional random fields with CRF++ or Wapiti.
interaction, text analytics, semantics, NLP and so on.
Specialties: Natural Language Processing, especially for
Tomorrow, these will converge and interpenetrate,
pattern analysis and statistical language modeling.
releasing new energies and possibilities for human
Information Retrieval, especially for tuning Lucene and
communication.
Lemur/Indri. Text Entry (Input Method).
Brian McConnell
Amlaku Eshetie
Brian
McConnell
is
the
Head
of
input
method products.
Amlaku earned a BA degree in
Localization for Insightly, the leading
Foreign
small
for
(English & French) in 1997, and
Google Apps. He is also the publisher
an MA in Teaching English as a
of Translation Reports, a buyers guide
Foreign Language (TEFL) in 2005,
for translation and localization technology and services,
both at Addis Ababa University, Ethiopia. He had been a
as well as a frequent contributor to TAUS Review.
teacher of English at various levels until he switched to
Specialties: Telecommunications system and software
translation and localisation in 2009. Currently, Amlaku
design with emphasis on IVR, wireless and multi-modal
is the founder and manager of KHAABBA International
communications. Translation and localization technology.
Training and Language Services at which he has been
business
CRM
service
Languages
&
Literature
able to create a big base of clients for services, such as localisation, translation, editing & proofreading, interpretation, voiceovers, copy writing.
38
Contributors
Perspectives Jost Zetzsche Jost Zetzsche is a certified Englishto-German technical translator, a translation technology consultant, and a widely published author on various aspects of translation. Originally from Hamburg, Germany, he earned a Ph.D. in the field of Chinese translation history and linguistics. His computer guide for translators, A Translator’s Tool Box for the 21st Century, is now in its eleventh edition and his technical newsletter for translators goes out to more than 10,000 translation professionals. In 2012, Penguin published his co-authored Found in Translation, a book about translation and interpretation for the general public. His Twitter handle is @jeromobot.
Luigi Muzii Luigi Muzii has been working in the language industry for more than 30 years as a translator, localizer, technical
writer,
author,
trainer,
university teacher of terminology and localization, and consultant. He has authored books on technical writing and translation quality systems, and is a regular speaker at conferences.
Nicholas Ostler
Lane Greene
Nicholas Ostler is author of three
Lane Greene is a business and
books on language history, Empires
finance
of the Word (2005), Ad Infinitum (on
Economist based in Berlin, and
Latin - 2007), and The Last Lingua
he also writes frequently about
Franca (2010). He is also Chairman
language for the newspaper and
of the Foundation for Endangered Languages, a global
online. His book on the politics of language around
charitable organization registered in England and Wales.
the world, You Are What You Speak, was published by
A research associate at the School of Oriental and African
Random House in Spring 2011. He contributed a chapter
Studies, University of London, he has also been a visiting
on culture to the Economist book “Megachange”, and his
professor at Hitotsubashi University in Tokyo, and L.N.
writing has also appeared in many other publications. He
Gumilev University in Astana, Kazakhstan. He holds an
is an outside advisor to Freedom House, and from 2005
M.A. from Oxford University in Latin, Greek, philosophy
to 2009 was an adjunct assistant professor in the Center
and economics, and a 1979 Ph.D. in linguistics from
for Global Affairs at New York University.
correspondent
for
The
M.I.T. He is an academician in the Russian Academy of Linguistics.
39
Directory of Distributors
Appen Appen is an award-winning, global leader in language, search and social technology. Appen helps leading technology companies expand into new global markets. BrauerTraining Training a new generation of translators & interpreters for the Digital Age using a web-based platform + cafeteriastyle modular workshops. Capita TI Capita TI offers translation and interpreting services in more than 150 languages to ensure that your marketing messages are heard - in any language. Cloudwords Cloudwords accelerates content globalization at scale, dramatically reducing the cost, complexity and turnaround time required for localization. Concorde Concorde is the largest LSP in the Netherlands. We believe in the empowering benefits of technology in multilingual services. Crestec Europe B.V. We provide complete technical documentation services in any language and format in a wide range of subjects. Whatever your needs are, we have the solution for you! Global Textware Expertise in many disciplines. From small quick turnaround jobs to complex translation. All you need to communicate about in any language. Hunnect Ltd. Hunnect Ltd. is an MLV with innovative thinking and a clear approach to translation automation and training post-editors. www.hunnect.hu Iconic Translation Machines Machine Translation with Subject Matter Expertise. We help companies adopt MT technology.
iDisc Established in 1987, iDISC is an ISO-9001 and EN-15038 certified language and software company based in Spain, Argentina, Mexico and Brazil. Jensen Localisation Localization services for the IT, Health Care, Tourism and Automotive industries in European languages (mostly Nordic, Dutch and Spanish).
40
KantanMT.com KantanMT.com is a leading SaaS based statistical machine translation platform that enables users to develop and manage customized MT engines in the cloud. Kawamura International Based in Tokyo, KI provides language services to companies around the world including MT and PE solutions to accelerate global business growth. KHAABBA International Training and Language Services KHAABBA is an LSP company for African languages based in Ethiopia. Larsen Globalization Ltd Larsen Globalization Ltd is a recruitment company dedicated to the localization industry since 2000 with offices in Europe, the US and Japan. Lingo24 Lingo24 delivers a range of professional language services, using technologies to help our clients & linguists work more effectively. Lionbridge Lionbridge is the largest translation company and #1 localization provider in marketing services in the world, ensuring global success for over 800 leading brands Moravia Flexible thinking. Reliable delivery. Under this motto, Moravia delivers multilingual language services for the world’s brand leaders. Morningside Translation We’re a leading translation services company partnering with the Am Law 100 and Fortune 500 companies around the globe. MorphoLogic Localisation MorphoLogic Localisation is the developer of Globalese, an SMT system that helps increase translation productivity, decrease costs and shorten delivery times. Pactera Pactera is a leading Globalization Services provider, partnering with our clients to offer localization, in-market solutions and speech recognition services. Rockant Consulting & Training We provide consulting, training and managed services that transform your career from “localization guy/girl,” to a strategic adviser to management.
Directory of Distributors
Safaba Translation Solutions, Inc. A technology leader providing automated translation solutions that deliver superior quality and simplify the path to global presence unlike any other solution. Sovee Sovee is a premier provider of translation and video solutions. The Sovee Smart Engine â&#x20AC;&#x153;learnsâ&#x20AC;? translation preferences in 6800 languages. sQuid sQuid help companies integrate and exploit translation technologies in their workflows and maximize their use of their language data. STP Nordic Translation STP is a technology-focused Regional Language Vendor specialising in English, French, German and the Nordic languages. See www.stptrans.com. SYSTRAN SYSTRAN is the market historic provider of language translation softwaresolutions for global corporations, public agencies and LSPs tauyou language technology machine translation and natural language processing solutions for the translation industry
TraductaNET Traductanet is a linguistic service company specialising in translation, software and website localisation, terminology management and interpreting. VTMT As of 2013, VTMT sells translations made by man & machine. VTMT uses only PEMT, returning good translations quickly and for a fair price. Welocalize Welocalize offers innovative translation & localization solutions helping global brands grow & reach audiences around the world.
41
42
Industry Agenda
Upcoming TAUS Events
Upcoming TAUS Webinars
TAUS Executive Forum 9-10 April, 2015 Tokyo (Japan)
TAUS Translation Technology Showcase Lingo24 and ABBYY 6 May 2015
TAUS QE Summit Dublin 28 May, 2015 Dublin (Ireland) hosted by Microsoft
TAUS Translation Quality Webinar What are you really evaluating? 17 June, 2015
TAUS Industry Leaders Forum 1-2 June, 2015 Berlin (Germany)
Translation Automation Users Call Citrix Use Case 23 April 2015
TAUS Annual Conference 12-13 October, 2015 San Jose, CA (USA) TAUS QE Summit San Jose 14 October, 2015 San Jose, CA (USA) hosted by eBay
TAUS HAUS
Industry Events
TAUS Office Amsterdam Keizersgracht 74 21 May, 2015
Localization World 13-15 April, 2015 Shanghai (China) 4-6 June, 2015 Berlin (Germany)
Do you want to have your event listed here? Write to editor@taus.net for information.
43
• The ribbon, particularly for new users and others who are not aware of all the settings in the old menu structure • Much more accurate Concordance search results • Speed improvements • Virtual merge and autosave • Improved display filter • Easier access to the various help resources • New TM fields and field values are immediately available • Very stable Nora Diaz Freelance Translator - Mexico
Join the conversation noradiaz.blogspot.co.uk @NoraDiazB #Studio2014
www.sdl.com/studio2014 www.translationzone.com/studio2014 Take it further, share projects with Studio GroupShare 44 SDL www.translationzone.com/groupshare2014
Purchase or upgrade to SDL Trados Studio 2014 today /sdltrados