Why Can't Machines Read Arabic? A Google News Initiative Pitch by John Lillywhite

Page 1

19th Dec 2019

Why Can’t Machines Read Arabic at Scale?

Al Bawaba News and Convz Marketing Analytics Consultancy Present A GNI challenge shortlist pitch for developing a new API to more accurately read and semantically tag / categorize large Arabic language datasets

1


Contents 1. End Product and Deliverables 2. Publisher Applications 3. Consumer Applications 4. Al Bawaba and DISCO Content Marketplace

● ● ● ● ● ● ●

4.1 Our Team 4.2 Problem 4.3 Solution 4.4 Product 4.5 Target Market 4.6 Competition Images

4.7 Business Model and Opportunities 4.8 Appendix

● Images

2


1. End Product and Deliverables Background: One of the reasons Google Search, Google Home and applications like SIRI are so advanced is because machines can read English. Over the past five years, our team has worked with IBM Watson, Open Calais from Thomson Reuters and Rosette to create a granular, vast digital archive from over 19 years of original and syndicated Arabic news content for a platform called ‘DISCO Content Marketplace.’ The project is very similar to a case study Google Cloud completed with the ​New York Times print archive in 2018​. The problem? Our digital archive is in Arabic. Linguistic complexity and significant variations in the quality of content mean that machines don’t always know what they are looking at. End Product: You might not notice it, but when you publish an English language news article a series of semantic tags are generated (see Fig 4 in Appendix). These tags include metadata such as key individuals, locations, institutions and themes mentioned in the article. Most editors appreciate that these tags are important for SEO. At scale, when we are dealing with terabytes of data and manual tagging is no longer possible, the ability of machines to accurately tag and categorise articles goes from important to critical. Working with Google Cloud and Convz Marketing and Analytics, our end product is a ‘trained AI’ capable of filtering, tagging and ‘reading’ Arabic content much more thoroughly. Deliverables ● One of the largest Arabic language archives in the Middle East becomes searchable. This means syndication clients, third-party publishers and researchers can extract more value from the archive. ● An API that can be integrated into other publishing platforms and websites to improve the semantic tagging and filtering of Arabic content ● Freemium access for a limited time period to historical elements of the DISCO archive.

3


2. Publisher Applications

The following three applications are relevant to the business needs of Al Bawaba News, DISCO Content Marketplace and SyndiGate. Other publishers may envisage different or surprising applications. i) A Dedicated API ● After 6-8 months of testing a dedicated API would allow DISCO Content Marketplace, Al Bawaba News and SyndiGate to exponentially improve the quality of our Arabic language databases. ● This would transform the quality of existing services, as well as the services we can offer in the future. ● The API be made available to other Arabic publishers in the region.

ii) More Advanced Syndication ● As the largest syndication network in MENA, SyndiGate pushes thousands of stories in Arabic and English to clients all over the world, every single day. This process is automatic, using RSS and other technologies. ● If we can increase the accuracy of the semantic tagging of our Arabic content, the content we provide to our clients will fundamentally improve. This could also involve new services, such as the delivery of hyper-localised or specific content: oil market results in a single country or region, or comments from a politician anywhere in the world. iii) Wire Integration in Arabic ● Most publishers utilize what’s called a ‘Wire Service’ - that means automatic updates from Reuters or Bloomberg about what’s happening in the world. Increasingly, these feeds are ​being automated providing more time for publishers to focus on original content and hire human beings. ● At Al Bawaba, between 30-40% of our editors are responsible for publishing syndicated content. Editors can spend up to 50% of their time not publishing stories, but sourcing stories from a vast database. What if we could suggest or tailor stories to them?

4


● Ideally our arabic editors would login, to review automated suggestions: ‘Breaking’, ‘Top Business’, ‘Top Entertainment’, ‘Top Sport’, ‘Human Interest’ and ‘Trending Now’. ● Once an API is established capable of semantically tagging Arabic content, this integration is one small but significant step away.

3. Consumer Applications

At the consumer end much of the application is latent. We don’t know why Google Search is better - ​it just is​. ● Quality: That said, machine reading of Arabic has significant application for consumers. It means the articles we can push to clients and consumers in Arabic are higher quality, and delivered at a faster pace. ● Future Services: It also means in future we can integrate new and exciting technologies with Arabic language websites and services, opening up entirely new possibilities and business solutions.

5


4. Al Bawaba and DISCO Content Marketplace

4.1. Our Team

Our team is made up of journalists, developers, researchers and curators, all of whom have a deep understanding of both the regional and global digital media landscapes. This involves a licensing, sales and content marketing team in Dubai, an in-house development team in Amman responsible for building and maintaining our digital infrastructure, and a team of project managers and original content creators based between Amman, Dubai, the United States and Europe. With decades of experience in creating world-class content technology, our development team is well versed in working with content across multiple formats and languages. To date, they have processed and enriched tens of millions of content items, overcoming many of the challenges that come with the complex nuances of the Arabic language. Our editorial staff come from a broad range media backgrounds, including major publishing houses, world-leading news publications and global brand content teams. They have watched the digital publishing scene in the Middle East grow from its humble beginnings into the competitive and complex network it is today, and have acted as pioneers in many aspects of this development. Our team boast an unrivaled knowledge of the Middle East’s media landscape. They actively identify and make accessible content from publications across every language, area and political viewpoints in the Middle East, whether they be niche blogs or major newspapers and online publications.

6


4.2 Problem

Machines can’t read very well Arabic yet. That makes searching through vast archives of Arabic content really hard. It also makes curating that content, packaging that content and selling that content to clients almost impossible.

4.3 Solution Through our parent and sister companies, we manage one of the largest datasets of Arabic news and third party content anywhere. Our solution is to leverage the Google Cloud and build a ‘trained AI’ capable of semantically understanding and tagging Arabic content with much higher accuracy. This solution won’t simply help transform our business. It will contribute towards capacity building Arabic language publishing on the Internet itself.

4.4 Product

As an independently funded news source, Al Bawaba strives to share local news and insights from across the political spectrum, free from influence and censorship. Through a sophisticated network of on-the-ground sources, licensed content providers, and an expert team of in-house editors, Al Bawaba has developed an editorial and content strategy which breaks boundaries in Middle East journalism and storytelling. Watch an introductory video ​here​. DISCO is an innovative and revolutionary digital content marketplace, where buyers can access, search for and acquire an instant license, to use, or republish content from the Middle East and beyond, all fully rights-cleared. Combining proprietary technology with world-class journalism, our services are essential to both niche and major publishers who are hungry for trustworthy, multilingual content that informs, educates, and entertains audiences across the globe.

7


4.5 Target Market DISCO Content Marketplace is an essential tool for publishers looking to increase the scope of their coverage and give their readers news, opinions and analysis from areas not covered in-depth by their staff reporters or global news wires. The target market, therefore, is publishers from both the Middle East and the wider world who are keen to expand, enrich and streamline the editorial process with premium licensed content, whether they be newspapers, magazines, or online publications. 4.6 Competition The two key competitors to DISCO in the Middle East are the availability of cheap original content from low-wage copywriters and the willingness of publishers to steal content and republish it without a license. Our prediction is that as the digital publishing industry grows and matures in the region, copyright will be taken more seriously, and as audiences are inundated with poorly-written cheap content, they will turn towards sources providing them with premium and high-quality content. Outside of the Middle East, our competition includes global news wires (AFP, AP, Reuters, PTI, etc) as well as niche services such as The Interview People.

4.7 Business Model and Opportunities We are a team of journalists, tech junkies, and content strategists who believe that the dissemination of information is key to an informed, educated and entertained world. Our goal is to power the editorial programs of publishers across the globe with engaging, easy to license news, research, features, analysis and opinion from throughout the Middle East and beyond. Our approach is two-fold: Tell the story from a local perspective through licensing, processing and distributing content from the Middle East and making it available to publishers worldwide, while at the same time enriching the local publishing scene with easy-to-access, affordable content from outside the region.

8


Streamlining the editorial process and making content more accessible to the global audience will help us achieve that goal. 4.8 Appendix (Images) Fig. 1 & 2: Images of Disco Content Marketplace. Note filtering on the left. Fig. 1

Fig. 2

9


Fig. 3: The Al Bawaba News Syndication Suite for our News Editors

Fig. 4: Semantic Tagging in English or ‘How keywords are generated’

10


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.