Making the World's Language Data Useful

23 ene. 2014 - Nepali. Punjabi. Somali. Yoruba. Zulu. (Translations using any of the ... Congreso Nacional Africano (CNA
7MB Größe 11 Downloads 278 Ansichten
Making the World’s Language Data Useful Data Innovation In Google Translate Macduff Hughes ([email protected]) Engineering Director, Google Translate

Jan 23, 2014

Translation: from rule-based...

...to statistical / data-driven.

“Students of language are now for the first time justified in undertaking serious study of language from a mechanical point of view.” Russian-English: 6 rules, 250 words.

1954: GeorgetownIBM experiment

1966: ALPAC Report “The problem is not to meet some nonexistent need through nonexistent machine translation.”

2.2M English-French sentence pairs

1994: IBM “Candide” paper

1998: Google founded

2006: Google Translate

World Wide Web indexed

Google Confidential and Proprietary

Language Barriers on the Web ● ● ● ● ●

English is about half of the web About 20% of world population has some English skill Ten languages account for 80% of the web Practically no web in Hindi, Arabic (~300M speakers each) To most of the world, the web is mostly unreadable

Google Confidential and Proprietary

Google Translate: in browser

Google Confidential and Proprietary

Google Translate: Web

Google Confidential and Proprietary

Google Translate: Mobile

Google Confidential and Proprietary

Making a Difference

Google Confidential and Proprietary

Google Translate stats 80 languages 200+ million users 92% from outside the USA 100+ million Android installs 1+ billion translations per day Total text translated each day equal to 1+ million books Google Confidential and Proprietary

Applications, large...

News Gathering Intelligence Patents Customer Service ….

and small... Google Confidential and Proprietary

Feb 13, 2012

THE DALLES, Ore. -- A mobile app helped Oregon State Police troopers communicate with a foreign-speaking man who was experiencing a diabetic reaction while driving on I84 near The Dalles. On Sunday afternoon, OSP troopers responded to several reports of a possible impaired driver.... The driver, a 57-year-old man of Chinese descent, spoke little English so the troopers used a Google language translator app to discover that the man was diabetic. The cell phone app translates more than 50 languages in text or spoken word. http://www.kgw.com/news/Police-use-translation-app-to-aid-driver-139255188.html Google Confidential and Proprietary

Owner of [Chinese] restaurant accused of assaulting wife January 8, 2014 “An employee… called police after hearing screaming, arguing and pots and dishes clattering in the kitchen, according to the police report…. When police arrived around 7 p.m., they used two tablets running Google Translate to communicate with [the suspect’s] wife…”

http://www.theeagle.com/news/crime/article_02055545-6416-5728-b0b8-7469ff356200.html Google Confidential and Proprietary

Even “small” languages have global reach Translate’s new languages Dec 2013: Hausa Igbo Maori Mongolian Nepali Punjabi Somali Yoruba Zulu (Translations using any of the above)

Google Confidential and Proprietary

...often for migrants Translate’s new languages Dec 2013: Hausa Igbo Maori Mongolian Nepali Punjabi Somali Yoruba Zulu

Google Confidential and Proprietary

How it works

Google Confidential and Proprietary

Google Confidential and Proprietary

Google Confidential and Proprietary

100s of Billions of words...

Trillions of web pages... Parallel Data Mining

Parallel Translations

Translation Training

Language Model

Translation Model

Google Web Index Language Model Training

Trillions of bytes of data.

Google Confidential and Proprietary

The cat sat on the mat. The El La

The cat El gato El gato se La gata el gato

The cat sat

cat sat

on the mat.

gato se sentó

en el tatami. en el tapete. en la colchoneta. en la estera. sobre la alfombra. sobre la esterilla.

El gato se sentó en la estera. El gato se sentó en el tapete. El gato se sentó en el tatami. ...

Translation Model

Language Model

El gato se sentó en la estera. Google Confidential and Proprietary

How well does it work?

Google Confidential and Proprietary

Sometimes really well... Tras estar preso durante más de 27 años cumpliendo cadena perpetua, Nelson Mandela fue liberado, recibió el Premio Nobel de la Paz y fue elegido democráticamente como presidente de su país.3 Antes de estar preso había sido líder de Umkhonto we Sizwe, el brazo armado del Congreso Nacional Africano (CNA), creado a su vez por el Congreso de Sindicatos Sudafricanos y el Partido Comunista Sudafricano. En 1962 fue arrestado y condenado por sabotaje, además de otros cargos, a cadena perpetua. La mayor parte de los más de 27 años que estuvo en la cárcel los pasó en la prisión-isla de Robben Island.

After being imprisoned for over 27 years serving a life sentence, Nelson Mandela was released, received the Nobel Peace Prize and was democratically elected as president of his country. 3 Before being imprisoned had been the leader of Umkhonto we Sizwe , the armed wing the African National Congress (ANC), created in turn by the Congress of South African Trade Unions and the South African Communist Party . In 1962 he was arrested and convicted of sabotage and other charges, to life imprisonment. Most of the more than 27 years he spent in prison were spent in the prison-island Robben Island . Spanish Wikipedia: https://es.wikipedia.org/wiki/Nelson_Mandela

Google Confidential and Proprietary

Sometimes less well 当选總統前,曼德拉是 积极的反种族隔离人士,任非洲人國民大會武裝組織民族之矛領袖。當曼德拉領 導反種族隔離運動時,南非法院曾判 处他「密謀推翻政府」等罪名,曼德拉前后共服刑 27年半,其中有約 18年在羅本島度過。

Before becoming president, Mandela was an active anti-apartheid party, either the ANC armed organization spear nation leaders. When Mandela leading antiapartheid movement, South African courts had sentenced him to "conspiring to overthrow the government" and other charges, after Mandela served 27 years of a half, of which about 18 years in Robben Island to spend.

Chinese Wikipedia: https://zh.wikipedia.org/wiki/Nelson_Mandela

Google Confidential and Proprietary

What determines translation quality?

Ingenuity and elbow grease

Language Complexity (syntax, morphology…)

DATA Google Confidential and Proprietary

1B

Google Confidential and Proprietary

Highly inflected languages

Google Confidential and Proprietary

Very different syntax compared to English.

Google Confidential and Proprietary

Future Directions Quality in addition to quantity ● Machine Translations on web ● Low quality and mixed content Volunteer crowdsourcing ● Very small languages (Maori 2013) ● Slang, chat, social network content… ● Wiktionary Speech Google Confidential and Proprietary

Conclusion

Google Translate demonstrates… Power of data Repurposing of data Potential for still more impact

Google Confidential and Proprietary

Thank you!

Hvala! Salamat!

Mun gode! Ua tsaug! Na-ekele unu! Matur nuwun! សូមអរគុ ណ!

ध यवाद!

Танд их баярлалаа!

ध यवाद! ਤੁਹਾਡਾ ਧੰ ਨਵਾਦ ਹੈ!

Waad ku mahadsan tahay! O ṣeun! Mauruuru!

Ngiyabonga! Google Confidential and Proprietary