Thursday, November 03, 2011

Google Translate "already speaks 57 languages as well as a 10-year-old"

Wow. That's the claim of a headline of a Slate story by Jeremy Kingsley, here. In the body of the story, they use a somewhat different formulation:
Today, the [Google] algorithm has an understanding of language something like a 10-year-old’s, but its rate of improvement is fast exceeding human language-learning development.
That makes more sense, and I can imagine that a journalist trying to explain how GoogleTranslate works would reach for this kind of comparison. Note that the two claims are different: one about comprehension and one about production.

But whichever claim you take, it's an empirical claim of sorts. So, I tried it with a little chunk of Spanish (from here), figuring that lots of our readers know Spanish and that it's probably one of the better developed languages (compared to Albanian and Azerbaijani, which it also does). This is the original:
La Historia de la lengua espanola, de Rafael Lapesa, es obra de ejemplaridad casi unica en el campo linguistico y literario. Hace medio siglo que llego al publico por primera vez, y desde entonces ha formado, enriquecido y deleitado a muchas generaciones de estudiosos. Esta edicion recoge la ultima reelaboracion que el maestro Lapesa hizo con exigente entusiasmo, aumentando su volumen original en mas de un tercio. Y de nuevo se impone concluir: nadie como Rafael Lapesa ha descrito la historia de nuestra lengua; nadie ha sabido contarla con tanta eficacia, con tanto encanto. Mediante la vision sucesiva de los distintos estados del espanol, Lapesa logro fundir historia, lenguaje, cultura y vida. Su libro alcanza cohesion superior al concebir como inseparables la lengua y la literatura. Los grandes autores y obras aparecen caracterizados en su estilo de forma inolvidable. La belleza de las creaciones individuales se suma asi a la oscura labor del pueblo. En el dominio de los materiales brillan las cualidades relevantes de Lapesa: saber exacto, equilibrio, serena objetividad, talante generoso, claridad pura (casi sin tecnicismos), compenetracion mental y sensitiva con lo tratado, modestia, sacrificio. Memorable Historia la que (desde el pasado y desde el presente) construyo el maestro. Para todo hispanohablante ha sido, es y ha de seguir siendo obra especialmente querida.
Here's what GoogleTranslate spits out:
The History of the Spanish language, Rafael Lapesa is exemplary work almost unique in the field of languages ​​and literature. Half a century ago who came to the public for the first time, and has since formed, enriched and delighted many generations of scholars. This edition includes the latest reworking the teacher did Lapesa demanding enthusiasm, increasing its original volume in more than one third. And again imposed conclude: anyone like Rafael Lapesa has described the history of our language, no one has been able to tell it so effectively with so much charm. Through the vision of successive various states of Spanish, melt achievement Lapesa history, language, culture and life. His book reaches cohesion conceived as inseparable than language and literature. Major authors and works are characterized in an unforgettable style. The beauty of individual creations joins the dark work of the people. In the domain of materials relevant qualities shine Lapesa: exact knowledge, balance, objectivity, calm, generous spirit, pure brightness (almost non-technical), sensory and mental rapport with the treaty, modesty, sacrifice. That Memorable History (from the past and from the present) built the master. For all speaking has been, is and must remain a work especially dear.
I don't talk to 10-year-olds that often, but while this is a really impressive automatic result (to me at least), I wonder how we judge the program's level of 'understanding' of a language?  And if we have a metric, is this 10-year-old-like?
In terms of production, it's not close to the syntactic patterns that a kid of that age would have, right? I'm a little surprised that it's not better on pro-drop and don't get why it seems to have simply skipped some words … I could see using an English possessive instead of 'de', but I don't get why "Para todo hispanohablante" comes out as "For all speaking".
But I'm in favor of anything that involves "the dark work of the people", to which I now return …


Dianna said...

Hmm. Well, I use Google Chrome, and any time I'm on a non-English website, I get a bar at the top that says, "This page is in LANGUAGE. Would you like to translate it?" And you can choose "yes," "no," or "Never translate LANGUAGE" where LANGUAGE is the language used on the website (German, Hungarian, French, etc.).

Or it's supposed to be, anyway. Google Chrome routinely tells me "This page is in Danish" when it's most definitely in Norwegian. And I know bokmål and dansk are similar on the surface, but you'd think Google Translate could tell the difference! I can only assume the web address has no bearing, because these are websites ending in .no rather than .dk. Computers are only so smart...

be_slayed said...

My understanding is that the Spanish <-> English algorithm is the star of the show, with the other pairings not faring so well. Certainly, Hindi <-> English tends to produce a lot of gibberish.

Mr. Verb said...

Thanks to both of you. I don't know how it's so hard for a program to distinguish Danish from bokmål. Wouldn't bear THAT much on the results, given what this test shows.

So, I accidentally chose the best case? That's not promising. One of our contributors says that the German-English is shot through with issues.

Joe said...

Just did a parallel test with German, where there was serious misinterpretation of pretty simple clauses.

Monica said...

I too must return to the dark work of the people, but not without admiring "melt achievement Lapesa history, language, culture and life". Pure poetry!

Anonymous said...

I use Google Chrome as well, and though I love that translation feature when planning international travel, it is *extremely* annoying that it translates currencies. I'm currently living in England, which Chrome knows, and I'm about to go to Iceland. When I'm browsing on Icelandic sites, I keep seeing things, like hotel rates, that are 11,000 GBP! Hovering over that bit of text shows "original text: 11,000 ISK." NOT the same. Equally puzzling, it translates Reykjavik to London.

Nevertheless, what translate does accomplish is certainly impressive.