Tuesday, July 20, 2010

Decoding scripts -- automatically

This story is starting to float through the current news cycle. Regina Barzilay and her colleagues are working on automated deciphering of scripts, and just presented on it in Uppsala at the Association for Computational Linguistics. I don't see the paper on her website yet, but according to their abstract (from here):
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and high-level morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps nearly all letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for over half of the Ugaritic words which have cognates in Hebrew.
See here, for a little background on the scripts.

They apparently have plans to try it next on Etruscan. That's going to be a serious challenge, but this could be cool.

Who knows, maybe the Voinych manuscript is next!

4 comments:

Chris said...

Unfortunately this paper has been getting the Chinese whispers treatment in the blogosphere and MSM. Most claim the algorithm "translates" ancient languages or replaces linguists with computers. It does neither.

My position is that, while interesting, the results are less than meets the eye. Its reliance on using a known related language hamstrings it more than a little. Producing an alphabetic mapping and a cognate set is nice, but "deciphering a dead language" it ain't.

Mr. Verb said...

Yeah, that was a really good post. (Hadn't see it yet ... been behind on stuff.)

The Etruscan case is very different and vastly more challenging.

James Crippen said...

A good control case might be to run it on Dutch and German to see that it identifies cognate graphemes and so forth. Then on English and French to see where it makes mispredictions.

Mr. Verb said...

Here's where it would be nice to see the full paper: They must have done some kind of pilot work before doing Ugaritic, right?