An IIT Kharagpur 2022 calendar sparked debate regarding the Indus Valley Civilisation. Let’s not get into the related controversies. However, it immediately reminds us that about a century has passed since the discovery of Harappa and Mohenjodaro but its language still remains cryptic. Just like the Indus unicorn, maybe.
Being a Bengali brought up in Kolkata, I had the opportunity to read the adventures of Professor Shonku, an eccentric super-scientist created by Satyajit Ray. The mythical super-scientist could decipher the Indus scripts! In the story “The Unicorn Expedition”, Ray, however, was surprised at the presence of an imaginary animal such as a unicorn among the Indus seals, along with other real-life animal-seals. In his 2015 paper “Ancient civilization: Cracking the Indus script”, published in Nature, Andrew Robinson reminded us about the motifs of a puzzling “unicorn”.
To me, the Indus Valley mystery is thus a unicorn mystery, to some extent. Over decades, several attempts have been made in deciphering the Indus script, without any unanimous outcome yet. A prize of $10,000 was offered as reward by an anonymous donor in 2004 through comparative historian Steve Farmer. Didn’t hear any real-life “Professor Shonku” claiming it yet!
The famous Rosetta Stone certainly helped decode the ancient Egyptian hieroglyphs as similar text has been written in three distinct languages — ancient Greek, Egyptian hieroglyphs, and Egyptian Demotic. Not all lost languages are so lucky to have a Rosetta Stone. Due to the absence of bilingual texts and extreme brevity of the inscriptions, deciphering the Indus inscriptions is a daunting task. Moreover, Mr Robinson pointed out in his 2015 Nature paper: “Was this an ancestor of Sanskrit or Dravidian, or of some other Indian language family, such as Munda, or was it a language that has disappeared?”
Iravatham Mahadevan’s 1977 work provides a detailed study on concordances in the Indus script — it was found that the signs were not randomly distributed, certain signs were associated with the unicorn symbol, while others were associated with the gharial and dotted circle symbols.
Well, is it a written language at all? Steve Farmer and Harvard’s professor Michael Wizel argue that it’s not. Asko Parpola of the University of Helsinki, who has been trying to decipher the script since 1968, and others, however, think it was pretty much linguistic. Professor Parpola’s 1994 book, Deciphering the Indus Script, is built on his study of nearly 4,000 samples of the writing that survive, mainly on stamp seals and amulets.
Now comes artificial intelligence (AI). For decades, AI has been playing a vital role in the comprehension and decoding of lost languages. In various ways. In fact, as early as 1953, the Linear B language, known as the oldest Greek language spoken by the Mycenaean, was the first language to be decoded using AI. But, the Indus script is possibly a more complex system. However, the ever-expanding horizon of AI is making its domain broader, with an increasing power of synthesis and self-learning. A 2009 paper in Science magazine led by University of Washington computer scientist Rajesh Rao presented evidence for the linguistic hypothesis by showing that the Indus script’s conditional entropy is closer to those of natural languages than various types of non-linguistic systems. There have been other recent attempts by different researchers of using AI to demystify the Indus script — ranging from understanding how Indus inscriptions conveyed meanings to finding its root.
Implementing AI needs the training of an algorithm involving comparison with known values. But, certainly, no previously known values can be there for lost languages. Well, recent developments might help. In a 2019 article, Jiaming Luo and Regina Barzilay of MIT and Yuan Cao of Google proposed a novel neural approach for the automatic decipherment of lost languages. They applied their method on two already decoded languages — Linear B and also 3,000-year-old Ugaritic. A 5 per cent absolute improvement over state-of-the-art results for Ugaritic was observed and 67.3 per cent of Linear B cognates were correctly translated into the suitable Greek equivalents.
Then, in a follow-up 2020 study, these authors, along with co-researchers, developed a new algorithm for languages that are not fully segmented into words, and also the closest known language is not determined. Their model captures the natural phonological geometry by learning character embedding based on the International Phonetic Alphabet. They evaluated their model on two already deciphered languages — Gothic and Ugaritic — and it worked! And they could decipher an undeciphered one, Iberian. Importantly, there’s no consensus about Iberian’s close known language — it can be Greek, Latin, or Basque.
Well, what next? Can such AI or its suitable improvement decipher the Indus script in near future, ignoring all previous analyses? Will the Indus unicorns be demystified then? Maybe. Still, one would never know the pronunciations in a lost language, for example. A bit of spice would remain buried under time. Always.
The writer is professor of Statistics at Indian Statistical Institute, Kolkata