In the study, 'Learning the Character Inventories of Undeciphered Scripts Using Unsupervised Deep Clustering’, researchers Logan Born, PhD Student in Simon Fraser University's School of Computing Science, Dr. M. Willis Monroe, Assistant Professor in the Department of Historical Studies at the University of New Brunswick, Dr. Kathryn Kelley, Postdoctoral Researcher at the University of Bologna and Dr. Anoop Sarkar, Professor at Simon Fraser University's School of Computing Science, explore the decipherment of text and grouping characters through script clustering and present novel architectures which exploit varying degrees of contextual and visual information to learn representations for use in script clustering.
In this Q&A with Language Sciences, the research team breaks down approaches to text decipherment, explain what "script clustering" is, outlines what's next for their research, and more!
With regards to decipherment, can you explain what a token-to-type problem is and how you approach solving this type of problem?
In general, "types" are like abstract categories and "tokens" are concrete objects that belong to those categories. In our work, the "types" would be the possible characters in a script, while the tokens would be the actual physical impressions on the clay. Our task is kind of like trying to read messy handwriting: you know that each of these shapes corresponds to some letter, and you need to figure out which of the squiggles on the page correspond to which letter. In our case, however, we don't know the underlying script (i.e., we don't know what the possible letters might be) so we have to combine information from multiple sources to make an educated guess at what the underlying script might be. We ended up combining visual information with contextual features from something called a language model: this approach captures the intuition that every instance of a given character should both share a similar appearance and occur in more-or-less similar textual contexts.
What is script clustering, and how it is applied when deciphering scripts?
We use the phrase "script clustering" to refer to the task of determining how many characters there are in an unknown script, and, at the same time, labeling each token from the original corpus as one of those characters. We used this phrase to set our work apart from similar tasks like optical character recognition, where the underlying types are already known and you just need to label the tokens. This is an iterative process and does not necessarily result in concrete result. Rather, we need to use the clusters as a step in the process towards eventual decipherment.
How does the complexity and number of characters in a script impact your model’s ability to learn the underlying character inventory?
Surprising, the number of characters in a script does not seem to have a very large impact on the difficulty of this task. In one of our experiments, we pretended that Japanese was a lost language, and tried to recover the Japanese script using our model. We were expecting the model to struggle since the Japanese writing systems use so many characters, but in fact even our simplest model was able to solve the task very effectively. Our models struggled a lot more in settings where there were fewer characters, but they could be written many different ways (i.e. think of all the different ways to handwrite some English letters -- having all those shapes makes it harder to tell when two tokens are actually the same underlying character).
Why are contextual models more effective for script recovery than contextless models?
There's a game in the newspaper where they encipher a famous quote, and you have to decode what it's supposed to say. Most people will solve those by looking for patterns that they recognize from regular English text: for example, some letters often get doubled up, like "ee", while other letters like "s" are particularly common at the end of words. By looking for these sorts of contextual features, you can usually get a foothold to start figuring out what the puzzle is supposed to say. Our models are very similar: once you learn that a certain character occurs in a particular context, you should get better at recognizing that character in that context the next time you see it. This helps the model to avoid making mistakes when there are two characters that look similar but get used in different ways. A graphical analogy would be linked scripts (like English cursive) trying to understand the individual glyphs without context would be much harder. With the context of neighbouring letters individual glyphs can be more securely deduced.
What’s next for this research?
Another way to solve decipherment problems is by trying to "pair up" undeciphered data with data from a known language. (This is called finding an "alignment" between the two languages.) For example, if you suspect that a given text is meant to represent English, you can try to match up each of the undeciphered characters with a letter from the English alphabet. This can be a painstaking process, since you usually have to search over many possible alignments to find the right one. We're looking into ways to make existing alignment models more efficient, and to make them better at handling data from scripts like proto-Elamite, which can be "messier" than other datasets owing to our relative lack of understanding of the script. There is a contemporaneous and neighbouring script called Proto-Cuneiform (linked historically to later cuneiform) some of future plans involved trying to find alignment between these two scripts because of their close historical and cultural background.
Click here to access the full study.
Written by Kelsea Franzke