Taking up the class of “Writing Across Languages – The Post-Monolingual Anglophone Novel”, I hadn’t really anticipated the extent to which we were going to work in the field of digital humanities. During my bachelor in literary studies and philosophy there was very little computational work involved and it is – mildly stated – not my field of expertise.
Nevertheless, I am quite thankful for the opportunity to get to know some of the methods and advantages of DH and distant reading as well as, on the other hand, learning about some of the problems and biases of more traditional methods like close reading.
Ironically, this blog post is still very concerned with the many problems of English annotation programmes. In the seminar, each one of us annotates a post-monolingual anglophone novel, that is, a novel mainly written in English, but with a considerable portion of foreign language(s) incorporated in the text. For the purpose we used panda (a version of python) and spaCy equipped with an English model on a surface of Google Collab (instead of Jupyter Notebook, which seems to be very difficult to instal). Although this is only one of probably many programmes (not my field of expertise, remember?), I think it is safe to say that most software trained with the English language will share those problems.
I will now list some of the difficulties I encountered personally while annotating my sentences and then describe some of the general mistakes I noticed in the programme’s annotation – concerning foreign words and phrases, but also (and this I hadn’t expected) just poetic language in general.
The first difficulty I encountered, was my limited knowledge of linguistics. I have never studied any language, so my linguistical input lies as far back as Elementary School. Or, maybe, everything I randomly picked up in secondary literature during my bachelor. Neither did I understand all the abbreviations for token dependency spaCy used. Fortunately, most of the mistakes the programme made were so obvious that even I noticed them. And to be honest, there were already so many of those that I might actually be lucky not to notice the rest.
The second obstacle I had to overcome was the limited number of foreign (meaning not German) languages I know. The novel I chose, “The Dragonfly Sea” by Yvonne Adhiambo Owuor, includes a great variety of languages, amongst which are Swahili, Pate, Arabic, Turkish, Mandarin, French, Portuguese, Hindi and English. Of those, I only know French and English (what a Eurocentric education, right?). So, to actually understand, whether the foreign words and sentences were annotated correctly, I had to translate them in a way, that let me actually understand the structure of the language.
Last but not least, I had to deal with Chinese characters and Arabic letters, that weren’t necessarily translated in the book. I had to look up transliterations into Latin letters in other parts of the book or guess the meaning and then look up the English translation with DeepL, to then translate the English translation back to Chinese characters. Often times I had to look up several versions of a translation also in other online dictionaries/translation programs to find the right character and copy it into the programme.
Concerning the programme’s mistakes, the most common was the classification of foreign words as either proper nouns (PROPN) or nouns (NOUN). Take, for example the following sentence:
„Ayaana asked, ‚Ma-e, mababu wetu walienda wapi?‘ – Where are our people?“
(Owuor 32)
Every last word (even the „e“ as a singular token), was labeled a proper noun, except the word „wapi“, which was labeled an adverb. Even if the book didn’t give me the English translation right away, it is highly unlikely, that a sentence would consist of five proper nouns and an adverb. I didn’t even bother to check whether the dependencies in that sentence might be right.
Also, the programme’s classification of the tokens wasn’t consistent in itself. Take the following paragraph:
„Before the child had seen him, she used to twirl in the ocean’s shallows and sing a loud song of children at ease: ‘Ukuti, Ukuti Wa mnazi, wa mnazi Ukipata Upepo Watete…watete…watetemeka…’“ (Owuor 16)
Whereas the first „Wa“ is labeled a proper noun, its repetition „wa“ is labeld as an adverbial preposition (ADP). In the same manner the first „Watete“ is defined as a proper noun, the second „watete“ just as a noun. I suspect, this has something to do with the capital writing (in fact, if I’m correct, all capital foreign words were PROPN). But it shows that all foreign words are labeled somewhat arbitrarily.
This is, by the way, the case for all non-English words, regardless the language. I thought it might work better with European languages, given that the machine was probably trained with literature from the European canon. But a French sentence, for example, is treated just in the same manner:
„‘Us.’ ‘Us?’ ‘Yes.’ ‘C’est une chose à laquelle je n’avais pas pensé.’“ (Owuor 218)
The French determinant une became a PROPN, as well as à (correct: preposition), laquelle (correct: pronoun), je (correct: pronoun), n’avais (correct: negation and verb), pas (correct: negation), pensé (correct: verb). The French noun chose became verb and root of the sentence. This also happens when we’re not talking about a whole sentence, but only a few words, as in the following example:
„Well, he has his character enter a tavern and go up to a tavern keeper and request a solicitud de asilo – lovely word – ‘solicitude,’ it evokes protectiveness.“
Here, panda and spaCy took solicitud to be an adjective (correct: noun), asilo was an adverb (correct: noun) and de was marked with an X for „other“ (correct: preposition).
The French example brings me to another observation: I was surprised to notice, that the programme seems to have quite some problems with poetical language. Especially elliptical sentences seemed to confuse it. In the example above, the first „Us“ is also labeled as a PROPN. The second „Us?“, on the other hand, is correctly recognized as a pronoun (PRON). I suspect, this is due to the fact, that it is more common to use only a pronoun in an interrogative clause. Still, it is curious to see, that a programme used to analyse literary texts cannot process an elliptical sentence. The same happens in the following example:
„Muhidin told Ayaana to repeat its name, kereng’ende, in four other languages: ‘Matapiojos. Libélula. Naaldekoker. Dragonfly.’ Ayaana intoned, ‘Matapiojos-libélula-naaldekoker-dragonfly.’„ (Owuor 38)
Whereas I was no longer surprised that non-English words were categorized incorrectly, „Dragonfly“ was also labeled a PROPN. Even more alarming was the outcome when I annotated the following paragraph:
“‘Allahu Akbar…‘ Another day, night, day. Herald of promise, easing an ancient brooding island into wakefulness. (Owuor, 15)
The programme actually took „Herald“ to be the PROPN and ROOT of the sentence. Although this sentence is somewhat poetical, it is nonetheless not highly unusual in an English novel. I must wonder, whether the programme would annotate a monolingual novel correctly or whether, in this instant, the foreign words confused the machine so much, that it makes mistakes it normally wouldn’t make (but that seems to be a humanization of the programme, I suppose…).
Last, but not least, I noticed that the programme was also confused by the unusual use of capital letters, when citing a poem, for example.
So far, these are my experiences with annotating „The Dragonfly Sea“. I am very curious to see, what more we can learn about DH methods and how the discipline will tackle the above mentioned problems in the future.
Dear Iska, I also share the same elementary school linguistic knowledge. Hence, in order for me to understand the POS tags, I had to constantly search them or look up the POS abbreviations provided to us on ILIAS. Though it was very exciting to work with Collab and see how it recognizes the words and their POS, the process of understanding and comparing them was exhausting for me. I cannot imagine how difficult it could have been for you (or maybe it was not?) to analyze a novel written in so many different languages. The novel I examined contained only Spanish and in one case German words/phrases. I also noticed, as you mentioned, that most non-English words are tagged mostly as PRON (proper noun). However, in the following example, where the "Llave" appears four times, each time it was tagged differently; verb (the first), noun (the second and the third), and adjective (the last): "Llave maestra. There is no such thing as a llave universal. If we had a llave universal all our troubles would be over. No, señora Weiss is the only one with a llave maestra for Building C". Naturally, the reason for Collab to recognize the word as a noun is the "a" before the phrase, but what surprises me is that it does not recognize the "universal" as an adj. Apparently, as you also noticed, it doesn't get better with recognizing other languages. For instance, the German words used in the novel are as well almost all categorized as PRON. To be honest, even now writing this comment and rereading my notes makes me tired :)! But I am also excited to see whether these tiny steps we are taking helps improving these digital devices.