Multi-Lingual Annotations in Python with „A Concise Chinese-English Dictionary for Lovers“

This was not my first time working with Python to figure out the tokens of sentences via computational means. However, my last time working with Python (in a linguistics BA seminar with Prof. Kevin Tang) has been some time back, so I have appreciated that working our example sentences through it has been made very easy.

Running the three example sentences given to us, I noticed that punctuations marking speech are often interpreted as part of words or compounds and are tagged as POS, which makes it difficult to check which tokens refer to which words.

Additionally, foreign words were always interpreted as proper nouns – even in the Spanish phrase clarita del huevo, in which del is clearly not a proper noun. I would have thought that maybe the Spanish language would be easier to interpret or perhaps translate than the Swahili sentence as it might be more commonly known, but Python does not do any translating and so struggles with anything that is not English. Dependencies can thus not be correctly determined: in the sentence Tías called me blanca, palida, clarita del huevo the last three parts (blanca, palida, clarita del huevo) are a listing, all nouns or noun phrases are on equal standing and not dependent on each other, but Python marks huevo as a dependent of palida.

The given example with Swahili words in it cannot be properly tagged at all as – according to Python – ‘Ayaaaana! Haki ya Mungu … aieee!’ The threat-drenched contralto came from the bushes to the left of the mangroves. ‘Aii, mwanangu, mbona wanitesa?’ is made up entirely of nouns or proper nouns. Only the English part could be identified correctly.

The English-Chinese example provides similar issues:

  • In Chinese, it is the same word ‘’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘’, a roof on top, then some legs and arms inside.

In this example, the Chinese hanzi are not computed as proper nouns. Python interprets 家 as a noun once and an adjective another time, marking it – quite nonsensically – as a dependent of legs.

Now looking at the novel I have been reading – A Concise Chinese-English Dictionary for Lovers by Xiaolu Guo – I looked for similar multilingual sentences to test in the programme. I couldn’t find many such sentences, but tested the ones I did find:

  • 知识’ mean knowledge, ‘分子’ mean molecule.

知识 is here correctly interpreted as a noun, although it is interpreted as one singular noun rather than a noun phrase consisting of a verb (知 – to know) and a noun (识 – knowledge).

Similar goes for分子, interpreted as one noun rather than a noun phrase consisting of a verb (分 – divide) and a noun (子 – son, child).

The word “mean” in this sentence is the verb “to mean”, but not conjugated correctly because the narrator is not fluent in English yet and struggles with English grammar. Python thus tags it as an adjective. Accordingly, the dependencies turn out incorrect as well, as the mean should function as the head of its sentence. Instead, knowledge becomes that head upon which all other words depend.

  • is fart in Chinese. It is the word made up from two parts. is a symbol of a body with tail, and underneath that represent two legs. That means fart, a kind of Chi.

In the 3rd sentence “represent” is interpreted as a dependent of “is” – again, likely due to the improper grammar the narrator uses. The hanzi are not computed at all (屁), marked as a proper noun (尸) or a noun (比).

  • Chi (), everything to do with Chi is very important to us Chinese.

This sentence starts out sort of elliptical. The phrase very important to us Chinese is tagged and interpreted correctly, but Python struggles with everything that comes before it, marking Chi () as a dependent of is. Evidently, Python cannot compute and interpret punctuation correctly, which in this case should indicated that the first word is separate from the sentence after the comma.

In the next two examples I wanted to test how Python deals with incorrect English grammar but without the interference of non-English words, hypothesising that situations like the above mean would also occur here.

  • I feeling I can die for all kinds of situation in every second.

In this case, coming from previous errors Python made due to incorrect grammar and conjugation, I thought that feeling might be interpreted as a noun because the auxiliary “am” of the progressive form is missing. Surprisingly, Python has no problem recognising feeling as the verb it is indeed supposed to be, marking it correctly as the head/root of the entire sentence

  • I scared by cars because they seems coming from any possible directing.

Similar as above, I wanted to test how Python deals with these grammatical errors (seems instead of seem; directing instead of direction) – again, surprisingly, all tokens were tagged and interpreted properly with all their dependencies. Even directing has been correctly identified as a noun instead of a progressive verb.

Evidently, annotating multi-lingual sentences correctly is not a possibility – at least not with the Python code we have been given. While the programme has no problem interpreting English sentences with incorrect grammar, it is thrown for a loop as soon as non-English words are introduced, which was a very interesting observation to make.

My experience with annotating multilingual sentences with Google Collab

My first experience with Google Colab was surprisingly positive. While the idea of working with a programming software seemed overwhelming at first, when we actually got to try using it during the seminar, it seemed quite intuitive and easy to navigate with the help of Ms. Pardey and the other students. The concept of syntactic trees and dependency relations was something I initially struggled with, considering my Introduction to Linguistics lecture was in 2017, and I have since mostly stayed in the field of Literary and Cultural Studies. Combined with the abbreviations of the different tags within the programme, it was difficult for me to understand the results Google Colab was showing me. However, the class discussion and the glossary, as well as some revision using Google were very helpful. What I also did was feed the programme simple sentences at first (think “I like apples.” “What is your favourite colour?”) to see what the results would look like with less complicated sentence structures.

When I tried to annotate my own sentences from the novel “On Earth we’re briefly gorgeous”, I came across some difficulties, as I cannot understand Vietnamese and thus had to find out the structure of the Vietnamese passages myself first in order to verify Google Colab’s results. What I discovered was that the programme struggled to determine the Vietnamese words, which, to me, seems to be inevitable because, as I understand it, the programming language is only suitable for the English language(?). Because of this, the overall dependency relations were off, since my example sentences combined English and Vietnamese words in the same sentence but mostly without an indicator like an English preposition or determiner. I could not yet find a way to fix this problem but am very interested in what would be a solution in such cases.

I am interested to learn about the others’ experiences with Google Colab and am keen to learn more about computer-based analysis of multilingual text.

Conversion and Annotation of Susan Abulhawa’s „The Blue Between Sky and Water“, or A Demonstration of Software Failure through Anglocentrism?

Introduction

Having focused on reading literature through a postcolonial studies lens as well as on Eurocentric bias in the field of linguistics throughout my studies, the attempt of using conversion and annotation tools on a postcolonial post-monolingual Anglophone novel seemed intriguing. I was interested to see how it would deal with the novel I chose – Susan Abulhawa’s The Blue Between Sky and Water.

The novel by the Palestinian-American writer and human rights activist mixes Palestinian Arabic with English in a few different ways, although some patterns can be observed: Food items, terms for relatives, and culture-specific terms are usually written in latinized Arabic. Terms are usually introduced in italics once, and then re-appear throughout the novel un-italicised. As the software works with raw text, these specificities were lost. Apart from nouns, however, the novel includes verbs, adjectives, and phrases here and there in Palestinian Arabic as well, sometimes translated in the next sentences or before that, and sometimes not.

Sentence Choice

I chose seven sentences that show variation in the mixing of languages. For instance, one sentence I picked only includes Arabic nouns which denote food items:

                “One of her brothers arrived and they all shared a late breakfast of eggs, potatoes, za’atar, olive oil, olives, hummus, fuul, pickled vegetables, and warm fresh bread” (174).

Another contains only one adjective in Arabic:

                “‘Who is there?’ a woman’s voice asked in Arabic and Nazmiyeh relaxed upon hearing the Palestinian fallahi accent” (35).

Yet another contains an entire phrase:

                “The woman with wilted breasts began to sob quietly as others consoled her and banished the devil with disapproving eyes at Nazmiyeh – a’ootho billah min al shaytan – when a female soldier wheeled in a large box of clothes, and with a gesture of her hand, gave the naked women permission to get dressed” (114).

In addition I chose other sentences which include nouns in different ways to compare how the software deals with them. This was useful, as will become clear in the analysis of the mistakes the software made in the annotation.

Problems and Technical Difficulties

In general, working with the Jupyter Notebook style interface via Google Colab and Python worked out without any greater issues. The only problem I noticed was that double quotation marks could not be used, as they are part of the code and thus confuse the interface. Single quotation marks were acceptable for the software, so I replaced double quotation marks with single ones.

Google Colab, however, was not as user-friendly. I found it rather unintuitive and it often claimed I had too many tabs open at the same time. As other students recommended, clearing the history and waiting a few minutes seem to have solved that issue. It is, however, time-consuming.

Mistakes in the Annotation

Most of the time, Arabic words were labelled as PROPN, proper nouns. This was especially the case with the sentence including an entire phrase in Arabic. For instance, “a’ootho” should be labelled as a verb, but was labelled as proper noun. “billah” was labelled as one single proper noun, even though it should be labelled as a preposition in combination with a noun, or proper noun (Allah). In other sentences, the software labelled Arabic nouns as adverbs. One example is “fuul”, which is a type of bean. Yet other nouns, such as “jomaa” were incorrectly labelled as adjectives. So, while the most frequent mistake was to label any Arabic word as proper noun, the software was not consistent in its mislabeling throughout.

The software was still able to locate the ROOT correctly, despite its confusion around the Arabic terms. Arabic words resembling English words were sometimes thought to be English words. For instance, “Um” (mother) was mistakenly labelled as an interjection.

„Um“ is mislabeled as an interjection.

Overall, this was an interesting experience, but I was disappointed that the software was unable to deal with Arabic words to such an extent, even though it was expected.  

Converting and Annotating Multilingual Sentences & Quotes – My Experience

Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.

I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.

Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.

It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.

Annotating Multilingual Sentences in „Hold“ by Michael Donkor: Twi and English

In his novel „Hold“, which was first published in 2018, Micheal Donkor continually weaves Twi words into English sentences, thus constructing a multilingual narrative. When he uses words in Twi, the author highlights them and sets them apart by italicizing them. Though much could be said about the function of this practice, it is a secondary issue in regards to annotating sentences from the novel. Within the context of the seminar „Writing across Languages“, I am mainly interested in the form- italics being a part of it. Moreover, I am interested in how the multilingual sentences can be annotated and what challenges arise in doing so.

Multilingual Sample Sentences in Donkor’s Novel

There are different techniques that Donkor uses in order to establish multilingualism. I tried to choose sentences, words and phrases that show a variety of techniques the author uses. Often, phrases or whole sentences in Twi are used in dialogue.

‚What a polite and best-mannered young lady we have on our grounds this pleasant day. Wa ye adeƐ.

Donkor, Michael: Hold, p. 7

Me da ase,‘ Belinda said softly.

Donkor, Michael: Hold, p. 10

Other times, only one word is used in an English sentence. Here, as in almost all cases of multilingualism in Donkor’s novel, the Twi words are italicized.

Belinda worked the pestle in the asanka, using her weight against the ingredients, grinding the slippery onion and pepper.

Donkor, Michael: Hold, p. 28

In the tro tro on the way home from the zoo, Belinda had done her best to enjoy Mary’s sulking silence.

Donkor, Michael: Hold, p. 25

I have chosen 8 sample sentences in total and pasted them into „Jupyter Notebook“. Though a few letters differed from the English alphabet and had to be inserted seperately, I faced no technical difficulties. A variety of challenges arose, however, regarding the machine annotation of Twi.

Challenges in Annotating Twi in English Sentences

No Italics in Annotations

I have repeatedly mentioned that Donkor uses italics to signal mutlilingualism. There is no way to indicate italics within „Jupyter Notebook“. Thus, it would be impossible to use these annotations to analyze the use of italics in multilingual texts, whether with a diachronic lens or otherwise. Nor can any differing practices across languages be analyzed, seeing that there is no way to indicate italics and therefore search for them later on.

Twi Words with the same form as English words

There are a few instances within the chosen sample sentences where a Twi phrase includes a word that resembles an English word. „Jupyter Notebook“, being based on an annotation model for the English language, identifies the form of these words and classifies them according to the English POS. In the chosen, annotated sentences this issue applies to the words „me“ or „bone“. Since I lack the language skills in Twi to understand the words, I can neither confirm nor deny if the classification generally is correct. It shows, however, that there are challenges in differentiating the Twi words from the English words.

Classifying Twi Words Consistently

In general, there’s a lack of consistency with respect to the classification of Twi words. The word „Aboa“ appears twice in two adjacent sentences. Still, the classifications for the word differ despite the same form. First it is identified as „ADJ“, then as „PROPN“. Due to a lack of input of the Twi language, these parts of the sentences are not labeled correctly.

Aboa!‘ Mary laughed. Aboa was Mother’s insult of choice too;

Donkor, Michael: Hold, p. 52

Technical Difficulties and First Results of Annotating Multilingual Sentences in „The Moor’s Account“

First of all, I have been finding this work very interesting, getting a new perspective on literature. However, I encountered some technical difficulties while having Google Collab analyse my sentences. After three sentences, it told me I had reached my free limit and would have to purchase Collab Pro. This is probably because I tried to save my results by opening a new file for every sentence. Thus, I have only worked through a couple of examples so far, taken from the novel The Moor’s Account  by Laila Lalami.

The first sentence I had annotated was, „When I said, Bawuus ni kwiamoja, one of the women inevitably corrected me, Ni bawuus kwiamoja.“ (Lalami 175). Both times the words are used in the sentence, Collab tagged them as proper nouns, as if they were all names. The phrase is not translated in the novel, but the correction is as follows: “in Capoque […] the doer and the done-to were spoken of before the deed itself” (ibid.). This means that, presumably, “Ni” and “bawuus” are the doer and the done-to, while “kwiamoja” is a verb. It is very difficult to research this language, however; I suppose it is not spoken anymore.

Another sentence I let Collab process was, “I whispered Ayat al-Kursi to myself“ (123). “Ayat”, “al”, and “Kursi” are all tagged as proper nouns, which is acceptable, I believe, as the words refer to a specific verse in the Quran. In the dependency analysis, however, it says that “Kursi” is dependent on the word “whispered” and the major part of a compound (dobj).

Hopefully, I will be able to go through more sentences and I am excited for the next steps.

Initial experiences in annotating multilingual text in Ocean Vuong’s „On Earth We’re Briefly Gorgeous“

I have always been fascinated by how literature and language studies have been influenced by digital fields such as coding and AI, so working on this project and studying the interdependencies between digital humanities and comparative studies has been a great learning experience. Having said that, it should be noted that I am not very tech-savvy, so the annotation assignment using Google Colab and Python initially seemed a little daunting. After using it together in class though, it became easier to work with the software and my initial confusion was cleared.

The text I’m working on is Ocean Vuong’s On Earth We’re Briefly Gorgeous and it is a majorly English text with a few Vietnamese words and phrases. As a novel, it challenges the conventional idea of a mother tongue as a source of identity and stability. Through the protagonist, a Vietnamese American speaker who translates between English and Vietnamese, Vuong portrays the mother tongue as something that is constantly changing and disconnected from its origins, like an orphan. Multilinguality then, in the context of the novel becomes more of a multi-culture discourse where the experiences of Vietnamese people in America do not get translated.

Finding samples for annotation and analysis in such a text was a bit of a task because there were comparatively fewer multilingual passages. Nevertheless, the passages I did work with give a fairly comprehensive idea of how the software is heavily Anglo-centric and fails to correctly annotate non-English languages.

One of the sentences I used for analysis was: “Đẹp quá!” you once exclaimed, pointing to the hummingbird whirring over the creamy orchid in the neighbor’s yard. Vuong gives us a translation of the Vietnamese phrase in the next sentence of the novel: „It’s beautiful“. SpaCy recognises the words as PROPN and verb, but as is clear from the translation in the source text, neither of the words is a proper noun or a verb and could roughly be classified as adjective and pronoun if we are to categorise based on POS. “Creamy”, used to describe the orchid is also recognised as a noun and not an adjective by the software. Not being a student with a linguistics background, I have had difficulties understanding the dependency encoding part of the programme and then applying it to check whether it has been run correctly or not. But, as far as I have comprehended, dependency relation indexing for English words and POS is accurate while Vietnamese words are wrongly indexed (I suspect based on the already incorrect POS tags). 

I also wanted to try to work with a sentence where Viternamese was not used as direct speech to see if the software recognises them differently: You wanted to buy oxtail, to make bún bò huế for the cold winter week ahead of us. The novel doesn’t have an exact translation but contextually it is evident in context that it is a dish made of beef. A little bit of independent research revealed that bún bò huế is a dish made of rice noodles and beef slices. The software almost correctly recognises “bún” and “huế” as nouns, but “bò” is wrongly found to be a verb. Expecting a program to recognise the cultural intricacies of a word might be expecting too much of it, but even in the instance that it did recognise it correctly, I am inclined to believe that it was merely a fluke. 

”Tên tôi là Lan.” My name is Lan. was another sentence that I tried to annotate, and because I was having fun with the software by now, I also wanted to try annotating them as two different sentences as well—just the Vietnamese and then just the English translation. The results were extremely interesting and although I cannot fully explain why they varied (again, maybe a linguistics background would’ve helped) it goes to show that the spaCy software needs to be made more inclusive of languages that are not eruo/anglo-centric.

The Vietnamese name is recognised as PROPN correctly, maybe based on the capitalisation. But, the POS and thence the dependency indexing is incorrect for the Vietnamese.  As seen in the above images, Python’a consistency in indexing and classifying varies for the same sentence when it is paired with English words and when it isn’t. The deprel variation for Lan in the Vietnamese and English sentences is different as well (npadvmod and attr).

This only goes to show that almost all softwares and translating codes are written in English-speaking countries and for the English-dominant market, and anglo-centrism poses barriers in literary studies when it comes to studying multilingual texts and their correct and factual translation.  

My Experience Converting and Annotating Multilingual Sentences

I have always been intrigued by the potential of using coding and programming software when it comes to languages, but as someone with no background in coding/programming, I have also been intimidated by it. So when we first tried out Google Colab in a session of Writing Across Languages: The Post-Monolingual Anglophone Novel, I was very excited, and a bit skeptical about how it was going to turn out, especially considering the fact that we were going to try out multilingual sentences, and not plain English sentences. Since we tried it out first as a group, it was easier for me to set aside my anxiety associated with using a programming language, as most of us were new to this.

As expected, we did find some anomalies when it came to POS tagging of multilingual sentences, considering that most of these tools depend on English. Initially, it was difficult for me to understand dependency relations (deprel) and get a hang of the abbreviations that indicate POS tagging and deprel. We discussed these in class, and I also did some independent research to understand these concepts better. What I understood is that getting a hang of it all only comes with practice. I also tried annotating some very common English sentences (such as „The quick brown fox jumps over the lazy dog.“) to get a better insight of how SpaCy works with English sentences vs. multilingual sentences.

The novel I chose to work with is Arundhati Roy’s The Ministry of Utmost Happiness, which has the presence of many Indian languages in addition to English, and the example sentences I worked with mostly had Urdu/Hindi words. As we saw in class, SpaCy tagged most Urdu/Hindi words as Proper Noun, sometimes correctly, and sometimes not. It was quite easy for me to figure out the mistakes in POS tagging due to my personal familiarity with these languages, and in some cases, the literal translation follows the word in the text itself, by context or definition.

I have to mention that the initial fog has lifted at this point, and the process of understanding and identifying oddities did get better with more examples I tried. But I do believe that the software also is a little confused at this point when it comes to the identification and tagging of non-English words, and there is a great scope of improvement in this aspect.

Thoughts and Problems while Converting and Indexing Multilingual Sentences of Abdulrazak Gurnah’s novel Afterlives

In the class „Writing across Languages: The Post-Monolingual Anglophone Novel“ we started working with digital tokenization, tagging, indexing and later annotating in order to – very broken down – take a look at how digital softwares react to multilingualism in novels. As most softwares and programmes are made in English-speaking countries for the English-speaking market and are hence almost exclusively in English, we are interested in how they perceive and annotate non-English words and phrases. Does their anglocentricism provide us with problems or will they actually understand non-English and annotate it correctly? (Small spoiler: they don’t.)

In my case I worked with Abdulrazak Gurnah’s novel Afterlives in which multiple languages are part of the primarily English narration. I had no problems with any of the technical aspects of this step, so after putting my example sentences into the provided Google Collab template (on the basis of Jupyter Notebook which just was too difficult to install), these are my main findings:

  • Our assumption that it defines all non-English words as proper nouns did become almost exclusively true – I think there were only a a couple of examples where it identified them differently
  • Sometimes the ROOT of a sentence was very weirdly placed.
  • Punctuations are not seen as separate tokens/entities in the dependency tree.

Here are some examples:

She wrote: Kaniumiza. Nisaidie. Afiya. He has hurt me. Help me.

In this example both kaniumiza and nisaidie are declared as proper nouns while kaniumiza is a direct object and nisaidie a ROOT. Afiya is also a proper noun and a ROOT, which makes sense as it is a name and the only part of this one-word sentence. However, the others do not make much sense, especially as the direct translation is given afterwards. I could understand all of them being a ROOT, but I just don’t understand why kaniumiza is seen as a direct object. It’s also unfortunate that the programme does not seem to see the whole example as an entity in which the sentences correlate with each other on a semantic level, but only sees it per individual sentence. Because if this were different, it would identify He has hurt me. Help me. as the translation of Kaniumiza. Nisaidie.


’Jana, leo, kesho,‘ Afiya said, pointing to each word in turn. Yesterday, today, tomorrow.

This one confused me a lot: Why is kesho a nouns while the others are proper nouns? Also jana is seen as a nominal subject, for which I only have the explanation that the programme thinks it is the name Jana and not a word in another language. However, how come leo is a conjunction and kesho a direct object? All of them should be indexed the same. I also do not understand why we suddenly have no ROOT at all in these three words, while this happened with other examples. Additionally, this time there was also something I don’t quite understand in the indexing of the English words: Why is tomorrow identified as the ROOT and not one of the other words? It is also quite sad that this time, the programme – again – did not realise that the direct translation of the non-English is part of this example.


After the third miscarriage in three years she was persuaded by neighbours to consult a herbalist, a mganga.

This one surprised me a bit. Because not only is mganga seen as a noun, but also as an appositional modifier, meaning that the programme realised it is another word for herbalist. This is the first – and only – time that an example of the African language (I’m very sorry, but I just could not discern which of the 125 Tanzanian languages it is) is indexed correclty.


I then thought that maybe it would be different with another language and tried this example:

They were proud of their reputation for viciousness, and their officers and the administrators of Deutsch-Ostafrika loved them to be just like that.

However, even with German and the official name of a region (Deutsch-Ostafrika) there were problems. Though both parts of the word are seen as proper nouns – which is correct – only Deutsch is seen as a compound, while Ostafrika is seen as the object of a preposition. This is not necessary incorrect, however, Deutsch-Ostafrika is one word, even if it is hyphenated. Hence, in my understanding, both parts of the world should be seen as a compound and they together as the object of a preposition. 


And lastly, another example with German: 

He looked like a … a schüler, a learned man, a restrained man.

Here, the programme did identify schüler correctly – as a noun and as the object of the preposition like. I was quite impressed with that, and what impressed me even more was the fact, that it also identified learned man and restrained man as appositional modifiers of schüler. This is the only example sentence in which not only the POS-tagging but also the indexing and dependency relation is correct. My only explanation for this is, that schüler is also a word used within the English language, though it is an Old-English word and not commonly used (see OED), and hence known to English dictionaries.


Lastly, I want to say that I actually had kind of fun doing this. Yes, I had to look up some of the linguistic definitions, especially with the dependency relations, but overall it was fun. And a bit infuriating at times when the programme made the same mistakes again and again. So I’m looking forward to the next step of the process.