My experience with annotating multilingual sentences with Google Collab

My first experience with Google Colab was surprisingly positive. While the idea of working with a programming software seemed overwhelming at first, when we actually got to try using it during the seminar, it seemed quite intuitive and easy to navigate with the help of Ms. Pardey and the other students. The concept of syntactic trees and dependency relations was something I initially struggled with, considering my Introduction to Linguistics lecture was in 2017, and I have since mostly stayed in the field of Literary and Cultural Studies. Combined with the abbreviations of the different tags within the programme, it was difficult for me to understand the results Google Colab was showing me. However, the class discussion and the glossary, as well as some revision using Google were very helpful. What I also did was feed the programme simple sentences at first (think “I like apples.” “What is your favourite colour?”) to see what the results would look like with less complicated sentence structures.

When I tried to annotate my own sentences from the novel “On Earth we’re briefly gorgeous”, I came across some difficulties, as I cannot understand Vietnamese and thus had to find out the structure of the Vietnamese passages myself first in order to verify Google Colab’s results. What I discovered was that the programme struggled to determine the Vietnamese words, which, to me, seems to be inevitable because, as I understand it, the programming language is only suitable for the English language(?). Because of this, the overall dependency relations were off, since my example sentences combined English and Vietnamese words in the same sentence but mostly without an indicator like an English preposition or determiner. I could not yet find a way to fix this problem but am very interested in what would be a solution in such cases.

I am interested to learn about the others’ experiences with Google Colab and am keen to learn more about computer-based analysis of multilingual text.

Technical Difficulties and First Results of Annotating Multilingual Sentences in „The Moor’s Account“

First of all, I have been finding this work very interesting, getting a new perspective on literature. However, I encountered some technical difficulties while having Google Collab analyse my sentences. After three sentences, it told me I had reached my free limit and would have to purchase Collab Pro. This is probably because I tried to save my results by opening a new file for every sentence. Thus, I have only worked through a couple of examples so far, taken from the novel The Moor’s Account  by Laila Lalami.

The first sentence I had annotated was, „When I said, Bawuus ni kwiamoja, one of the women inevitably corrected me, Ni bawuus kwiamoja.“ (Lalami 175). Both times the words are used in the sentence, Collab tagged them as proper nouns, as if they were all names. The phrase is not translated in the novel, but the correction is as follows: “in Capoque […] the doer and the done-to were spoken of before the deed itself” (ibid.). This means that, presumably, “Ni” and “bawuus” are the doer and the done-to, while “kwiamoja” is a verb. It is very difficult to research this language, however; I suppose it is not spoken anymore.

Another sentence I let Collab process was, “I whispered Ayat al-Kursi to myself“ (123). “Ayat”, “al”, and “Kursi” are all tagged as proper nouns, which is acceptable, I believe, as the words refer to a specific verse in the Quran. In the dependency analysis, however, it says that “Kursi” is dependent on the word “whispered” and the major part of a compound (dobj).

Hopefully, I will be able to go through more sentences and I am excited for the next steps.

My Experience Converting and Annotating Multilingual Sentences

I have always been intrigued by the potential of using coding and programming software when it comes to languages, but as someone with no background in coding/programming, I have also been intimidated by it. So when we first tried out Google Colab in a session of Writing Across Languages: The Post-Monolingual Anglophone Novel, I was very excited, and a bit skeptical about how it was going to turn out, especially considering the fact that we were going to try out multilingual sentences, and not plain English sentences. Since we tried it out first as a group, it was easier for me to set aside my anxiety associated with using a programming language, as most of us were new to this.

As expected, we did find some anomalies when it came to POS tagging of multilingual sentences, considering that most of these tools depend on English. Initially, it was difficult for me to understand dependency relations (deprel) and get a hang of the abbreviations that indicate POS tagging and deprel. We discussed these in class, and I also did some independent research to understand these concepts better. What I understood is that getting a hang of it all only comes with practice. I also tried annotating some very common English sentences (such as „The quick brown fox jumps over the lazy dog.“) to get a better insight of how SpaCy works with English sentences vs. multilingual sentences.

The novel I chose to work with is Arundhati Roy’s The Ministry of Utmost Happiness, which has the presence of many Indian languages in addition to English, and the example sentences I worked with mostly had Urdu/Hindi words. As we saw in class, SpaCy tagged most Urdu/Hindi words as Proper Noun, sometimes correctly, and sometimes not. It was quite easy for me to figure out the mistakes in POS tagging due to my personal familiarity with these languages, and in some cases, the literal translation follows the word in the text itself, by context or definition.

I have to mention that the initial fog has lifted at this point, and the process of understanding and identifying oddities did get better with more examples I tried. But I do believe that the software also is a little confused at this point when it comes to the identification and tagging of non-English words, and there is a great scope of improvement in this aspect.

Thoughts and Problems while Converting and Indexing Multilingual Sentences of Abdulrazak Gurnah’s novel Afterlives

In the class „Writing across Languages: The Post-Monolingual Anglophone Novel“ we started working with digital tokenization, tagging, indexing and later annotating in order to – very broken down – take a look at how digital softwares react to multilingualism in novels. As most softwares and programmes are made in English-speaking countries for the English-speaking market and are hence almost exclusively in English, we are interested in how they perceive and annotate non-English words and phrases. Does their anglocentricism provide us with problems or will they actually understand non-English and annotate it correctly? (Small spoiler: they don’t.)

In my case I worked with Abdulrazak Gurnah’s novel Afterlives in which multiple languages are part of the primarily English narration. I had no problems with any of the technical aspects of this step, so after putting my example sentences into the provided Google Collab template (on the basis of Jupyter Notebook which just was too difficult to install), these are my main findings:

  • Our assumption that it defines all non-English words as proper nouns did become almost exclusively true – I think there were only a a couple of examples where it identified them differently
  • Sometimes the ROOT of a sentence was very weirdly placed.
  • Punctuations are not seen as separate tokens/entities in the dependency tree.

Here are some examples:

She wrote: Kaniumiza. Nisaidie. Afiya. He has hurt me. Help me.

In this example both kaniumiza and nisaidie are declared as proper nouns while kaniumiza is a direct object and nisaidie a ROOT. Afiya is also a proper noun and a ROOT, which makes sense as it is a name and the only part of this one-word sentence. However, the others do not make much sense, especially as the direct translation is given afterwards. I could understand all of them being a ROOT, but I just don’t understand why kaniumiza is seen as a direct object. It’s also unfortunate that the programme does not seem to see the whole example as an entity in which the sentences correlate with each other on a semantic level, but only sees it per individual sentence. Because if this were different, it would identify He has hurt me. Help me. as the translation of Kaniumiza. Nisaidie.

’Jana, leo, kesho,‘ Afiya said, pointing to each word in turn. Yesterday, today, tomorrow.

This one confused me a lot: Why is kesho a nouns while the others are proper nouns? Also jana is seen as a nominal subject, for which I only have the explanation that the programme thinks it is the name Jana and not a word in another language. However, how come leo is a conjunction and kesho a direct object? All of them should be indexed the same. I also do not understand why we suddenly have no ROOT at all in these three words, while this happened with other examples. Additionally, this time there was also something I don’t quite understand in the indexing of the English words: Why is tomorrow identified as the ROOT and not one of the other words? It is also quite sad that this time, the programme – again – did not realise that the direct translation of the non-English is part of this example.

After the third miscarriage in three years she was persuaded by neighbours to consult a herbalist, a mganga.

This one surprised me a bit. Because not only is mganga seen as a noun, but also as an appositional modifier, meaning that the programme realised it is another word for herbalist. This is the first – and only – time that an example of the African language (I’m very sorry, but I just could not discern which of the 125 Tanzanian languages it is) is indexed correclty.

I then thought that maybe it would be different with another language and tried this example:

They were proud of their reputation for viciousness, and their officers and the administrators of Deutsch-Ostafrika loved them to be just like that.

However, even with German and the official name of a region (Deutsch-Ostafrika) there were problems. Though both parts of the word are seen as proper nouns – which is correct – only Deutsch is seen as a compound, while Ostafrika is seen as the object of a preposition. This is not necessary incorrect, however, Deutsch-Ostafrika is one word, even if it is hyphenated. Hence, in my understanding, both parts of the world should be seen as a compound and they together as the object of a preposition. 

And lastly, another example with German: 

He looked like a … a schüler, a learned man, a restrained man.

Here, the programme did identify schüler correctly – as a noun and as the object of the preposition like. I was quite impressed with that, and what impressed me even more was the fact, that it also identified learned man and restrained man as appositional modifiers of schüler. This is the only example sentence in which not only the POS-tagging but also the indexing and dependency relation is correct. My only explanation for this is, that schüler is also a word used within the English language, though it is an Old-English word and not commonly used (see OED), and hence known to English dictionaries.

Lastly, I want to say that I actually had kind of fun doing this. Yes, I had to look up some of the linguistic definitions, especially with the dependency relations, but overall it was fun. And a bit infuriating at times when the programme made the same mistakes again and again. So I’m looking forward to the next step of the process.

Initial thoughts on converting & annotating multilingual sentences + my experiences with examples from Yvonne Adhiambo Owuor’s „Dust“

From a literary translation point of view, I am very interested in the work with multilingual anglophone texts as it is something that I will probably come across quite a bit in the actual work as a translator of novels from English to German. The seminar „Writing Across Languages: The Post-Monolingual Anglophone Novel“ deals precisely with this topic in the context of the digital humanities. And at first, digital humanities didn’t sound too overwhelming to me. Because like most students, I work with a laptop every day.

When it came to actually converting and annotating some multilingual sentences, and hearing words such as programming language, the visualization of data and code during the preparation though, I have to admit I developed quite a trepidation. After all, one of the reasons I decided to work with texts was because I try to stay away from advanced technology that goes beyond the surface as far as I possibly can and when I have problems with my technical devices, I’d rather ask someone who is familiar with it, than despair on my own. So, I was very relieved that we neither had to write code ourselves nor were we required to have any previous experience in the field. What really helped me was that we worked with an example sentence all together in class. This way, it did not feel as scary, since I only had to follow the instructions I was given with immediate feedback and any issues that occurred could be solved together immediately. So, there was no need to despair. One problem that did occur with the first example sentence was, when copying the sentence into the cell, there need not be any paragraphs, otherwise the code will not work. If I had encountered the problem on my own, I am sure it would have been much more difficult and also time-consuming to find out what I had done wrong (or whether I had broken something – deleted the whole internet, who knows).

After this first shared experience it was much less of an effort to get started on my own. And it was even fun to start experimenting with the sentences from the novel „Dust“ that I had read in preparation and to check, what mistakes the English-based code causes. The errors I came across the most were the categorizations of many of the non-English words (in the case of the novel „Dust“ by Yvonne Adhiambo Owuor these are words from Kiswahili, Latin, Spanish and a local variety of English) as proper nouns and nouns in the cell for the part of speech and as compounds in the cell concerned with the dependency encoding. This tendency is especially strong, if several non-English words follow each other or at least in these cases it becomes most obvious for the observer because the annotation makes it look like there are whole sentences just consisting of proper nouns and nouns without any verbs. The encoding of the dependency also seemed to be quite arbitrary. In some cases, tokens that were far removed from each other and seemingly were not connected to each other (including punctuation) were categorized as dependent. Punctuation was another issue, especially in the visualization of the dependencies. Usually, when the first token was an inverted comma, it was also visualized as a single token in the dependency tree. With the other punctuations this was not the case, they got assigned to another token and these two tokens where then depicted as one. The merging of tokens also seemed quite arbitrary because, especially with non-English words, sometimes the word in front of the punctuation and sometimes the word following it were merged with the punctuation token.

Now that I have dealt with some examples myself, I am very interested to see what mistakes the others found with their example sentences and what the next steps of the analysis of our findings will look like!

Problems (and correct classifications) in annotating training and example sentences in different languages from R. F. Kuang’s „Babel“: My experiences

Within the context of our seminar „Writing Across Languages: The Post-Monolingual Anglophone Novel“, our task was to test, how the software „Jupyter Notebook“, equipped with an English database, classified foreign words in a novel that is mostly written in English. The relevant categories were parts of speech and dependency tags. As „Jupyter Notebook“ was too tiresome to install, we worked with a copy of a Jupyter notebook in Google Collab instead. We had two example sentences which we could use in order to become acquainted with the software. Our main job was to read our novel and to note down examples of multilingual text passages, so that they could be annotated by the software.

This preparation for these sentences to be annotated by the software posed a few problems for me. My first problem was that my book, „Babel“ by R.F. Kuang, uses a lot of Chinese words and these words are sometimes presented as a Chinese character. I had some problems with copying the Chinese characters. The problem was not so much what the character meant or to what it translated in English, as this was indicated most of the time, but I had no idea on how to copy the Chinese characters, especially as the Kindle app does not allow copying words or phrases from the app. My initial idea was to enter the romanized version of the character or its meaning in English into the Google translator and to then just copy the Chinese character from there. However, this didn’t work because it already said in the book that the usage of this character for this meaning was quite unusual and the Google translator only indicates one possible way of writing a Chinese word as a Chinese character. My second idea was to take a photo, to copy the Chinese character from there and to then paste it into my document. This also didn’t work because I couldn’t copy the characters in the apps that I have tested either. After some unsuccessful tries with apps in which the user can draw the Chinese character and during which the Chinese characters could not be recognized by these apps, I ended up on this website: This website also allows the user to draw the Chinese character and then guesses what Chinese character you drew, but it seems to have a much larger database than the apps I tested. Here is an example:

In this example, I wanted to draw the first character in the suggestions below. In the cases of multiradical characters, meaning Chinese characters which consist of more than one radical, I had to choose the option „Multiradical and then chose characters from a large list. Here is also an example:

These two methods take a lot of time, of course, but in the end, I managed to find all the characters that I needed.

My second, more minor problem while copying the text from the app into my document were the horizontal lines above vowels in romanized Chinese writing and also in Latin. In my research, I learned that in Latin, these lines show the stress of a particular vowel. One way or another, I knew that I had to indicate these lines above a vowel somehow. In the end, I found a Q&A page onn which a user indicated how to type these lines above vowels on the keyboard. This is the page I used: Just like the website with the Chinese characters, this isn’t an academic website, but for my purpose, it sufficed.

Annotating the training sentences

As I mentioned earlier, I first went through the example sentences. While going through each of them individually, I will name a few mistakes that the software made along with some correct annotations which are decidedly fewer. The mistakes that I name are usually first about words that are foreign to the English language, and then, if there are any, mistakes which were made concerning words which belong to the English language. Concerning the correct annotations, I only mention those words which are non-English, as it should be the norm that English words are annotated correctly, considering that the data with which the software was trained, was written in English. The first training sentence was:

Tías called me blanca, palida, clarita del huevo and never let me bathe in the sun. While Leandro was tostadito, quemadito como un frijol, I was pale.

Lickorish Quinn 129

The mistakes that the software made which I recognized were that “palida” was categorized as a conjugation (correct: adjective) and that “frijol” was classified as an attribute. Concerning the English words, the only mistake that I recognized was that „let“ was classified as a conjunction (correct: auxiliary). Some correct decisions that the software made were that “Tías” was classified as a nominal subject, that “blanca” was classified as an object predicate, that “clarita del” was classified as a compound, but that huevo was classified as a conjunction and that “tostadito” was classified as an adjectival modifier.

The second training sentence which was annotated was:

In Chinese, it is the same word ‘家’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘家’, a roof on top, then some legs and arms inside.

Guo 125-26

The mistakes here were the following: the first “家” was classified as an appositional modifier, but also as a noun (which is correct), the second “家” was classified as an unclassified dependent and thus radically differs from the first annotation, and “jia”, which is the romanized version of the Chinese character “家”, was categorized as a proper noun (correct: noun) and an appositional modifier. Concerning the English words, there were also a few mistakes: “family” was considered a conjunction (correct: noun), “is” was classified as a conjunction (correct: auxiliary), “legs” was classified as a noun (which is correct) and as a root and “arms” was classified as a conjunction (correct: noun).

Annotating the quotes from „Babel“

As I was curious, how the software would react to sentences with Latin words, I started with a fairly easy one:

But in Latin, malum means “bad” and mālum,’ he wrote the words out for Robin, emphasizing the macron with force, ‘means “apple”.

Kuang 25

The mistakes in the annotation were that the first “malum” was categorized as a noun (correct: adjective) and a nominal subject (correct: adjectival subject) and that the second „mālum“ was classified as a proper noun (correct: noun) and a conjunction (correct: subject). The English words in this sentence, however, were categorized correctly. To me, this shows that the software does not understand the sentence because otherwise, it would have recognized that „malum“ means „bad“ and is thus an adjective and that „mālum“ means „apple“ and is thus a noun.

Okay, a sentence with Latin was not annotated successfully. Let’s see whether a Chinese word is better. An example would be:

Wúxíng – in Chinese, ‘formless, shapeless, incorporeal’. The closest English translation was ‘invisible’.

Kuang 65

Nope, the annotation to this sentence was even worse. First of all, “Wúxíng” was classified as a proper noun (correct: noun) and as a root (which could be correct, as there is no main verb which could be the root of the sentence). Furthermore, there are a few English words which were not annotated correctly: “Chinese” was classified as a proper noun (correct: noun) and “formless” was classified as an adjectival modifier, while “shapeless” was classified as an adverbial modifier (correct: probably adjectival modifier).

Now, I was invested and wanted to get to the bottom of this weird annotation of foreign words by the Jupyter Notebook. That’s why I chose this as my third sentence from „Babel“:

Que siempre la lengua fue compañera del imperio; y de tal manera lo siguió, que junta mente començaron, crecieron y florecieron, y después junta fue la caida de entrambos.

Kuang 3

Honestly, I expected this sentence to be full of either classifications as proper nouns or full of mistakes. The mistakes in this sentence concerning the foreign words were that “crecieron” and “florecieron” were classified as nouns (correct: verbs), “de” was classified as unknown (correct: preposition) and that the rest of the words were classified as proper nouns. Concerning the clausal level, the words were either classified as compounds or as appositional modifiers. Interestingly, the software correctly recognized “la” as a determiner and “imperio” as a direct object. I don’t know whether the software was just lucky with these two annotations or whether the place of the words in these sentences somehow influenced the correct annotation. One or way or the other, it is clear that this software cannot annotate words in a language other than English very consistantly.

As I recognized that it made little sense to enter entire sentences in another language into the software, I wanted to make the work for the software as easy as possible. Thus, I returned to Chinese and entered the following sentences next:

He will learn. Tā huì xué. Three words in both English and Chinese. In Latin, it takes only one. Disce.

Kuang 26

I was hoping that the software would recognize that the first two sentences were a translation from each other and that they thus would be categorized correctly. I was disappointed because in English, the words “He will learn“ were correctly classified as a pronoun, an auxiliary and as a verb, while in Chinese, “Tā huì xué” were classified as a noun, a noun and a verb, which is, based on the English translation (as I don’t speak Chinese) is not correct. Apart from that, “English”, “Latin”  and, unsurprisingly, “Disce” were categorized as proper nouns, while actually, „English“, and „Latin“ are nouns, while „Disce“ is a verb. One further mistake consisted in the software annotating „Chinese“ as a conjunction (correct: noun). The different annotations of one and the same sentence in different languages confirms my assumption that the software does not actually understand the words that it annotates.

Okay, Chinese didn’t work. I was curious whether another language which is closer to English would help. So, I chose two sentences in French from Babel:

‘Ce sont des idiots,’ she said to Letty. / ‘Je suis tout à fait d’accord,’ Letty murmured back.

Kuang 71

The result from this annotation was also disappointing. “Ce” was classified as a verb (correct: determiner), “sont” (correct: verb) and “des” (correct: article) as an adjective, “Je” (personal pronoun) and “tout à fait d’accord” were classified as proper nouns, “ce” was classified as an adverbial clause, “sont” and “des” was classified as adjectival modifiers, “suis” was classified as a clausal complement, and “à fait” was classified as a compound, while “tout” and “d’accord” were classified as a noun phrase as adverbial modifier and as a direct object, respectively. Interestingly, the words “idiots” as a noun and as a direct object and “suis” as a verb were classified correctly.

Okay, a closer language like French didn’t work. Maybe German could be a better help for the software to annotate the foreign word correctly. That’s the reason why I chose this sentence:

But heimlich means more than just secrets.

Kuang 81

I figured that, as “heimlich”, or rather, its negative counterpart “unheimlich”, is often used in the context of horror literature, maybe the software would be able to recognize this word and thus annotate it correctly. However, I was, again, disappointed, as “Heimlich” was classified as a noun (correct: adjective) and a nominal subject (correct: adjectival subject).  

Next, I was again intrigued by the Chinese language and I wanted to know whether the results of the Chinese character and its translation in the training sentence above were just a coincidence. So, I chose the following sentence, which is similar to the training sentence, with a Chinese character:

Why was the character for ‘woman’ – 女 – also the radical used in the character for ‘slavery’? In the character for ‘good’?

Kuang 110

The answer was: no, the results from the training sentence were not a coincidence: “女” was classified as unidentified (correct: noun) and as an appositional modifier. Furthermore, an English word was also categorized incorrectly: “radical” was wrongly classified as an adjective (correct: noun). The classification of „In“ as the root could be correct, as the sentence „In the character for ‚good‘, there is no verb which could be the root of the sentence.

Okay, now, Chinese was for me out of the question. Maybe the French and German words were not established enough in the English language. So, I decided to choose a sentence with a French word which I have already heard to have been used in other English sentences:

‘It’s not the company, it’s the ennui,’ he was saying.

Kuang 144

Well, “ennui” was, indeed, correctly classified as a noun and as an attribute. However, the particle “’s” was classified as a clausal complement. But nevermind, we’re making progress concerning the annotation of the non-English words.

Next, I was interested in how the software would handle Greek words. As an example sentence, I chose:

The Greek kárabos has a number of different meanings including “boat”, “crab”, or “beetle”.

Kuang 156

Just like the sentence before, the foreign word was classified correctly, in this context as a noun and as a nominal subject. However, now the software seemed confused about the classification of the English words: “including” was considered a preposition, but also to be a verb, “crab” was considered as a noun, but also as a conjunction, while “beetle” was considered to be a verb (correct: noun) and a conjunction.

Okay, nice, the Greek word was also no problem for the software. As I am a native German speaker, I wanted to give a second chance to the annotation of a German word in my last example. I chose this as an example sentence:

‘The Germans have this lovely word, Sitzfleisch,’ Professor Playfair said pleasantly when Ramy protested that they had over forty hours of reading a week.

Kuang 168

Concerning the German word “Sitzfleisch”, I was disappointed because the software classified it as a proper noun (correct: noun) and as an appositional modifier. Concerning the English words, there were also some mistakes: “Professor” was classified as a proper noun (correct: noun) and a compound, while “Playfair” was classified as a nominal subject, “have” was classified as a clausal complement, “when” was classified as an adverbial modifier and “protested” was classified as an adverbial clause modifier.

A more general problem that I encountered while annotating these sentences was that some tags like “oprd” were neither mentioned in the table we received in order to recognize the clauses and the parts of speech, nor were they mentioned in the spaCy article. Instead, I found this website, which helped me with the abbreviations:

Another, very minor technical problem concerned the work with Google Collabs. The program sometimes gave me feedback that I had opened too many notebooks that were not closed yet although I had closed them. To solve this problem, however, I simply had to click “Close all notebooks except the current” or something like that and then I could continue annotating.

On a more positive note, the software consistently succeeded in the indexing of the tokens and in classifying punctuation marks as such. The only exception that I found was the apostrophe in „’s“.

Comprehension questions

I don’t really think that I still have any comprehension questions, I am just not quite sure whether I correctly assessed the parts of speech and the dependency tags because the linguistics class in which I learned these terms was quite a long time ago in about 2017. That is also the reason why I mostly didn’t indicate the correct dependency tags if they were obviously wrong. I googled or try to look up the abbreviations and what they meant, of course, but I am still not quite sure whether there aren’t still some mistakes that I made in this regard. That’s why I also didn’t check whether the tree at the bottom of the interface was correct. There could be some answers as to why the software classifies some words the way it does, but right now, I didn’t see a systematic wrong approach to foreign words except that they are often classified as proper nouns and as compounds. I was also not quite sure how to write this blog entry as I haven’t written a blog entry yet and I am not that into blogs myself.