Multi-Lingual Annotations in Python with „A Concise Chinese-English Dictionary for Lovers“

This was not my first time working with Python to figure out the tokens of sentences via computational means. However, my last time working with Python (in a linguistics BA seminar with Prof. Kevin Tang) has been some time back, so I have appreciated that working our example sentences through it has been made very easy.

Running the three example sentences given to us, I noticed that punctuations marking speech are often interpreted as part of words or compounds and are tagged as POS, which makes it difficult to check which tokens refer to which words.

Additionally, foreign words were always interpreted as proper nouns – even in the Spanish phrase clarita del huevo, in which del is clearly not a proper noun. I would have thought that maybe the Spanish language would be easier to interpret or perhaps translate than the Swahili sentence as it might be more commonly known, but Python does not do any translating and so struggles with anything that is not English. Dependencies can thus not be correctly determined: in the sentence Tías called me blanca, palida, clarita del huevo the last three parts (blanca, palida, clarita del huevo) are a listing, all nouns or noun phrases are on equal standing and not dependent on each other, but Python marks huevo as a dependent of palida.

The given example with Swahili words in it cannot be properly tagged at all as – according to Python – ‘Ayaaaana! Haki ya Mungu … aieee!’ The threat-drenched contralto came from the bushes to the left of the mangroves. ‘Aii, mwanangu, mbona wanitesa?’ is made up entirely of nouns or proper nouns. Only the English part could be identified correctly.

The English-Chinese example provides similar issues:

  • In Chinese, it is the same word ‘’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘’, a roof on top, then some legs and arms inside.

In this example, the Chinese hanzi are not computed as proper nouns. Python interprets 家 as a noun once and an adjective another time, marking it – quite nonsensically – as a dependent of legs.

Now looking at the novel I have been reading – A Concise Chinese-English Dictionary for Lovers by Xiaolu Guo – I looked for similar multilingual sentences to test in the programme. I couldn’t find many such sentences, but tested the ones I did find:

  • 知识’ mean knowledge, ‘分子’ mean molecule.

知识 is here correctly interpreted as a noun, although it is interpreted as one singular noun rather than a noun phrase consisting of a verb (知 – to know) and a noun (识 – knowledge).

Similar goes for分子, interpreted as one noun rather than a noun phrase consisting of a verb (分 – divide) and a noun (子 – son, child).

The word “mean” in this sentence is the verb “to mean”, but not conjugated correctly because the narrator is not fluent in English yet and struggles with English grammar. Python thus tags it as an adjective. Accordingly, the dependencies turn out incorrect as well, as the mean should function as the head of its sentence. Instead, knowledge becomes that head upon which all other words depend.

  • is fart in Chinese. It is the word made up from two parts. is a symbol of a body with tail, and underneath that represent two legs. That means fart, a kind of Chi.

In the 3rd sentence “represent” is interpreted as a dependent of “is” – again, likely due to the improper grammar the narrator uses. The hanzi are not computed at all (屁), marked as a proper noun (尸) or a noun (比).

  • Chi (), everything to do with Chi is very important to us Chinese.

This sentence starts out sort of elliptical. The phrase very important to us Chinese is tagged and interpreted correctly, but Python struggles with everything that comes before it, marking Chi () as a dependent of is. Evidently, Python cannot compute and interpret punctuation correctly, which in this case should indicated that the first word is separate from the sentence after the comma.

In the next two examples I wanted to test how Python deals with incorrect English grammar but without the interference of non-English words, hypothesising that situations like the above mean would also occur here.

  • I feeling I can die for all kinds of situation in every second.

In this case, coming from previous errors Python made due to incorrect grammar and conjugation, I thought that feeling might be interpreted as a noun because the auxiliary “am” of the progressive form is missing. Surprisingly, Python has no problem recognising feeling as the verb it is indeed supposed to be, marking it correctly as the head/root of the entire sentence

  • I scared by cars because they seems coming from any possible directing.

Similar as above, I wanted to test how Python deals with these grammatical errors (seems instead of seem; directing instead of direction) – again, surprisingly, all tokens were tagged and interpreted properly with all their dependencies. Even directing has been correctly identified as a noun instead of a progressive verb.

Evidently, annotating multi-lingual sentences correctly is not a possibility – at least not with the Python code we have been given. While the programme has no problem interpreting English sentences with incorrect grammar, it is thrown for a loop as soon as non-English words are introduced, which was a very interesting observation to make.

Converting and Annotating Multilingual Sentences & Quotes – My Experience

Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.

I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.

Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.

It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.

Annotating Multilingual Sentences in „Hold“ by Michael Donkor: Twi and English

In his novel „Hold“, which was first published in 2018, Micheal Donkor continually weaves Twi words into English sentences, thus constructing a multilingual narrative. When he uses words in Twi, the author highlights them and sets them apart by italicizing them. Though much could be said about the function of this practice, it is a secondary issue in regards to annotating sentences from the novel. Within the context of the seminar „Writing across Languages“, I am mainly interested in the form- italics being a part of it. Moreover, I am interested in how the multilingual sentences can be annotated and what challenges arise in doing so.

Multilingual Sample Sentences in Donkor’s Novel

There are different techniques that Donkor uses in order to establish multilingualism. I tried to choose sentences, words and phrases that show a variety of techniques the author uses. Often, phrases or whole sentences in Twi are used in dialogue.

‚What a polite and best-mannered young lady we have on our grounds this pleasant day. Wa ye adeƐ.

Donkor, Michael: Hold, p. 7

Me da ase,‘ Belinda said softly.

Donkor, Michael: Hold, p. 10

Other times, only one word is used in an English sentence. Here, as in almost all cases of multilingualism in Donkor’s novel, the Twi words are italicized.

Belinda worked the pestle in the asanka, using her weight against the ingredients, grinding the slippery onion and pepper.

Donkor, Michael: Hold, p. 28

In the tro tro on the way home from the zoo, Belinda had done her best to enjoy Mary’s sulking silence.

Donkor, Michael: Hold, p. 25

I have chosen 8 sample sentences in total and pasted them into „Jupyter Notebook“. Though a few letters differed from the English alphabet and had to be inserted seperately, I faced no technical difficulties. A variety of challenges arose, however, regarding the machine annotation of Twi.

Challenges in Annotating Twi in English Sentences

No Italics in Annotations

I have repeatedly mentioned that Donkor uses italics to signal mutlilingualism. There is no way to indicate italics within „Jupyter Notebook“. Thus, it would be impossible to use these annotations to analyze the use of italics in multilingual texts, whether with a diachronic lens or otherwise. Nor can any differing practices across languages be analyzed, seeing that there is no way to indicate italics and therefore search for them later on.

Twi Words with the same form as English words

There are a few instances within the chosen sample sentences where a Twi phrase includes a word that resembles an English word. „Jupyter Notebook“, being based on an annotation model for the English language, identifies the form of these words and classifies them according to the English POS. In the chosen, annotated sentences this issue applies to the words „me“ or „bone“. Since I lack the language skills in Twi to understand the words, I can neither confirm nor deny if the classification generally is correct. It shows, however, that there are challenges in differentiating the Twi words from the English words.

Classifying Twi Words Consistently

In general, there’s a lack of consistency with respect to the classification of Twi words. The word „Aboa“ appears twice in two adjacent sentences. Still, the classifications for the word differ despite the same form. First it is identified as „ADJ“, then as „PROPN“. Due to a lack of input of the Twi language, these parts of the sentences are not labeled correctly.

Aboa!‘ Mary laughed. Aboa was Mother’s insult of choice too;

Donkor, Michael: Hold, p. 52

Initial experiences in annotating multilingual text in Ocean Vuong’s „On Earth We’re Briefly Gorgeous“

I have always been fascinated by how literature and language studies have been influenced by digital fields such as coding and AI, so working on this project and studying the interdependencies between digital humanities and comparative studies has been a great learning experience. Having said that, it should be noted that I am not very tech-savvy, so the annotation assignment using Google Colab and Python initially seemed a little daunting. After using it together in class though, it became easier to work with the software and my initial confusion was cleared.

The text I’m working on is Ocean Vuong’s On Earth We’re Briefly Gorgeous and it is a majorly English text with a few Vietnamese words and phrases. As a novel, it challenges the conventional idea of a mother tongue as a source of identity and stability. Through the protagonist, a Vietnamese American speaker who translates between English and Vietnamese, Vuong portrays the mother tongue as something that is constantly changing and disconnected from its origins, like an orphan. Multilinguality then, in the context of the novel becomes more of a multi-culture discourse where the experiences of Vietnamese people in America do not get translated.

Finding samples for annotation and analysis in such a text was a bit of a task because there were comparatively fewer multilingual passages. Nevertheless, the passages I did work with give a fairly comprehensive idea of how the software is heavily Anglo-centric and fails to correctly annotate non-English languages.

One of the sentences I used for analysis was: “Đẹp quá!” you once exclaimed, pointing to the hummingbird whirring over the creamy orchid in the neighbor’s yard. Vuong gives us a translation of the Vietnamese phrase in the next sentence of the novel: „It’s beautiful“. SpaCy recognises the words as PROPN and verb, but as is clear from the translation in the source text, neither of the words is a proper noun or a verb and could roughly be classified as adjective and pronoun if we are to categorise based on POS. “Creamy”, used to describe the orchid is also recognised as a noun and not an adjective by the software. Not being a student with a linguistics background, I have had difficulties understanding the dependency encoding part of the programme and then applying it to check whether it has been run correctly or not. But, as far as I have comprehended, dependency relation indexing for English words and POS is accurate while Vietnamese words are wrongly indexed (I suspect based on the already incorrect POS tags). 

I also wanted to try to work with a sentence where Viternamese was not used as direct speech to see if the software recognises them differently: You wanted to buy oxtail, to make bún bò huế for the cold winter week ahead of us. The novel doesn’t have an exact translation but contextually it is evident in context that it is a dish made of beef. A little bit of independent research revealed that bún bò huế is a dish made of rice noodles and beef slices. The software almost correctly recognises “bún” and “huế” as nouns, but “bò” is wrongly found to be a verb. Expecting a program to recognise the cultural intricacies of a word might be expecting too much of it, but even in the instance that it did recognise it correctly, I am inclined to believe that it was merely a fluke. 

”Tên tôi là Lan.” My name is Lan. was another sentence that I tried to annotate, and because I was having fun with the software by now, I also wanted to try annotating them as two different sentences as well—just the Vietnamese and then just the English translation. The results were extremely interesting and although I cannot fully explain why they varied (again, maybe a linguistics background would’ve helped) it goes to show that the spaCy software needs to be made more inclusive of languages that are not eruo/anglo-centric.

The Vietnamese name is recognised as PROPN correctly, maybe based on the capitalisation. But, the POS and thence the dependency indexing is incorrect for the Vietnamese.  As seen in the above images, Python’a consistency in indexing and classifying varies for the same sentence when it is paired with English words and when it isn’t. The deprel variation for Lan in the Vietnamese and English sentences is different as well (npadvmod and attr).

This only goes to show that almost all softwares and translating codes are written in English-speaking countries and for the English-dominant market, and anglo-centrism poses barriers in literary studies when it comes to studying multilingual texts and their correct and factual translation.  

My Experience Converting and Annotating Multilingual Sentences

I have always been intrigued by the potential of using coding and programming software when it comes to languages, but as someone with no background in coding/programming, I have also been intimidated by it. So when we first tried out Google Colab in a session of Writing Across Languages: The Post-Monolingual Anglophone Novel, I was very excited, and a bit skeptical about how it was going to turn out, especially considering the fact that we were going to try out multilingual sentences, and not plain English sentences. Since we tried it out first as a group, it was easier for me to set aside my anxiety associated with using a programming language, as most of us were new to this.

As expected, we did find some anomalies when it came to POS tagging of multilingual sentences, considering that most of these tools depend on English. Initially, it was difficult for me to understand dependency relations (deprel) and get a hang of the abbreviations that indicate POS tagging and deprel. We discussed these in class, and I also did some independent research to understand these concepts better. What I understood is that getting a hang of it all only comes with practice. I also tried annotating some very common English sentences (such as „The quick brown fox jumps over the lazy dog.“) to get a better insight of how SpaCy works with English sentences vs. multilingual sentences.

The novel I chose to work with is Arundhati Roy’s The Ministry of Utmost Happiness, which has the presence of many Indian languages in addition to English, and the example sentences I worked with mostly had Urdu/Hindi words. As we saw in class, SpaCy tagged most Urdu/Hindi words as Proper Noun, sometimes correctly, and sometimes not. It was quite easy for me to figure out the mistakes in POS tagging due to my personal familiarity with these languages, and in some cases, the literal translation follows the word in the text itself, by context or definition.

I have to mention that the initial fog has lifted at this point, and the process of understanding and identifying oddities did get better with more examples I tried. But I do believe that the software also is a little confused at this point when it comes to the identification and tagging of non-English words, and there is a great scope of improvement in this aspect.

Thoughts and Problems while Converting and Indexing Multilingual Sentences of Abdulrazak Gurnah’s novel Afterlives

In the class „Writing across Languages: The Post-Monolingual Anglophone Novel“ we started working with digital tokenization, tagging, indexing and later annotating in order to – very broken down – take a look at how digital softwares react to multilingualism in novels. As most softwares and programmes are made in English-speaking countries for the English-speaking market and are hence almost exclusively in English, we are interested in how they perceive and annotate non-English words and phrases. Does their anglocentricism provide us with problems or will they actually understand non-English and annotate it correctly? (Small spoiler: they don’t.)

In my case I worked with Abdulrazak Gurnah’s novel Afterlives in which multiple languages are part of the primarily English narration. I had no problems with any of the technical aspects of this step, so after putting my example sentences into the provided Google Collab template (on the basis of Jupyter Notebook which just was too difficult to install), these are my main findings:

  • Our assumption that it defines all non-English words as proper nouns did become almost exclusively true – I think there were only a a couple of examples where it identified them differently
  • Sometimes the ROOT of a sentence was very weirdly placed.
  • Punctuations are not seen as separate tokens/entities in the dependency tree.

Here are some examples:

She wrote: Kaniumiza. Nisaidie. Afiya. He has hurt me. Help me.

In this example both kaniumiza and nisaidie are declared as proper nouns while kaniumiza is a direct object and nisaidie a ROOT. Afiya is also a proper noun and a ROOT, which makes sense as it is a name and the only part of this one-word sentence. However, the others do not make much sense, especially as the direct translation is given afterwards. I could understand all of them being a ROOT, but I just don’t understand why kaniumiza is seen as a direct object. It’s also unfortunate that the programme does not seem to see the whole example as an entity in which the sentences correlate with each other on a semantic level, but only sees it per individual sentence. Because if this were different, it would identify He has hurt me. Help me. as the translation of Kaniumiza. Nisaidie.


’Jana, leo, kesho,‘ Afiya said, pointing to each word in turn. Yesterday, today, tomorrow.

This one confused me a lot: Why is kesho a nouns while the others are proper nouns? Also jana is seen as a nominal subject, for which I only have the explanation that the programme thinks it is the name Jana and not a word in another language. However, how come leo is a conjunction and kesho a direct object? All of them should be indexed the same. I also do not understand why we suddenly have no ROOT at all in these three words, while this happened with other examples. Additionally, this time there was also something I don’t quite understand in the indexing of the English words: Why is tomorrow identified as the ROOT and not one of the other words? It is also quite sad that this time, the programme – again – did not realise that the direct translation of the non-English is part of this example.


After the third miscarriage in three years she was persuaded by neighbours to consult a herbalist, a mganga.

This one surprised me a bit. Because not only is mganga seen as a noun, but also as an appositional modifier, meaning that the programme realised it is another word for herbalist. This is the first – and only – time that an example of the African language (I’m very sorry, but I just could not discern which of the 125 Tanzanian languages it is) is indexed correclty.


I then thought that maybe it would be different with another language and tried this example:

They were proud of their reputation for viciousness, and their officers and the administrators of Deutsch-Ostafrika loved them to be just like that.

However, even with German and the official name of a region (Deutsch-Ostafrika) there were problems. Though both parts of the word are seen as proper nouns – which is correct – only Deutsch is seen as a compound, while Ostafrika is seen as the object of a preposition. This is not necessary incorrect, however, Deutsch-Ostafrika is one word, even if it is hyphenated. Hence, in my understanding, both parts of the world should be seen as a compound and they together as the object of a preposition. 


And lastly, another example with German: 

He looked like a … a schüler, a learned man, a restrained man.

Here, the programme did identify schüler correctly – as a noun and as the object of the preposition like. I was quite impressed with that, and what impressed me even more was the fact, that it also identified learned man and restrained man as appositional modifiers of schüler. This is the only example sentence in which not only the POS-tagging but also the indexing and dependency relation is correct. My only explanation for this is, that schüler is also a word used within the English language, though it is an Old-English word and not commonly used (see OED), and hence known to English dictionaries.


Lastly, I want to say that I actually had kind of fun doing this. Yes, I had to look up some of the linguistic definitions, especially with the dependency relations, but overall it was fun. And a bit infuriating at times when the programme made the same mistakes again and again. So I’m looking forward to the next step of the process.

Initial thoughts on converting & annotating multilingual sentences + my experiences with examples from Yvonne Adhiambo Owuor’s „Dust“

From a literary translation point of view, I am very interested in the work with multilingual anglophone texts as it is something that I will probably come across quite a bit in the actual work as a translator of novels from English to German. The seminar „Writing Across Languages: The Post-Monolingual Anglophone Novel“ deals precisely with this topic in the context of the digital humanities. And at first, digital humanities didn’t sound too overwhelming to me. Because like most students, I work with a laptop every day.

When it came to actually converting and annotating some multilingual sentences, and hearing words such as programming language, the visualization of data and code during the preparation though, I have to admit I developed quite a trepidation. After all, one of the reasons I decided to work with texts was because I try to stay away from advanced technology that goes beyond the surface as far as I possibly can and when I have problems with my technical devices, I’d rather ask someone who is familiar with it, than despair on my own. So, I was very relieved that we neither had to write code ourselves nor were we required to have any previous experience in the field. What really helped me was that we worked with an example sentence all together in class. This way, it did not feel as scary, since I only had to follow the instructions I was given with immediate feedback and any issues that occurred could be solved together immediately. So, there was no need to despair. One problem that did occur with the first example sentence was, when copying the sentence into the cell, there need not be any paragraphs, otherwise the code will not work. If I had encountered the problem on my own, I am sure it would have been much more difficult and also time-consuming to find out what I had done wrong (or whether I had broken something – deleted the whole internet, who knows).

After this first shared experience it was much less of an effort to get started on my own. And it was even fun to start experimenting with the sentences from the novel „Dust“ that I had read in preparation and to check, what mistakes the English-based code causes. The errors I came across the most were the categorizations of many of the non-English words (in the case of the novel „Dust“ by Yvonne Adhiambo Owuor these are words from Kiswahili, Latin, Spanish and a local variety of English) as proper nouns and nouns in the cell for the part of speech and as compounds in the cell concerned with the dependency encoding. This tendency is especially strong, if several non-English words follow each other or at least in these cases it becomes most obvious for the observer because the annotation makes it look like there are whole sentences just consisting of proper nouns and nouns without any verbs. The encoding of the dependency also seemed to be quite arbitrary. In some cases, tokens that were far removed from each other and seemingly were not connected to each other (including punctuation) were categorized as dependent. Punctuation was another issue, especially in the visualization of the dependencies. Usually, when the first token was an inverted comma, it was also visualized as a single token in the dependency tree. With the other punctuations this was not the case, they got assigned to another token and these two tokens where then depicted as one. The merging of tokens also seemed quite arbitrary because, especially with non-English words, sometimes the word in front of the punctuation and sometimes the word following it were merged with the punctuation token.

Now that I have dealt with some examples myself, I am very interested to see what mistakes the others found with their example sentences and what the next steps of the analysis of our findings will look like!

Problems (and correct classifications) in annotating training and example sentences in different languages from R. F. Kuang’s „Babel“: My experiences

Within the context of our seminar „Writing Across Languages: The Post-Monolingual Anglophone Novel“, our task was to test, how the software „Jupyter Notebook“, equipped with an English database, classified foreign words in a novel that is mostly written in English. The relevant categories were parts of speech and dependency tags. As „Jupyter Notebook“ was too tiresome to install, we worked with a copy of a Jupyter notebook in Google Collab instead. We had two example sentences which we could use in order to become acquainted with the software. Our main job was to read our novel and to note down examples of multilingual text passages, so that they could be annotated by the software.

This preparation for these sentences to be annotated by the software posed a few problems for me. My first problem was that my book, „Babel“ by R.F. Kuang, uses a lot of Chinese words and these words are sometimes presented as a Chinese character. I had some problems with copying the Chinese characters. The problem was not so much what the character meant or to what it translated in English, as this was indicated most of the time, but I had no idea on how to copy the Chinese characters, especially as the Kindle app does not allow copying words or phrases from the app. My initial idea was to enter the romanized version of the character or its meaning in English into the Google translator and to then just copy the Chinese character from there. However, this didn’t work because it already said in the book that the usage of this character for this meaning was quite unusual and the Google translator only indicates one possible way of writing a Chinese word as a Chinese character. My second idea was to take a photo, to copy the Chinese character from there and to then paste it into my document. This also didn’t work because I couldn’t copy the characters in the apps that I have tested either. After some unsuccessful tries with apps in which the user can draw the Chinese character and during which the Chinese characters could not be recognized by these apps, I ended up on this website: https://www.qhanzi.com/index.html. This website also allows the user to draw the Chinese character and then guesses what Chinese character you drew, but it seems to have a much larger database than the apps I tested. Here is an example:

In this example, I wanted to draw the first character in the suggestions below. In the cases of multiradical characters, meaning Chinese characters which consist of more than one radical, I had to choose the option „Multiradical and then chose characters from a large list. Here is also an example:

These two methods take a lot of time, of course, but in the end, I managed to find all the characters that I needed.

My second, more minor problem while copying the text from the app into my document were the horizontal lines above vowels in romanized Chinese writing and also in Latin. In my research, I learned that in Latin, these lines show the stress of a particular vowel. One way or another, I knew that I had to indicate these lines above a vowel somehow. In the end, I found a Q&A page onn which a user indicated how to type these lines above vowels on the keyboard. This is the page I used: https://www.gutefrage.net/frage/lateinischer-betonungsstrich-word. Just like the website with the Chinese characters, this isn’t an academic website, but for my purpose, it sufficed.

Annotating the training sentences

As I mentioned earlier, I first went through the example sentences. While going through each of them individually, I will name a few mistakes that the software made along with some correct annotations which are decidedly fewer. The mistakes that I name are usually first about words that are foreign to the English language, and then, if there are any, mistakes which were made concerning words which belong to the English language. Concerning the correct annotations, I only mention those words which are non-English, as it should be the norm that English words are annotated correctly, considering that the data with which the software was trained, was written in English. The first training sentence was:

Tías called me blanca, palida, clarita del huevo and never let me bathe in the sun. While Leandro was tostadito, quemadito como un frijol, I was pale.

Lickorish Quinn 129

The mistakes that the software made which I recognized were that “palida” was categorized as a conjugation (correct: adjective) and that “frijol” was classified as an attribute. Concerning the English words, the only mistake that I recognized was that „let“ was classified as a conjunction (correct: auxiliary). Some correct decisions that the software made were that “Tías” was classified as a nominal subject, that “blanca” was classified as an object predicate, that “clarita del” was classified as a compound, but that huevo was classified as a conjunction and that “tostadito” was classified as an adjectival modifier.

The second training sentence which was annotated was:

In Chinese, it is the same word ‘家’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘家’, a roof on top, then some legs and arms inside.

Guo 125-26

The mistakes here were the following: the first “家” was classified as an appositional modifier, but also as a noun (which is correct), the second “家” was classified as an unclassified dependent and thus radically differs from the first annotation, and “jia”, which is the romanized version of the Chinese character “家”, was categorized as a proper noun (correct: noun) and an appositional modifier. Concerning the English words, there were also a few mistakes: “family” was considered a conjunction (correct: noun), “is” was classified as a conjunction (correct: auxiliary), “legs” was classified as a noun (which is correct) and as a root and “arms” was classified as a conjunction (correct: noun).

Annotating the quotes from „Babel“

As I was curious, how the software would react to sentences with Latin words, I started with a fairly easy one:

But in Latin, malum means “bad” and mālum,’ he wrote the words out for Robin, emphasizing the macron with force, ‘means “apple”.

Kuang 25

The mistakes in the annotation were that the first “malum” was categorized as a noun (correct: adjective) and a nominal subject (correct: adjectival subject) and that the second „mālum“ was classified as a proper noun (correct: noun) and a conjunction (correct: subject). The English words in this sentence, however, were categorized correctly. To me, this shows that the software does not understand the sentence because otherwise, it would have recognized that „malum“ means „bad“ and is thus an adjective and that „mālum“ means „apple“ and is thus a noun.

Okay, a sentence with Latin was not annotated successfully. Let’s see whether a Chinese word is better. An example would be:

Wúxíng – in Chinese, ‘formless, shapeless, incorporeal’. The closest English translation was ‘invisible’.

Kuang 65

Nope, the annotation to this sentence was even worse. First of all, “Wúxíng” was classified as a proper noun (correct: noun) and as a root (which could be correct, as there is no main verb which could be the root of the sentence). Furthermore, there are a few English words which were not annotated correctly: “Chinese” was classified as a proper noun (correct: noun) and “formless” was classified as an adjectival modifier, while “shapeless” was classified as an adverbial modifier (correct: probably adjectival modifier).

Now, I was invested and wanted to get to the bottom of this weird annotation of foreign words by the Jupyter Notebook. That’s why I chose this as my third sentence from „Babel“:

Que siempre la lengua fue compañera del imperio; y de tal manera lo siguió, que junta mente començaron, crecieron y florecieron, y después junta fue la caida de entrambos.

Kuang 3

Honestly, I expected this sentence to be full of either classifications as proper nouns or full of mistakes. The mistakes in this sentence concerning the foreign words were that “crecieron” and “florecieron” were classified as nouns (correct: verbs), “de” was classified as unknown (correct: preposition) and that the rest of the words were classified as proper nouns. Concerning the clausal level, the words were either classified as compounds or as appositional modifiers. Interestingly, the software correctly recognized “la” as a determiner and “imperio” as a direct object. I don’t know whether the software was just lucky with these two annotations or whether the place of the words in these sentences somehow influenced the correct annotation. One or way or the other, it is clear that this software cannot annotate words in a language other than English very consistantly.

As I recognized that it made little sense to enter entire sentences in another language into the software, I wanted to make the work for the software as easy as possible. Thus, I returned to Chinese and entered the following sentences next:

He will learn. Tā huì xué. Three words in both English and Chinese. In Latin, it takes only one. Disce.

Kuang 26

I was hoping that the software would recognize that the first two sentences were a translation from each other and that they thus would be categorized correctly. I was disappointed because in English, the words “He will learn“ were correctly classified as a pronoun, an auxiliary and as a verb, while in Chinese, “Tā huì xué” were classified as a noun, a noun and a verb, which is, based on the English translation (as I don’t speak Chinese) is not correct. Apart from that, “English”, “Latin”  and, unsurprisingly, “Disce” were categorized as proper nouns, while actually, „English“, and „Latin“ are nouns, while „Disce“ is a verb. One further mistake consisted in the software annotating „Chinese“ as a conjunction (correct: noun). The different annotations of one and the same sentence in different languages confirms my assumption that the software does not actually understand the words that it annotates.

Okay, Chinese didn’t work. I was curious whether another language which is closer to English would help. So, I chose two sentences in French from Babel:

‘Ce sont des idiots,’ she said to Letty. / ‘Je suis tout à fait d’accord,’ Letty murmured back.

Kuang 71

The result from this annotation was also disappointing. “Ce” was classified as a verb (correct: determiner), “sont” (correct: verb) and “des” (correct: article) as an adjective, “Je” (personal pronoun) and “tout à fait d’accord” were classified as proper nouns, “ce” was classified as an adverbial clause, “sont” and “des” was classified as adjectival modifiers, “suis” was classified as a clausal complement, and “à fait” was classified as a compound, while “tout” and “d’accord” were classified as a noun phrase as adverbial modifier and as a direct object, respectively. Interestingly, the words “idiots” as a noun and as a direct object and “suis” as a verb were classified correctly.

Okay, a closer language like French didn’t work. Maybe German could be a better help for the software to annotate the foreign word correctly. That’s the reason why I chose this sentence:

But heimlich means more than just secrets.

Kuang 81

I figured that, as “heimlich”, or rather, its negative counterpart “unheimlich”, is often used in the context of horror literature, maybe the software would be able to recognize this word and thus annotate it correctly. However, I was, again, disappointed, as “Heimlich” was classified as a noun (correct: adjective) and a nominal subject (correct: adjectival subject).  

Next, I was again intrigued by the Chinese language and I wanted to know whether the results of the Chinese character and its translation in the training sentence above were just a coincidence. So, I chose the following sentence, which is similar to the training sentence, with a Chinese character:

Why was the character for ‘woman’ – 女 – also the radical used in the character for ‘slavery’? In the character for ‘good’?

Kuang 110

The answer was: no, the results from the training sentence were not a coincidence: “女” was classified as unidentified (correct: noun) and as an appositional modifier. Furthermore, an English word was also categorized incorrectly: “radical” was wrongly classified as an adjective (correct: noun). The classification of „In“ as the root could be correct, as the sentence „In the character for ‚good‘, there is no verb which could be the root of the sentence.

Okay, now, Chinese was for me out of the question. Maybe the French and German words were not established enough in the English language. So, I decided to choose a sentence with a French word which I have already heard to have been used in other English sentences:

‘It’s not the company, it’s the ennui,’ he was saying.

Kuang 144

Well, “ennui” was, indeed, correctly classified as a noun and as an attribute. However, the particle “’s” was classified as a clausal complement. But nevermind, we’re making progress concerning the annotation of the non-English words.

Next, I was interested in how the software would handle Greek words. As an example sentence, I chose:

The Greek kárabos has a number of different meanings including “boat”, “crab”, or “beetle”.

Kuang 156

Just like the sentence before, the foreign word was classified correctly, in this context as a noun and as a nominal subject. However, now the software seemed confused about the classification of the English words: “including” was considered a preposition, but also to be a verb, “crab” was considered as a noun, but also as a conjunction, while “beetle” was considered to be a verb (correct: noun) and a conjunction.

Okay, nice, the Greek word was also no problem for the software. As I am a native German speaker, I wanted to give a second chance to the annotation of a German word in my last example. I chose this as an example sentence:

‘The Germans have this lovely word, Sitzfleisch,’ Professor Playfair said pleasantly when Ramy protested that they had over forty hours of reading a week.

Kuang 168

Concerning the German word “Sitzfleisch”, I was disappointed because the software classified it as a proper noun (correct: noun) and as an appositional modifier. Concerning the English words, there were also some mistakes: “Professor” was classified as a proper noun (correct: noun) and a compound, while “Playfair” was classified as a nominal subject, “have” was classified as a clausal complement, “when” was classified as an adverbial modifier and “protested” was classified as an adverbial clause modifier.

A more general problem that I encountered while annotating these sentences was that some tags like “oprd” were neither mentioned in the table we received in order to recognize the clauses and the parts of speech, nor were they mentioned in the spaCy article. Instead, I found this website, which helped me with the abbreviations: https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean.

Another, very minor technical problem concerned the work with Google Collabs. The program sometimes gave me feedback that I had opened too many notebooks that were not closed yet although I had closed them. To solve this problem, however, I simply had to click “Close all notebooks except the current” or something like that and then I could continue annotating.

On a more positive note, the software consistently succeeded in the indexing of the tokens and in classifying punctuation marks as such. The only exception that I found was the apostrophe in „’s“.

Comprehension questions

I don’t really think that I still have any comprehension questions, I am just not quite sure whether I correctly assessed the parts of speech and the dependency tags because the linguistics class in which I learned these terms was quite a long time ago in about 2017. That is also the reason why I mostly didn’t indicate the correct dependency tags if they were obviously wrong. I googled or try to look up the abbreviations and what they meant, of course, but I am still not quite sure whether there aren’t still some mistakes that I made in this regard. That’s why I also didn’t check whether the tree at the bottom of the interface was correct. There could be some answers as to why the software classifies some words the way it does, but right now, I didn’t see a systematic wrong approach to foreign words except that they are often classified as proper nouns and as compounds. I was also not quite sure how to write this blog entry as I haven’t written a blog entry yet and I am not that into blogs myself.