My experience with annotating multilingual sentences with Google Collab

My first experience with Google Colab was surprisingly positive. While the idea of working with a programming software seemed overwhelming at first, when we actually got to try using it during the seminar, it seemed quite intuitive and easy to navigate with the help of Ms. Pardey and the other students. The concept of syntactic trees and dependency relations was something I initially struggled with, considering my Introduction to Linguistics lecture was in 2017, and I have since mostly stayed in the field of Literary and Cultural Studies. Combined with the abbreviations of the different tags within the programme, it was difficult for me to understand the results Google Colab was showing me. However, the class discussion and the glossary, as well as some revision using Google were very helpful. What I also did was feed the programme simple sentences at first (think “I like apples.” “What is your favourite colour?”) to see what the results would look like with less complicated sentence structures.

When I tried to annotate my own sentences from the novel “On Earth we’re briefly gorgeous”, I came across some difficulties, as I cannot understand Vietnamese and thus had to find out the structure of the Vietnamese passages myself first in order to verify Google Colab’s results. What I discovered was that the programme struggled to determine the Vietnamese words, which, to me, seems to be inevitable because, as I understand it, the programming language is only suitable for the English language(?). Because of this, the overall dependency relations were off, since my example sentences combined English and Vietnamese words in the same sentence but mostly without an indicator like an English preposition or determiner. I could not yet find a way to fix this problem but am very interested in what would be a solution in such cases.

I am interested to learn about the others’ experiences with Google Colab and am keen to learn more about computer-based analysis of multilingual text.

Annotating Multilingual Sentences in Yara Rodrigues Fowler’s „Stubborn Archivist“ – Experience & Observations

Initially, I was intrigued but also a bit worried about working on this project with Python, as previous linguistic research during my bachelor’s taught me that working with programming software can be a bit error-prone and frustrating at times. However, thanks to the prepared script, the annotation process via Google Colab and Python was very intuitive and easy to use, so thankfully, I did not experience any major technical difficulties.

For my research, I annotated multilingual sentences from Stubborn Archivist by Yara Rodrigues Fowler. The novel was published in 2019 and follows the coming-of-age journey of a young British-Brazilian woman in contemporary South London. While mainly written in English, it also includes many Portuguese terms and phrases, highlighting the character’s connection to both British and Brazilian cultures and identities.

In general, the software correctly annotated single Portuguese nouns like tia or empregada when they were used in a standard English sentence structure and clearly indicated, for instance, preceded by a determiner:

You don’t have an empregada at your house in London?

(Rodrigues Fowler 137)

However, some errors arose with compound nouns, such as in leite condensado (condensed milk), where the second part, condensado, was mistakenly identified as the head of the phrase instead of leite.

Vovó Cecília shook the bowl of cocoa powder over the pan and as the sprinkle powder became wet and fat and darkened the baby curved it into the centre of the hot leite condensado.

(Rodrigues Fowler 132)

Another recurring problem I encountered in the annotation involved Portuguese words that have English equivalents. For instance, in the passage:

But Vovô I thought Columbus discovered America? No. Christopher Columbus discovered América do Norte. América is a continent—two continents. And Brasil is in América do Sul.

(Rodrigues Fowler 146)

Here, the Portuguese preposition + determiner do was mistaken for the English (to) do and thus incorrectly annotated, once as a verb and once as an auxiliary.

On a stylistic level, the frequent use of dialogue inserts without proper punctuation stands out prominently in Rodrigues Fowler’s novel. This aspect also posed the most significant challenge during annotation, as illustrated by the following example:

So your father every time we went out he would want to try a new juice
And obviously there are so many juices and fruits that he had never seen before
Açaí
Maracujá
Acerola
Jabuticaba
Cajú
He was always asking—But what is that in English?

(Rodrigues Fowler 117)

In such instances, the program was unable to determine the correct sentence structure of the individual phrases, often interpreting them as one continuous sentence. Moreover, in this particular example, the individual Portuguese names for various fruits and juices were not recognized accurately and were instead marked as two compound nouns. However, this issue likely stems from the novel’s use of line breaks to separate the individual parts of speech, which cannot be indicated in the annotation program.

Conversion and Annotation of Susan Abulhawa’s „The Blue Between Sky and Water“, or A Demonstration of Software Failure through Anglocentrism?

Introduction

Having focused on reading literature through a postcolonial studies lens as well as on Eurocentric bias in the field of linguistics throughout my studies, the attempt of using conversion and annotation tools on a postcolonial post-monolingual Anglophone novel seemed intriguing. I was interested to see how it would deal with the novel I chose – Susan Abulhawa’s The Blue Between Sky and Water.

The novel by the Palestinian-American writer and human rights activist mixes Palestinian Arabic with English in a few different ways, although some patterns can be observed: Food items, terms for relatives, and culture-specific terms are usually written in latinized Arabic. Terms are usually introduced in italics once, and then re-appear throughout the novel un-italicised. As the software works with raw text, these specificities were lost. Apart from nouns, however, the novel includes verbs, adjectives, and phrases here and there in Palestinian Arabic as well, sometimes translated in the next sentences or before that, and sometimes not.

Sentence Choice

I chose seven sentences that show variation in the mixing of languages. For instance, one sentence I picked only includes Arabic nouns which denote food items:

                “One of her brothers arrived and they all shared a late breakfast of eggs, potatoes, za’atar, olive oil, olives, hummus, fuul, pickled vegetables, and warm fresh bread” (174).

Another contains only one adjective in Arabic:

                “‘Who is there?’ a woman’s voice asked in Arabic and Nazmiyeh relaxed upon hearing the Palestinian fallahi accent” (35).

Yet another contains an entire phrase:

                “The woman with wilted breasts began to sob quietly as others consoled her and banished the devil with disapproving eyes at Nazmiyeh – a’ootho billah min al shaytan – when a female soldier wheeled in a large box of clothes, and with a gesture of her hand, gave the naked women permission to get dressed” (114).

In addition I chose other sentences which include nouns in different ways to compare how the software deals with them. This was useful, as will become clear in the analysis of the mistakes the software made in the annotation.

Problems and Technical Difficulties

In general, working with the Jupyter Notebook style interface via Google Colab and Python worked out without any greater issues. The only problem I noticed was that double quotation marks could not be used, as they are part of the code and thus confuse the interface. Single quotation marks were acceptable for the software, so I replaced double quotation marks with single ones.

Google Colab, however, was not as user-friendly. I found it rather unintuitive and it often claimed I had too many tabs open at the same time. As other students recommended, clearing the history and waiting a few minutes seem to have solved that issue. It is, however, time-consuming.

Mistakes in the Annotation

Most of the time, Arabic words were labelled as PROPN, proper nouns. This was especially the case with the sentence including an entire phrase in Arabic. For instance, “a’ootho” should be labelled as a verb, but was labelled as proper noun. “billah” was labelled as one single proper noun, even though it should be labelled as a preposition in combination with a noun, or proper noun (Allah). In other sentences, the software labelled Arabic nouns as adverbs. One example is “fuul”, which is a type of bean. Yet other nouns, such as “jomaa” were incorrectly labelled as adjectives. So, while the most frequent mistake was to label any Arabic word as proper noun, the software was not consistent in its mislabeling throughout.

The software was still able to locate the ROOT correctly, despite its confusion around the Arabic terms. Arabic words resembling English words were sometimes thought to be English words. For instance, “Um” (mother) was mistakenly labelled as an interjection.

„Um“ is mislabeled as an interjection.

Overall, this was an interesting experience, but I was disappointed that the software was unable to deal with Arabic words to such an extent, even though it was expected.  

Converting and Annotating Multilingual Sentences & Quotes – My Experience

Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.

I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.

Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.

It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.

Annotating Multilingual Sentences in „Hold“ by Michael Donkor: Twi and English

In his novel „Hold“, which was first published in 2018, Micheal Donkor continually weaves Twi words into English sentences, thus constructing a multilingual narrative. When he uses words in Twi, the author highlights them and sets them apart by italicizing them. Though much could be said about the function of this practice, it is a secondary issue in regards to annotating sentences from the novel. Within the context of the seminar „Writing across Languages“, I am mainly interested in the form- italics being a part of it. Moreover, I am interested in how the multilingual sentences can be annotated and what challenges arise in doing so.

Multilingual Sample Sentences in Donkor’s Novel

There are different techniques that Donkor uses in order to establish multilingualism. I tried to choose sentences, words and phrases that show a variety of techniques the author uses. Often, phrases or whole sentences in Twi are used in dialogue.

‚What a polite and best-mannered young lady we have on our grounds this pleasant day. Wa ye adeƐ.

Donkor, Michael: Hold, p. 7

Me da ase,‘ Belinda said softly.

Donkor, Michael: Hold, p. 10

Other times, only one word is used in an English sentence. Here, as in almost all cases of multilingualism in Donkor’s novel, the Twi words are italicized.

Belinda worked the pestle in the asanka, using her weight against the ingredients, grinding the slippery onion and pepper.

Donkor, Michael: Hold, p. 28

In the tro tro on the way home from the zoo, Belinda had done her best to enjoy Mary’s sulking silence.

Donkor, Michael: Hold, p. 25

I have chosen 8 sample sentences in total and pasted them into „Jupyter Notebook“. Though a few letters differed from the English alphabet and had to be inserted seperately, I faced no technical difficulties. A variety of challenges arose, however, regarding the machine annotation of Twi.

Challenges in Annotating Twi in English Sentences

No Italics in Annotations

I have repeatedly mentioned that Donkor uses italics to signal mutlilingualism. There is no way to indicate italics within „Jupyter Notebook“. Thus, it would be impossible to use these annotations to analyze the use of italics in multilingual texts, whether with a diachronic lens or otherwise. Nor can any differing practices across languages be analyzed, seeing that there is no way to indicate italics and therefore search for them later on.

Twi Words with the same form as English words

There are a few instances within the chosen sample sentences where a Twi phrase includes a word that resembles an English word. „Jupyter Notebook“, being based on an annotation model for the English language, identifies the form of these words and classifies them according to the English POS. In the chosen, annotated sentences this issue applies to the words „me“ or „bone“. Since I lack the language skills in Twi to understand the words, I can neither confirm nor deny if the classification generally is correct. It shows, however, that there are challenges in differentiating the Twi words from the English words.

Classifying Twi Words Consistently

In general, there’s a lack of consistency with respect to the classification of Twi words. The word „Aboa“ appears twice in two adjacent sentences. Still, the classifications for the word differ despite the same form. First it is identified as „ADJ“, then as „PROPN“. Due to a lack of input of the Twi language, these parts of the sentences are not labeled correctly.

Aboa!‘ Mary laughed. Aboa was Mother’s insult of choice too;

Donkor, Michael: Hold, p. 52

Technical Difficulties and First Results of Annotating Multilingual Sentences in „The Moor’s Account“

First of all, I have been finding this work very interesting, getting a new perspective on literature. However, I encountered some technical difficulties while having Google Collab analyse my sentences. After three sentences, it told me I had reached my free limit and would have to purchase Collab Pro. This is probably because I tried to save my results by opening a new file for every sentence. Thus, I have only worked through a couple of examples so far, taken from the novel The Moor’s Account  by Laila Lalami.

The first sentence I had annotated was, „When I said, Bawuus ni kwiamoja, one of the women inevitably corrected me, Ni bawuus kwiamoja.“ (Lalami 175). Both times the words are used in the sentence, Collab tagged them as proper nouns, as if they were all names. The phrase is not translated in the novel, but the correction is as follows: “in Capoque […] the doer and the done-to were spoken of before the deed itself” (ibid.). This means that, presumably, “Ni” and “bawuus” are the doer and the done-to, while “kwiamoja” is a verb. It is very difficult to research this language, however; I suppose it is not spoken anymore.

Another sentence I let Collab process was, “I whispered Ayat al-Kursi to myself“ (123). “Ayat”, “al”, and “Kursi” are all tagged as proper nouns, which is acceptable, I believe, as the words refer to a specific verse in the Quran. In the dependency analysis, however, it says that “Kursi” is dependent on the word “whispered” and the major part of a compound (dobj).

Hopefully, I will be able to go through more sentences and I am excited for the next steps.

Thoughts and Problems while Converting and Indexing Multilingual Sentences of Abdulrazak Gurnah’s novel Afterlives

In the class „Writing across Languages: The Post-Monolingual Anglophone Novel“ we started working with digital tokenization, tagging, indexing and later annotating in order to – very broken down – take a look at how digital softwares react to multilingualism in novels. As most softwares and programmes are made in English-speaking countries for the English-speaking market and are hence almost exclusively in English, we are interested in how they perceive and annotate non-English words and phrases. Does their anglocentricism provide us with problems or will they actually understand non-English and annotate it correctly? (Small spoiler: they don’t.)

In my case I worked with Abdulrazak Gurnah’s novel Afterlives in which multiple languages are part of the primarily English narration. I had no problems with any of the technical aspects of this step, so after putting my example sentences into the provided Google Collab template (on the basis of Jupyter Notebook which just was too difficult to install), these are my main findings:

  • Our assumption that it defines all non-English words as proper nouns did become almost exclusively true – I think there were only a a couple of examples where it identified them differently
  • Sometimes the ROOT of a sentence was very weirdly placed.
  • Punctuations are not seen as separate tokens/entities in the dependency tree.

Here are some examples:

She wrote: Kaniumiza. Nisaidie. Afiya. He has hurt me. Help me.

In this example both kaniumiza and nisaidie are declared as proper nouns while kaniumiza is a direct object and nisaidie a ROOT. Afiya is also a proper noun and a ROOT, which makes sense as it is a name and the only part of this one-word sentence. However, the others do not make much sense, especially as the direct translation is given afterwards. I could understand all of them being a ROOT, but I just don’t understand why kaniumiza is seen as a direct object. It’s also unfortunate that the programme does not seem to see the whole example as an entity in which the sentences correlate with each other on a semantic level, but only sees it per individual sentence. Because if this were different, it would identify He has hurt me. Help me. as the translation of Kaniumiza. Nisaidie.


’Jana, leo, kesho,‘ Afiya said, pointing to each word in turn. Yesterday, today, tomorrow.

This one confused me a lot: Why is kesho a nouns while the others are proper nouns? Also jana is seen as a nominal subject, for which I only have the explanation that the programme thinks it is the name Jana and not a word in another language. However, how come leo is a conjunction and kesho a direct object? All of them should be indexed the same. I also do not understand why we suddenly have no ROOT at all in these three words, while this happened with other examples. Additionally, this time there was also something I don’t quite understand in the indexing of the English words: Why is tomorrow identified as the ROOT and not one of the other words? It is also quite sad that this time, the programme – again – did not realise that the direct translation of the non-English is part of this example.


After the third miscarriage in three years she was persuaded by neighbours to consult a herbalist, a mganga.

This one surprised me a bit. Because not only is mganga seen as a noun, but also as an appositional modifier, meaning that the programme realised it is another word for herbalist. This is the first – and only – time that an example of the African language (I’m very sorry, but I just could not discern which of the 125 Tanzanian languages it is) is indexed correclty.


I then thought that maybe it would be different with another language and tried this example:

They were proud of their reputation for viciousness, and their officers and the administrators of Deutsch-Ostafrika loved them to be just like that.

However, even with German and the official name of a region (Deutsch-Ostafrika) there were problems. Though both parts of the word are seen as proper nouns – which is correct – only Deutsch is seen as a compound, while Ostafrika is seen as the object of a preposition. This is not necessary incorrect, however, Deutsch-Ostafrika is one word, even if it is hyphenated. Hence, in my understanding, both parts of the world should be seen as a compound and they together as the object of a preposition. 


And lastly, another example with German: 

He looked like a … a schüler, a learned man, a restrained man.

Here, the programme did identify schüler correctly – as a noun and as the object of the preposition like. I was quite impressed with that, and what impressed me even more was the fact, that it also identified learned man and restrained man as appositional modifiers of schüler. This is the only example sentence in which not only the POS-tagging but also the indexing and dependency relation is correct. My only explanation for this is, that schüler is also a word used within the English language, though it is an Old-English word and not commonly used (see OED), and hence known to English dictionaries.


Lastly, I want to say that I actually had kind of fun doing this. Yes, I had to look up some of the linguistic definitions, especially with the dependency relations, but overall it was fun. And a bit infuriating at times when the programme made the same mistakes again and again. So I’m looking forward to the next step of the process.

Problems (and correct classifications) in annotating training and example sentences in different languages from R. F. Kuang’s „Babel“: My experiences

Within the context of our seminar „Writing Across Languages: The Post-Monolingual Anglophone Novel“, our task was to test, how the software „Jupyter Notebook“, equipped with an English database, classified foreign words in a novel that is mostly written in English. The relevant categories were parts of speech and dependency tags. As „Jupyter Notebook“ was too tiresome to install, we worked with a copy of a Jupyter notebook in Google Collab instead. We had two example sentences which we could use in order to become acquainted with the software. Our main job was to read our novel and to note down examples of multilingual text passages, so that they could be annotated by the software.

This preparation for these sentences to be annotated by the software posed a few problems for me. My first problem was that my book, „Babel“ by R.F. Kuang, uses a lot of Chinese words and these words are sometimes presented as a Chinese character. I had some problems with copying the Chinese characters. The problem was not so much what the character meant or to what it translated in English, as this was indicated most of the time, but I had no idea on how to copy the Chinese characters, especially as the Kindle app does not allow copying words or phrases from the app. My initial idea was to enter the romanized version of the character or its meaning in English into the Google translator and to then just copy the Chinese character from there. However, this didn’t work because it already said in the book that the usage of this character for this meaning was quite unusual and the Google translator only indicates one possible way of writing a Chinese word as a Chinese character. My second idea was to take a photo, to copy the Chinese character from there and to then paste it into my document. This also didn’t work because I couldn’t copy the characters in the apps that I have tested either. After some unsuccessful tries with apps in which the user can draw the Chinese character and during which the Chinese characters could not be recognized by these apps, I ended up on this website: https://www.qhanzi.com/index.html. This website also allows the user to draw the Chinese character and then guesses what Chinese character you drew, but it seems to have a much larger database than the apps I tested. Here is an example:

In this example, I wanted to draw the first character in the suggestions below. In the cases of multiradical characters, meaning Chinese characters which consist of more than one radical, I had to choose the option „Multiradical and then chose characters from a large list. Here is also an example:

These two methods take a lot of time, of course, but in the end, I managed to find all the characters that I needed.

My second, more minor problem while copying the text from the app into my document were the horizontal lines above vowels in romanized Chinese writing and also in Latin. In my research, I learned that in Latin, these lines show the stress of a particular vowel. One way or another, I knew that I had to indicate these lines above a vowel somehow. In the end, I found a Q&A page onn which a user indicated how to type these lines above vowels on the keyboard. This is the page I used: https://www.gutefrage.net/frage/lateinischer-betonungsstrich-word. Just like the website with the Chinese characters, this isn’t an academic website, but for my purpose, it sufficed.

Annotating the training sentences

As I mentioned earlier, I first went through the example sentences. While going through each of them individually, I will name a few mistakes that the software made along with some correct annotations which are decidedly fewer. The mistakes that I name are usually first about words that are foreign to the English language, and then, if there are any, mistakes which were made concerning words which belong to the English language. Concerning the correct annotations, I only mention those words which are non-English, as it should be the norm that English words are annotated correctly, considering that the data with which the software was trained, was written in English. The first training sentence was:

Tías called me blanca, palida, clarita del huevo and never let me bathe in the sun. While Leandro was tostadito, quemadito como un frijol, I was pale.

Lickorish Quinn 129

The mistakes that the software made which I recognized were that “palida” was categorized as a conjugation (correct: adjective) and that “frijol” was classified as an attribute. Concerning the English words, the only mistake that I recognized was that „let“ was classified as a conjunction (correct: auxiliary). Some correct decisions that the software made were that “Tías” was classified as a nominal subject, that “blanca” was classified as an object predicate, that “clarita del” was classified as a compound, but that huevo was classified as a conjunction and that “tostadito” was classified as an adjectival modifier.

The second training sentence which was annotated was:

In Chinese, it is the same word ‘家’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘家’, a roof on top, then some legs and arms inside.

Guo 125-26

The mistakes here were the following: the first “家” was classified as an appositional modifier, but also as a noun (which is correct), the second “家” was classified as an unclassified dependent and thus radically differs from the first annotation, and “jia”, which is the romanized version of the Chinese character “家”, was categorized as a proper noun (correct: noun) and an appositional modifier. Concerning the English words, there were also a few mistakes: “family” was considered a conjunction (correct: noun), “is” was classified as a conjunction (correct: auxiliary), “legs” was classified as a noun (which is correct) and as a root and “arms” was classified as a conjunction (correct: noun).

Annotating the quotes from „Babel“

As I was curious, how the software would react to sentences with Latin words, I started with a fairly easy one:

But in Latin, malum means “bad” and mālum,’ he wrote the words out for Robin, emphasizing the macron with force, ‘means “apple”.

Kuang 25

The mistakes in the annotation were that the first “malum” was categorized as a noun (correct: adjective) and a nominal subject (correct: adjectival subject) and that the second „mālum“ was classified as a proper noun (correct: noun) and a conjunction (correct: subject). The English words in this sentence, however, were categorized correctly. To me, this shows that the software does not understand the sentence because otherwise, it would have recognized that „malum“ means „bad“ and is thus an adjective and that „mālum“ means „apple“ and is thus a noun.

Okay, a sentence with Latin was not annotated successfully. Let’s see whether a Chinese word is better. An example would be:

Wúxíng – in Chinese, ‘formless, shapeless, incorporeal’. The closest English translation was ‘invisible’.

Kuang 65

Nope, the annotation to this sentence was even worse. First of all, “Wúxíng” was classified as a proper noun (correct: noun) and as a root (which could be correct, as there is no main verb which could be the root of the sentence). Furthermore, there are a few English words which were not annotated correctly: “Chinese” was classified as a proper noun (correct: noun) and “formless” was classified as an adjectival modifier, while “shapeless” was classified as an adverbial modifier (correct: probably adjectival modifier).

Now, I was invested and wanted to get to the bottom of this weird annotation of foreign words by the Jupyter Notebook. That’s why I chose this as my third sentence from „Babel“:

Que siempre la lengua fue compañera del imperio; y de tal manera lo siguió, que junta mente començaron, crecieron y florecieron, y después junta fue la caida de entrambos.

Kuang 3

Honestly, I expected this sentence to be full of either classifications as proper nouns or full of mistakes. The mistakes in this sentence concerning the foreign words were that “crecieron” and “florecieron” were classified as nouns (correct: verbs), “de” was classified as unknown (correct: preposition) and that the rest of the words were classified as proper nouns. Concerning the clausal level, the words were either classified as compounds or as appositional modifiers. Interestingly, the software correctly recognized “la” as a determiner and “imperio” as a direct object. I don’t know whether the software was just lucky with these two annotations or whether the place of the words in these sentences somehow influenced the correct annotation. One or way or the other, it is clear that this software cannot annotate words in a language other than English very consistantly.

As I recognized that it made little sense to enter entire sentences in another language into the software, I wanted to make the work for the software as easy as possible. Thus, I returned to Chinese and entered the following sentences next:

He will learn. Tā huì xué. Three words in both English and Chinese. In Latin, it takes only one. Disce.

Kuang 26

I was hoping that the software would recognize that the first two sentences were a translation from each other and that they thus would be categorized correctly. I was disappointed because in English, the words “He will learn“ were correctly classified as a pronoun, an auxiliary and as a verb, while in Chinese, “Tā huì xué” were classified as a noun, a noun and a verb, which is, based on the English translation (as I don’t speak Chinese) is not correct. Apart from that, “English”, “Latin”  and, unsurprisingly, “Disce” were categorized as proper nouns, while actually, „English“, and „Latin“ are nouns, while „Disce“ is a verb. One further mistake consisted in the software annotating „Chinese“ as a conjunction (correct: noun). The different annotations of one and the same sentence in different languages confirms my assumption that the software does not actually understand the words that it annotates.

Okay, Chinese didn’t work. I was curious whether another language which is closer to English would help. So, I chose two sentences in French from Babel:

‘Ce sont des idiots,’ she said to Letty. / ‘Je suis tout à fait d’accord,’ Letty murmured back.

Kuang 71

The result from this annotation was also disappointing. “Ce” was classified as a verb (correct: determiner), “sont” (correct: verb) and “des” (correct: article) as an adjective, “Je” (personal pronoun) and “tout à fait d’accord” were classified as proper nouns, “ce” was classified as an adverbial clause, “sont” and “des” was classified as adjectival modifiers, “suis” was classified as a clausal complement, and “à fait” was classified as a compound, while “tout” and “d’accord” were classified as a noun phrase as adverbial modifier and as a direct object, respectively. Interestingly, the words “idiots” as a noun and as a direct object and “suis” as a verb were classified correctly.

Okay, a closer language like French didn’t work. Maybe German could be a better help for the software to annotate the foreign word correctly. That’s the reason why I chose this sentence:

But heimlich means more than just secrets.

Kuang 81

I figured that, as “heimlich”, or rather, its negative counterpart “unheimlich”, is often used in the context of horror literature, maybe the software would be able to recognize this word and thus annotate it correctly. However, I was, again, disappointed, as “Heimlich” was classified as a noun (correct: adjective) and a nominal subject (correct: adjectival subject).  

Next, I was again intrigued by the Chinese language and I wanted to know whether the results of the Chinese character and its translation in the training sentence above were just a coincidence. So, I chose the following sentence, which is similar to the training sentence, with a Chinese character:

Why was the character for ‘woman’ – 女 – also the radical used in the character for ‘slavery’? In the character for ‘good’?

Kuang 110

The answer was: no, the results from the training sentence were not a coincidence: “女” was classified as unidentified (correct: noun) and as an appositional modifier. Furthermore, an English word was also categorized incorrectly: “radical” was wrongly classified as an adjective (correct: noun). The classification of „In“ as the root could be correct, as the sentence „In the character for ‚good‘, there is no verb which could be the root of the sentence.

Okay, now, Chinese was for me out of the question. Maybe the French and German words were not established enough in the English language. So, I decided to choose a sentence with a French word which I have already heard to have been used in other English sentences:

‘It’s not the company, it’s the ennui,’ he was saying.

Kuang 144

Well, “ennui” was, indeed, correctly classified as a noun and as an attribute. However, the particle “’s” was classified as a clausal complement. But nevermind, we’re making progress concerning the annotation of the non-English words.

Next, I was interested in how the software would handle Greek words. As an example sentence, I chose:

The Greek kárabos has a number of different meanings including “boat”, “crab”, or “beetle”.

Kuang 156

Just like the sentence before, the foreign word was classified correctly, in this context as a noun and as a nominal subject. However, now the software seemed confused about the classification of the English words: “including” was considered a preposition, but also to be a verb, “crab” was considered as a noun, but also as a conjunction, while “beetle” was considered to be a verb (correct: noun) and a conjunction.

Okay, nice, the Greek word was also no problem for the software. As I am a native German speaker, I wanted to give a second chance to the annotation of a German word in my last example. I chose this as an example sentence:

‘The Germans have this lovely word, Sitzfleisch,’ Professor Playfair said pleasantly when Ramy protested that they had over forty hours of reading a week.

Kuang 168

Concerning the German word “Sitzfleisch”, I was disappointed because the software classified it as a proper noun (correct: noun) and as an appositional modifier. Concerning the English words, there were also some mistakes: “Professor” was classified as a proper noun (correct: noun) and a compound, while “Playfair” was classified as a nominal subject, “have” was classified as a clausal complement, “when” was classified as an adverbial modifier and “protested” was classified as an adverbial clause modifier.

A more general problem that I encountered while annotating these sentences was that some tags like “oprd” were neither mentioned in the table we received in order to recognize the clauses and the parts of speech, nor were they mentioned in the spaCy article. Instead, I found this website, which helped me with the abbreviations: https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean.

Another, very minor technical problem concerned the work with Google Collabs. The program sometimes gave me feedback that I had opened too many notebooks that were not closed yet although I had closed them. To solve this problem, however, I simply had to click “Close all notebooks except the current” or something like that and then I could continue annotating.

On a more positive note, the software consistently succeeded in the indexing of the tokens and in classifying punctuation marks as such. The only exception that I found was the apostrophe in „’s“.

Comprehension questions

I don’t really think that I still have any comprehension questions, I am just not quite sure whether I correctly assessed the parts of speech and the dependency tags because the linguistics class in which I learned these terms was quite a long time ago in about 2017. That is also the reason why I mostly didn’t indicate the correct dependency tags if they were obviously wrong. I googled or try to look up the abbreviations and what they meant, of course, but I am still not quite sure whether there aren’t still some mistakes that I made in this regard. That’s why I also didn’t check whether the tree at the bottom of the interface was correct. There could be some answers as to why the software classifies some words the way it does, but right now, I didn’t see a systematic wrong approach to foreign words except that they are often classified as proper nouns and as compounds. I was also not quite sure how to write this blog entry as I haven’t written a blog entry yet and I am not that into blogs myself.