Multi-Lingual Annotations in Python with „A Concise Chinese-English Dictionary for Lovers“

This was not my first time working with Python to figure out the tokens of sentences via computational means. However, my last time working with Python (in a linguistics BA seminar with Prof. Kevin Tang) has been some time back, so I have appreciated that working our example sentences through it has been made very easy.

Running the three example sentences given to us, I noticed that punctuations marking speech are often interpreted as part of words or compounds and are tagged as POS, which makes it difficult to check which tokens refer to which words.

Additionally, foreign words were always interpreted as proper nouns – even in the Spanish phrase clarita del huevo, in which del is clearly not a proper noun. I would have thought that maybe the Spanish language would be easier to interpret or perhaps translate than the Swahili sentence as it might be more commonly known, but Python does not do any translating and so struggles with anything that is not English. Dependencies can thus not be correctly determined: in the sentence Tías called me blanca, palida, clarita del huevo the last three parts (blanca, palida, clarita del huevo) are a listing, all nouns or noun phrases are on equal standing and not dependent on each other, but Python marks huevo as a dependent of palida.

The given example with Swahili words in it cannot be properly tagged at all as – according to Python – ‘Ayaaaana! Haki ya Mungu … aieee!’ The threat-drenched contralto came from the bushes to the left of the mangroves. ‘Aii, mwanangu, mbona wanitesa?’ is made up entirely of nouns or proper nouns. Only the English part could be identified correctly.

The English-Chinese example provides similar issues:

  • In Chinese, it is the same word ‘’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘’, a roof on top, then some legs and arms inside.

In this example, the Chinese hanzi are not computed as proper nouns. Python interprets 家 as a noun once and an adjective another time, marking it – quite nonsensically – as a dependent of legs.

Now looking at the novel I have been reading – A Concise Chinese-English Dictionary for Lovers by Xiaolu Guo – I looked for similar multilingual sentences to test in the programme. I couldn’t find many such sentences, but tested the ones I did find:

  • 知识’ mean knowledge, ‘分子’ mean molecule.

知识 is here correctly interpreted as a noun, although it is interpreted as one singular noun rather than a noun phrase consisting of a verb (知 – to know) and a noun (识 – knowledge).

Similar goes for分子, interpreted as one noun rather than a noun phrase consisting of a verb (分 – divide) and a noun (子 – son, child).

The word “mean” in this sentence is the verb “to mean”, but not conjugated correctly because the narrator is not fluent in English yet and struggles with English grammar. Python thus tags it as an adjective. Accordingly, the dependencies turn out incorrect as well, as the mean should function as the head of its sentence. Instead, knowledge becomes that head upon which all other words depend.

  • is fart in Chinese. It is the word made up from two parts. is a symbol of a body with tail, and underneath that represent two legs. That means fart, a kind of Chi.

In the 3rd sentence “represent” is interpreted as a dependent of “is” – again, likely due to the improper grammar the narrator uses. The hanzi are not computed at all (屁), marked as a proper noun (尸) or a noun (比).

  • Chi (), everything to do with Chi is very important to us Chinese.

This sentence starts out sort of elliptical. The phrase very important to us Chinese is tagged and interpreted correctly, but Python struggles with everything that comes before it, marking Chi () as a dependent of is. Evidently, Python cannot compute and interpret punctuation correctly, which in this case should indicated that the first word is separate from the sentence after the comma.

In the next two examples I wanted to test how Python deals with incorrect English grammar but without the interference of non-English words, hypothesising that situations like the above mean would also occur here.

  • I feeling I can die for all kinds of situation in every second.

In this case, coming from previous errors Python made due to incorrect grammar and conjugation, I thought that feeling might be interpreted as a noun because the auxiliary “am” of the progressive form is missing. Surprisingly, Python has no problem recognising feeling as the verb it is indeed supposed to be, marking it correctly as the head/root of the entire sentence

  • I scared by cars because they seems coming from any possible directing.

Similar as above, I wanted to test how Python deals with these grammatical errors (seems instead of seem; directing instead of direction) – again, surprisingly, all tokens were tagged and interpreted properly with all their dependencies. Even directing has been correctly identified as a noun instead of a progressive verb.

Evidently, annotating multi-lingual sentences correctly is not a possibility – at least not with the Python code we have been given. While the programme has no problem interpreting English sentences with incorrect grammar, it is thrown for a loop as soon as non-English words are introduced, which was a very interesting observation to make.

Annotating multilingual sentences in „America Is Not The Heart“ by Elaine Castillo

Working with Google Collab to do the sentence annotations wasn’t so scary to me because I was already familiar with some coding/programming and linguistics from my bachelor. However, I haven’t worked with both in combination like this before, so that was a new and interesting experience for me. I had no real problems working with the program, some difficulties only occured when trying to run multiple collab notebooks at the same time (I had to save and close some before moving on to new ones!).

I chose a few sentences from the novel „America Is Not The Heart“ by Elaine Castillo, some sentences are mostly English with a few Tagalog loanwords (which are all food-related items), some Tagalog sentences with some English words and some completely Tagalog sentences. From my first observations I could tell that the program had no idea how to deal with the Tagalog words or sentences. On a positive note, the program at least annotated the English parts correctly from what I could see.

As a visualization, I copied the sentence into an excel table with the annotations and dependency relations that the program provided. The yellow highlighted bit is my attempt at giving a slightly more correct version of the word categories with a rough translation. I did not attempt my own dependeny relations for lack of grammatical knowledge, but I would guess that it should look very different than the current one.

Either way i found it very interesting with how the program dealt with the sentences!

My experience with annotating multilingual sentences with Google Collab

My first experience with Google Colab was surprisingly positive. While the idea of working with a programming software seemed overwhelming at first, when we actually got to try using it during the seminar, it seemed quite intuitive and easy to navigate with the help of Ms. Pardey and the other students. The concept of syntactic trees and dependency relations was something I initially struggled with, considering my Introduction to Linguistics lecture was in 2017, and I have since mostly stayed in the field of Literary and Cultural Studies. Combined with the abbreviations of the different tags within the programme, it was difficult for me to understand the results Google Colab was showing me. However, the class discussion and the glossary, as well as some revision using Google were very helpful. What I also did was feed the programme simple sentences at first (think “I like apples.” “What is your favourite colour?”) to see what the results would look like with less complicated sentence structures.

When I tried to annotate my own sentences from the novel “On Earth we’re briefly gorgeous”, I came across some difficulties, as I cannot understand Vietnamese and thus had to find out the structure of the Vietnamese passages myself first in order to verify Google Colab’s results. What I discovered was that the programme struggled to determine the Vietnamese words, which, to me, seems to be inevitable because, as I understand it, the programming language is only suitable for the English language(?). Because of this, the overall dependency relations were off, since my example sentences combined English and Vietnamese words in the same sentence but mostly without an indicator like an English preposition or determiner. I could not yet find a way to fix this problem but am very interested in what would be a solution in such cases.

I am interested to learn about the others’ experiences with Google Colab and am keen to learn more about computer-based analysis of multilingual text.

Annotating Multilingual Sentences in Yara Rodrigues Fowler’s „Stubborn Archivist“ – Experience & Observations

Initially, I was intrigued but also a bit worried about working on this project with Python, as previous linguistic research during my bachelor’s taught me that working with programming software can be a bit error-prone and frustrating at times. However, thanks to the prepared script, the annotation process via Google Colab and Python was very intuitive and easy to use, so thankfully, I did not experience any major technical difficulties.

For my research, I annotated multilingual sentences from Stubborn Archivist by Yara Rodrigues Fowler. The novel was published in 2019 and follows the coming-of-age journey of a young British-Brazilian woman in contemporary South London. While mainly written in English, it also includes many Portuguese terms and phrases, highlighting the character’s connection to both British and Brazilian cultures and identities.

In general, the software correctly annotated single Portuguese nouns like tia or empregada when they were used in a standard English sentence structure and clearly indicated, for instance, preceded by a determiner:

You don’t have an empregada at your house in London?

(Rodrigues Fowler 137)

However, some errors arose with compound nouns, such as in leite condensado (condensed milk), where the second part, condensado, was mistakenly identified as the head of the phrase instead of leite.

Vovó Cecília shook the bowl of cocoa powder over the pan and as the sprinkle powder became wet and fat and darkened the baby curved it into the centre of the hot leite condensado.

(Rodrigues Fowler 132)

Another recurring problem I encountered in the annotation involved Portuguese words that have English equivalents. For instance, in the passage:

But Vovô I thought Columbus discovered America? No. Christopher Columbus discovered América do Norte. América is a continent—two continents. And Brasil is in América do Sul.

(Rodrigues Fowler 146)

Here, the Portuguese preposition + determiner do was mistaken for the English (to) do and thus incorrectly annotated, once as a verb and once as an auxiliary.

On a stylistic level, the frequent use of dialogue inserts without proper punctuation stands out prominently in Rodrigues Fowler’s novel. This aspect also posed the most significant challenge during annotation, as illustrated by the following example:

So your father every time we went out he would want to try a new juice
And obviously there are so many juices and fruits that he had never seen before
Açaí
Maracujá
Acerola
Jabuticaba
Cajú
He was always asking—But what is that in English?

(Rodrigues Fowler 117)

In such instances, the program was unable to determine the correct sentence structure of the individual phrases, often interpreting them as one continuous sentence. Moreover, in this particular example, the individual Portuguese names for various fruits and juices were not recognized accurately and were instead marked as two compound nouns. However, this issue likely stems from the novel’s use of line breaks to separate the individual parts of speech, which cannot be indicated in the annotation program.

Conversion and Annotation of Susan Abulhawa’s „The Blue Between Sky and Water“, or A Demonstration of Software Failure through Anglocentrism?

Introduction

Having focused on reading literature through a postcolonial studies lens as well as on Eurocentric bias in the field of linguistics throughout my studies, the attempt of using conversion and annotation tools on a postcolonial post-monolingual Anglophone novel seemed intriguing. I was interested to see how it would deal with the novel I chose – Susan Abulhawa’s The Blue Between Sky and Water.

The novel by the Palestinian-American writer and human rights activist mixes Palestinian Arabic with English in a few different ways, although some patterns can be observed: Food items, terms for relatives, and culture-specific terms are usually written in latinized Arabic. Terms are usually introduced in italics once, and then re-appear throughout the novel un-italicised. As the software works with raw text, these specificities were lost. Apart from nouns, however, the novel includes verbs, adjectives, and phrases here and there in Palestinian Arabic as well, sometimes translated in the next sentences or before that, and sometimes not.

Sentence Choice

I chose seven sentences that show variation in the mixing of languages. For instance, one sentence I picked only includes Arabic nouns which denote food items:

                “One of her brothers arrived and they all shared a late breakfast of eggs, potatoes, za’atar, olive oil, olives, hummus, fuul, pickled vegetables, and warm fresh bread” (174).

Another contains only one adjective in Arabic:

                “‘Who is there?’ a woman’s voice asked in Arabic and Nazmiyeh relaxed upon hearing the Palestinian fallahi accent” (35).

Yet another contains an entire phrase:

                “The woman with wilted breasts began to sob quietly as others consoled her and banished the devil with disapproving eyes at Nazmiyeh – a’ootho billah min al shaytan – when a female soldier wheeled in a large box of clothes, and with a gesture of her hand, gave the naked women permission to get dressed” (114).

In addition I chose other sentences which include nouns in different ways to compare how the software deals with them. This was useful, as will become clear in the analysis of the mistakes the software made in the annotation.

Problems and Technical Difficulties

In general, working with the Jupyter Notebook style interface via Google Colab and Python worked out without any greater issues. The only problem I noticed was that double quotation marks could not be used, as they are part of the code and thus confuse the interface. Single quotation marks were acceptable for the software, so I replaced double quotation marks with single ones.

Google Colab, however, was not as user-friendly. I found it rather unintuitive and it often claimed I had too many tabs open at the same time. As other students recommended, clearing the history and waiting a few minutes seem to have solved that issue. It is, however, time-consuming.

Mistakes in the Annotation

Most of the time, Arabic words were labelled as PROPN, proper nouns. This was especially the case with the sentence including an entire phrase in Arabic. For instance, “a’ootho” should be labelled as a verb, but was labelled as proper noun. “billah” was labelled as one single proper noun, even though it should be labelled as a preposition in combination with a noun, or proper noun (Allah). In other sentences, the software labelled Arabic nouns as adverbs. One example is “fuul”, which is a type of bean. Yet other nouns, such as “jomaa” were incorrectly labelled as adjectives. So, while the most frequent mistake was to label any Arabic word as proper noun, the software was not consistent in its mislabeling throughout.

The software was still able to locate the ROOT correctly, despite its confusion around the Arabic terms. Arabic words resembling English words were sometimes thought to be English words. For instance, “Um” (mother) was mistakenly labelled as an interjection.

„Um“ is mislabeled as an interjection.

Overall, this was an interesting experience, but I was disappointed that the software was unable to deal with Arabic words to such an extent, even though it was expected.  

Converting and Annotating Multilingual Sentences & Quotes – My Experience

Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.

I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.

Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.

It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.

Annotating Multilingual Sentences in „Hold“ by Michael Donkor: Twi and English

In his novel „Hold“, which was first published in 2018, Micheal Donkor continually weaves Twi words into English sentences, thus constructing a multilingual narrative. When he uses words in Twi, the author highlights them and sets them apart by italicizing them. Though much could be said about the function of this practice, it is a secondary issue in regards to annotating sentences from the novel. Within the context of the seminar „Writing across Languages“, I am mainly interested in the form- italics being a part of it. Moreover, I am interested in how the multilingual sentences can be annotated and what challenges arise in doing so.

Multilingual Sample Sentences in Donkor’s Novel

There are different techniques that Donkor uses in order to establish multilingualism. I tried to choose sentences, words and phrases that show a variety of techniques the author uses. Often, phrases or whole sentences in Twi are used in dialogue.

‚What a polite and best-mannered young lady we have on our grounds this pleasant day. Wa ye adeƐ.

Donkor, Michael: Hold, p. 7

Me da ase,‘ Belinda said softly.

Donkor, Michael: Hold, p. 10

Other times, only one word is used in an English sentence. Here, as in almost all cases of multilingualism in Donkor’s novel, the Twi words are italicized.

Belinda worked the pestle in the asanka, using her weight against the ingredients, grinding the slippery onion and pepper.

Donkor, Michael: Hold, p. 28

In the tro tro on the way home from the zoo, Belinda had done her best to enjoy Mary’s sulking silence.

Donkor, Michael: Hold, p. 25

I have chosen 8 sample sentences in total and pasted them into „Jupyter Notebook“. Though a few letters differed from the English alphabet and had to be inserted seperately, I faced no technical difficulties. A variety of challenges arose, however, regarding the machine annotation of Twi.

Challenges in Annotating Twi in English Sentences

No Italics in Annotations

I have repeatedly mentioned that Donkor uses italics to signal mutlilingualism. There is no way to indicate italics within „Jupyter Notebook“. Thus, it would be impossible to use these annotations to analyze the use of italics in multilingual texts, whether with a diachronic lens or otherwise. Nor can any differing practices across languages be analyzed, seeing that there is no way to indicate italics and therefore search for them later on.

Twi Words with the same form as English words

There are a few instances within the chosen sample sentences where a Twi phrase includes a word that resembles an English word. „Jupyter Notebook“, being based on an annotation model for the English language, identifies the form of these words and classifies them according to the English POS. In the chosen, annotated sentences this issue applies to the words „me“ or „bone“. Since I lack the language skills in Twi to understand the words, I can neither confirm nor deny if the classification generally is correct. It shows, however, that there are challenges in differentiating the Twi words from the English words.

Classifying Twi Words Consistently

In general, there’s a lack of consistency with respect to the classification of Twi words. The word „Aboa“ appears twice in two adjacent sentences. Still, the classifications for the word differ despite the same form. First it is identified as „ADJ“, then as „PROPN“. Due to a lack of input of the Twi language, these parts of the sentences are not labeled correctly.

Aboa!‘ Mary laughed. Aboa was Mother’s insult of choice too;

Donkor, Michael: Hold, p. 52

Technical Difficulties and First Results of Annotating Multilingual Sentences in „The Moor’s Account“

First of all, I have been finding this work very interesting, getting a new perspective on literature. However, I encountered some technical difficulties while having Google Collab analyse my sentences. After three sentences, it told me I had reached my free limit and would have to purchase Collab Pro. This is probably because I tried to save my results by opening a new file for every sentence. Thus, I have only worked through a couple of examples so far, taken from the novel The Moor’s Account  by Laila Lalami.

The first sentence I had annotated was, „When I said, Bawuus ni kwiamoja, one of the women inevitably corrected me, Ni bawuus kwiamoja.“ (Lalami 175). Both times the words are used in the sentence, Collab tagged them as proper nouns, as if they were all names. The phrase is not translated in the novel, but the correction is as follows: “in Capoque […] the doer and the done-to were spoken of before the deed itself” (ibid.). This means that, presumably, “Ni” and “bawuus” are the doer and the done-to, while “kwiamoja” is a verb. It is very difficult to research this language, however; I suppose it is not spoken anymore.

Another sentence I let Collab process was, “I whispered Ayat al-Kursi to myself“ (123). “Ayat”, “al”, and “Kursi” are all tagged as proper nouns, which is acceptable, I believe, as the words refer to a specific verse in the Quran. In the dependency analysis, however, it says that “Kursi” is dependent on the word “whispered” and the major part of a compound (dobj).

Hopefully, I will be able to go through more sentences and I am excited for the next steps.

Experiences and observations on annotating the multilingual novel „The Dragonfly Sea“ by Yvonne Adhiambo Owuor

Taking up the class of “Writing Across Languages – The Post-Monolingual Anglophone Novel”, I hadn’t really anticipated the extent to which we were going to work in the field of digital humanities. During my bachelor in literary studies and philosophy there was very little computational work involved and it is – mildly stated – not my field of expertise.

Nevertheless, I am quite thankful for the opportunity to get to know some of the methods and advantages of DH and distant reading as well as, on the other hand, learning about some of the problems and biases of more traditional methods like close reading.

Ironically, this blog post is still very concerned with the many problems of English annotation programmes. In the seminar, each one of us annotates a post-monolingual anglophone novel, that is, a novel mainly written in English, but with a considerable portion of foreign language(s) incorporated in the text. For the purpose we used panda (a version of python) and spaCy equipped with an English model on a surface of Google Collab (instead of Jupyter Notebook, which seems to be very difficult to instal). Although this is only one of probably many programmes (not my field of expertise, remember?), I think it is safe to say that most software trained with the English language will share those problems.

I will now list some of the difficulties I encountered personally while annotating my sentences and then describe some of the general mistakes I noticed in the programme’s annotation – concerning foreign words and phrases, but also (and this I hadn’t expected) just poetic language in general.

The first difficulty I encountered, was my limited knowledge of linguistics. I have never studied any language, so my linguistical input lies as far back as Elementary School. Or, maybe, everything I randomly picked up in secondary literature during my bachelor. Neither did I understand all the abbreviations for token dependency spaCy used. Fortunately, most of the mistakes the programme made were so obvious that even I noticed them. And to be honest, there were already so many of those that I might actually be lucky not to notice the rest.

The second obstacle I had to overcome was the limited number of foreign (meaning not German) languages I know. The novel I chose, “The Dragonfly Sea” by Yvonne Adhiambo Owuor, includes a great variety of languages, amongst which are Swahili, Pate, Arabic, Turkish, Mandarin, French, Portuguese, Hindi and English. Of those, I only know French and English (what a Eurocentric education, right?). So, to actually understand, whether the foreign words and sentences were annotated correctly, I had to translate them in a way, that let me actually understand the structure of the language.

Last but not least, I had to deal with Chinese characters and Arabic letters, that weren’t necessarily translated in the book. I had to look up transliterations into Latin letters in other parts of the book or guess the meaning and then look up the English translation with DeepL, to then translate the English translation back to Chinese characters. Often times I had to look up several versions of a translation also in other online dictionaries/translation programs to find the right character and copy it into the programme.

Concerning the programme’s mistakes, the most common was the classification of foreign words as either proper nouns (PROPN) or nouns (NOUN). Take, for example the following sentence:

„Ayaana asked, ‚Ma-e, mababu wetu walienda wapi?‘ – Where are our people?“

(Owuor 32)


Every last word (even the „e“ as a singular token), was labeled a proper noun, except the word „wapi“, which was labeled an adverb. Even if the book didn’t give me the English translation right away, it is highly unlikely, that a sentence would consist of five proper nouns and an adverb. I didn’t even bother to check whether the dependencies in that sentence might be right.

Also, the programme’s classification of the tokens wasn’t consistent in itself. Take the following paragraph:


„Before the child had seen him, she used to twirl in the ocean’s shallows and sing a loud song of children at ease: ‘Ukuti, Ukuti Wa mnazi, wa mnazi Ukipata Upepo Watete…watete…watetemeka…’“ (Owuor 16)


Whereas the first „Wa“ is labeled a proper noun, its repetition „wa“ is labeld as an adverbial preposition (ADP). In the same manner the first „Watete“ is defined as a proper noun, the second „watete“ just as a noun. I suspect, this has something to do with the capital writing (in fact, if I’m correct, all capital foreign words were PROPN). But it shows that all foreign words are labeled somewhat arbitrarily.

This is, by the way, the case for all non-English words, regardless the language. I thought it might work better with European languages, given that the machine was probably trained with literature from the European canon. But a French sentence, for example, is treated just in the same manner:


„‘Us.’ ‘Us?’ ‘Yes.’ ‘C’est une chose à laquelle je n’avais pas pensé.’“ (Owuor 218)


The French determinant une became a PROPN, as well as à (correct: preposition), laquelle (correct: pronoun), je (correct: pronoun), n’avais (correct: negation and verb), pas (correct: negation), pensé (correct: verb). The French noun chose became verb and root of the sentence. This also happens when we’re not talking about a whole sentence, but only a few words, as in the following example:


„Well, he has his character enter a tavern and go up to a tavern keeper and request a solicitud de asilo – lovely word – ‘solicitude,’ it evokes protectiveness.“


Here, panda and spaCy took solicitud to be an adjective (correct: noun), asilo was an adverb (correct: noun) and de was marked with an X for „other“ (correct: preposition).

The French example brings me to another observation: I was surprised to notice, that the programme seems to have quite some problems with poetical language. Especially elliptical sentences seemed to confuse it. In the example above, the first „Us“ is also labeled as a PROPN. The second „Us?“, on the other hand, is correctly recognized as a pronoun (PRON). I suspect, this is due to the fact, that it is more common to use only a pronoun in an interrogative clause. Still, it is curious to see, that a programme used to analyse literary texts cannot process an elliptical sentence. The same happens in the following example:


„Muhidin told Ayaana to repeat its name, kereng’ende, in four other languages: ‘Matapiojos. Libélula. Naaldekoker. Dragonfly.’ Ayaana intoned, ‘Matapiojos-libélula-naaldekoker-dragonfly.’ (Owuor 38)


Whereas I was no longer surprised that non-English words were categorized incorrectly, „Dragonfly“ was also labeled a PROPN. Even more alarming was the outcome when I annotated the following paragraph:


“‘Allahu Akbar…‘ Another day, night, day. Herald of promise, easing an ancient brooding island into wakefulness. (Owuor, 15)


The programme actually took „Herald“ to be the PROPN and ROOT of the sentence. Although this sentence is somewhat poetical, it is nonetheless not highly unusual in an English novel. I must wonder, whether the programme would annotate a monolingual novel correctly or whether, in this instant, the foreign words confused the machine so much, that it makes mistakes it normally wouldn’t make (but that seems to be a humanization of the programme, I suppose…).
Last, but not least, I noticed that the programme was also confused by the unusual use of capital letters, when citing a poem, for example.

So far, these are my experiences with annotating „The Dragonfly Sea“. I am very curious to see, what more we can learn about DH methods and how the discipline will tackle the above mentioned problems in the future.

Initial experiences in annotating multilingual text in Ocean Vuong’s „On Earth We’re Briefly Gorgeous“

I have always been fascinated by how literature and language studies have been influenced by digital fields such as coding and AI, so working on this project and studying the interdependencies between digital humanities and comparative studies has been a great learning experience. Having said that, it should be noted that I am not very tech-savvy, so the annotation assignment using Google Colab and Python initially seemed a little daunting. After using it together in class though, it became easier to work with the software and my initial confusion was cleared.

The text I’m working on is Ocean Vuong’s On Earth We’re Briefly Gorgeous and it is a majorly English text with a few Vietnamese words and phrases. As a novel, it challenges the conventional idea of a mother tongue as a source of identity and stability. Through the protagonist, a Vietnamese American speaker who translates between English and Vietnamese, Vuong portrays the mother tongue as something that is constantly changing and disconnected from its origins, like an orphan. Multilinguality then, in the context of the novel becomes more of a multi-culture discourse where the experiences of Vietnamese people in America do not get translated.

Finding samples for annotation and analysis in such a text was a bit of a task because there were comparatively fewer multilingual passages. Nevertheless, the passages I did work with give a fairly comprehensive idea of how the software is heavily Anglo-centric and fails to correctly annotate non-English languages.

One of the sentences I used for analysis was: “Đẹp quá!” you once exclaimed, pointing to the hummingbird whirring over the creamy orchid in the neighbor’s yard. Vuong gives us a translation of the Vietnamese phrase in the next sentence of the novel: „It’s beautiful“. SpaCy recognises the words as PROPN and verb, but as is clear from the translation in the source text, neither of the words is a proper noun or a verb and could roughly be classified as adjective and pronoun if we are to categorise based on POS. “Creamy”, used to describe the orchid is also recognised as a noun and not an adjective by the software. Not being a student with a linguistics background, I have had difficulties understanding the dependency encoding part of the programme and then applying it to check whether it has been run correctly or not. But, as far as I have comprehended, dependency relation indexing for English words and POS is accurate while Vietnamese words are wrongly indexed (I suspect based on the already incorrect POS tags). 

I also wanted to try to work with a sentence where Viternamese was not used as direct speech to see if the software recognises them differently: You wanted to buy oxtail, to make bún bò huế for the cold winter week ahead of us. The novel doesn’t have an exact translation but contextually it is evident in context that it is a dish made of beef. A little bit of independent research revealed that bún bò huế is a dish made of rice noodles and beef slices. The software almost correctly recognises “bún” and “huế” as nouns, but “bò” is wrongly found to be a verb. Expecting a program to recognise the cultural intricacies of a word might be expecting too much of it, but even in the instance that it did recognise it correctly, I am inclined to believe that it was merely a fluke. 

”Tên tôi là Lan.” My name is Lan. was another sentence that I tried to annotate, and because I was having fun with the software by now, I also wanted to try annotating them as two different sentences as well—just the Vietnamese and then just the English translation. The results were extremely interesting and although I cannot fully explain why they varied (again, maybe a linguistics background would’ve helped) it goes to show that the spaCy software needs to be made more inclusive of languages that are not eruo/anglo-centric.

The Vietnamese name is recognised as PROPN correctly, maybe based on the capitalisation. But, the POS and thence the dependency indexing is incorrect for the Vietnamese.  As seen in the above images, Python’a consistency in indexing and classifying varies for the same sentence when it is paired with English words and when it isn’t. The deprel variation for Lan in the Vietnamese and English sentences is different as well (npadvmod and attr).

This only goes to show that almost all softwares and translating codes are written in English-speaking countries and for the English-dominant market, and anglo-centrism poses barriers in literary studies when it comes to studying multilingual texts and their correct and factual translation.