I have always been fascinated by how literature and language studies have been influenced by digital fields such as coding and AI, so working on this project and studying the interdependencies between digital humanities and comparative studies has been a great learning experience. Having said that, it should be noted that I am not very tech-savvy, so the annotation assignment using Google Colab and Python initially seemed a little daunting. After using it together in class though, it became easier to work with the software and my initial confusion was cleared.
The text I’m working on is Ocean Vuong’s On Earth We’re Briefly Gorgeous and it is a majorly English text with a few Vietnamese words and phrases. As a novel, it challenges the conventional idea of a mother tongue as a source of identity and stability. Through the protagonist, a Vietnamese American speaker who translates between English and Vietnamese, Vuong portrays the mother tongue as something that is constantly changing and disconnected from its origins, like an orphan. Multilinguality then, in the context of the novel becomes more of a multi-culture discourse where the experiences of Vietnamese people in America do not get translated.
Finding samples for annotation and analysis in such a text was a bit of a task because there were comparatively fewer multilingual passages. Nevertheless, the passages I did work with give a fairly comprehensive idea of how the software is heavily Anglo-centric and fails to correctly annotate non-English languages.
One of the sentences I used for analysis was: “Đẹp quá!” you once exclaimed, pointing to the hummingbird whirring over the creamy orchid in the neighbor’s yard. Vuong gives us a translation of the Vietnamese phrase in the next sentence of the novel: „It’s beautiful“. SpaCy recognises the words as PROPN and verb, but as is clear from the translation in the source text, neither of the words is a proper noun or a verb and could roughly be classified as adjective and pronoun if we are to categorise based on POS. “Creamy”, used to describe the orchid is also recognised as a noun and not an adjective by the software. Not being a student with a linguistics background, I have had difficulties understanding the dependency encoding part of the programme and then applying it to check whether it has been run correctly or not. But, as far as I have comprehended, dependency relation indexing for English words and POS is accurate while Vietnamese words are wrongly indexed (I suspect based on the already incorrect POS tags).
I also wanted to try to work with a sentence where Viternamese was not used as direct speech to see if the software recognises them differently: You wanted to buy oxtail, to make bún bò huế for the cold winter week ahead of us. The novel doesn’t have an exact translation but contextually it is evident in context that it is a dish made of beef. A little bit of independent research revealed that bún bò huế is a dish made of rice noodles and beef slices. The software almost correctly recognises “bún” and “huế” as nouns, but “bò” is wrongly found to be a verb. Expecting a program to recognise the cultural intricacies of a word might be expecting too much of it, but even in the instance that it did recognise it correctly, I am inclined to believe that it was merely a fluke.
”Tên tôi là Lan.” My name is Lan. was another sentence that I tried to annotate, and because I was having fun with the software by now, I also wanted to try annotating them as two different sentences as well—just the Vietnamese and then just the English translation. The results were extremely interesting and although I cannot fully explain why they varied (again, maybe a linguistics background would’ve helped) it goes to show that the spaCy software needs to be made more inclusive of languages that are not eruo/anglo-centric.
The Vietnamese name is recognised as PROPN correctly, maybe based on the capitalisation. But, the POS and thence the dependency indexing is incorrect for the Vietnamese. As seen in the above images, Python’a consistency in indexing and classifying varies for the same sentence when it is paired with English words and when it isn’t. The deprel variation for Lan in the Vietnamese and English sentences is different as well (npadvmod and attr).
This only goes to show that almost all softwares and translating codes are written in English-speaking countries and for the English-dominant market, and anglo-centrism poses barriers in literary studies when it comes to studying multilingual texts and their correct and factual translation.
Your blog post on annotating multilingual sentences in Ocean Vuong’s "On Earth We're Briefly Gorgeous" resonated with my own experiences in a similar project. I, too, was initially daunted by the technical aspects of using Google Colab and Python, especially for handling multilingual texts. Your findings about SpaCy's limitations in correctly tagging Vietnamese words and phrases mirror my observations with Urdu and Hindi in Arundhati Roy’s "The Ministry of Utmost Happiness." In both cases, the software's Anglo-centric design often misclassifies non-English words, highlighting a critical gap in NLP tools. Like you, I found that understanding POS tagging and dependency relations was challenging but became more manageable with practice and independent research. Your analysis of specific sentences, such as "Đẹp quá!" and "bún bò huế," echoes my own challenges. I encountered similar issues with the sentence "I’m a mehfil, I’m a gathering." SpaCy's inconsistency in tagging and indexing non-English words underscores the need for more inclusive NLP models. Overall, our experiences reinforces the importance of developing more culturally and linguistically aware annotation tools in digital humanities. I also find it encouraging to see non-tech savvy and non-linguistics folks like ourselves are able to navigate and overcome these challenges in our first steps to the field of digital humanities.