Working with ANNIS was a lot easier than on spaCy and it has a pretty user-friendly interface. The query-builder function in the software helped in not only identifying the tendencies that machine annotations have towards non-English words but also in quantifying these tendencies. For example, while we had already established that POS tagging for non-English words is biased towards identifying them as porper nouns, through ANNIS we know that of 681 entries tagged “Foreign“ in our corpus, 420 are labelled with the POS tag PROPN, and 168 as NOUN.
This also reflects how the non-English words have been used in the English text of the books in places of nouns in a sentence structure, thus also contributing to its recognition as a noun or proper noun.
As discussed in class, a variety of reasons underly this phenomenon. In an English sentence, replacing an English noun with a non-English one allows the author to preserve the sentence’s structure, making it more understandable for those who do not speak the language, and so the non-English words tend to be food, curse words, words that don’t have a clear translation in English, or clothes. In contrast, using a foreign verb or adjective could change the grammar and potentially confuse readers who are not familiar with the syntactic rules of other languages.
This in turn gives us an insight into the reception of the book, the market that it is written for, and the literary sociology around a text that is multilingual.
Our corpus consisted of several languages and it was interesting to note that these tendencies of POS tagging were applied to all non-English languages. This shows that the machine has been coded to essentially homogenise and annotate all languages that are not English in the same way, with no regard for the nuances and complexities that make languages so different from one another.
Machine Translations
While working on self-translations and machine translations for a multilingual text, we noticed that there is a lot of potential to develop software that can recognise context and more than two languages in a sentence. The text I translated was Ocean Vuong’s “On Earth We’re Briefly Gorgeous” which has English and Vietnamese, and manually translated as well as used Google Translate to convert the multilingual sentences to HIndi completely.
The machine translations often fell short in capturing the nuances and cultural context of languages, resulting in robotic and literal renditions that can distort meaning. For instance, in the provided translations from Ocean Vuong’s „On Earth We’re Briefly Gorgeous,“ phrases like “Đẹp quá!” were inaccurately translated to „आप ठीक है!“ which means “you are alright,” completely missing the expression of beauty. Similarly, „cream-colored orchid“ was misinterpreted as „creamy,“ suggesting a texture rather than color. Machine translations also struggled with the temporal context, as seen in “the cold winter week ahead of us,” where a literal translation changed the sense of time to a spatial reference. short in capturing the nuances and cultural context of languages, resulting in robotic and literal renditions that can distort meaning.
The translation of multilingual sentences introduces additional challenges. In the text, Vietnamese phrases were awkwardly blended with English, resulting in jumbled and incorrect meanings, such as “có đuôi bò không?” becoming “क्या आप खुश हैं?” which translates to „are you happy?“ instead of inquiring about oxtail availability. Machine translations lack the ability to discern these subtleties, leading to nonsensical and culturally insensitive outputs. In contrast, manual translations provide a more coherent and culturally aware interpretation, preserving the intended meaning and emotional depth of the original text.
Thank you for your wonderful analysis, Aishwarya! I can say that our experience with ANNIS and machine translations mirror each other. Evidently, both ANNIS and spaCy struggle with accurately tagging non-English words, often defaulting to proper nouns. This overreliance on English-centric rules hampers the analysis of multilingual texts. Your exploration of machine translation further underscores the challenges in capturing linguistic and cultural nuances. The inability of these systems to handle language mixing and contextual understanding results in often inaccurate translations. I resonate with your observations about the cultural specificity of certain words and the loss of meaning in translation, and I especially find your observation about the loss of temporal context very interesting, as it is something I didn't notice when I did the translations. I would keep an eye out for more of these instances in the future. In short, our shared experiences point towards the need for better language models, that can annotate, translate, and work with every language, independent of English as the "main" language.