My Experience Converting and Annotating Multilingual Sentences

I have always been intrigued by the potential of using coding and programming software when it comes to languages, but as someone with no background in coding/programming, I have also been intimidated by it. So when we first tried out Google Colab in a session of Writing Across Languages: The Post-Monolingual Anglophone Novel, I was very excited, and a bit skeptical about how it was going to turn out, especially considering the fact that we were going to try out multilingual sentences, and not plain English sentences. Since we tried it out first as a group, it was easier for me to set aside my anxiety associated with using a programming language, as most of us were new to this.

As expected, we did find some anomalies when it came to POS tagging of multilingual sentences, considering that most of these tools depend on English. Initially, it was difficult for me to understand dependency relations (deprel) and get a hang of the abbreviations that indicate POS tagging and deprel. We discussed these in class, and I also did some independent research to understand these concepts better. What I understood is that getting a hang of it all only comes with practice. I also tried annotating some very common English sentences (such as „The quick brown fox jumps over the lazy dog.“) to get a better insight of how SpaCy works with English sentences vs. multilingual sentences.

The novel I chose to work with is Arundhati Roy’s The Ministry of Utmost Happiness, which has the presence of many Indian languages in addition to English, and the example sentences I worked with mostly had Urdu/Hindi words. As we saw in class, SpaCy tagged most Urdu/Hindi words as Proper Noun, sometimes correctly, and sometimes not. It was quite easy for me to figure out the mistakes in POS tagging due to my personal familiarity with these languages, and in some cases, the literal translation follows the word in the text itself, by context or definition.

I have to mention that the initial fog has lifted at this point, and the process of understanding and identifying oddities did get better with more examples I tried. But I do believe that the software also is a little confused at this point when it comes to the identification and tagging of non-English words, and there is a great scope of improvement in this aspect.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert