I have always been intrigued by the potential of using coding and programming software when it comes to languages, but as someone with no background in coding/programming, I have also been intimidated by it. So when we first tried out Google Colab in a session of Writing Across Languages: The Post-Monolingual Anglophone Novel, I was very excited, and a bit skeptical about how it was going to turn out, especially considering the fact that we were going to try out multilingual sentences, and not plain English sentences. Since we tried it out first as a group, it was easier for me to set aside my anxiety associated with using a programming language, as most of us were new to this.
As expected, we did find some anomalies when it came to POS tagging of multilingual sentences, considering that most of these tools depend on English. Initially, it was difficult for me to understand dependency relations (deprel) and get a hang of the abbreviations that indicate POS tagging and deprel. We discussed these in class, and I also did some independent research to understand these concepts better. What I understood is that getting a hang of it all only comes with practice. I also tried annotating some very common English sentences (such as „The quick brown fox jumps over the lazy dog.“) to get a better insight of how SpaCy works with English sentences vs. multilingual sentences.
The novel I chose to work with is Arundhati Roy’s The Ministry of Utmost Happiness, which has the presence of many Indian languages in addition to English, and the example sentences I worked with mostly had Urdu/Hindi words. As we saw in class, SpaCy tagged most Urdu/Hindi words as Proper Noun, sometimes correctly, and sometimes not. It was quite easy for me to figure out the mistakes in POS tagging due to my personal familiarity with these languages, and in some cases, the literal translation follows the word in the text itself, by context or definition.
I found the results for the sentence „I’m a mehfil, I’m a gathering.“ quite interesting. The sentence has two clauses, and meaning-wise, they are the same (mehfil means gathering in Urdu), and both are technically independent from one another. And yet, the second “ ‚m “ is identifies as the root, and the first is a ccomp. I believe the sentence was considered as a complex sentence in this case, but even so, I wonder why the first “ ‚m “ is not the root instead of the second. I am yet to figure this out, but this definitely seemed odd.
I have to mention that the initial fog has lifted at this point, and the process of understanding and identifying oddities did get better with more examples I tried. But I do believe that the software also is a little confused at this point when it comes to the identification and tagging of non-English words, and there is a great scope of improvement in this aspect.
Reading through your entry, I realised that a lot of these were issues I flagged in my experience of annotating sentences. I also faced challenges with POS tagging and understanding dependency relations, especially for non-English words. Working on Ocean Vuong’s "On Earth We’re Briefly Gorgeous," I encountered similar challenges with SpaCy's Anglo-centric bias. For instance, Vietnamese words like "Đẹp quá!" were misclassified, reflecting the software's limitations with non-English languages. Like you, I found that with practice and deeper exploration, the process became clearer. The language in the text you have used is one I understand pretty well so going through your entry was insightful and showed me how even though Vietnamese and Hindi/Urdu are very different, the software recognises and works with them in the same way. It's evident that these tools need refinement for multilingual support, but the journey of learning and identifying these anomalies has been incredibly perceptive and rewarding.