Multi-Lingual Annotations in Python with „A Concise Chinese-English Dictionary for Lovers“

This was not my first time working with Python to figure out the tokens of sentences via computational means. However, my last time working with Python (in a linguistics BA seminar with Prof. Kevin Tang) has been some time back, so I have appreciated that working our example sentences through it has been made very easy.

Running the three example sentences given to us, I noticed that punctuations marking speech are often interpreted as part of words or compounds and are tagged as POS, which makes it difficult to check which tokens refer to which words.

Additionally, foreign words were always interpreted as proper nouns – even in the Spanish phrase clarita del huevo, in which del is clearly not a proper noun. I would have thought that maybe the Spanish language would be easier to interpret or perhaps translate than the Swahili sentence as it might be more commonly known, but Python does not do any translating and so struggles with anything that is not English. Dependencies can thus not be correctly determined: in the sentence Tías called me blanca, palida, clarita del huevo the last three parts (blanca, palida, clarita del huevo) are a listing, all nouns or noun phrases are on equal standing and not dependent on each other, but Python marks huevo as a dependent of palida.

The given example with Swahili words in it cannot be properly tagged at all as – according to Python – ‘Ayaaaana! Haki ya Mungu … aieee!’ The threat-drenched contralto came from the bushes to the left of the mangroves. ‘Aii, mwanangu, mbona wanitesa?’ is made up entirely of nouns or proper nouns. Only the English part could be identified correctly.

The English-Chinese example provides similar issues:

  • In Chinese, it is the same word ‘’ (jia) for ‘home’ and ‘family’ and sometimes including ‘house’. To us, family is same thing as house, and this house is their only home too. ‘’, a roof on top, then some legs and arms inside.

In this example, the Chinese hanzi are not computed as proper nouns. Python interprets 家 as a noun once and an adjective another time, marking it – quite nonsensically – as a dependent of legs.

Now looking at the novel I have been reading – A Concise Chinese-English Dictionary for Lovers by Xiaolu Guo – I looked for similar multilingual sentences to test in the programme. I couldn’t find many such sentences, but tested the ones I did find:

  • 知识’ mean knowledge, ‘分子’ mean molecule.

知识 is here correctly interpreted as a noun, although it is interpreted as one singular noun rather than a noun phrase consisting of a verb (知 – to know) and a noun (识 – knowledge).

Similar goes for分子, interpreted as one noun rather than a noun phrase consisting of a verb (分 – divide) and a noun (子 – son, child).

The word “mean” in this sentence is the verb “to mean”, but not conjugated correctly because the narrator is not fluent in English yet and struggles with English grammar. Python thus tags it as an adjective. Accordingly, the dependencies turn out incorrect as well, as the mean should function as the head of its sentence. Instead, knowledge becomes that head upon which all other words depend.

  • is fart in Chinese. It is the word made up from two parts. is a symbol of a body with tail, and underneath that represent two legs. That means fart, a kind of Chi.

In the 3rd sentence “represent” is interpreted as a dependent of “is” – again, likely due to the improper grammar the narrator uses. The hanzi are not computed at all (屁), marked as a proper noun (尸) or a noun (比).

  • Chi (), everything to do with Chi is very important to us Chinese.

This sentence starts out sort of elliptical. The phrase very important to us Chinese is tagged and interpreted correctly, but Python struggles with everything that comes before it, marking Chi () as a dependent of is. Evidently, Python cannot compute and interpret punctuation correctly, which in this case should indicated that the first word is separate from the sentence after the comma.

In the next two examples I wanted to test how Python deals with incorrect English grammar but without the interference of non-English words, hypothesising that situations like the above mean would also occur here.

  • I feeling I can die for all kinds of situation in every second.

In this case, coming from previous errors Python made due to incorrect grammar and conjugation, I thought that feeling might be interpreted as a noun because the auxiliary “am” of the progressive form is missing. Surprisingly, Python has no problem recognising feeling as the verb it is indeed supposed to be, marking it correctly as the head/root of the entire sentence

  • I scared by cars because they seems coming from any possible directing.

Similar as above, I wanted to test how Python deals with these grammatical errors (seems instead of seem; directing instead of direction) – again, surprisingly, all tokens were tagged and interpreted properly with all their dependencies. Even directing has been correctly identified as a noun instead of a progressive verb.

Evidently, annotating multi-lingual sentences correctly is not a possibility – at least not with the Python code we have been given. While the programme has no problem interpreting English sentences with incorrect grammar, it is thrown for a loop as soon as non-English words are introduced, which was a very interesting observation to make.

Dieser Beitrag wurde unter Allgemein, Blog posts, Student entries, Writing across Languages abgelegt und mit , , , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert