loanpy – a framework for computer-aided borrowing detection

Viktor Martinović

The primary goal of historical-comparative linguistics is to find out more about the history of a language. Insights beyond that belong strictly speaking to other scientific fields, or as Sinor (1988) formulates:
Of course comparative linguistics may provide some valuable information on past events but one should not lose sight of the fact that languages and not peoples are the proper subject of linguistics, and diachronic linguistics should deal with the history of one or several languages and not with the history of peoples who spoke them. (p. XVIII)
Historical lexicology is a subfield of historical linguistics that deals with etymologies of single words. Those words can be divided into two categories: Inherited and borrowed. Inherited words usually have cognates in at least one other language of the same family, and borrowings are usually the consequence of language contact.
Computational methods are fairly new in historical linguistics and manual methods still prevail. To the more notable achievements of computational historical linguistics belongs the automated creation of phylogenetic trees, as well as automated cognate detection (cf. Rama et al. 2018). There have been only few attempts to computationally detect borrowings (cf. Babiker 2020, p.10), loanpy is one of them.
loanpy is a python-library that consist of four modules – helpers.py, reconstructor.py, adapter.py and loanfinder.py. Helpers consist of independent functions that are called by the other modules. Reconstructor extracts sound change rules from an etymological dictionary, sorts them according to the number of their occurrence and stores them in a python-dictionary. This information is then used to reconstruct hypothetical old roots of other words in the modern-day language. Adapter adapts words from the tentative donor language according to constraints of the tentative recipient language. Loanfinder compares reconstructed pseudo-roots with pseudo-adaptations and identifies phonetic matches. Then, it calculates the semantic similarity of those matches, according to which the output is sorted.
This framework accounts for two phenomena that were mostly ignored in previous approaches: Firstly, rule-based instead of similarity-based reconstructions and adaptations (the drawback, however, is that the framework can only be applied to languages where etymological data is already available) and secondly semantic shift.
I have used this framework to investigate whether there might be yet-undetected Gothic loanwords in Hungarian. First results seem promising at first glance, but still require some deeper analysis. In theory, the software should allow its users to input any two given languages. To investigate to which degree this is practically true will be the next step of my research.

References

  • Denis Sinor (ed.): The Uralic Languages, in: Handbuch der Orientalistik, 1988, E.J.Brill
  • Taraka Rama, Johann-Mattis List, Johannes Wahle, Gerhard Jäger: Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?, 2018, Cornell University
  • Hiba Babiker, Abbie Hantgan, Johann-Mattis List: First steps towards the detection of contact layers in Bangime: A multi-disciplinary, computer-assisted approach, 2020, Humanities Commons