This page presents the data digitized and processed by the QuantHistLing project at the Philipps University in Marburg.
This is the filtered version of the corpus of dictionaries and wordlists. Due to copyright restrictions we are not allowed to publish the whole dataset. This version comprises all dictionary and wordlist entries that contain one word which has a stem that belongs to one of the words in the Spanish swadesh list. We used the entries of the Wikipedia swadesh list to filter our data. Each entry of the swadesh list and each entry of our sources was stemmed using the Snowball stemmer of the NLTK. In case of equality the entry is included in the data you will find here. The Python script that we used to filter the entries is available here (the filter algorithm starts around line 67).
In addition we included one of the sources without any filter, so you have full access to the complete data of the following source: Thiesen, Wesley & Thiesen, Eva. 1998. Diccionario Bora—Castellano Castellano—Bora.
There are two download packages that contain the data from the whole corpus. One contain CSV files and only includes heads and translations in the case of the dictionary sources, or concepts and counterparts in the case of wordlist sources. You may uses this data in combination with the lingpy libary, for example. The other packages contains the complete basic data and annotations encoded as data package (see Data Protocols). Check the README in the package for more information about the content. The download links are:
- CSV files: http://www.quanthistling.info/data/downloads/csv/data.zip
- Data package: http://www.quanthistling.info/data/downloads/datapackages/data.zip