To whom it might concern,
I am writing in regards of the project
Cultural Diversity Observatory and the data we are collecting. In short, this project aims at bridging the content gaps between language editions that relate to cultural and geographical aspects. For this we need to retrieve data from all language editions and Wikidata, and run some scripts in order to crawl down the category and the link graph, in order to create some datasets and statistics.
The reason that I am writing is because we are stuck as we cannot automatize the scripts to retrieve data from the Replicas. We could create the datasets few months ago but during the past months it is impossible.
We are concerned because one thing is to create the dataset once for research purposes and another thing is to create them on monthly basis. This is what we promised in the
project grant details and now we cannot do it because of the infrastructure. It is important to do it on monthly basis because the data visualizations and statistics Wikipedia communities will receive need to be updated.
Lately there had been some changes in the Replicas databases and the queries that used to take several hours are getting stuck completely. We tried to code them in multiple ways: a) using complex queries, b) doing the joins as code logics and in-memory, c) downloading the parts of the table that we require and storing them in a local database.
None is an option now considering the current performance of the replicas.
Bryan Davis suggested that this might be a moment to consult the Analytics team, considering the Hadoop environemnt is design to run long, complex queries and it has massively more compute power than the Wiki Replicas cluster. We would certainly be relieved If you considerd we could connect to these Analytics databases (Hadoop).
Let us know if you need more information on the specific queries or the processes we are running. The server we are using is wcdo.eqiad.wmflabs. We will be happy to explain in detail anything you require.
Thanks.
Best regards,
Marc Miquel
PS: You can read about the method we follow to retrieve data and create the dataset here:
Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM. 2334-0770
_______________________________________________