Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

9 Jul 2019

Hello,

...
 From your description seems that your problem is not
one of computation (well,  your main problem) but rather data extraction. The labs
replicas
are not meant for big data extraction jobs as you have just found out.
Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
denormalized data that you can probably use, it is still up for discussion
whether the data will be released as a JSON dump or other but basically is
a denormalized version of all the data held in the replicas that will be
created monthly.

Please take a look at the documentation of the dataset:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…

This is the phab ticket:
https://phabricator.wikimedia.org/T208612

So, to sum up, once this dataset is out (we hope late this quarter or early
next) you can probably build your own datasets from it thus rendering your
usage of the replicas obsolete. Hopefully this makes sense.

Thanks,

Nuria

On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel &lt;marcmiquel(a)gmail.com&gt; wrote:

...
  To whom it might concern,

 I am writing in regards of the project *Cultural Diversity Observatory*
 and the data we are collecting. In short, this project aims at bridging the
 content gaps between language editions that relate to cultural and
 geographical aspects. For this we need to retrieve data from all language
 editions and Wikidata, and run some scripts in order to crawl down the
 category and the link graph, in order to create some datasets and
 statistics.

 The reason that I am writing is because we are stuck as we cannot
 automatize the scripts to retrieve data from the Replicas. We could create
 the datasets few months ago but during the past months it is impossible.

 We are concerned because one thing is to create the dataset once for
 research purposes and another thing is to create them on monthly basis.
 This is what we promised in the project grant

<https://meta.wikimedia.org/wiki/Grants:Project/WCDO/Culture_Gap_Monthly_Monitoring>
 details and now we cannot do it because of the infrastructure. It is
 important to do it on monthly basis because the data visualizations and
 statistics Wikipedia communities will receive need to be updated.

 Lately there had been some changes in the Replicas databases and the
 queries that used to take several hours are getting stuck completely. We
 tried to code them in multiple ways: a) using complex queries, b) doing the
 joins as code logics and in-memory, c) downloading the parts of the table
 that we require and storing them in a local database. *None is an option
 now *considering the current performance of the replicas.

 Bryan Davis suggested that this might be a moment to consult the Analytics
 team, considering the Hadoop environemnt is design to run long, complex
 queries and it has massively more compute power than the Wiki Replicas
 cluster. We would certainly be relieved If you considerd we could connect
 to these Analytics databases (Hadoop).

 Let us know if you need more information on the specific queries or the
 processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
 will be happy to explain in detail anything you require.

 Thanks.
 Best regards,

 Marc Miquel

 PS: You can read about the method we follow to retrieve data and create
 the dataset here:

 *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
 Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
 the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
 2334-0770 *
 wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases