Dear Mr.,
I thank you for your efforts. The link to H2020 is
https://ec.europa.eu/programmes/horizon2020/en/how-get-funding.
Yours Sincerely,
Houcemeddine Turki
________________________________
De : Analytics <analytics-bounces(a)lists.wikimedia.org> de la part de Houcemeddine A.
Turki <turkiabdelwaheb(a)hotmail.fr>
Envoyé : mardi 9 juillet 2019 16:12
À : A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing analytics
hadoop databases
Dear Mr.,
I thank you for your efforts. When we were in WikiIndaba 2018, it was interesting to see
your research work. The project is interesting particularly because there are many
cultures across the work that are underrepresented in Internet and mainly Wikipedia.
Concerning the formal collaboration, I think that if your team can apply for a H2020
grant, this will be useful. This worked for the Scholia project and can work for you as
well.
Yours Sincerely,
Houcemeddine Turki
________________________________
De : Analytics <analytics-bounces(a)lists.wikimedia.org> de la part de Nuria Ruiz
<nuria(a)wikimedia.org>
Envoyé : mardi 9 juillet 2019 16:00
À : A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.
Objet : Re: [Analytics] project Cultural Diversity Observatory / accessing analytics
hadoop databases
Marc:
We'd like to start the formal process to have an
active collaboration, as it seems there is no other solution available
Given that formal collaborations are somewhat hard to obtain (research team has so many
resources) my recommendation would be to import the public data into other computing
platform that is not as constrained as labs in terms of space and do your calculations
there.
Thanks,
Nuria
On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel
<marcmiquel@gmail.com<mailto:marcmiquel@gmail.com>> wrote:
Thanks for your clarification Nuria.
The categorylinks table is working better lately. Computing counts at the pagelinks table
is critical. I'm afraid there is no solution for this one.
I thought about creating a temporary table pagelinks with data from the dumps for each
language edition. But to replicate the pagelinks database in the sever local disk would be
so costful in terms of time and space. The magnitude of the enwiki table for pagelinks
must be more than 50GB. The entire process would run during many many days considering the
other language editions too.
Other counts I need to do is the number of editors per article, which also gets stuck with
the revision table. For the rest of data, as you said, it is more about retrieval, as you
said, and I can use alternatives.
The queries to obtain count for pagelinks is something that worked before with the
database replicas and a database with more power like Hadoop would do with certain ease.
The problem is both a mixture of retrieval but also computing power.
We'd like to start the formal process to have an active collaboration, as it seems
there is no other solution available and we cannot be stuck and not deliver the work
promised. I'll let you know when I have more info.
Thanks again.
Best,
Marc Miquel
Missatge de Nuria Ruiz <nuria@wikimedia.org<mailto:nuria@wikimedia.org>> del
dia dt., 9 de jul. 2019 a les 1:44:
Will there be a release for these two tables?
No,
sorry, there will not be. The dataset release is about pages and users. To be extra clear
though, it is not tables but a denormalized reconstruction of the edit history.
Could I connect to the Hadoop to see if the queries on
pagelinks and categorylinks run faster?
It is a bit more complicated that just
"connecting" but I do not think we have to dwell on that, cause, as far as I
know, there is no categorylink info in hadoop.
Hadoop has the set of data from mediawiki that we use to create the dataset I pointed you
to:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his… and a
bit more.
Is it possible to extract some of this information from the xml dumps? Perhaps somebody
in the list has other ideas?
Thanks,
Nuria
P.S. So you know in order to facilitate access to our computing resources and private data
(there is no way for us to give access to only "part" of the data we hold in
hadoop) we require an active collaboration with our research team. We cannot support
ad-hoc access to hadoop for community members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel
<marcmiquel@gmail.com<mailto:marcmiquel@gmail.com>> wrote:
Hello Nuria,
This seems like an interesting alternative for some data (page, users, revision). It can
really help and make some processes faster (at the moment we gave up running again the
revision, as the new user_agent change made it also slower). So we will take a look at it
as soon as it is ready.
However, the scripts are struggling with other tables: pagelinks and category graph.
For instance, we need to count the percentage of links an article directs to other pages
or the percentage of links it receives from a group of pages. Likewise, we need to run
down the category graph starting from a specific group of categories. At the moment, the
query that uses pagelinks is not really working when counting when passing parameters for
the entire table or for specific parts (using batches).
Will there be a release for these two tables? Could I connect to the Hadoop to see if the
queries on pagelinks and categorylinks run faster?
If there is any other alternative we'd be happy to try as we cannot progress for
several weeks.
Thanks again,
Marc
Missatge de Nuria Ruiz <nuria@wikimedia.org<mailto:nuria@wikimedia.org>> del
dia dt., 9 de jul. 2019 a les 0:56:
Hello,
From your description seems that your problem is not
one of computation (well, your main problem) but rather data extraction. The labs
replicas are not meant for big data extraction jobs as you have just found out. Neither is
Hadoop. Now, our team will be releasing soon a dataset of edit denormalized data that you
can probably use, it is still up for discussion whether the data will be released as a
JSON dump or other but basically is a denormalized version of all the data held in the
replicas that will be created monthly.
Please take a look at the documentation of the dataset:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
This is the phab ticket:
https://phabricator.wikimedia.org/T208612
So, to sum up, once this dataset is out (we hope late this quarter or early next) you can
probably build your own datasets from it thus rendering your usage of the replicas
obsolete. Hopefully this makes sense.
Thanks,
Nuria
On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel
<marcmiquel@gmail.com<mailto:marcmiquel@gmail.com>> wrote:
To whom it might concern,
I am writing in regards of the project Cultural Diversity Observatory and the data we are
collecting. In short, this project aims at bridging the content gaps between language
editions that relate to cultural and geographical aspects. For this we need to retrieve
data from all language editions and Wikidata, and run some scripts in order to crawl down
the category and the link graph, in order to create some datasets and statistics.
The reason that I am writing is because we are stuck as we cannot automatize the scripts
to retrieve data from the Replicas. We could create the datasets few months ago but during
the past months it is impossible.
We are concerned because one thing is to create the dataset once for research purposes and
another thing is to create them on monthly basis. This is what we promised in the project
grant<https://meta.wikimedia.org/wiki/Grants:Project/WCDO/Culture_Gap_Mo…
details and now we cannot do it because of the infrastructure. It is important to do it on
monthly basis because the data visualizations and statistics Wikipedia communities will
receive need to be updated.
Lately there had been some changes in the Replicas databases and the queries that used to
take several hours are getting stuck completely. We tried to code them in multiple ways:
a) using complex queries, b) doing the joins as code logics and in-memory, c) downloading
the parts of the table that we require and storing them in a local database. None is an
option now considering the current performance of the replicas.
Bryan Davis suggested that this might be a moment to consult the Analytics team,
considering the Hadoop environemnt is design to run long, complex queries and it has
massively more compute power than the Wiki Replicas cluster. We would certainly be
relieved If you considerd we could connect to these Analytics databases (Hadoop).
Let us know if you need more information on the specific queries or the processes we are
running. The server we are using is wcdo.eqiad.wmflabs. We will be happy to explain in
detail anything you require.
Thanks.
Best regards,
Marc Miquel
PS: You can read about the method we follow to retrieve data and create the dataset here:
Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A
Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI
Conference on Web and Social Media. ICWSM. ACM. 2334-0770
wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/<http://wvv…
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org<mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics