Dear Christina,
You are likely to find more researchers and people who regularly work with
our metadata on the research mailing list.
Send Wiki-research-l mailing list submissions to
wiki-research-l(a)lists.wikimedia.org
To subscribe or unsubscribe, please visit
2. Re: Accessing wikipedia metadata (Gava, Cristina)
On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
wikimedia-l(a)lists.wikimedia.org> wrote:
Hello everyone,
It is my first time interacting in this mailing list, so I will be happy
to receive further feedbacks on how to better interact with the
community :)
I am trying to access Wikipedia meta data in a streaming and
time/resource
sustainable manner. By meta data I mean many of
the voices that can be
found in the statistics of a wiki article, such as edits, editors list,
page views etc.
I would like to do such for an online classifier type of structure:
retrieve the data from a big number of wiki pages every tot time and use
it
as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive,
both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the
Wikipedia database, in the same way this is done through the Quarry tool.
The problem with Quarry is that I would like to build a standalone
script,
without having to depend on a user interface like
Quarry. Do you think
that
this is possible? I am still fairly new to all of
this and I don’t know
exactly which is the best direction.
I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
access wiki replicas both through Toolforge and PAWS, however I didn’t
understand which one would serve me better, could I ask you for some
feedback?
Also, as far as I understood [2]
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
accessing the DB through Hive is too technical for what I need, right?
Especially because it seems that I would need an account with production
shell access and I honestly don’t think that I would be granted access to
it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic
in the way of retrieving and polishing the data. As also, it would be
strongly decentralised and physical-machine dependent, unless I upload
the
polished data online every time.
Sorry for this long message, but I thought it was better to give you a
clearer picture (hoping this is clear enough). If you could give me even
some hint it would be highly appreciated.
Best,
Cristina
[1]
https://meta.wikimedia.org/wiki/Research:Data
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l(a)lists.wikimedia.org, guidelines
at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org…
To unsubscribe send an email to
wikimedia-l-leave(a)lists.wikimedia.org