Re: [Wiki-research-l] Database of all users (Kiril Simeonovski) # of contributions - Wiki-research-l

7 Jun 2019

Kiril,

I wrote something a while back in java that was able to get the number of
contributions per user for a given language in Wikipedia. It could be able
to be altered for your purposes if the datastructure of the namespaces is
the same or similar.

https://github.com/hachacha/wikiParticipants
particularly this file
https://github.com/hachacha/wikiParticipants/blob/master/src/wikipediansbyn…

Altering which contributions would be saved within a specific date range is
possible.

God Bless,

Jonathan

On Fri, Jun 7, 2019 at 8:00 AM &lt;wiki-research-l-request(a)lists.wikimedia.org&gt;
wrote:

...
  Send Wiki-research-l mailing list submissions to
         wiki-research-l(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
         https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 or, via email, send a message with subject or body 'help' to
         wiki-research-l-request(a)lists.wikimedia.org

 You can reach the person managing the list at
         wiki-research-l-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Wiki-research-l digest..."

 Today's Topics:

    1. Fwd: [Wikidata] Scaling Wikidata Query Service (Pine W)
    2. Database of all users (Kiril Simeonovski)
    3. Re: Database of all users (Federico Leva (Nemo))
    4. Re: Database of all users (Kiril Simeonovski)

 ----------------------------------------------------------------------

 Message: 1
 Date: Thu, 6 Jun 2019 19:35:13 +0000
 From: Pine W &lt;wiki.pine(a)gmail.com&gt;
 To: &quot;wikitech-l(a)lists.wikimedia.org&quot; &lt;wikitech-l(a)lists.wikimedia.org&gt;rg>,
         Wiki Research-l &lt;wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: [Wiki-research-l] Fwd: [Wikidata] Scaling Wikidata Query
         Service
 Message-ID:
         <CAF=dyJiJFXf7Jp8NUUu90Zd2dBT6J=
 FhTyjirAWRhN+UV2jLpQ(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset="UTF-8"

 Forwarding in case this is of interest.

 Pine
 ( https://meta.wikimedia.org/wiki/User:Pine )

 ---------- Forwarded message ---------
 From: Guillaume Lederrey &lt;glederrey(a)wikimedia.org&gt;
 Date: Thu, Jun 6, 2019 at 7:33 PM
 Subject: [Wikidata] Scaling Wikidata Query Service
 To: Discussion list for the Wikidata project. <
 wikidata(a)lists.wikimedia.org&gt;

 Hello all!

 There has been a number of concerns raised about the performance and
 scaling of Wikdata Query Service. We share those concerns and we are
 doing our best to address them. Here is some info about what is going
 on:

 In an ideal world, WDQS should:

 * scale in terms of data size
 * scale in terms of number of edits
 * have low update latency
 * expose a SPARQL endpoint for queries
 * allow anyone to run any queries on the public WDQS endpoint
 * provide great query performance
 * provide a high level of availability

 Scaling graph databases is a "known hard problem", and we are reaching
 a scale where there are no obvious easy solutions to address all the
 above constraints. At this point, just "throwing hardware at the
 problem" is not an option anymore. We need to go deeper into the
 details and potentially make major changes to the current architecture.
 Some scaling considerations are discussed in [1]. This is going to take
 time.

 Reasonably, addressing all of the above constraints is unlikely to
 ever happen. Some of the constraints are non negotiable: if we can't
 keep up with Wikidata in term of data size or number of edits, it does
 not make sense to address query performance. On some constraints, we
 will probably need to compromise.

 For example, the update process is asynchronous. It is by nature
 expected to lag. In the best case, this lag is measured in minutes,
 but can climb to hours occasionally. This is a case of prioritizing
 stability and correctness (ingesting all edits) over update latency.
 And while we can work to reduce the maximum latency, this will still
 be an asynchronous process and needs to be considered as such.

 We currently have one Blazegraph expert working with us to address a
 number of performance and stability issues. We
 are planning to hire an additional engineer to help us support the
 service in the long term. You can follow our current work in phabricator
 [2].

 If anyone has experience with scaling large graph databases, please
 reach out to us, we're always happy to share ideas!

 Thanks all for your patience!

    Guillaume

 [1]
 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
 [2] https://phabricator.wikimedia.org/project/view/1239/

 --
 Guillaume Lederrey
 Engineering Manager, Search Platform
 Wikimedia Foundation
 UTC+2 / CEST

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 ------------------------------

 Message: 2
 Date: Fri, 7 Jun 2019 08:57:38 +0200
 From: Kiril Simeonovski &lt;kiril.simeonovski(a)gmail.com&gt;
 To: Research into Wikimedia content and communities
         &lt;wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: [Wiki-research-l] Database of all users
 Message-ID:
         <
 CABuEHm5mfDeo7sjrpmW_aK-Mpd2qH0c2JyGnoU9OdB5YtuTheg(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset="UTF-8"

 Dear all,

 I was wondering if there is a way to extract a database of all users (or
 selection of users according to some criteria) with their contributions to
 the Wikimedia projects until a fixed point of time from the XTools.

 Thank you.

 Best regards,
 Kiril

 ------------------------------

 Message: 3
 Date: Fri, 7 Jun 2019 10:53:30 +0300
 From: "Federico Leva (Nemo)" &lt;nemowiki(a)gmail.com&gt;
 To: Research into Wikimedia content and communities
         &lt;wiki-research-l(a)lists.wikimedia.org&gt;rg>, Kiril Simeonovski
         &lt;kiril.simeonovski(a)gmail.com&gt;
 Subject: Re: [Wiki-research-l] Database of all users
 Message-ID: &lt;33f8a998-2144-1d49-5347-8c59018e2fcb(a)gmail.com&gt;
 Content-Type: text/plain; charset=utf-8; format=flowed

 Kiril Simeonovski, 07/06/19 09:57:
    with their contributions to
 the Wikimedia projects 
 Do you mean the *number* of their contributions, or literally all their
 contributions? Filtering the stub dumps would be one systematic way to
 get all the metadata about edits.

 If you just need aggregate numbers with some filter by date, namespace
 or other, the fastest way is probably to write a script which loops
 through all the databases on Labs. For instance I made this to list the
 users who contribute in a certain language, to find translators for very
 small languages:
 <

https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/lists/+/master/sc…

 Federico

 ------------------------------

 Message: 4
 Date: Fri, 7 Jun 2019 09:57:45 +0200
 From: Kiril Simeonovski &lt;kiril.simeonovski(a)gmail.com&gt;
 To: "Federico Leva (Nemo)" &lt;nemowiki(a)gmail.com&gt;
 Cc: Research into Wikimedia content and communities
         &lt;wiki-research-l(a)lists.wikimedia.org&gt;
 Subject: Re: [Wiki-research-l] Database of all users
 Message-ID:
         <CABuEHm7ahWx9P=
 xa_km1S+Q3Z0WkOHaxcOunFx3AsA_cfnv-hg(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset="UTF-8"

 Hi Federico,

 Thanks for the straightforward answer. My idea is to extract the number of
 contributions across projects and namespaces.

 Best,
 Kiril

 On Fri, Jun 7, 2019 at 9:53 AM Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt;
 wrote:

 > Kiril Simeonovski, 07/06/19 09:57:
 > >   with their contributions to
 > > the Wikimedia projects
   > Do you mean the *number* of their
contributions, or literally all their
 > contributions? Filtering the stub dumps would be one systematic way to
 > get all the metadata about edits.
   > If you just need aggregate numbers
with some filter by date, namespace
 > or other, the fastest way is probably to write a script which loops
 > through all the databases on Labs. For instance I made this to list the
 > users who contribute in a certain language, to find translators for very
 > small languages:
 > <

https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/lists/+/master/sc…
 >  
   > Federico

 ------------------------------

 Subject: Digest Footer

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ------------------------------

 End of Wiki-research-l Digest, Vol 166, Issue 4
 ***********************************************