Thank you Tilman, you reminded us that we need to ensure out way of
counting articles are the same as the metrics we get from wikimedia. That's
very helpful
Victor
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
* • *Software Engineer, Data Engine
* •* Google Inc.
* • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691
* • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 10:51 AM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
Another data source is
https://meta.wikimedia.org/
wiki/List_of_Wikipedias/Table (transcluded in
https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated
twice daily by a bot that directly retrieves the numbers as reported in
each wiki's [[Special:Statistics]] page, and can be considered reliable.
(I.e. it is using basically the same primary source as
http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
Two more comments inline below.
On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
Forwarding this question to the public Analytics
list, where it's good to
have these kinds of discussions. If you're interested in this data and how
it changes over time, do subscribe and watch for updates, notices of
outages, etc.
Ok, so on to your question. You'd like the *total # of articles for
each wiki*. I think the simplest way right now is to query the AQS
(Analytics Query Service) API, documented here:
https://wikitech.wikimed
ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say
en.wikipedia.org, you can
get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
<https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900>*
Unless I'm mistaken, summing up these monthly numbers would yield 3.5
million articles - but English Wikipedia has already over 5 million per
https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different
result?
In general, it's worth being aware that there are various subtleties
involved in defining article counts precisely, as detailed at
https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too
much aware, that's not good for your mental health. Seriously, that page is
a data analyst's version of a horror novel. Don't read it alone at night.)
And to get a list of all wikis, to plug into that
URL instead of "
en.wikipedia.org", the most up-to-date information is here:
https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via
the mediawiki API:
https://meta.wikimedia.or
g/w/api.php?action=sitematrix&formatversion=2&format=json&ma
xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS
API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API:
https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages.
So if you were interested in something else, you can browse around there
and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
zzn(a)google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018
Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from
Wikipedia. Currently it's undergoing a project that expands language
coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this
project.She plans to *monitor the list of all the current available
wikipedia's sites and the number of articles for each language*, so
that we can compare with our extraction system's output to sanity-check if
there is a massive breakage of the extraction logic, or if we need to
add/remove languages in the event that a new wikipedia site is introduced
to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best
where we can find this data. Here are 4 places we already know, but doesn't
seem to have the data.
-
https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
information we need, but the list is manually edited, not automatic
-
https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI:
https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API
https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article
information
Do you know what is a good place to find this information? Thank you!
Victor
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
* • *Software Engineer, Data Engine
* •* Google Inc.
* • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691
* • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ----------
From: Yuan Gao <gaoyuan(a)google.com>
Date: Wed, Mar 28, 2018 at 4:15 PM
Subject: Monitor the number of Wikipedia sites and the number of
articles in each site
To: Zainan Victor Zhou <zzn(a)google.com>
Cc: Wenjie Song <wenjies(a)google.com>om>, WikiData <wikidata(a)google.com>
Hi Victor,
as we discussed in the meeting, I'd like to monitor:
1) the number of Wikipedia sites
2) the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily
update of these numbers? What we can find now is
https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
--
Yuan Gao
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB