Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&a.... Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn@google.com
wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Hi Victor et al.,
[going to a slight tangent.]
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn@google.com
wrote:
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage.
Please keep us posted on this project as much as you can given all the constraints you operate under. We are doing a lot of work in a common space as we have learned in the past :) Publications, reports, or anything else that can be shared publicly would be very valuable for us. :) Also, do ping me if you or someone in your team will be in Lyon for The Web Conference 2018. Wiki Workshop http://wikiworkshop.org/2018/ is a great opportunity to catch up with the broader Wiki research/analytics community. :)
Best, Leila
Leila, how are you! Glad to receive your email. I answered inline below
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn * • *Software Engineer, Data Engine * •* Google Inc. * • *zzn@google.com ecarmeli@google.com - 650.336.5691 * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 5:36 AM, Leila Zia leila@wikimedia.org wrote:
Hi Victor et al.,
[going to a slight tangent.]
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
zzn@google.com> wrote:
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage.
Please keep us posted on this project as much as you can given all the constraints you operate under. We are doing a lot of work in a common space as we have learned in the past :) Publications, reports, or anything else that can be shared publicly would be very valuable for us. :)
Thanks Leila, Publications/reports about what questions in particular are you interested in? Let me know how we can help. I can try to see if I can find and connect the right team within Google
Also, do ping me if you or someone in your team will be in Lyon for The Web Conference 2018. Wiki Workshop http://wikiworkshop.org/2018/ is a great opportunity to catch up with the broader Wiki research/analytics community. :)
I wasn't aware of the workshop, and probably can't go myself. Based on the page, seems Cong Yu from Google Research will be going?
Best, Leila
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
… *monitor the list of all the current available wikipedia's sites and the
number of articles for each language*, …
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
I don't think this has yet been mentioned: Wikistats has an automatically-updated CSV file listing all the Wikipedias (and Wiktionaries, etc.) as well as how many total articles they have, how many "good" articles, how many stubs, etc., each language has, from English to Kanuri:
https://wikistats.wmflabs.org/ -> "csv" link -> https://wikistats.wmflabs.org/api.php?action=dump&table=wikipedias&f...
I use this enough that I made a little JavaScript library to fetch this data, parse it, and programmatically yield a nice JSON representation:
https://github.com/fasiha/wikipedia-languages/
Sorry if I overlooked one of your requirements and am suggesting something that won't work for you.
Another data source is https://meta.wikimedia.org/ wiki/List_of_Wikipedias/Table (transcluded in https://meta.wikimedia.org/wik i/List_of_Wikipedias ), which is updated twice daily by a bot that directly retrieves the numbers as reported in each wiki's [[Special:Statistics]] page, and can be considered reliable. (I.e. it is using basically the same primary source as http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
Two more comments inline below.
On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
Unless I'm mistaken, summing up these monthly numbers would yield 3.5 million articles - but English Wikipedia has already over 5 million per https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different result?
In general, it's worth being aware that there are various subtleties involved in defining article counts precisely, as detailed at https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too much aware, that's not good for your mental health. Seriously, that page is a data analyst's version of a horror novel. Don't read it alone at night.)
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix& formatversion=2&format=json&maxage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Mar 29, 2018 at 7:51 PM, Tilman Bayer tbayer@wikimedia.org wrote:
Another data source is https://meta.wikimedia.org/ wiki/List_of_Wikipedias/Table (transcluded in https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated twice daily by a bot that directly retrieves the numbers as reported in each wiki's [[Special:Statistics]] page, and can be considered reliable. (I.e. it is using basically the same primary source as http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
Two more comments inline below.
On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
Unless I'm mistaken, summing up these monthly numbers would yield 3.5 million articles
35 million actually (34512751), so even further off. Looking at the documentation, this seems to be the API request for *all* new pages, not just articles. The result still doesn't match https://en.wikipedia. org/wiki/Special:Statistics , which gives 45 million total pages; but maybe that's because redirects are counted differently or such.
The correct URL seems to be this (replace "all-page-types" above with "content"): *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/content/monthly/2001010100/2018032900* That would seem to yield 5613179 articles by February 28, 2018, still a notable discrepancy to the official number on [[Special:Statistics]] (5,600,831 right now, a month later), but a smaller one.
- but English Wikipedia has already over 5 million per
https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different result?
In general, it's worth being aware that there are various subtleties involved in defining article counts precisely, as detailed at https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too much aware, that's not good for your mental health. Seriously, that page is a data analyst's version of a horror novel. Don't read it alone at night.)
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json&ma xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Thank you Tilman, you reminded us that we need to ensure out way of counting articles are the same as the metrics we get from wikimedia. That's very helpful
Victor
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn * • *Software Engineer, Data Engine * •* Google Inc. * • *zzn@google.com ecarmeli@google.com - 650.336.5691 * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 10:51 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Another data source is https://meta.wikimedia.org/ wiki/List_of_Wikipedias/Table (transcluded in https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated twice daily by a bot that directly retrieves the numbers as reported in each wiki's [[Special:Statistics]] page, and can be considered reliable. (I.e. it is using basically the same primary source as http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
Two more comments inline below.
On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
Unless I'm mistaken, summing up these monthly numbers would yield 3.5 million articles - but English Wikipedia has already over 5 million per https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different result?
In general, it's worth being aware that there are various subtleties involved in defining article counts precisely, as detailed at https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too much aware, that's not good for your mental health. Seriously, that page is a data analyst's version of a horror novel. Don't read it alone at night.)
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json&ma xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Hi Tilman, our team, i.e., the team working on extracting the knowledge from Wikipedia in Google, has just compared our crawled data with https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. In the following sites, we have quite significant diffs: Wikipedia site # listed in Table # from Google crawled data http://ady.wikipedia.org/ 409 549
http://bjn.wikipedia.org/ 1844 1952
http://bo.wikipeida.org/ 5818 11120
We following the same definition of article https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F to count Google crawled data. Is there anyway to debug why there is huge diffs? Take bo.wikipedia.org for example, we tried to crawl all urls listed in https://bo.wikipedia.org/w/index.php?title=Special:AllPages&hideredirect..., but it seems it contains redirect pages, so the total sum of urls are 16498, not 5818.
On Fri, Mar 30, 2018 at 10:51 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Another data source is https://meta.wikimedia.org/ wiki/List_of_Wikipedias/Table (transcluded in https://meta.wikimedia.org/wiki/List_of_Wikipedias ), which is updated twice daily by a bot that directly retrieves the numbers as reported in each wiki's [[Special:Statistics]] page, and can be considered reliable. (I.e. it is using basically the same primary source as http://wikistats.wmflabs.org/display.php?t=wp , the tool Ahmed mentioned.)
Two more comments inline below.
On Thu, Mar 29, 2018 at 2:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
Unless I'm mistaken, summing up these monthly numbers would yield 3.5 million articles - but English Wikipedia has already over 5 million per https://en.wikipedia.org/wiki/Special:Statistics . Do you get a different result?
In general, it's worth being aware that there are various subtleties involved in defining article counts precisely, as detailed at https://meta.wikimedia.org/wiki/Article_counts_revisited . (But not too much aware, that's not good for your mental health. Seriously, that page is a data analyst's version of a horror novel. Don't read it alone at night.)
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json&ma xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
On Wed, Aug 1, 2018 at 3:07 PM Yuan Gao gaoyuan@google.com wrote:
Hi Tilman, our team, i.e., the team working on extracting the knowledge from Wikipedia in Google, has just compared our crawled data with https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. In the following sites, we have quite significant diffs:
The stats Special Page for bo.wikipedia provide the following count as of today:
Content pages https://bo.wikipedia.org/w/index.php?title=Special:AllPages&hideredirects=1 5,818 Pages https://bo.wikipedia.org/wiki/Special:AllPages (All pages in the wiki, including talk pages, redirects, etc.)16,498
A page, according to software documentation is: "The automatic definition used by the software at Special:Statistics https://en.wikipedia.org/wiki/Special:Statistics is: *any page that is in the article namespace, is not a redirect page https://en.wikipedia.org/wiki/Wikipedia:Redirect and contains at least one wiki link*." Could it be possible that your definition is broader than the Mediawiki one? https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F#Lists_of_artic... Other things I would suggest is if Google may be including duplicate results.
There could be some amount of caching in both the statistics calculation and the rendering of those pages, although probably not enough to double the number of articles.
Thank you very much Dan, this turns out to be very helpful. My teammates has started looking into it.
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn * • *Software Engineer, Data Engine * •* Google Inc. * • *zzn@google.com ecarmeli@google.com - 650.336.5691 * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech. wikimedia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action= sitematrix&formatversion=2&format=json&maxage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Thanks to Tilman for pointing out that this data is still being worked on. So, yes, there are lots of subtleties in how we count articles, redirects, content vs. non-content, etc. I don't have the answer to all of the discrepancies that Tilman found, but if you need a very accurate answer, the only way is to get an account on labs and start digging into how exactly you want to count the articles. As our datasets and APIs get more mature, we're hoping to give as much flexibility as everyone needs, but not so much as to drive people crazy. Until then, we're slowly improving our docs.
And yes, don't read some of this stuff alone at night, the buddy system works well for data analysis, lol
On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) zzn@google.com wrote:
Thank you very much Dan, this turns out to be very helpful. My teammates has started looking into it.
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json& maxage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/...
That would seem to yield 5,613,179 articles by February 28, 2018, still a
notable discrepancy to the official number on [[Special:Statistics]] (5,600,831 right now, a month later), but a smaller one.
A <1% difference is not surprising. Please see: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Data_Quality
We have vetted this data (per project) with wikistats1 metrics definitions and in most cases differences are less than 1.5%. Again, per project. We expect this to shift a little as we continue working on data quality but we will always see small differences with other data sources. Some variability comes from data definitions, some for sources and computations.
Thanks Dan, that's very helpful, I asked two follow-up questions inline below
* • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn * • *Software Engineer, Data Engine * •* Google Inc. * • *zzn@google.com ecarmeli@google.com - 650.336.5691 * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Sat, Mar 31, 2018 at 12:34 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks to Tilman for pointing out that this data is still being worked on. So, yes, there are lots of subtleties in how we count articles, redirects, content vs. non-content, etc. I don't have the answer to all of the discrepancies that Tilman found, but if you need a very accurate answer, the only way is to get an account on labs and start digging into how exactly you want to count the articles.
What's the best way to signup the labs account? (does it require certain qualifications?) And could you point us to the code or entry of the code repository?
As our datasets and APIs get more mature, we're hoping to give as much flexibility as everyone needs, but not so much as to drive people crazy. Until then, we're slowly improving our docs.
And yes, don't read some of this stuff alone at night, the buddy system works well for data analysis, lol
On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Thank you very much Dan, this turns out to be very helpful. My teammates has started looking into it.
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimed ia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json&ma xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list,
but the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I
can't find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Zainan:
Labs is our cloud environment for volunteers, you can direct questions about that to cloud e-mail list.
https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction
Thanks,
Nuria
On Mon, Apr 2, 2018 at 7:44 PM, Zainan Zhou (a.k.a Victor) zzn@google.com wrote:
Thanks Dan, that's very helpful, I asked two follow-up questions inline below
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Sat, Mar 31, 2018 at 12:34 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Thanks to Tilman for pointing out that this data is still being worked on. So, yes, there are lots of subtleties in how we count articles, redirects, content vs. non-content, etc. I don't have the answer to all of the discrepancies that Tilman found, but if you need a very accurate answer, the only way is to get an account on labs and start digging into how exactly you want to count the articles.
What's the best way to signup the labs account? (does it require certain qualifications?) And could you point us to the code or entry of the code repository?
As our datasets and APIs get more mature, we're hoping to give as much flexibility as everyone needs, but not so much as to drive people crazy. Until then, we're slowly improving our docs.
And yes, don't read some of this stuff alone at night, the buddy system works well for data analysis, lol
On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Thank you very much Dan, this turns out to be very helpful. My teammates has started looking into it.
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu <dandreescu@wikimedia.org
wrote:
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/... https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900*
And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.or g/w/api.php?action=sitematrix&formatversion=2&format=json&ma xage=3600&smaxage=3600. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contr ibuting/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) < zzn@google.com> wrote:
Hi Dan,
How are you! This is Victor, It's been a while since we meet at the 2018 Wikimedia Dev Summit. I hope you are doing great.
As I mentioned to you, my team works on extracting the knowledge from Wikipedia. Currently it's undergoing a project that expands language coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this project.She plans to *monitor the list of all the current available wikipedia's sites and the number of articles for each language*, so that we can compare with our extraction system's output to sanity-check if there is a massive breakage of the extraction logic, or if we need to add/remove languages in the event that a new wikipedia site is introduced to/remove from the wikipedia family.
I think your team at Analytics at Wikimedia probably knows the best where we can find this data. Here are 4 places we already know, but doesn't seem to have the data.
information we need, but the list is manually edited, not automatic
- https://stats.wikimedia.org/EN/Sitemap.htm, has the full list,
but the information seems pretty out of date(last updated almost a month ago)
- StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I
can't find the full list nor the number of articles
- API https://wikimedia.org/api/rest_v1/ suggested by elukey on
#wikimedia-analytics channel, it doesn't seem to have # of article information
Do you know what is a good place to find this information? Thank you!
Victor
- • **Zainan Zhou(**周载南**) a.k.a. "Victor" * http://who/zzn
- • *Software Engineer, Data Engine
- •* Google Inc.
- • *zzn@google.com ecarmeli@google.com - 650.336.5691
- • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
---------- Forwarded message ---------- From: Yuan Gao gaoyuan@google.com Date: Wed, Mar 28, 2018 at 4:15 PM Subject: Monitor the number of Wikipedia sites and the number of articles in each site To: Zainan Victor Zhou zzn@google.com Cc: Wenjie Song wenjies@google.com, WikiData wikidata@google.com
Hi Victor, as we discussed in the meeting, I'd like to monitor:
- the number of Wikipedia sites
- the number of articles in each site
Can you help us to contact with WMF to get a realtime or at least daily update of these numbers? What we can find now is https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of Wikipedia sites is manually updated, and possibly out-of-date.
The monitor can help us catch such bugs.
-- Yuan Gao
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics