Hi,
I always hear people saying that most of the articles usually receive little to no edits (and that is used to encourage participants to make sure their articles are good enough). I would like to know if there are statistics that support this for the English and Arabic Wikipedia.
Best, Reem
Reem Al-Kashif, 07/09/2016 15:52:
I always hear people saying that most of the articles usually receive little to no edits
Do you mean that many articles * have not been edited in a long time (6+ months?), * have few revisions (that is?), or * have only a human editor or two?
(and that is used to encourage participants to make sure their articles are good enough).
Dubious reasoning; other factors kept unchanged, articles with errors or other deficiencies are more likely to be edited further.
I would like to know if there are statistics that support this for the English and Arabic Wikipedia.
Wikistats reports on the average number of edits per article (while you'd need a median at least): https://stats.wikimedia.org/EN/TablesArticlesEditsPerArticle.htm
MediaWiki tells you the 5000 oldest pages https://ar.wikipedia.org/wiki/Special:AncientPages and you can easily replicate such a query e.g. on http://quarry.wmflabs.org/
Nemo
Hi Reem,
Here's some rough estimates.
English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
English has ~5.2 million articles, with an average of ~92 edits per article, not counting deleted edits (or deleted articles). Note that 80% of those articles are more than three years old, so they've had plenty of time to build up the 92 edits.
[The page does not explicitly say that only article edits are counted in the tables, but this is easy to confirm - https://en.wikipedia.org/wiki/ Wikipedia:Statistics has 847m edits]
Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
Arabic has ~437k articles, ~31 edits/article - but only half of these are more than three years old, so they're on average a lot younger than the English ones.
As of July there are 3.3m edits/month in English - this is equal to an average of 0.63 edits/article/month - and 226k edits/month in Arabic, equal to 0.52 edits/article/month. July was a slow month for Arabic, and March had more than twice as many edits, 487k, across 415k articles.
These are plain averages. The distribution is going to be very skewed, so high-edit articles get most of the attention, and the other articles easily go months without attention. If we assume an 80:20 distribution - which is a wild guess but sounds plausible - then the "long tail" of 80% of articles would get 20% of the edits. In this case, a plausible average would be:
* English long tail, 4.16m articles and 660k edits/month = average of six months between each edit * Arabic (July) long tail, 350k articles and 45k edits/month = average of seven or eight months between each edit * Arabic (March) long tail, 332k articles and 97k edits/month = average of three and a half months between each edit
This is a broad range, but it feels more or less right for all those unloved pages...
Andrew.
On 7 September 2016 at 14:52, Reem Al-Kashif <reemalkashif@gmail.com javascript:;> wrote:
Hi,
I always hear people saying that most of the articles usually receive
little
to no edits (and that is used to encourage participants to make sure their articles are good enough). I would like to know if there are statistics
that
support this for the English and Arabic Wikipedia.
Best, Reem
-- Kind regards, Reem Al-Kashif
Analytics mailing list Analytics@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/analytics
-- - Andrew Gray andrew.gray@dunelm.org.uk javascript:;
Quick follow up 'cause I was curious. I calculated the average and standard deviation for edits per namespace 0 article on enwiki. I tried to do it on the research db replicas but it took forever so I did it on the hadoop cluster. Including archived pages isn't useful, doesn't change the results almost at all. Including pages outside namespace 0 increases the standard deviation and decreases the average. Here are the results:
484,170,218 edits on namespace 0 12,756,342 pages in namespace 0
standard deviation for edits per page: 213.58 average edits per page: 38.02 average days between first and last edit per page: 1215.27
So considering the standard deviation is much larger than the mean, I'm pretty confident to answer yes, I think the vast majority of articles in namespace 0 on enwiki get very few edits. The dataset we're working on releasing as part of wikistats 2.0 will allow these kinds of questions to be answered really easily and really quickly. Stay tuned over the next few quarters :)
And the queries: https://gist.github.com/milimetric/ 8b5f447e3ef09b6fe4384e0f75cc0b34
If you want to edit those queries to find something else out, I'm happy to run them one or two more times, but then I really have to get back to my real job :)
On Wed, Sep 7, 2016 at 12:42 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Reem,
Here's some rough estimates.
English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
English has ~5.2 million articles, with an average of ~92 edits per article, not counting deleted edits (or deleted articles). Note that 80% of those articles are more than three years old, so they've had plenty of time to build up the 92 edits.
[The page does not explicitly say that only article edits are counted in the tables, but this is easy to confirm - https://en.wikipedia.org/wiki/ Wikipedia:Statistics has 847m edits]
Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
Arabic has ~437k articles, ~31 edits/article - but only half of these are more than three years old, so they're on average a lot younger than the English ones.
As of July there are 3.3m edits/month in English - this is equal to an average of 0.63 edits/article/month - and 226k edits/month in Arabic, equal to 0.52 edits/article/month. July was a slow month for Arabic, and March had more than twice as many edits, 487k, across 415k articles.
These are plain averages. The distribution is going to be very skewed, so high-edit articles get most of the attention, and the other articles easily go months without attention. If we assume an 80:20 distribution - which is a wild guess but sounds plausible - then the "long tail" of 80% of articles would get 20% of the edits. In this case, a plausible average would be:
- English long tail, 4.16m articles and 660k edits/month = average of six
months between each edit
- Arabic (July) long tail, 350k articles and 45k edits/month = average of
seven or eight months between each edit
- Arabic (March) long tail, 332k articles and 97k edits/month = average of
three and a half months between each edit
This is a broad range, but it feels more or less right for all those unloved pages...
Andrew.
On 7 September 2016 at 14:52, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi,
I always hear people saying that most of the articles usually receive
little
to no edits (and that is used to encourage participants to make sure
their
articles are good enough). I would like to know if there are statistics
that
support this for the English and Arabic Wikipedia.
Best, Reem
-- Kind regards, Reem Al-Kashif
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Dan,
Thanks for running these!
I'm struck by the figure of 12.8m pages in ns0 - it looks like this includes redirects (there are ~7.6m ns0 redirects on enwiki, and ~5.2m articles). This will probably skew things a lot, as the majority of those will probably be edited once and never touched again, barring the target page being moved,. Given they're ~60% of the pages, this will introduce a lot of extra weight for "articles with very few edits" and "articles that get edited very infrequently".
It might be worth trying to filter out redirects - I suspect this would have a noticeable effect on both the distribution and the mean time between edits.
Andrew.
On 14 September 2016 at 22:01, Dan Andreescu dandreescu@wikimedia.org wrote:
Quick follow up 'cause I was curious. I calculated the average and standard deviation for edits per namespace 0 article on enwiki. I tried to do it on the research db replicas but it took forever so I did it on the hadoop cluster. Including archived pages isn't useful, doesn't change the results almost at all. Including pages outside namespace 0 increases the standard deviation and decreases the average. Here are the results:
484,170,218 edits on namespace 0 12,756,342 pages in namespace 0
standard deviation for edits per page: 213.58 average edits per page: 38.02 average days between first and last edit per page: 1215.27
So considering the standard deviation is much larger than the mean, I'm pretty confident to answer yes, I think the vast majority of articles in namespace 0 on enwiki get very few edits. The dataset we're working on releasing as part of wikistats 2.0 will allow these kinds of questions to be answered really easily and really quickly. Stay tuned over the next few quarters :)
And the queries: https://gist.github.com/milimetric/8b5f447e3ef09b6fe4384e0f75cc0b34
If you want to edit those queries to find something else out, I'm happy to run them one or two more times, but then I really have to get back to my real job :)
On Wed, Sep 7, 2016 at 12:42 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Reem,
Here's some rough estimates.
English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
English has ~5.2 million articles, with an average of ~92 edits per article, not counting deleted edits (or deleted articles). Note that 80% of those articles are more than three years old, so they've had plenty of time to build up the 92 edits.
[The page does not explicitly say that only article edits are counted in the tables, but this is easy to confirm - https://en.wikipedia.org/wiki/Wikipedia:Statistics has 847m edits]
Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
Arabic has ~437k articles, ~31 edits/article - but only half of these are more than three years old, so they're on average a lot younger than the English ones.
As of July there are 3.3m edits/month in English - this is equal to an average of 0.63 edits/article/month - and 226k edits/month in Arabic, equal to 0.52 edits/article/month. July was a slow month for Arabic, and March had more than twice as many edits, 487k, across 415k articles.
These are plain averages. The distribution is going to be very skewed, so high-edit articles get most of the attention, and the other articles easily go months without attention. If we assume an 80:20 distribution - which is a wild guess but sounds plausible - then the "long tail" of 80% of articles would get 20% of the edits. In this case, a plausible average would be:
- English long tail, 4.16m articles and 660k edits/month = average of six
months between each edit
- Arabic (July) long tail, 350k articles and 45k edits/month = average of
seven or eight months between each edit
- Arabic (March) long tail, 332k articles and 97k edits/month = average of
three and a half months between each edit
This is a broad range, but it feels more or less right for all those unloved pages...
Andrew.
On 7 September 2016 at 14:52, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi,
I always hear people saying that most of the articles usually receive little to no edits (and that is used to encourage participants to make sure their articles are good enough). I would like to know if there are statistics that support this for the English and Arabic Wikipedia.
Best, Reem
-- Kind regards, Reem Al-Kashif
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Good point, updated to *exclude redirects* and rerun:
total_namespace_0_revisions: 457,574,404 total_namespace_0_pages: 5,236,104
per namespace 0 non-redirect article:
standard deviation of edits: *324.45* *average* edits: *87.54* standard deviation of days between first and last edit: *1360.16* *average* days between first and last edit: *2316.37*
So you were right, Andrew, numbers change, but I think the nature of the data is roughly the same. It's interesting that average difference between first and last edit is smaller than two standard deviations. That suggests that curve is also slightly lopsided, with perhaps lots of more recently created articles and few long lived ones. But that "recent" could be the spike in the 2007-2011 period. It may be interesting to play with these metrics more, and I'll keep this in mind as we build the new infrastructure (making these queries as fast as possible and easy to dig into).
On Wed, Sep 14, 2016 at 6:18 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Dan,
Thanks for running these!
I'm struck by the figure of 12.8m pages in ns0 - it looks like this includes redirects (there are ~7.6m ns0 redirects on enwiki, and ~5.2m articles). This will probably skew things a lot, as the majority of those will probably be edited once and never touched again, barring the target page being moved,. Given they're ~60% of the pages, this will introduce a lot of extra weight for "articles with very few edits" and "articles that get edited very infrequently".
It might be worth trying to filter out redirects - I suspect this would have a noticeable effect on both the distribution and the mean time between edits.
Andrew.
On 14 September 2016 at 22:01, Dan Andreescu dandreescu@wikimedia.org wrote:
Quick follow up 'cause I was curious. I calculated the average and
standard
deviation for edits per namespace 0 article on enwiki. I tried to do it
on
the research db replicas but it took forever so I did it on the hadoop cluster. Including archived pages isn't useful, doesn't change the
results
almost at all. Including pages outside namespace 0 increases the
standard
deviation and decreases the average. Here are the results:
484,170,218 edits on namespace 0 12,756,342 pages in namespace 0
standard deviation for edits per page: 213.58 average edits per page: 38.02 average days between first and last edit per page: 1215.27
So considering the standard deviation is much larger than the mean, I'm pretty confident to answer yes, I think the vast majority of articles in namespace 0 on enwiki get very few edits. The dataset we're working on releasing as part of wikistats 2.0 will allow these kinds of questions
to be
answered really easily and really quickly. Stay tuned over the next few quarters :)
And the queries: https://gist.github.com/milimetric/8b5f447e3ef09b6fe4384e0f75cc0b34
If you want to edit those queries to find something else out, I'm happy
to
run them one or two more times, but then I really have to get back to my real job :)
On Wed, Sep 7, 2016 at 12:42 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Reem,
Here's some rough estimates.
English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
English has ~5.2 million articles, with an average of ~92 edits per article, not counting deleted edits (or deleted articles). Note that
80% of
those articles are more than three years old, so they've had plenty of
time
to build up the 92 edits.
[The page does not explicitly say that only article edits are counted in the tables, but this is easy to confirm - https://en.wikipedia.org/wiki/Wikipedia:Statistics has 847m edits]
Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
Arabic has ~437k articles, ~31 edits/article - but only half of these
are
more than three years old, so they're on average a lot younger than the English ones.
As of July there are 3.3m edits/month in English - this is equal to an average of 0.63 edits/article/month - and 226k edits/month in Arabic,
equal
to 0.52 edits/article/month. July was a slow month for Arabic, and
March had
more than twice as many edits, 487k, across 415k articles.
These are plain averages. The distribution is going to be very skewed,
so
high-edit articles get most of the attention, and the other articles
easily
go months without attention. If we assume an 80:20 distribution - which
is a
wild guess but sounds plausible - then the "long tail" of 80% of
articles
would get 20% of the edits. In this case, a plausible average would be:
- English long tail, 4.16m articles and 660k edits/month = average of
six
months between each edit
- Arabic (July) long tail, 350k articles and 45k edits/month = average
of
seven or eight months between each edit
- Arabic (March) long tail, 332k articles and 97k edits/month = average
of
three and a half months between each edit
This is a broad range, but it feels more or less right for all those unloved pages...
Andrew.
On 7 September 2016 at 14:52, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi,
I always hear people saying that most of the articles usually receive little to no edits (and that is used to encourage participants to make sure their articles are good enough). I would like to know if there are
statistics
that support this for the English and Arabic Wikipedia.
Best, Reem
-- Kind regards, Reem Al-Kashif
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
To Andrew's point about excluding redirects, see also this paper by Benjamin Mako Hill and Aaron Shaw (CCed): https://mako.cc/ copyrighteous/consider-the-redirect (don't know if they have data for Arabic Wikipedia too)
In short, the distribution of edits is very different for redirects and articles. In light of this, and to address Reem's original question, it's probably worth looking at the actual histogram before relying on the average or other statistical moments.
Also interesting in this regard, although the data is not current: https://meta.wikimedia.org/wiki/Wikipedia_article_depth
On Thu, Sep 15, 2016 at 7:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Good point, updated to *exclude redirects* and rerun:
total_namespace_0_revisions: 457,574,404 total_namespace_0_pages: 5,236,104
per namespace 0 non-redirect article:
standard deviation of edits: *324.45* *average* edits: *87.54* standard deviation of days between first and last edit: *1360.16* *average* days between first and last edit: *2316.37*
So you were right, Andrew, numbers change, but I think the nature of the data is roughly the same. It's interesting that average difference between first and last edit is smaller than two standard deviations. That suggests that curve is also slightly lopsided, with perhaps lots of more recently created articles and few long lived ones. But that "recent" could be the spike in the 2007-2011 period. It may be interesting to play with these metrics more, and I'll keep this in mind as we build the new infrastructure (making these queries as fast as possible and easy to dig into).
On Wed, Sep 14, 2016 at 6:18 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
Hi Dan,
Thanks for running these!
I'm struck by the figure of 12.8m pages in ns0 - it looks like this includes redirects (there are ~7.6m ns0 redirects on enwiki, and ~5.2m articles). This will probably skew things a lot, as the majority of those will probably be edited once and never touched again, barring the target page being moved,. Given they're ~60% of the pages, this will introduce a lot of extra weight for "articles with very few edits" and "articles that get edited very infrequently".
It might be worth trying to filter out redirects - I suspect this would have a noticeable effect on both the distribution and the mean time between edits.
Andrew.
On 14 September 2016 at 22:01, Dan Andreescu dandreescu@wikimedia.org wrote:
Quick follow up 'cause I was curious. I calculated the average and
standard
deviation for edits per namespace 0 article on enwiki. I tried to do
it on
the research db replicas but it took forever so I did it on the hadoop cluster. Including archived pages isn't useful, doesn't change the
results
almost at all. Including pages outside namespace 0 increases the
standard
deviation and decreases the average. Here are the results:
484,170,218 edits on namespace 0 12,756,342 pages in namespace 0
standard deviation for edits per page: 213.58 average edits per page: 38.02 average days between first and last edit per page: 1215.27
So considering the standard deviation is much larger than the mean, I'm pretty confident to answer yes, I think the vast majority of articles in namespace 0 on enwiki get very few edits. The dataset we're working on releasing as part of wikistats 2.0 will allow these kinds of questions
to be
answered really easily and really quickly. Stay tuned over the next few quarters :)
And the queries: https://gist.github.com/milimetric/8b5f447e3ef09b6fe4384e0f75cc0b34
If you want to edit those queries to find something else out, I'm happy
to
run them one or two more times, but then I really have to get back to my real job :)
On Wed, Sep 7, 2016 at 12:42 PM, Andrew Gray <andrew.gray@dunelm.org.uk
wrote:
Hi Reem,
Here's some rough estimates.
English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
English has ~5.2 million articles, with an average of ~92 edits per article, not counting deleted edits (or deleted articles). Note that
80% of
those articles are more than three years old, so they've had plenty of
time
to build up the 92 edits.
[The page does not explicitly say that only article edits are counted
in
the tables, but this is easy to confirm - https://en.wikipedia.org/wiki/Wikipedia:Statistics has 847m edits]
Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
Arabic has ~437k articles, ~31 edits/article - but only half of these
are
more than three years old, so they're on average a lot younger than the English ones.
As of July there are 3.3m edits/month in English - this is equal to an average of 0.63 edits/article/month - and 226k edits/month in Arabic,
equal
to 0.52 edits/article/month. July was a slow month for Arabic, and
March had
more than twice as many edits, 487k, across 415k articles.
These are plain averages. The distribution is going to be very skewed,
so
high-edit articles get most of the attention, and the other articles
easily
go months without attention. If we assume an 80:20 distribution -
which is a
wild guess but sounds plausible - then the "long tail" of 80% of
articles
would get 20% of the edits. In this case, a plausible average would be:
- English long tail, 4.16m articles and 660k edits/month = average of
six
months between each edit
- Arabic (July) long tail, 350k articles and 45k edits/month = average
of
seven or eight months between each edit
- Arabic (March) long tail, 332k articles and 97k edits/month =
average of
three and a half months between each edit
This is a broad range, but it feels more or less right for all those unloved pages...
Andrew.
On 7 September 2016 at 14:52, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi,
I always hear people saying that most of the articles usually receive little to no edits (and that is used to encourage participants to make sure their articles are good enough). I would like to know if there are
statistics
that support this for the English and Arabic Wikipedia.
Best, Reem
-- Kind regards, Reem Al-Kashif
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
- Andrew Gray andrew.gray@dunelm.org.uk
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics