Hi Giovanni,
Thank you for your message :)
You are correct in that there is no information on page-to-page link as of
today, as well as no information for instance on historical values of
revisions being redirects for instance.
We share with you the idea that such information is extremely valuable, and
we have in mind to be able to extract it at some point.
The reason for which it has not yet been done is because those pieces
of information are only available through parsing the wikitext of every
revision, which is not only resource intensive but also complicated
technically (templates, version changes etc).
You can be sure we will send another announcement when we'll release that
data :)
Best,
On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia <glc3(a)mail.usf.edu>
wrote:
> Hi Joseph,
>
> Thanks a lot for creating and sharing such a valuable resource. I went
> through the schema and from what I understand there is no information about
> page-to-page links, correct? Are there any resources that would provide
> such historical data?
>
> Best,
>
> *Giovanni Luca Ciampaglia* ∙ glciampaglia.com
> Assistant Professor
> Computer Science and Engineering
> <https://www.usf.edu/engineering/cse/> ∙ University
> of South Florida <https://www.usf.edu/>
>
> *Due to Florida’s broad open records law, email to or from university
> employees is public record, available to the public and the media upon
> request.*
>
>
> On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou <
> jallemandou(a)wikimedia.org> wrote:
>
> > Hi Analytics People,
> >
> > The Wikimedia Analytics Team is pleased to announce the release of the
> most
> > complete dataset we have to date to analyze content and contributors
> > metadata: Mediawiki History [1] [2].
> >
> > Data is in TSV format, released monthly around the 3rd of the month
> > usually, and every new release contains the full history of metadata.
> >
> > The dataset contains an enhanced [3] and historified [4] version of user,
> > page and revision metadata and serves as a base to Wiksitats API on
> edits,
> > users and pages [5] [6].
> >
> > We hope you will have as much fun playing with the data as we have
> building
> > it, and we're eager to hear from you [7], whether for issues, ideas or
> > usage of the data.
> >
> > Analytically yours,
> >
> > --
> > Joseph Allemandou (joal) (he / him)
> > Sr Data Engineer
> > Wikimedia Foundation
> >
> > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> > [2]
> >
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
> > [3] Many pre-computed fields are present in the dataset, from edit-counts
> > by user and page to reverts and reverted information, as well as time
> > between events.
> > [4] As accurate as possible historical usernames and page-titles (as well
> > as user-groups and blocks) is available in addition to current values,
> and
> > are provided in a denormalized way to every event of the dataset.
> > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> > [6] https://wikimedia.org/api/rest_v1/
> > [7]
> >
> >
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi…
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
--
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
I want to echo what Nate said. We've been using this for more than a year
within the Wikimedia Foundation, and it has made analyses of editing
behavior much, much easier and faster, not to mention a lot less annoying.
This is the product of years of expert work by the Analytics team, and they
deserve plenty of congratulations for it 😊
On Mon, 10 Feb 2020 at 10:42, Nate E TeBlunthuis <nathante(a)uw.edu> wrote:
> Thank you so much Joal! I've been happily using this data for some time
> and I'm optimistic that it can make doing thorough analyses of Wikimedia
> projects much more accessible to the community, students, and researchers.
>
> -- Nate
> ------------------------------
> *From:* Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> on
> behalf of Joseph Allemandou <jallemandou(a)wikimedia.org>
> *Sent:* Monday, February 10, 2020 8:27 AM
> *To:* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org>;
> Research into Wikimedia content and communities <
> wiki-research-l(a)lists.wikimedia.org>; Product Analytics <
> product-analytics(a)wikimedia.org>
> *Subject:* [Wiki-research-l] Announcement - Mediawiki History Dumps
>
> Hi Analytics People,
>
> The Wikimedia Analytics Team is pleased to announce the release of the most
> complete dataset we have to date to analyze content and contributors
> metadata: Mediawiki History [1] [2].
>
> Data is in TSV format, released monthly around the 3rd of the month
> usually, and every new release contains the full history of metadata.
>
> The dataset contains an enhanced [3] and historified [4] version of user,
> page and revision metadata and serves as a base to Wiksitats API on edits,
> users and pages [5] [6].
>
> We hope you will have as much fun playing with the data as we have building
> it, and we're eager to hear from you [7], whether for issues, ideas or
> usage of the data.
>
> Analytically yours,
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
> [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> [2]
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
> [3] Many pre-computed fields are present in the dataset, from edit-counts
> by user and page to reverts and reverted information, as well as time
> between events.
> [4] As accurate as possible historical usernames and page-titles (as well
> as user-groups and blocks) is available in addition to current values, and
> are provided in a denormalized way to every event of the dataset.
> [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> [6] https://wikimedia.org/api/rest_v1/
> [7]
>
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20Hi…
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
+Analytics who might be able to help with how reverts / Abuse Filter / etc.
figure into edit counts
In addition to the links from HaeB, I would also suggest reading the recent
report on content moderation on wikis, which on top of interviews has some
quantitative analyses and additional methods for understanding reverts:
https://meta.wikimedia.org/wiki/Research:Understanding_content_moderation_o…
Best,
Isaac
On Fri, Jan 31, 2020 at 8:25 PM Tilman Bayer <haebwiki(a)gmail.com> wrote:
> Concerning 1) and about analyzing reverts in general, see
> https://meta.wikimedia.org/wiki/Research:Revert .
>
> To explore 5), https://meta.wikimedia.org/wiki/AbuseFilter and
> https://tools.wmflabs.org/ptwikis/Filters:enwiki may be of interest.
>
> Regards, HaeB
>
> On Wed, Jan 29, 2020 at 12:01 PM Su-Laine Brodsky <sulainey(a)gmail.com>
> wrote:
>
> > Hi everyone,
> >
> > I’m looking for statistics about the edits that are reverted on the
> > English Wikipedia. This is for purposes of explaining to the public what
> > Wikipedia’s quality control processes are like. If hard numbers aren’t
> > available, I’m also interested in educated guesstimates.
> >
> > 1) An often-quoted statistic is that 7% of edits are reverted. Is this
> > still believed to be true?
> >
> > 2) According to
> > https://blog.wikimedia.org/2017/07/19/scoring-platform-team/, 2.5% of
> > edits are vandalism. There are other common reasons for reverting, and
> I’m
> > wondering if anyone has studied their frequency. Does anyone know what
> > percentage of all edits are reverted for being:
> > a) Spam (as perceived by the reverter)
> > b) Copyright violation
> > c) Violations of the Biographies of Living Persons policy
> >
> > 3) Do statistics on the number of edits per day on the English Wikipedia
> > (i.e. 164,000 edits per day) include edits that are blocked by the spam
> > blacklists or by edit filters?
> >
> > 4) How many edits per day on the English Wikiepdia are prevented
> (blocked)
> > by the spam blacklists?
> >
> > 5) How many edits per day on the English Wikiepdia are prevented by the
> > edit filters?
> >
> > 6) What percentage of all reverts are made by users of Huggle and Stiki?
> >
> > 7) What proportion of vandalism is quickly reverted? A 2007 study
> > (Priedhorsky et al) found that 42% of vandalistic contributions are
> > repaired within one view and 70% within ten views - have any newer
> studies
> > been done on this?
> >
> > Thanks in advance!
> >
> > Su-Laine
> > Vancouver, BC
> >
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation