We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario
Hi all,
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
On Tue, Oct 23, 2012 at 4:51 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On 22 October 2012 16:03, Hydriz Wikipedia admin@alphacorp.tk wrote:
Hi all,
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by the Open Knowledge Foundation for precisely this reason - so that data researchers can easily find data about Wikimedia, and see when it's updated.
[0] - http://thedatahub.org/en/group/wikimedia
J.
<what James said>
The dumps server was never meant to become a permanent open data repository, but it started being used as an ad-hoc solution to host all sort of datasets published by WMF on top of the actual XML dumps: that's the problem we're trying to fix.
Regardless of where the data is physically hosted, your go-to point to discover WMF datasets from now on is the DataHub. Think of it as a data registry: the registry is all you need to know in order to find where the data is hosted and to extract the appropriate metadata/documentation.
HTH
Dario
On Oct 22, 2012, at 5:06 PM, James Forrester james@jdforrester.org wrote:
On 22 October 2012 16:03, Hydriz Wikipedia admin@alphacorp.tk wrote:
Hi all,
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by the Open Knowledge Foundation for precisely this reason - so that data researchers can easily find data about Wikimedia, and see when it's updated.
[0] - http://thedatahub.org/en/group/wikimedia
J.
James D. Forrester jdforrester@gmail.com [[Wikipedia:User:Jdforrester|James F.]] (speaking purely in a personal capacity)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
cc-ed xmldatadumps-l
Hi,
2012/10/23 Dario Taraborelli dtaraborelli@wikimedia.org:
2012/10/23 James Forrester james@jdforrester.org:
On 22 October 2012 16:03, Hydriz Wikipedia admin@alphacorp.tk wrote:
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by the Open Knowledge Foundation for precisely this reason - so that data researchers can easily find data about Wikimedia, and see when it's updated.
The dumps server was never meant to become a permanent open data repository, but it started being used as an ad-hoc solution to host all sort of datasets published by WMF on top of the actual XML dumps: that's the problem we're trying to fix.
Regardless of where the data is physically hosted, your go-to point to discover WMF datasets from now on is the DataHub. Think of it as a data registry: the registry is all you need to know in order to find where the data is hosted and to extract the appropriate metadata/documentation.
That's fine for me but I think more communication about this would be welcome. I've added a link to meta:Data_dumps¹ and I'll communicate about this on the French Wikipedia, but a link on the dumps' page for other downloads² would be great.
Most people I've helped to find data on the Wikimedia projects now know about dumps.wikimedia.org, but AFAIK none of them is reading wiki-research-l.
Best regards,
¹ https://meta.wikimedia.org/wiki/Data_dumps ² http://dumps.wikimedia.org/other/
Thanks Jérémie,
we are definitely aiming for a more official announcement. The reason for the soft launch is that, after experimenting for a few months with the DataHub, we are still reporting to the developers issues that need to be addressed before a broader announcement. The CKAN data browser, for example, is quite rudimentary; there is limited support for batch file upload; data citation support is not keeping up with standards/best practices in the field etc. If anyone on these lists is interested in crash-testing the repository I'd be happy to follow up off-list.
Despite these issues, CKAN remains our engine of choice: it's open source, actively maintained by OKFN (an organization whose mission is aligned to Wikimedia's) and is currently used by large orgs and governments to run institutional repositories (like http://data.gov.uk).
The long-term vision is that of an actual "data/API hub" built on top of a naked repository, to facilitate the discovery/reuse of various data sources. I copy below a note I posted some weeks ago to wikitech-l on this topic.
Dario
Begin forwarded message:
From: Dario Taraborelli dario@wikimedia.org Subject: Re: [Wikitech-l] Proposal to add an API/Developer/Developer Hub link to the footer of Wikimedia wikis Date: September 25, 2012 10:55:47 AM PDT
I am very excited to see this proposal and happy to help in my spare time, thanks for starting the thread. In fact, I started brainstorming a while ago with a number of colleagues and community members on how an ideal Wikimedia developer hub might look like.
My thoughts:
(1) the hub should be focused on documenting reuse of Wikimedia's data sources (the API, the XML dumps, the IRC streams), not just the MediaWiki codebase. We are investing quite a lot of outreach effort in the MediaWiki developer community, this hub should be broader in scope and support the development of third-party apps/services building on these data sources. A consultation we ran last year indicates that a large number of developers/researchers interested in building services/mashups on top of Wikipedia don't have a clue about what data/APIs we make available beside the XML dumps or where to find this data: this is the audience we should build the developer hub for.
(2) the hub should host simple recipes on how to use existing data sources for building applications and list existing libraries for data crunching/manipulation. My initial attempt at listing Wikimedia/Wikipedia apps, mashups and data wrangling libraries is this spreadsheet, contributions are welcome [1]
(3) on top of documenting data sources/APIs we should showcase the best applications that use them and incentivize more developers to play with our data, like Flickr does with its app garden. WMF designer Vibha Bamba created these two mockups [1] [2], loosely inspired by http://selection.datavisualization.ch, for a visual directory that we could initially host on Labs.
Dario
[1] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0Ams-fyukCIlMdDV... [2] http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-01.png [3] http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-02.png
On Oct 22, 2012, at 7:00 PM, Jérémie Roquet arkanosis@gmail.com wrote:
cc-ed xmldatadumps-l
Hi,
2012/10/23 Dario Taraborelli dtaraborelli@wikimedia.org:
2012/10/23 James Forrester james@jdforrester.org:
On 22 October 2012 16:03, Hydriz Wikipedia admin@alphacorp.tk wrote:
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by the Open Knowledge Foundation for precisely this reason - so that data researchers can easily find data about Wikimedia, and see when it's updated.
The dumps server was never meant to become a permanent open data repository, but it started being used as an ad-hoc solution to host all sort of datasets published by WMF on top of the actual XML dumps: that's the problem we're trying to fix.
Regardless of where the data is physically hosted, your go-to point to discover WMF datasets from now on is the DataHub. Think of it as a data registry: the registry is all you need to know in order to find where the data is hosted and to extract the appropriate metadata/documentation.
That's fine for me but I think more communication about this would be welcome. I've added a link to meta:Data_dumps¹ and I'll communicate about this on the French Wikipedia, but a link on the dumps' page for other downloads² would be great.
Most people I've helped to find data on the Wikimedia projects now know about dumps.wikimedia.org, but AFAIK none of them is reading wiki-research-l.
Best regards,
¹ https://meta.wikimedia.org/wiki/Data_dumps ² http://dumps.wikimedia.org/other/
-- Jérémie
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Taha,
other than the internal reports during the product dev phase [1] and some occasional uses of this data in the literature, there hasn't been much work on AFT ratings. To my knowledge, the best use of this data outside of WMF is in Adam Hyland's work (he presented a study at Wikimania [2] and I think he's working on a follow-up paper).
Dario
[1] http://www.mediawiki.org/wiki/Article_feedback/Research [2] http://en.wikipedia.org/wiki/User:Protonk/Article_Feedback
On Oct 27, 2012, at 6:57 AM, Taha Yasseri taha.yaseri@gmail.com wrote:
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I forgot to mention Ashton Anderson's dataviz work based on AFTv4 data
https://graphics.stanford.edu/wikis/cs448b-11-fall/FP-AndersonAshton
On Oct 27, 2012, at 1:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Taha,
other than the internal reports during the product dev phase [1] and some occasional uses of this data in the literature, there hasn't been much work on AFT ratings. To my knowledge, the best use of this data outside of WMF is in Adam Hyland's work (he presented a study at Wikimania [2] and I think he's working on a follow-up paper).
Dario
[1] http://www.mediawiki.org/wiki/Article_feedback/Research [2] http://en.wikipedia.org/wiki/User:Protonk/Article_Feedback
On Oct 27, 2012, at 6:57 AM, Taha Yasseri taha.yaseri@gmail.com wrote:
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
…and on a final note, this is an awesome work in progress that attempts to classify Wikipedia articles based on a broad range of quality metrics (including AFT ratings).
https://github.com/slaporte/qualityvis
On Oct 27, 2012, at 1:42 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I forgot to mention Ashton Anderson's dataviz work based on AFTv4 data
https://graphics.stanford.edu/wikis/cs448b-11-fall/FP-AndersonAshton
On Oct 27, 2012, at 1:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Taha,
other than the internal reports during the product dev phase [1] and some occasional uses of this data in the literature, there hasn't been much work on AFT ratings. To my knowledge, the best use of this data outside of WMF is in Adam Hyland's work (he presented a study at Wikimania [2] and I think he's working on a follow-up paper).
Dario
[1] http://www.mediawiki.org/wiki/Article_feedback/Research [2] http://en.wikipedia.org/wiki/User:Protonk/Article_Feedback
On Oct 27, 2012, at 6:57 AM, Taha Yasseri taha.yaseri@gmail.com wrote:
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks Dario. I should also add your own CSCW'13 paper. Right?
On Sat, Oct 27, 2012 at 10:51 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
…and on a final note, this is an awesome work in progress that attempts to classify Wikipedia articles based on a broad range of quality metrics (including AFT ratings).
https://github.com/slaporte/qualityvis
On Oct 27, 2012, at 1:42 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I forgot to mention Ashton Anderson's dataviz work based on AFTv4 data
https://graphics.stanford.edu/wikis/cs448b-11-fall/FP-AndersonAshton
On Oct 27, 2012, at 1:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Taha,
other than the internal reports during the product dev phase [1] and some occasional uses of this data in the literature, there hasn't been much work on AFT ratings. To my knowledge, the best use of this data outside of WMF is in Adam Hyland's work (he presented a study at Wikimania [2] and I think he's working on a follow-up paper).
Dario
[1] http://www.mediawiki.org/wiki/Article_feedback/Research [2] http://en.wikipedia.org/wiki/User:Protonk/Article_Feedback
On Oct 27, 2012, at 6:57 AM, Taha Yasseri taha.yaseri@gmail.com wrote:
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
no, that's based on textual feedback data from a small random sample of articles [1] from the AFTv5 tests, not the current ratings (AFTv4)
[1] http://meta.wikimedia.org/wiki/Research:AFT
On Oct 27, 2012, at 2:13 PM, Taha Yasseri taha.yaseri@gmail.com wrote:
Thanks Dario. I should also add your own CSCW'13 paper. Right?
On Sat, Oct 27, 2012 at 10:51 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: …and on a final note, this is an awesome work in progress that attempts to classify Wikipedia articles based on a broad range of quality metrics (including AFT ratings).
https://github.com/slaporte/qualityvis
On Oct 27, 2012, at 1:42 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I forgot to mention Ashton Anderson's dataviz work based on AFTv4 data
https://graphics.stanford.edu/wikis/cs448b-11-fall/FP-AndersonAshton
On Oct 27, 2012, at 1:36 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Taha,
other than the internal reports during the product dev phase [1] and some occasional uses of this data in the literature, there hasn't been much work on AFT ratings. To my knowledge, the best use of this data outside of WMF is in Adam Hyland's work (he presented a study at Wikimania [2] and I think he's working on a follow-up paper).
Dario
[1] http://www.mediawiki.org/wiki/Article_feedback/Research [2] http://en.wikipedia.org/wiki/User:Protonk/Article_Feedback
On Oct 27, 2012, at 6:57 AM, Taha Yasseri taha.yaseri@gmail.com wrote:
Hi Dario, Thank you. That's indeed a very interesting data set.
Is anyone aware of any study or analysis of this or similar data on "article ratings"? Even a raw data analysis would be very helpful to set up a systematic study. Unfortunately, I'm not update on the state of the art.
cheers, .Taha
On Mon, Oct 22, 2012 at 10:51 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: We've released a full, anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
http://thedatahub.org/en/dataset/wikipedia-article-ratings
The dataset (which includes 11m unique article ratings along 4 dimensions) is licensed under CC0 and supersedes the partial dumps originally hosted on the dumps server. Real-time AFTv4 data remains available as usual via the toolserver. Feel free to get in touch if you have any questions about this data.
Dario _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- .t
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org