Thanks Jérémie,
we are definitely aiming for a more official announcement. The reason for the soft launch is that, after experimenting for a few months with the DataHub, we are still reporting to the developers issues that need to be addressed before a broader announcement. The CKAN data browser, for example, is quite rudimentary; there is limited support for batch file upload; data citation support is not keeping up with standards/best practices in the field etc. If anyone on these lists is interested in crash-testing the repository I'd be happy to follow up off-list.
Despite these issues, CKAN remains our engine of choice: it's open source, actively maintained by OKFN (an organization whose mission is aligned to Wikimedia's) and is currently used by large orgs and governments to run institutional repositories (like http://data.gov.uk).
The long-term vision is that of an actual "data/API hub" built on top of a naked repository, to facilitate the discovery/reuse of various data sources. I copy below a note I posted some weeks ago to wikitech-l on this topic.
Dario
Begin forwarded message:
From: Dario Taraborelli dario@wikimedia.org Subject: Re: [Wikitech-l] Proposal to add an API/Developer/Developer Hub link to the footer of Wikimedia wikis Date: September 25, 2012 10:55:47 AM PDT
I am very excited to see this proposal and happy to help in my spare time, thanks for starting the thread. In fact, I started brainstorming a while ago with a number of colleagues and community members on how an ideal Wikimedia developer hub might look like.
My thoughts:
(1) the hub should be focused on documenting reuse of Wikimedia's data sources (the API, the XML dumps, the IRC streams), not just the MediaWiki codebase. We are investing quite a lot of outreach effort in the MediaWiki developer community, this hub should be broader in scope and support the development of third-party apps/services building on these data sources. A consultation we ran last year indicates that a large number of developers/researchers interested in building services/mashups on top of Wikipedia don't have a clue about what data/APIs we make available beside the XML dumps or where to find this data: this is the audience we should build the developer hub for.
(2) the hub should host simple recipes on how to use existing data sources for building applications and list existing libraries for data crunching/manipulation. My initial attempt at listing Wikimedia/Wikipedia apps, mashups and data wrangling libraries is this spreadsheet, contributions are welcome [1]
(3) on top of documenting data sources/APIs we should showcase the best applications that use them and incentivize more developers to play with our data, like Flickr does with its app garden. WMF designer Vibha Bamba created these two mockups [1] [2], loosely inspired by http://selection.datavisualization.ch, for a visual directory that we could initially host on Labs.
Dario
[1] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0Ams-fyukCIlMdDV... [2] http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-01.png [3] http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-02.png
On Oct 22, 2012, at 7:00 PM, Jérémie Roquet arkanosis@gmail.com wrote:
cc-ed xmldatadumps-l
Hi,
2012/10/23 Dario Taraborelli dtaraborelli@wikimedia.org:
2012/10/23 James Forrester james@jdforrester.org:
On 22 October 2012 16:03, Hydriz Wikipedia admin@alphacorp.tk wrote:
I have long been wanting to say this, but is it possible for the team behind compiling such datasets to put future (and if possible, current) datasets into dumps.wikimedia.org so that it is easier for everyone to find stuff and not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by the Open Knowledge Foundation for precisely this reason - so that data researchers can easily find data about Wikimedia, and see when it's updated.
The dumps server was never meant to become a permanent open data repository, but it started being used as an ad-hoc solution to host all sort of datasets published by WMF on top of the actual XML dumps: that's the problem we're trying to fix.
Regardless of where the data is physically hosted, your go-to point to discover WMF datasets from now on is the DataHub. Think of it as a data registry: the registry is all you need to know in order to find where the data is hosted and to extract the appropriate metadata/documentation.
That's fine for me but I think more communication about this would be welcome. I've added a link to meta:Data_dumps¹ and I'll communicate about this on the French Wikipedia, but a link on the dumps' page for other downloads² would be great.
Most people I've helped to find data on the Wikimedia projects now know about dumps.wikimedia.org, but AFAIK none of them is reading wiki-research-l.
Best regards,
¹ https://meta.wikimedia.org/wiki/Data_dumps ² http://dumps.wikimedia.org/other/
-- Jérémie
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l