Thanks Jérémie,
we are definitely aiming for a more official announcement. The reason for the soft launch
is that, after experimenting for a few months with the DataHub, we are still reporting to
the developers issues that need to be addressed before a broader announcement. The CKAN
data browser, for example, is quite rudimentary; there is limited support for batch file
upload; data citation support is not keeping up with standards/best practices in the field
etc. If anyone on these lists is interested in crash-testing the repository I'd be
happy to follow up off-list.
Despite these issues, CKAN remains our engine of choice: it's open source, actively
maintained by OKFN (an organization whose mission is aligned to Wikimedia's) and is
currently used by large orgs and governments to run institutional repositories (like
http://data.gov.uk).
The long-term vision is that of an actual "data/API hub" built on top of a naked
repository, to facilitate the discovery/reuse of various data sources. I copy below a note
I posted some weeks ago to wikitech-l on this topic.
Dario
Begin forwarded message:
From: Dario Taraborelli <dario(a)wikimedia.org>
Subject: Re: [Wikitech-l] Proposal to add an API/Developer/Developer Hub link to the
footer of Wikimedia wikis
Date: September 25, 2012 10:55:47 AM PDT
I am very excited to see this proposal and happy to help in my spare time, thanks for
starting the thread. In fact, I started brainstorming a while ago with a number of
colleagues and community members on how an ideal Wikimedia developer hub might look like.
My thoughts:
(1) the hub should be focused on documenting reuse of Wikimedia's data sources (the
API, the XML dumps, the IRC streams), not just the MediaWiki codebase. We are investing
quite a lot of outreach effort in the MediaWiki developer community, this hub should be
broader in scope and support the development of third-party apps/services building on
these data sources. A consultation we ran last year indicates that a large number of
developers/researchers interested in building services/mashups on top of Wikipedia
don't have a clue about what data/APIs we make available beside the XML dumps or where
to find this data: this is the audience we should build the developer hub for.
(2) the hub should host simple recipes on how to use existing data sources for building
applications and list existing libraries for data crunching/manipulation. My initial
attempt at listing Wikimedia/Wikipedia apps, mashups and data wrangling libraries is this
spreadsheet, contributions are welcome [1]
(3) on top of documenting data sources/APIs we should showcase the best applications that
use them and incentivize more developers to play with our data, like Flickr does with its
app garden. WMF designer Vibha Bamba created these two mockups [1] [2], loosely inspired
by
http://selection.datavisualization.ch, for a visual directory that we could initially
host on Labs.
Dario
[1]
https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0Ams-fyukCIlMdD…
[2]
http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-01.png
[3]
http://commons.wikimedia.org/wiki/File:Wikipedia_DataViz-02.png
On Oct 22, 2012, at 7:00 PM, Jérémie Roquet <arkanosis(a)gmail.com> wrote:
cc-ed xmldatadumps-l
Hi,
2012/10/23 Dario Taraborelli <dtaraborelli(a)wikimedia.org>rg>:
2012/10/23 James Forrester
<james(a)jdforrester.org>rg>:
On 22 October 2012 16:03, Hydriz Wikipedia
<admin(a)alphacorp.tk> wrote:
I have long been wanting to say this, but is it
possible for the team behind
compiling such datasets to put future (and if possible, current) datasets
into
dumps.wikimedia.org so that it is easier for everyone to find stuff and
not be all over the place? Thanks for that!
Many one-off and regular datasets, from query results to data dumps
and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by
the Open Knowledge Foundation for precisely this reason - so that data
researchers can easily find data about Wikimedia, and see when it's
updated.
[0] -
http://thedatahub.org/en/group/wikimedia
The dumps server was never meant to become a permanent open data repository, but it
started being used as an ad-hoc solution to host all sort of datasets published by WMF on
top of the actual XML dumps: that's the problem we're trying to fix.
Regardless of where the data is physically hosted, your go-to point to discover WMF
datasets from now on is the DataHub. Think of it as a data registry: the registry is all
you need to know in order to find where the data is hosted and to extract the appropriate
metadata/documentation.
That's fine for me but I think more communication about this would be
welcome. I've added a link to meta:Data_dumps¹ and I'll communicate
about this on the French Wikipedia, but a link on the dumps' page for
other downloads² would be great.
Most people I've helped to find data on the Wikimedia projects now
know about
dumps.wikimedia.org, but AFAIK none of them is reading
wiki-research-l.
Best regards,
¹
https://meta.wikimedia.org/wiki/Data_dumps
²
http://dumps.wikimedia.org/other/
--
Jérémie
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l