Domas Mituzas wrote:
Any tips? :-) My thoughts were that the schema used by the GlobalUsage extension might be reusable here (storing wiki, page namespace ID, page namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
GlobalUsage tracks file uses across a wiki family. Its schema is available here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GlobalUsage/Glob alUsage.sql?view=log.
But the biggest improvement would be post-processing (cleaning up) the source files. Right now if there are anomalies in the data, every re-user is expected to find and fix these on their own. It's _incredibly_ inefficient for everyone to adjust the data (for encoding strangeness, for bad clients, for data manipulation, for page existence possibly, etc.) rather than having the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc. There're zillions of ways to do post-processing, and none of these will match all needs of every user.
Yes, so providing raw data alongside cleaner data or alongside SQL table dumps (similar to the current dumps for MediaWiki tables) might make more sense here.
I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.
Nemo has done some work to put the files on Internet Archive, I think.
The reality is that a large pile of data that's not easily queryable is directly equivalent to no data at all, for most users. Echoing what I said earlier, it doesn't make much sense for people to be continually forced to reinvent the wheel (post-processing raw data and putting it into a queryable format).
I agree. By opening up the dataset I expected others to build upon that and create services. Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively different task from just analytical workload support. For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)
Yes, beyond Henrik's site, there really isn't much. It would probably help if Wikimedia stopped engaging in so much cookie-licking. That was part of the purpose of this thread: to clarify what Wikimedia is actually planning to invest in this endeavor.
Thank you for the detailed replies, Domas. :-)
MZMcBride