Domas Mituzas wrote:
Any tips? :-)
My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page
namespace name, and page title).
I don't know what GlobalUsage does, but probably it is all wrong ;-)
GlobalUsage tracks file uses across a wiki family. Its schema is available
here:
<http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GlobalUsage/Glob
alUsage.sql?view=log>.
But the
biggest improvement would be post-processing (cleaning up) the
source files. Right now if there are anomalies in the data, every re-user is
expected to find and fix these on their own. It's _incredibly_ inefficient
for everyone to adjust the data (for encoding strangeness, for bad clients,
for data manipulation, for page existence possibly, etc.) rather than having
the source files come out cleaner.
Raw data is fascinating in that regard though - one can see what are bad
clients, what are anomalies, how they encode titles, what are erroneus titles,
etc.
There're zillions of ways to do post-processing, and none of these will match
all needs of every user.
Yes, so providing raw data alongside cleaner data or alongside SQL table
dumps (similar to the current dumps for MediaWiki tables) might make more
sense here.
I'd love to see that all the data is preserved
infinitely. It is one of most
interesting datasets around, and its value for the future is quite incredible.
Nemo has done some work to put the files on Internet Archive, I think.
The reality is
that a large pile of data that's not easily queryable is
directly equivalent to no data at all, for most users. Echoing what I said
earlier, it doesn't make much sense for people to be continually forced to
reinvent the wheel (post-processing raw data and putting it into a queryable
format).
I agree. By opening up the dataset I expected others to build upon that and
create services.
Apparently that doesn't happen. As lots of people use the data, I guess there
is need for it, but not enough will to build anything for others to use, so it
will end up being created in WMF proper.
Building a service where data would be shown on every article is relatively
different task from just analytical workload support.
For now, building query-able service has been on my todo list, but there were
too many initiatives around that suggested that someone else will do that ;-)
Yes, beyond Henrik's site, there really isn't much. It would probably help
if Wikimedia stopped engaging in so much cookie-licking. That was part of
the purpose of this thread: to clarify what Wikimedia is actually planning
to invest in this endeavor.
Thank you for the detailed replies, Domas. :-)
MZMcBride