If you can narrow down the request a bit that makes it more likely we'll slip something into the backup script. :) [...snip...] Is that actually what people want?
Well, given the disk space limitation, straight off the bat we know that people can't have everything they want. In particular, the all-revisions version in SQL would be way too large I suspect (extra
= 40 gig).
That only leaves SQL versions of pages_current.xml.gz, and pages_public.xml.gz. If there is space for both then that would be ideal, but I understand that may be asking too much.
Personally, although I don't want the talk pages, others might, so if there's only space for one of these two, then I think a SQL version of pages_current.xml.gz is the way to go (i.e. current revisions of all pages), because it would be applicable for the widest possible audience.
Why do that when MediaWiki comes with an import tool built-in? ;)
Because I'm not importing it into MediaWiki. I'm importing it into a database, to then run non-MediaWiki software analysing the data - in particular: looking for bad wiki syntax ( http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wiki_Syntax ), suggesting new useful redirects and disambigs ( http://en.wikipedia.org/wiki/User:Nickj/Redirects ), and searching for good potential wiki-links ( http://en.wikipedia.org/wiki/User:LinkBot - [although for this one I haven't worked out how to get those suggestions out to page authors in a really satisfactory way]). I want to improve the Wikipedia, not mirror it.
In which version?
Ideally something that works with MySQL 3.23.49, but if that's too old then something that works with MySQL 4.0.24 instead.
What about everyone who wants something slightly different?
But the database dumps have never tried to be all things to all people.
Rather they've been snapshots of the various Wikipedias at semi-regular intervals of time, which you can load into a database (specifically MySQL, but if you can get it to work in another RDBMS, then more power to you).
I'm not asking for something entirely new, rather I'm asking for an equivalent replacement for what we already had.
(Also the SQL dump output needs to actually be tested before we dedicate a few gigs to it.)
I'm happy to be a guinea pig. Just give tell me where to get it from, and I'll leave it importing overnight and report back any errors/problems and whether it worked or not.
All the best, Nick.
On 17/09/05, Nick Jenkins nickpj@gmail.com wrote:
If you can narrow down the request a bit that makes it more likely we'll slip something into the backup script. :)
Why do that when MediaWiki comes with an import tool built-in? ;)
Because I'm not importing it into MediaWiki. I'm importing it into a database, to then run non-MediaWiki software analysing the data
The point being, I take it, that your software mimics the MediaWiki <=1.4 database schema, because it was designed to work with the old SQL dumps? So what you're asking for, specifically, is an SQL dump which mimics how the data would have looked in an installation of MediaWiki <=1.4 - which, it should be noted, is *not* how it actually looks in any database under >1.5.
Of course, with an XML dump you could always have analysis software that used whatever schema was appropriate for the job in hand (I'm pretty sure at least one set of tools did this *anyway*, I think it was the Topbanana reports) and have your own import scripts, but obviously that requires rather more effort on your part.
What about everyone who wants something slightly different?
But the database dumps have never tried to be all things to all people.
Rather they've been snapshots of the various Wikipedias at semi-regular intervals of time, which you can load into a database
Yes, but until now, they've been snapshots from the database as used by the software, which could be loaded into that same software. The closest equivalent to that would be an SQL dump for v1.5, but that would be pretty pointless given there's already an import script, and wouldn't help people like yourself whose databases just mimic what the old schema looked like.
There's never been such a major redesign before, so there wasn't really a problem with whether it was all things to all people or not - it just carried on being there.
Nick Jenkins wrote:
(Also the SQL dump output needs to actually be tested before we dedicate a few gigs to it.)
I'm happy to be a guinea pig. Just give tell me where to get it from, and I'll leave it importing overnight and report back any errors/problems and whether it worked or not.
I haven't had a chance to upload some test conversions yet but I'd appreciate it if you'd give a quick test of the program: http://leuksman.com/misc/mwdumper-preview.zip
This .zip contains the compiled converter program in its present state, with a bundled executable for Linux/x86 and as CIL assemblies for any other OS.
The Linux binary does *not* require a Mono installation; I've tested it on Ubuntu Hoary and Fedora Core 1. (It does require glib2, which is probably installed by default any any modern Linux.) For other OSs you'll need a Mono or .NET runtime to run the CIL binaries.
Run something like this to generate 1.4-format SQL: gzip -dc pages_public.xml.gz | ./mwdumper --format=sql:1.4 > out.sql
More info on the command-line options is in the README. If you're on Linux and the binary does or doesn't work, please let me know. (If it fails, please include output of 'ldd mwdumper'.)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org