I want to find out the length of a bunch of articles. I have earlier done this for the Swedish Wikipedia by importing the page.sql dump into a local MySQL instance, which works just fine.
But now that I try it for the English Wikipedia, the database import (of 10 million rows, averaging 94 bytes) appears to take somewhere between 24 and 48 hours (with keys disabled, I'm importing some 4500 rows per minute). This seems a bit unnecessary for just finding out the length of some 1000 articles. Especially if I want to do it again when the next dump becomes available. Is there some API on the toolserver, that I can use instead? Or should I consider retrieving the action=raw from the live server and just count the bytes? Where do I start?
I could even write a Perl script that parses the insert statements in page.sql and extracts the information I need, all in one pass. But this is not really why a MySQL dump is created.
Currently, there is a way to get many raw text articles at once through the [[mw:API]]. The revision's size is available in the database, and I plan to expose it shortly through that same API. You can already get the size of the revision if it was recently changed (see list=recentchanges or list=watchlist query)
On 8/8/07, Simetrical Simetrical+wikilist@gmail.com wrote:
You could ask someone with toolserver access to run the query for you.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Simetrical wrote:
You could ask someone with toolserver access to run the query for you.
My import crashed after 30 hours with a "duplicate key". I wrote a Perl script and it parses the entire page.sql in 3 minutes. It's a little tricky with the backspace escaped quotes, but it works almost every time. Still, this is a return to the 1980s. Isn't it possible to distribute MySQL tables as binary files? Or does PostgreSQL have that feature?
On 8/8/07, Lars Aronsson lars@aronsson.se wrote:
Isn't it possible to distribute MySQL tables as binary files?
Probably not, for InnoDB, or at any rate it would be a serious pain in the neck. "Table space" is shared among all tables and all databases on the server, in the same files. You can have per-table table spaces, but not completely: necessary metadata is still stored in the main table space file. Table space kind of sucks in other ways, too, like it generally doesn't shrink even if you drop tables/databases (without doing some extra trickery).
And of course the binary files are worthless to you if you're importing into MyISAM, PostgreSQL, Oracle (experimental though support may be), a different version of InnoDB, or any other database you might want to write support for, or for that matter a significantly different version of MediaWiki. That's why XML files are preferred, and failing that SQL files, which will at least work for any supported version of MySQL and any storage engine in it.
So with all that said, it shouldn't be surprising that grabbing a bit of data directly from the files would be a lot faster than actually importing them and then grabbing the data. Not that this is much of an API queston anymore.
Simetrical wrote:
And of course the binary files are worthless to you if you're importing into MyISAM, PostgreSQL, Oracle (experimental though support may be), a different version of InnoDB, or any other database you might want to write support for, or for that matter a significantly different version of MediaWiki. That's why XML files are preferred,
Both the current SQL and XML dumps are similar to unpacking a tar (or zip) archive into the filesystem, your own computer happens to use. What I wonder is if there is no equivalent to downloading an ISO 9660 CDROM image and mounting that filesystem (perhaps read-only), as it is. Modern Unix dialects can "mount" disk images from files without burning an actual disk. Mounting an existing filesystem is instantaneous, and as you "cd" and "ls" down into its file tree, more and more of its inodes will be buffered in the RAM managed by the running kernel. What I could find useful is "mounting" (rather than "importing") a frozen copy of a database, then "use" this new database, "show tables", "describe page", and "select count(*) from page". Apparently MySQL doesn't support this. Does PostgreSQL or Oracle or any other RDBMS?
One way around this could perhaps be to import the full XML dump into a local MySQL, then shutting down MySQL and putting the frozen files from /var/lib/mysql/ onto a Ubuntu "live" CDROM or DVD. Just boot up Ubuntu Linux from this disk, and you can immediately search through the full Wikipedia database. (As long as that fits on the disk...). I have not tried it, and don't know if it's possible to run MySQL from a live disk with pre-loaded tables. Someone with extra time on their hands can find a new hobby here. If one person does this, everybody else can download and burn the CDROM image and get started much faster than waiting for a 30 hour database import.
Not that this is much of an API queston anymore.
Correct. Sorry for going off topic.
On 8/8/07, Lars Aronsson lars@aronsson.se wrote:
One way around this could perhaps be to import the full XML dump into a local MySQL, then shutting down MySQL and putting the frozen files from /var/lib/mysql/ onto a Ubuntu "live" CDROM or DVD. Just boot up Ubuntu Linux from this disk, and you can immediately search through the full Wikipedia database. (As long as that fits on the disk...). I have not tried it, and don't know if it's possible to run MySQL from a live disk with pre-loaded tables. Someone with extra time on their hands can find a new hobby here. If one person does this, everybody else can download and burn the CDROM image and get started much faster than waiting for a 30 hour database import.
An interesting idea, but not very useful for people who intend to actually reuse the data, who would then have to dump *and* import it. It would possibly be handy for people who just want to do a statistical analysis or something, but not if you're trying to mirror.
And no, the full Wikipedia database is way, way too large to fit on a DVD. Even the pages-articles one (articles/templates/image descriptions/primary meta-pages) is quite a lot too large, it looks like, at 2.7 GB bz2'd. And that's leaving out images.
On Wed, 2007-08-08 at 19:18 -0700, Simetrical wrote:
Even the pages-articles one (articles/templates/image descriptions/primary meta-pages) is quite a lot too large, it looks like, at 2.7 GB bz2'd. And that's leaving out images.
Speaking of images... I have the circa 2005 image.tar here for the English Wikipedia (about 70gb of images).
I've heard that if it were possible to tar up the latest images, it would be > 300gb. Is this accurate?
Is there a better way to retrieve them without slamming the servers or putting more pressure on brion and friends to tar them up? A torrent file perhaps?
I've got several terabytes of storage and plenty of bandwidth to host them on our public trackers, if such a thing were to even exist.
David A. Desrosiers wrote:
Speaking of images... I have the circa 2005 image.tar here for the English Wikipedia (about 70gb of images).
I've heard that if it were possible to tar up the latest images, it would be > 300gb. Is this accurate?
Yes.
Is there a better way to retrieve them without slamming the servers or putting more pressure on brion and friends to tar them up?
Not really.
A torrent file perhaps?
Nope. (But see the third-party attempts at it.)
-- brion vibber (brion @ wikimedia.org)
David A. Desrosiers wrote:
Is there a better way to retrieve them without slamming the servers or putting more pressure on brion and friends to tar them up? A torrent file perhaps?
I've got several terabytes of storage and plenty of bandwidth to host them on our public trackers, if such a thing were to even exist.
There were some torrents for enwiki and possibly also for the images, set by a third party. See the archives.
wikitech-l@lists.wikimedia.org