Hello Wikipedians,
I am in the process of making a local mirror of the WikiPedia encyclopedia and seem to have hit a stumbling block. But first a related question. I have looked thru the software doc and the downloaded pages and haven't seen this but just want to make sure. I have the base software Ver 1.2.3 installed from the web interface. Only had one small but surmountable problem. When using IE 5.1.3 Under MacOS 9.0.4 I could not enter the name for the site. The field was overlaid by the info that should have been to its right. I switched to Netscape and then no problem.
Now I am in the process of populating the database and was wondering if in the maintenance folder (or someplace else) there is a set of script to fetch and upload the actual base data content and then the weekly updates. I would like to keep this mirror up to date with the master copy.
From looking at mailing list archive I have seen it stated that there is
no doc file explaining what each of the maint. scripts does, and looking them over hasn't yielded one to create/update the database. If one doesn't exist I am ready to do it manually. But in my first attempts I have hit a few problems.
The first trick is getting the correct data to do the upload with. I found the dump D/L page and the files for the EN version of the database(dated 2004-04-03). The current one looks find and I have been able to retrieve it and do some(not all) processing with it. My first problem is the old database. It is my assumption that this contains the full database content (minus images) prior to the new data in the current file. I notice that the format of the old/full file has changed recently and grown a lot. I tried to D/L the full DB as the single file and failed (403 - not authorized) and this seems to not be unexpected since there is mention of the multi part files for those experiencing problems. The single files http://download.wikimedia.org/archives/en/20040403_cur_table.sql.bz2 and http://download.wikimedia.org/archives/en/20040403_old_table.sql.bz2 have names and formats that make sense to me. The partials have me confused, especially with my inability to decompress them. First there are only three files and based on the file sizes (and one unlisted file) it appears that there should be four. The first three come to exactly 2Gigs each, a mathematical oddity if that was all there was, but a fourth file would even it out nicely. First the files themselves have names that give no clue as to their contents. http://download.wikimedia.org/archives/en/xaa xab xac and the unlisted xad . What format are these and how should they be joined together? I copied them over via wget and then tried to merge and decompress them but failed. The command and response I tried(to verify before actual processing) was:
========= start of clip -bash-2.05b$ nice bzip2 -t xaa xab xac xad bzip2: xaa: file ends unexpectedly bzip2: xab: bad magic number (file not created by bzip2) bzip2: xac: bad magic number (file not created by bzip2) bzip2: xad: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ========= end of clip
Are these files damaged or am I just using the wrong software to do this? BTW I am on a RH9 system running PHP434, MySQL 4.0.18 and Apache 2
To try and continue my testing and make sure I had everything else in place I thought I'd try using the current file and see how that went. It might not be all the data but it would give me a taste of how things were going. The decompress went fine but I had a problem part way during the load. The data that did load was enough for me to do some minimal testing and verify that the software basically works and that I was close in doing the upload. The command I tried and the response I got was:
========= start of clip -bash-2.05b$ nice mysql -p -uxxxxxxx wikipedia < 20040403_cur_table.sql Enter password: ERROR 1153 at line 831: Got a packet bigger than 'max_allowed_packet' -bash-2.05b$========= end of clip
What size should I be setting the 'max_allowed_packet' to?
Thanks in advance for your help and for creating this software and its associated database.
Paul
http://PrivacyDigest.com/ Daily news from the privacy front.
PS The ls -al for the data files I Downloaded is:
-rw-r--r-- 1 wikipedia psacln 850374900 Apr 3 02:09 20040403_cur_table.sql -rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:32 xaa -rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:35 xab -rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:38 xac -rw-r--r-- 1 wikipedia psacln 1614740369 Mar 22 17:40 xad
On Apr 7, 2004, at 21:11, Paul Hardwick wrote:
Only had one small but surmountable problem. When using IE 5.1.3 Under MacOS 9.0.4 I could not enter the name for the site. The field was overlaid by the info that should have been to its right. I switched to Netscape and then no problem.
I'll check this out, thanks for the note.
Now I am in the process of populating the database and was wondering if in the maintenance folder (or someplace else) there is a set of script to fetch and upload the actual base data content and then the weekly updates. I would like to keep this mirror up to date with the master copy.
No, there is no such script. Unfortunately we don't yet have a good procedure for synchronizing a mirror other than throwing out and replacing the whole thing every week or so.
Just note, INSTALL THE WIKI FIRST, then load in the data. The dumps *drop* the existing tables and replace them, and the install doesn't like to run over a partial set of tables. (Command-line install will drop any existing tables.)
The partials have me confused,
First, the bad news. The partials weren't being updated automatically by the backup process, so what you downloaded was about a month old. If you want the April 3 backup, you'll have to grab them again. Sorry... :(
Also, the split files are up to xae now. Compression of old revisions reduces the raw disk space (& disk cache) needed for the table, but totally ruins the compression ratio of the downloadable dumps.
-bash-2.05b$ nice bzip2 -t xaa xab xac xad bzip2: xaa: file ends unexpectedly bzip2: xab: bad magic number (file not created by bzip2)
That will try to decompress each file in turn, which doesn't work; you need to concatenate them back into a single stream. The simplest thing might be to pipe it straight into mysql, assuming you're already set up:
cat xa? | bzip2 -dc | mysql -u mywikiuser -p mydatabase
Or if you'd like to output a big decompressed SQL file:
cat xa? | bzip2 -dc > old_table_20040403.sql
-bash-2.05b$ nice mysql -p -uxxxxxxx wikipedia < 20040403_cur_table.sql Enter password: ERROR 1153 at line 831: Got a packet bigger than 'max_allowed_packet' -bash-2.05b$========= end of clip
What size should I be setting the 'max_allowed_packet' to?
I think 16MB is the maximum, try that.
-- brion vibber (brion @ pobox.com)
mediawiki-l@lists.wikimedia.org