I have done a small amount of testing, the tests look good. Acccordingly
I have started up one process to do dumps; please get your eyeballs on
them and let me know thumbs up or down. I'd like to start up the rest
of the processes by tomorrow at this time so if you can squeeze in some
time to look at them sooner rather than later that would be awesome.
Thanks!
Ariel
p.s. Yes this means I am done travelling for a while, thank goodness. I
think I am sick of airplanes. And *very* sick of jet lag.
Hello,
As you know, Chinese contains two similar language: "Traditional Chinese"
and "The simplified Chinese" , but it's hard to do translation between them
correctly. I know Wiki can do this translation properly. I think why not
release "Traditional Chinese" Dump and "The simplified Chinese" Dump, rather
than together. This can save a lot of time for Chinese language
researchers.
Thanks. Just a serious advice!
A little bit before the scheduled deployment of the 1.17 branch on our
production servers, I will be halting production of XML dumps.
Deployment is set for Tuesday Feb 8 at 07:00 UTC, so a few hours before
that I'll start shutting down processes.
This is a precautionary measure; after the deployment and any hasty
fixes that may be needed, I will be doing some testing to ensure that
dumps are not impacted, before we restart them. Barring some bizarre
problem, we should be back up and running within a day or two.
Ariel
Hello,
there seems to be a problem with the current jawiki-dump. The size of the complete history dump is only 4.3 GB, but the size of the dump before was 19 GB.
Another issue: Acccording to http://wikitech.wikimedia.org/view/Dumps#Worker_nodes there shoulde be 3 threads for the large dumps, but since a few days there are only 2 running threads.
Best regards,
Andreas
--
NEU: FreePhone - kostenlos mobil telefonieren und surfen!
Jetzt informieren: http://www.gmx.net/de/go/freephone
For folks who have not been following the saga on
http://wikitech.wikimedia.org/view/Dataset1
we were able to get the raid array back in service last night on the XML
data dumps server, and we are now busily copying data off of it to
another host. There's about 11T of dumps to copy over; once that's done
we will start serving these dumps read-only to the public again.
Because the state of the server hardware is still uncertain, we don't
want to do anything that might put the data at risk until that copy has
been made.
The replacement server is on order and we are watching that closely.
We have also been working on deploying a server to run one round of
dumps in the interrim.
Thanks for your patience (which is a way of saying, I know you are all
out of patience, as am I, but hang on just a little longer).
Ariel
Hello
I've some questions for you and I think it won't be so hard for you to answer.
What do you know about new wikipedia dumps?
When they will be refreshed and available to be downloaded (new dumps)
I've heard that they can give you an opportunity to order you own dump but I don't now how to do it. Can you help me?
And the last:
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
I think they are in searching for volunteers and mirrores for their projects. Can you support them and start mirroring for some languages?
Thank you.
I've just discovered this and thought other people parsing MediaWiki
dump files could also benefit.
Dump files contain a version number in the root element eg:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3"
xml:lang="vo">
A description of the changes to the format can be found at
http://www.mediawiki.org/xml/export-0.4.xsd :
Version 0.2 adds optional basic file upload info support,
which is used by our OAI export/import submodule.
Version 0.3 adds some site configuration information such
as a list of defined namespaces.
Version 0.4 adds per-revision delete flags, log exports,
discussion threading data, and a per-page redirect flag.
Notice that per-page redirect flags are documented to to begin with
version 0.4 dump files.
In fact the per-page redirect flag seems to be used from 28 July 2009
and can be found in dump files marked as version 0.3
Given this there are surely other features which occur in versions
earlier than documented so it would be wise to allow for this when
parsing dump files rather than relying on the version declaration.
Andrew Dunbar (hippietrail)
I'm trying to import the categorylinks table.
After MySQL finishes the inserts, it reaches the line:
ALTER TABLE `categorylinks` ENABLE KEYS
By doing a "show full processlist" I can see that MySQL is running this
doing "Repair with keycache". From what I've read about this online, it
takes 20-30 times longer to build the indexes this way than using "Repair by
sorting".
Any ideas how to get around this?