Wiki dump organization

List overview All Threads
Download

newer

older

Wikipedia API?

Recherche sous Wikimedia

Dorożyński Janusz

7 Sep 2005 7 Sep '05

4:41 p.m.

Hi everybody!

I'm newbie on this list, so apologize for any from me :-))

I think my problem is important for people who have like me local xAMP+M environment (for me x => W :-)) ) and want time to time load wikis dumps. All was all right since June 23rd, when was published last .sql dumps of "my" polish wiki. I load cur table into db for 10 minutes (300 pages per second). Now I find only .xml file. First, I completely not understand this change. For production needs, for example to restore db, .xml files vs. .slq file are out of range. For people like me too. Nowhere was any help how to use that new solution. When I at last find here help (from Brion post), no successful happens. gzip -dc pages_current.xml.gz | php importDump.php stops after loading appr. 15000 pages (from about 190000) when executing 47 row of importDump script (1.5rc2). Then I found 2979 bug about 47 row, the bug is still open (for 1.5rc4 too), and 3182 bug, open too.

Next I try Kates importDump.phps, things were long time fine and in progress, unfortunate for pages 142000 php was suddenly terminated without any message but from Windows. While php was work I watched const. increase consuming of memory reported by bug 3182, however not so drastic or failing the php. And next very important - flow of data through gzip & php is drastic low - appr. 5 p/s. vs. 300 p/s when I import .sql dump. I think export to .xml goes same slow and not acceptable for production needs. Are really wikis db's dumping to .xml? and if needed are restoring from xml?

So, is any chance that people can take from download.wikimedia.org .sql dumps? Xml dumps are completely useless for them.

Janusz 'Ency' Dorozynski

Show replies by date

Brion Vibber

7 Sep 7 Sep

7:26 p.m.

Dorożyński Janusz wrote:

...

So, is any chance that people can take from download.wikimedia.org .sql dumps? Xml dumps are completely useless for them.

No you can't get SQL dumps. :)

* The schema and compression formats keep changing, which breaks things for people trying to get at the data. * There is no longer any equivalent to the "cur table" for current-revision-only SQL dumps. * The page database now includes deleted page text, which can't be publically redistributed. * The databases as we have them contain a lot of indirection, with custom site-specific alternate tables. * A lot of text isn't even in there but is stored on another server cluster MAKING A RAW SQL DUMP USELESS IF YOU'RE NOT ON OUR SERVERS.

So straight SQL dumps for the public are not possible.

If you like you can use the mwdumper tool to convert the XML dumps to local-import-friendly SQL instead of using importDump.php (which as you note needs a bug fix). This is still experimental for that purpose and is in CVS: http://cvs.sourceforge.net/viewcvs.py/wikipedia/mwdumper/

When it's a little more tested I'll make a public release package for it.

-- brion vibber (brion @ pobox.com)

Dorożyński Janusz

8 Sep 8 Sep

3:17 p.m.

| -----Original Message----- | From: ... Brion Vibber | Sent: Wednesday, September 07, 2005 9:26 PM / | No you can't get SQL dumps. :)

:-(((

Bad or good but news. Well, now situation is more clear and reasonable.

However it means that declared and supported by wiki open content exist fully since 23rd June, now is limited.

/ | If you like you can use the mwdumper tool to convert the XML dumps to | local-import-friendly SQL instead of using importDump.php / | When it's a little more tested I'll make a public release | package for it.

Ok.

Cheers, Janusz

Brion Vibber

7:33 p.m.

Dorożyński Janusz wrote:

...

However it means that declared and supported by wiki open content exist fully since 23rd June, now is limited.

No it doesn't, it means the complete opposite.

If we didn't want to produce working dumps for you to be able to get all text, *THEN* we would still be making SQL dumps for you THAT DON'T WORK. I like to think we're not complete idiots, so we're not doing that.

-- brion vibber (brion @ pobox.com)

Dorożyński Janusz

9 Sep 9 Sep

8:16 a.m.

| -----Original Message----- | From: ... Brion Vibber | Sent: Thursday, September 08, 2005 9:34 PM / | Dorożyński Janusz wrote: | > However it means that declared and supported by wiki open | > content exist fully since 23rd June, now is limited. | | No it doesn't, it means the complete opposite. | | If we didn't want to produce working dumps for you to be able | to get all | text, *THEN* we would still be making SQL dumps for you THAT | DON'T WORK. | I like to think we're not complete idiots, so we're not doing that.

Brion, I think we - me and you - don't want so heavy word like "complete ..." :-)) . I don't think so.

Then - my technical skills are not so low that I can't understand circumstances like effects of db inner structure or dissipation of instances. Look, after you explained reasons I written short and accepting message "now situation is more clear and reasonable". I understand, that growing size of content, for example like images, doing no sense to download for local wiki. And that isn't your malignity but reality, I know. Etc., etc.

But ... Fact 1. Since 23rd June anybody can downloaded whole content (or text) Fact 2. Past 23rd June anybody can download excerpt of whole content (yes, yes, because are important technical reasons, etc., and I accept them without any doubts)

But what it means is really personal POV. I say "limiting", may be I'm not right, may be I am. But I have right to think and right to have my opinion. Like everybody have right to have. And you don't think that I means that you are "complete ...", please. My thinks are totally opposite :-)

Well, I think we can close this matter, and you than writing polemics can doing more pleasure thinks like writing scripts (mwdumper for example :-)))

Have nice day, best rgds, Janusz

Rowan Collins

4:07 p.m.

On 09/09/05, Dorożyński Janusz dorozynskij@poczta.onet.pl wrote:

...

But ... Fact 1. Since 23rd June anybody can downloaded whole content (or text) Fact 2. Past 23rd June anybody can download excerpt of whole content (yes, yes, because are important technical reasons, etc., and I accept them without any doubts)

As far as I know, this is not the situation - the dumps still represent the same data, it's just that they represent it in a different format. For instance, where previously you had the content of the 'cur' table, you now have an XML file (pages_current.xml.gz) which contains the same thing - the current version of each page. Similarly, the old 'old' table is available in a *more* friendly form (with the current revisions in too, which makes sense) as pages_full.xml.gz or filtered to the main namespace as all_titles_in_ns0.gz

The only downside is that the dumps are now harder to import, because the helper scripts are not yet well-developed. Once those scripts are working better, the new system will actually allow *improved* access to the data, since there can easily be more different dumps for different purposes.

-- Rowan Collins BSc [IMSoP]

Brion Vibber

8:10 p.m.

Dorożyński Janusz wrote:

...

But ... Fact 1. Since 23rd June anybody can downloaded whole content (or text) Fact 2. Past 23rd June anybody can download excerpt of whole content (yes, yes, because are important technical reasons, etc., and I accept them without any doubts)

I'm not sure what this means.

First, "since 23rd June" and "past 23rd June" mean about the same thing. Second, you can both download the whole content (with complete history) and download excerpts (current versions only, or current versions of all non-talk pages except user pages). This is the case both before and after that date (though the *additional* current-non-talk-non-user dump is new as of September).

...

But what it means is really personal POV. I say "limiting", may be I'm not right, may be I am. But I have right to think and right to have my opinion. Like everybody have right to have. And you don't think that I means that you are "complete ...", please. My thinks are totally opposite :-)

What do you think is limiting?

-- brion vibber (brion @ pobox.com)

Timwi

10 Sep 10 Sep

7:36 p.m.

Brion Vibber wrote:

...

First, "since 23rd June" and "past 23rd June" mean about the same thing.

He meant "until 23rd June" or "before 23rd June". He made the same mistake three times, and it was obvious at least to me what he meant.

Dorożyński Janusz

16 Sep 16 Sep

11:26 a.m.

| -----Original Message----- | From: ... Timwi | Sent: Saturday, September 10, 2005 9:37 PM / | He meant "until 23rd June" or "before 23rd June". He made the same | mistake three times, and it was obvious at least to me what he meant.

Oooops! Thanks Timwi, you are right.

I'm really sorry - it's all my fault.

Yes, I meant that before 23rd June we can used .sql dumps to succesfully loaded via mysqld.

Reg., Janusz 'Ency' Dorozynski

6857

Age (days ago)

6866

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Dorożyński Janusz
Rowan Collins
Timwi