Date: Wed, 17 Mar 2010 15:15:24 +0100 From: Platonides Platonides@gmail.com Subject: Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: wikitech-l@lists.wikimedia.org Message-ID: hnqo49$itc$1@dough.gmane.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Jamie Morken wrote:
Also I wonder if it is possible to convert from 7z to bz2 without having
to make the 5469GB file first? If this can be done then having only 7z
files would be fine, as the bz2 file could be created with a "normal"
PC (ie one without a 6TB+ harddrive). This would be a good solution,
but not sure if it can be done. If it could though, might as well get
rid of all the large wiki's bz2 pages-meta-history files!
Sure. 7z e -so DatabaseDump.7z | bzip -9 > DatabaseDump.bz
Hi,
Thanks for the info, I think 7z is the way to go :)
cheers, Jamie
Hi,
Firstly, congratulations for this! as i Know it has taken for a long time!
and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said. More specifically, how different shows on one article? Can anyone explain it in detail, please?
Additionally, why all the statistics of Wikipedia only use history dump for analysis?
Thanks very much!
Zeyi
Zeyi wrote:
Hi,
Firstly, congratulations for this! as i Know it has taken for a long time!
and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said.
You have explained the difference perfectly :)
More specifically, how different shows on one article? Can anyone explain it in detail, please?
It doesn't "show" the article. It's just a really really large bunch of wikitext separated by xml tags. It is shown by a tool. If you just wwant to read the articles, you don't need histories.
Additionally, why all the statistics of Wikipedia only use history dump for analysis?
Because they study things like changes made to articles, number of edits per time...
Thanks very much!
You're welcome.
On Mar 19 2010, Platonides wrote:
Zeyi wrote:
Hi,
Firstly, congratulations for this! as i Know it has taken for a long time!
and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said.
You have explained the difference perfectly :)
More specifically, how different shows on one article? Can anyone explain it in detail, please?
It doesn't "show" the article. It's just a really really large bunch of wikitext separated by xml tags. It is shown by a tool. If you just wwant to read the articles, you don't need histories.
What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100?
If i say that the current dump can explain how the current articles established from different edits, is that correct?
Additionally, why all the statistics of Wikipedia only use history dump for analysis?
Because they study things like changes made to articles, number of edits per time...
Thanks very much!
You're welcome.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 03/19/2010 11:02 AM, zh509@york.ac.uk wrote:
What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100?
The current dump shows 1 edit for each article, only the most recent at the time that article was processed. The history dump shows all edits for all articles.
Conrad
On Mar 19 2010, Conrad Irwin wrote:
On 03/19/2010 11:02 AM, zh509@york.ac.uk wrote:
What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100?
The current dump shows 1 edit for each article, only the most recent at the time that article was processed. The history dump shows all edits for all articles.
Wow, can you confirm that only the lastest edit can be collected by the current dump? So, the current dump isn't meaningful in the term of statistics?
Conrad
thanks, Zeyi
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Zeyi wrote:
Wow, can you confirm that only the lastest edit can be collected by the current dump?
Yes. That's the intent.
So, the current dump isn't meaningful in the term of statistics?
It's not extremely useful. Ther may be a few statistics for which it can be used.
wikitech-l@lists.wikimedia.org