The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all.
Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days.
Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works?
Ariel
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Thursday, September 8, 2011 12:22 am Subject: [Xmldatadumps-l] another month, another dump. ho hum :-P To: xmldatadumps-l@lists.wikimedia.org
The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additionalstring: -p<first-page-id-contained>p<last-pageid- contained> Expect the specific groupings to change from one run to the next; it's time- based,rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicatethat those pages were deleted and not included in the dump at all.
Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days.
Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works?
Hi Ariel,
This download addon for firefox works quite well, and is cross-platform:
http://en.wikipedia.org/wiki/DownThemAll! https://addons.mozilla.org/en-US/firefox/addon/downthemall/ http://www.downthemall.net/
cheers, Jamie
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is the concatenation of all those sub-files, right?
Would it be possible to restore the filename of this file to enwiki-latest-pages-articles.xml.bz2 for consistency with all the other wikipedias?
For example, the latest full dump in http://dumps.wikimedia.org/dewiki/latest/ is called dewiki-latest-pages-articles.xml.bz2 and it's the same in all other languages.
Thanks, Eric
On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all.
Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days.
Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works?
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Yes it is, and the new naming scheme of the "latest" files is a bug, I need to fix that. Grrr.
Ariel
Στις 08-09-2011, ημέρα Πεμ, και ώρα 16:05 -0700, ο/η Eric Sun έγραψε:
Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is the concatenation of all those sub-files, right?
Would it be possible to restore the filename of this file to enwiki-latest-pages-articles.xml.bz2 for consistency with all the other wikipedias?
For example, the latest full dump in http://dumps.wikimedia.org/dewiki/latest/ is called dewiki-latest-pages-articles.xml.bz2 and it's the same in all other languages.
Thanks, Eric
On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote: The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all. Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days. Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works? Ariel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Please have a look at these links again. If folks see any anomalies, please let me know. The names should be fixed at any rate.
Ariel
Στις 09-09-2011, ημέρα Παρ, και ώρα 09:49 +0300, ο/η Ariel T. Glenn έγραψε:
Yes it is, and the new naming scheme of the "latest" files is a bug, I need to fix that. Grrr.
Ariel
Στις 08-09-2011, ημέρα Πεμ, και ώρα 16:05 -0700, ο/η Eric Sun έγραψε:
Just to confirm, the enwiki-20110901-pages-articles.xml.bz2 file is the concatenation of all those sub-files, right?
Would it be possible to restore the filename of this file to enwiki-latest-pages-articles.xml.bz2 for consistency with all the other wikipedias?
For example, the latest full dump in http://dumps.wikimedia.org/dewiki/latest/ is called dewiki-latest-pages-articles.xml.bz2 and it's the same in all other languages.
Thanks, Eric
On Thu, Sep 8, 2011 at 12:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote: The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all. Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days. Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works? Ariel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Il 08/09/2011 09:22, Ariel T. Glenn ha scritto:
I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works?
I wrote this simple bash script https://github.com/SoNetFBK/wiki-network/blob/master/download_dumps.sh
It's really simple to use. Usage: download_dumps.sh LANG [OUTPUT_DIR] [MATCHING_STRING]
Examples: - download_dumps.sh en -> downloads every lastest file from enwiki - download_dumps.sh en /mydata/dumps -> the same but saves everything in /mydata/dumps - download_dumps.sh en /mydata/dumps history -> the same but downloads only the files that contain the word "history" in the name (you can use regex too!)
p.s.: in the same repo you'll find other interesting stuff to analyze the dumps (extracting a social network from user talk pages, content analysis, ecc..). If you need other info write me ;)
fox wrote:
Il 08/09/2011 09:22, Ariel T. Glenn ha scritto:
I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works?
I wrote this simple bash script https://github.com/SoNetFBK/wiki-network/blob/master/download_dumps.sh
I like your trick to fetch the date. What I used to do was to manually download the md5sums file, then parse it to fetch everything from there.
What doesn't seem to work is the files one. In the output of
elinks -no-references -no-numbering -dump http://dumps.wikimedia.org/enwiki/20110901/
there's nothing matching "enwiki-"
Il 16/09/2011 18:02, Platonides ha scritto:
I like your trick to fetch the date. What I used to do was to manually download the md5sums file, then parse it to fetch everything from there.
oh yes! i didn't see that, maybe is better
What doesn't seem to work is the files one. In the output of
elinks -no-references -no-numbering -dump http://dumps.wikimedia.org/enwiki/20110901/
there's nothing matching "enwiki-"
I don't understand your problem, sorry. That page contains a lot of strings matching "enwiki-"
┌[fox☮MachI]-(~) └> elinks -no-references -no-numbering -dump http://dumps.wikimedia.org/enwiki/20110901/ | grep enwiki
enwiki dump progress on 20110901 * enwiki-20110901-pages-meta-history1.xml-p000000010p000002326.7z * enwiki-20110901-pages-meta-history1.xml-p000002327p000004609.7z * enwiki-20110901-pages-meta-history1.xml-p000004610p000006654.7z [...]
xmldatadumps-l@lists.wikimedia.org