Hi,
I would like to access the Wiki dumps for Wikipedia and Wikitravel. Essentially, I am looking to get dumps for some US cities from both these sources for product research. However, it is not very clear at http://dumps.wikimedia.org/backup-index.html, which are the relevant files to pick up.
The other question I had was at what level of atomicity data is available in dumps. The Web Service allows us to retrieve a Wiki entry but it's not easily parsed out into different sections or in more granular form. I was wondering if the dumps solve this problem.
Appreciate any help about this.
Regards, Ashish
Hi,
I would like to access the Wiki dumps for Wikipedia and Wikitravel. Essentially, I am looking to get dumps for some US cities from both these sources for product research. However, it is not very clear at http://dumps.wikimedia.org/backup-index.html, which are the relevant files to pick up.
Which Wikipedia do you want? What data do you want? If you want the dumps of the English Wikipedia, find “enwiki” in the index page, which will link to the most recent dump files (currently http://dumps.wikimedia.org/enwiki/20120211/). If you want current text of articles, you want the pages-articles dump (currently http://dumps.wikimedia.org/enwiki/20120211/enwiki-20120211-pages-articles.xm..., 7.6 GB).
Wikitravel is not run by Wikimedia, so you won't find their dump at the same page. You can find information about their dump at http://wikitravel.org/en/Wikitravel:Database_dump.
The other question I had was at what level of atomicity data is available in dumps. The Web Service allows us to retrieve a Wiki entry but it's not easily parsed out into different sections or in more granular form. I was wondering if the dumps solve this problem.
No, the dumps contain the article text is the same form as what you can edit as a user or what you can get though the API. So, if you want to get any useful information out of it, you have to parse it by yourself.
Petr Onderka [[en:User:Svick]]
2012/3/3 Petr Onderka gsvick@gmail.com
Hi,
I would like to access the Wiki dumps for Wikipedia and Wikitravel. Essentially, I am looking to get dumps for some US cities from both these sources for product research. However, it is not very clear at http://dumps.wikimedia.org/backup-index.html, which are the relevant
files
to pick up.
Which Wikipedia do you want? What data do you want? If you want the dumps of the English Wikipedia, find “enwiki” in the index page, which will link to the most recent dump files (currently http://dumps.wikimedia.org/enwiki/20120211/). If you want current text of articles, you want the pages-articles dump (currently
http://dumps.wikimedia.org/enwiki/20120211/enwiki-20120211-pages-articles.xm... , 7.6 GB).
Wikitravel is not run by Wikimedia, so you won't find their dump at the same page. You can find information about their dump at http://wikitravel.org/en/Wikitravel:Database_dump.
I have added info to that page about WikiTeam dumps http://code.google.com/p/wikiteam/downloads/list?can=2&q=wikitravel
The files from
dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/pagecounts-20111008-180001.gz
to
dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/pagecounts-20111008-220001.gz
appear to be Domas' (very informative) blog about MySQl locking, rather than true log summaries.....
All the best,
Richard.
On 04/03/12 15:31, Richard Farmbrough wrote:
The files from
dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/pagecounts-20111008-180001.gz
to
dumps.wikimedia.org/other/pagecounts-raw/2011/2011-10/pagecounts-20111008-220001.gz
appear to be Domas' (very informative) blog about MySQl locking, rather than true log summaries.....
All the best,
Richard.
Lol. Domas sneaking a copy of his blog between the pagecounts!
I suspect that when downloading the pagecounts, those files were missing and the webserver defaulted to show the blog.
Lol. Domas sneaking a copy of his blog between the pagecounts!
how otherwise would I get readers?
I suspect that when downloading the pagecounts, those files were missing and the webserver defaulted to show the blog.
hehe, yeah, there's 404 handler, OTOH those files had to exist there, when they were being downloaded, there was no automated purge at any time, files wouldn't disappear...
Domas
xmldatadumps-l@lists.wikimedia.org