The current dump
enwiki-20160204-pages-articles.xml.bz2
contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.
Is this going to be fixed or should we assume that there might be duplicated pages in the dump? This never happened to us before.
Ciao,
seba
Hello all
I was wondering if someone could point me the english wikipedia
"enwiki-20150205-pages-articles-multistream" dump from which the 2015-04
dbpedia dumps were extracted. They used to be hosted on dump.wikipedia.org
but are 404 now.
Thanks
Praveen
2015, 2016 dumps are available at the wikimedia dumps website.
However I would like to have access to old xml wikipedia dumps.
I've been googling all around and tried everything from Torrents to an EBS
snapshot on amazon which supposedly contained many of the xml dumps.
Ive managed to somehow get access to random dumps for all years from 2006
to 2016, However I would like to get specific dumps. i.e: all dumpsaround
March for the mentioned years.
I wonder if there is a repository, or if anyone could share a them via
torrents( the current torrens don't have any seeds)
Thanks
Earlier today one of the snapshot hosts was mistakenly taken off line and
shortly thereafter rebooted. The cron job checker should pick up tomorrow
and after a few minutes continue where the run left off. I'll be checking
to make sure that all goes as planned.
Ariel
Changing compression programs was mentioned on the list last month. Doing
Google searches for one thing can bring up something totally different...
https://github.com/powturbo/TurboBench
This compares over 50 different compression programs with "pretty" graphs
and tables as output. I think the test results on the web page uses part
of an enwiki dump file. See http://mattmahoney.net/dc/text.html for
information on the file.
Note: The developer might be same person who did LZTurbo. A closed source
fork of GPL Tornado compression program. If wondering, Tornado leaves BZ2
in the dust in terms of speed with same compression size, It is available
for Linux and Windows, but is not in any Linux repository I could find.
There is also Squash Compression Benchmark
https://quixdb.github.io/squash-benchmark/
Bryan
Could somebody start the dumps. It would be nice to get two rounds of
dumps in a month for a change.
Also, shouldn't enwiki (dewiki? frwiki?) go first as they take the
longest? I think the second round won't start till these are finished.
Bryan