Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20181101 full revision history content run.
We are currently dumping 916 projects in total.
---------------------
Stats for bnwikivoyage on date 20181101
Total size of page content dump files for articles, current content only:
6587638
Total size of page content dump files for all pages, current content only:
6825818
Total size of page content dump files for all pages, all revisions:
99741838
---------------------
Stats for enwiki on date 20181101
Total size of page content dump files for articles, current content only:
69650333728
Total size of page content dump files for all pages, current content only:
155512399552
Total size of page content dump files for all pages, all revisions:
18210409893326
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Greetings XML Dump users and contributors!
Looks like https://wikimedia.bytemark.co.uk/ is not updated from
2017-11-26. I think, maybe somebody should delete it from mirror list or
contact bytemark notify them?
Best regards,
Mariusz "Nikow" Klinikowski.
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20190101 full revision history content run.
We are currently dumping 917 projects in total.
---------------------
Stats for rmywiki on date 20190101
Total size of page content dump files for articles, current content only:
2372738
Total size of page content dump files for all pages, current content only:
5012147
Total size of page content dump files for all pages, all revisions:
225293841
---------------------
Stats for enwiki on date 20190101
Total size of page content dump files for articles, current content only:
70514635946
Total size of page content dump files for all pages, current content only:
157464548890
Total size of page content dump files for all pages, all revisions:
18472280381222
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Folks may have noticed already that the links presented for downlod of
pages-articles-multistream dumps are incorrect on the web pages for big
wikis. The files exist for download but the wrong links were created.
I'll be looking into that and fixing it up over the next days, but in the
meantime you can manually download the files by specifying the right name.
Apologies for the inconvenience.
Ariel
I downloaded:
http://dumps.wikimedia.your.org/other/static_html_dumps/2008-06/en/wikipedi…
using wget and it seems to be fine:
$ _IFL="wikipedia-en-html.tar.7z"
$ ls -l "${_IFL}"
-rw-r--r-- 1 niggahme niggahme 15363543213 Jun 21 2008 wikipedia-en-html.tar.7z
$ file "${_IFL}"
wikipedia-en-html.tar.7z: 7-zip archive data, version 0.2
$ md5sum -b "${_IFL}"
03ce695cbf32a3f8636fa8d3f9c7d12e *wikipedia-en-html.tar.7z
$ sha256sum -b "${_IFL}"
c2794b6371a05017f03e2a345730fd763b1052872290b5c78763978a0b43c747
*wikipedia-en-html.tar.7z
$ sha512sum -b "${_IFL}"
d52a737ceca25ef18272ba70a4a56000a7a0bff92653fb462674333a0855f397c892b8aeb2e11206d391ba4cca48d46f5814d92db4d2096467519de38c5a189c
*wikipedia-en-html.tar.7z
$ 7z l "${_IFL}"
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64
bits,2 CPUs Intel(R) Pentium(R) CPU B940 @ 2.00GHz (206A7),ASM)
Scanning the drive for archives:
1 file, 15363543213 bytes (15 GiB)
Listing archive: wikipedia-en-html.tar.7z
--
Path = wikipedia-en-html.tar.7z
Type = 7z
Physical Size = 15363543213
Headers Size = 100
Method = LZMA:22
Solid = -
Blocks = 1
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2008-06-18 13:02:15 ..... 223674511360 15363543113 wikipedia-en-html.tar
------------------- ----- ------------ ------------ ------------------------
2008-06-18 13:02:15 223674511360 15363543113 1 files
$
But I ca'nt get the name of the compressed/contained file even though
ark and 7z show it to you. Here is my simple piece of code:
String aIFl = "wikipedia-en-html.tar.7z";
File I7ZKFl = new File(aIFl);
if(I7ZKFl.exists()){
try{
SevenZFile SvnZFl = new SevenZFile(I7ZKFl);
SevenZArchiveEntry entry;
int iIx = 0;
while((entry = SvnZFl.getNextEntry()) != null){
System.out.println("// __ [" + iIx + "]: |" + entry + "|");
System.out.println("// __ .getName() |" + entry.getName() + "|");
System.out.println("// __ .getSize() |" + entry.getSize() + "|");
System.out.println("// __ .getLastModifiedDate() |" +
entry.getLastModifiedDate() + "|");
++iIx;
}// ((entry = SvnZFl.getNextEntry()) != null)
}catch(IOException IOX){ IOX.printStackTrace(System.err); }
}
which, except for the name, its faithful output was:
// __ [0]: |org.apache.commons.compress.archivers.sevenz.SevenZArchiveEntry@179d3b25|
// __ .getName() |null|
// __ .getSize() |223674511360|
// __ .getLastModifiedDate() |Wed Jun 18 14:02:15 EDT 2008|
Why is it that I can't get the file name?
Also, if OO works, I should be able to access and process this file
while addressing it like (using an exclamation mark):
wikipedia-en-html.tar.7z!wikipedia-en-html.tar
So, I this point I should be able to go:
String aIFl = "wikipedia-en-html.tar.7z!wikipedia-en-html.tar"
FileInputStream FISTarK = new FileInputStream(new File(aIFl));
TarArchiveInputStream tarInput = new TarArchiveInputStream(FISTarK);
TarArchiveEntry tArKEnt;
while((tArKEnt=tarInput.getNextTarEntry()) != null){
...
}
right?
lbrtchx
TL;DR: Don't panic, the single articles multistream bz2 file for big wikis
will be produced shortly after the new smaller fles.
Long version: For big wikis which already have split up article files, we
now produce one multistream file per article file. These are now recombined
into a single file later, wth a single index file in the fashion everyone
is used to.
This part of the speedup work mentioned in the previous email.
Have a good weekend,
Ariel
If you use recompressxml in the mwbzutils package, as of version 0.0.9
(just deployed) it no longer writes bz2 compressed data by default to
stdout; instead it relies on the extension of the output file and will
write either gzipped, bz2 or plain text output, accordingly. This means
that if it is directed to write to stdout, this will be uncompressed data.
You can work around this in your scripts by piping the text from stdout to
bzip directly from recompressxml.
This change came as part of some speedup work. I won't discuss that more
until we see how the next couple of runs go.
Thanks for your understanding.
Ariel