Xmldatadumps-l January 2019

xmldatadumps-l@lists.wikimedia.org

3 participants
7 discussions

by noreply.xmldatadumps＠wikimedia.org

Greetings XML Dump users and contributors! This is your automatic monthly Dumps FAQ update email. This update contains figures for the 20181101 full revision history content run. We are currently dumping 916 projects in total. --------------------- Stats for bnwikivoyage on date 20181101 Total size of page content dump files for articles, current content only: 6587638 Total size of page content dump files for all pages, current content only: 6825818 Total size of page content dump files for all pages, all revisions: 99741838 --------------------- Stats for enwiki on date 20181101 Total size of page content dump files for articles, current content only: 69650333728 Total size of page content dump files for all pages, current content only: 155512399552 Total size of page content dump files for all pages, all revisions: 18210409893326 --------------------- Sincerely, Your friendly Wikimedia Dump Info Collector

5 years, 1 month

wikimedia.bytemark.co.uk mirror is not updated from 2017-11

by Mariusz "Nikow" Klinikowski

Greetings XML Dump users and contributors! Looks like https://wikimedia.bytemark.co.uk/ is not updated from 2017-11-26. I think, maybe somebody should delete it from mirror list or contact bytemark notify them? Best regards, Mariusz "Nikow" Klinikowski.

5 years, 1 month

XML Dumps FAQ monthly update

by noreply.xmldatadumps＠wikimedia.org

Greetings XML Dump users and contributors! This is your automatic monthly Dumps FAQ update email. This update contains figures for the 20190101 full revision history content run. We are currently dumping 917 projects in total. --------------------- Stats for rmywiki on date 20190101 Total size of page content dump files for articles, current content only: 2372738 Total size of page content dump files for all pages, current content only: 5012147 Total size of page content dump files for all pages, all revisions: 225293841 --------------------- Stats for enwiki on date 20190101 Total size of page content dump files for articles, current content only: 70514635946 Total size of page content dump files for all pages, current content only: 157464548890 Total size of page content dump files for all pages, all revisions: 18472280381222 --------------------- Sincerely, Your friendly Wikimedia Dump Info Collector

5 years, 2 months

incorrect links for pages articles multistream files for big wikis

by Ariel Glenn WMF

Folks may have noticed already that the links presented for downlod of pages-articles-multistream dumps are incorrect on the web pages for big wikis. The files exist for download but the wrong links were created. I'll be looking into that and fixing it up over the next days, but in the meantime you can manually download the files by specifying the right name. Apologies for the inconvenience. Ariel

5 years, 3 months

no name listed for file contained in a 7z file?

by Albretch Mueller

I downloaded: http://dumps.wikimedia.your.org/other/static_html_dumps/2008-06/en/wikipedi… using wget and it seems to be fine: $ _IFL="wikipedia-en-html.tar.7z" $ ls -l "${_IFL}" -rw-r--r-- 1 niggahme niggahme 15363543213 Jun 21 2008 wikipedia-en-html.tar.7z $ file "${_IFL}" wikipedia-en-html.tar.7z: 7-zip archive data, version 0.2 $ md5sum -b "${_IFL}" 03ce695cbf32a3f8636fa8d3f9c7d12e *wikipedia-en-html.tar.7z $ sha256sum -b "${_IFL}" c2794b6371a05017f03e2a345730fd763b1052872290b5c78763978a0b43c747 *wikipedia-en-html.tar.7z $ sha512sum -b "${_IFL}" d52a737ceca25ef18272ba70a4a56000a7a0bff92653fb462674333a0855f397c892b8aeb2e11206d391ba4cca48d46f5814d92db4d2096467519de38c5a189c *wikipedia-en-html.tar.7z $ 7z l "${_IFL}" 7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Pentium(R) CPU B940 @ 2.00GHz (206A7),ASM) Scanning the drive for archives: 1 file, 15363543213 bytes (15 GiB) Listing archive: wikipedia-en-html.tar.7z -- Path = wikipedia-en-html.tar.7z Type = 7z Physical Size = 15363543213 Headers Size = 100 Method = LZMA:22 Solid = - Blocks = 1 Date Time Attr Size Compressed Name ------------------- ----- ------------ ------------ ------------------------ 2008-06-18 13:02:15 ..... 223674511360 15363543113 wikipedia-en-html.tar ------------------- ----- ------------ ------------ ------------------------ 2008-06-18 13:02:15 223674511360 15363543113 1 files $ But I ca'nt get the name of the compressed/contained file even though ark and 7z show it to you. Here is my simple piece of code: String aIFl = "wikipedia-en-html.tar.7z"; File I7ZKFl = new File(aIFl); if(I7ZKFl.exists()){ try{ SevenZFile SvnZFl = new SevenZFile(I7ZKFl); SevenZArchiveEntry entry; int iIx = 0; while((entry = SvnZFl.getNextEntry()) != null){ System.out.println("// __ [" + iIx + "]: |" + entry + "|"); System.out.println("// __ .getName() |" + entry.getName() + "|"); System.out.println("// __ .getSize() |" + entry.getSize() + "|"); System.out.println("// __ .getLastModifiedDate() |" + entry.getLastModifiedDate() + "|"); ++iIx; }// ((entry = SvnZFl.getNextEntry()) != null) }catch(IOException IOX){ IOX.printStackTrace(System.err); } } which, except for the name, its faithful output was: // __ [0]: |org.apache.commons.compress.archivers.sevenz.SevenZArchiveEntry@179d3b25| // __ .getName() |null| // __ .getSize() |223674511360| // __ .getLastModifiedDate() |Wed Jun 18 14:02:15 EDT 2008| Why is it that I can't get the file name? Also, if OO works, I should be able to access and process this file while addressing it like (using an exclamation mark): wikipedia-en-html.tar.7z!wikipedia-en-html.tar So, I this point I should be able to go: String aIFl = "wikipedia-en-html.tar.7z!wikipedia-en-html.tar" FileInputStream FISTarK = new FileInputStream(new File(aIFl)); TarArchiveInputStream tarInput = new TarArchiveInputStream(FISTarK); TarArchiveEntry tArKEnt; while((tArKEnt=tarInput.getNextTarEntry()) != null){ ... } right? lbrtchx

5 years, 3 months

Change in multistream dump file production

by Ariel Glenn WMF

TL;DR: Don't panic, the single articles multistream bz2 file for big wikis will be produced shortly after the new smaller fles. Long version: For big wikis which already have split up article files, we now produce one multistream file per article file. These are now recombined into a single file later, wth a single index file in the fashion everyone is used to. This part of the speedup work mentioned in the previous email. Have a good weekend, Ariel

5 years, 3 months

mwbzutils BREAKING CHANGE

by Ariel Glenn WMF

If you use recompressxml in the mwbzutils package, as of version 0.0.9 (just deployed) it no longer writes bz2 compressed data by default to stdout; instead it relies on the extension of the output file and will write either gzipped, bz2 or plain text output, accordingly. This means that if it is directed to write to stdout, this will be uncompressed data. You can work around this in your scripts by piping the text from stdout to bzip directly from recompressxml. This change came as part of some speedup work. I won't discuss that more until we see how the next couple of runs go. Thanks for your understanding. Ariel

5 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l January 2019