Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

29 Jul 2020

If you extract the records from the XML into individual files, then the ZFS
deduplication should take effect. I'm not sure this is a great way to
handle the data, because you will potentially waste a lot of disk space on
"slack". (This is actually a good use case for slackless files, since they
won't grow.  But this would break the block alignment property.)  You will
also have to decide what to call each file.  If you use page names you will
have duplicates, if you use revision IDs then you have effectively done the
deduplication yourself - and you will need to create an index of some form.

A nice alternative would be to  parse each file and remove revisionID's you
have already seen.  The first file you process will not shrink, subsequent
files will shrink a lot.  You will however be missing many revisions if you
only use pages-current dumps.

On Wed, 29 Jul 2020 at 16:33, Ariel Glenn WMF &lt;ariel(a)wikimedia.org&gt; wrote:

...
  Are you looking for the most current version of each
page? Do you want
 articles or also talk pages, user pages and the rest?

 In any case, There are a couple of projects that might be of interest to
 you.
 One is the so-called adds/changes dumps, available here:
 https://dumps.wikimedia.org/other/incr/
 The other istwork being done on producing HTML dumps; you may follow the
 progress of that on Phabricator: https://phabricator.wikimedia.org/T254275
 Note that parsing wikitext to generate HTML is quite intensive; you might
 look at the Kwix project for more information about how they do it:
 https://github.com/openzim/mwoffliner

 Ariel

 On Wed, Jul 29, 2020 at 1:48 PM griffin tucker &lt;gtucker4.une(a)hotmail.com&gt;
 wrote:

  I figured I would decompress the .bz2 and .gz
files and that subsequent
 downloads of dumps would only store the changes, disregarding the
 compressed .bz2 , .gz , and .7z files.

 My purposes are just experimenting/learning (I’m a first year comp-sci
 student) and I really like the idea of downloading multiple dumps and it
 not taking up much more space.

 My plan was to download a few dumps of enwikinews as a test, and then go
 for enwikipedia when it’s tested successfully.

 I’ve just been doing this locally, however I was planning on using cloud
 virtual machines like aws, and then moving them to glacier for long-term
 storage (copies of the massive volumes).

 I’ve tried following the guides for using mediawiki to reproduce the
 dumps, but it runs into errors after only a few thousand pages. I was going
 to reproduce each dump and then scrape locally for .html files and store
 those. Images would be a bonus.

 Then, every month I want to run a script that would do all of this
 automatically, storing to a dedup volume.

 That’s my plan, anyway.

 *From:* Ariel Glenn WMF &lt;ariel(a)wikimedia.org&gt;
 *Sent:* Wednesday, 29 July 2020 4:49 PM
 *To:* Count Count &lt;countvoncount123456(a)gmail.com&gt;
 *Cc:* griffin tucker &lt;gtucker4.une(a)hotmail.com&gt;om>;
 xmldatadumps-l(a)lists.wikimedia.org
 *Subject:* Re: [Xmldatadumps-l] Has anyone had success with data
 deduplication?

 The basic problem is that the page content dumps are ordered by revision
 number within each page, which makes good sense for dumps users but means
 that the addition of a single revision to a page will shift all of the
 remaining data ,resulting in different compressed blocks. That's going to
 be true regardless of the compression type.

 In the not too distant future we might switch over to multi-stream output
 files for all page content, fixing the page id range per stream for bz2
 files. This might let a user check the current list of page ids against the
 previous one and only get the streams with the pages they want, in the
 brave new Hadoop-backed object store of my dreams. 7z files are another
 matter altogether and I don't see how we can do better there without
 rethinking them altogether.

 Can you describe which dump files you are keeping and why having them in
 sequence is useful? Maybe we can find a workaround that will let you get
 what you need without keeping a bunch of older files.

 Ariel

 On Tue, Jul 28, 2020 at 8:48 AM Count Count <
 countvoncount123456(a)gmail.com&gt; wrote:

 Hi!

 The underlying filesystem (ZFS) uses block-level deduplication, so unique
 chunks of 128KiB (default value) are only stored once. The 128KB chunks
 making up dumps are mostly unique since there is no alignment so
 deduplication will not help as far as I can see.

 Best regards,

 Count Count

 On Tue, Jul 28, 2020 at 3:51 AM griffin tucker &lt;gtucker4.une(a)hotmail.com&gt;
 wrote:

 I’ve tried using freenas/truenas with a data deduplication volume to
 store multiple sequential dumps, however it doesn’t seem to save much space
 at all – I was hoping someone could point me in the right direction so that
 I can download multiple dumps and not have it take up so much room
 (uncompressed).

 Has anyone tried anything similar and had success with data deduplication?

 Is there a guide?

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

 _______________________________________________  Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

-- 
Landline (UK) 01780 757 250
Mobile (UK) 0798 1995 792

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Has anyone had success with data deduplication?