Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

29 Jul 2020

I figured I would decompress the .bz2 and .gz files and that subsequent downloads of dumps
would only store the changes, disregarding the compressed .bz2 , .gz , and .7z files.

My purposes are just experimenting/learning (I’m a first year comp-sci student) and I
really like the idea of downloading multiple dumps and it not taking up much more space.

My plan was to download a few dumps of enwikinews as a test, and then go for enwikipedia
when it’s tested successfully.

I’ve just been doing this locally, however I was planning on using cloud virtual machines
like aws, and then moving them to glacier for long-term storage (copies of the massive
volumes).

I’ve tried following the guides for using mediawiki to reproduce the dumps, but it runs
into errors after only a few thousand pages. I was going to reproduce each dump and then
scrape locally for .html files and store those. Images would be a bonus.

Then, every month I want to run a script that would do all of this automatically, storing
to a dedup volume.

That’s my plan, anyway.

From: Ariel Glenn WMF &lt;ariel(a)wikimedia.org&gt;
Sent: Wednesday, 29 July 2020 4:49 PM
To: Count Count &lt;countvoncount123456(a)gmail.com&gt;
Cc: griffin tucker &lt;gtucker4.une(a)hotmail.com&gt;om>; xmldatadumps-l(a)lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

The basic problem is that the page content dumps are ordered by revision number within
each page, which makes good sense for dumps users but means that the addition of a single
revision to a page will shift all of the remaining data ,resulting in different compressed
blocks. That's going to be true regardless of the compression type.

In the not too distant future we might switch over to multi-stream output files for all
page content, fixing the page id range per stream for bz2 files. This might let a user
check the current list of page ids against the previous one and only get the streams with
the pages they want, in the brave new Hadoop-backed object store of my dreams. 7z files
are another matter altogether and I don't see how we can do better there without
rethinking them altogether.

Can you describe which dump files you are keeping and why having them in sequence is
useful? Maybe we can find a workaround that will let you get what you need without keeping
a bunch of older files.

Ariel

On Tue, Jul 28, 2020 at 8:48 AM Count Count
<countvoncount123456@gmail.com<mailto:countvoncount123456@gmail.com>> wrote:
Hi!

The underlying filesystem (ZFS) uses block-level deduplication, so unique chunks of 128KiB
(default value) are only stored once. The 128KB chunks making up dumps are mostly unique
since there is no alignment so deduplication will not help as far as I can see.

Best regards,

Count Count

On Tue, Jul 28, 2020 at 3:51 AM griffin tucker
<gtucker4.une@hotmail.com<mailto:gtucker4.une@hotmail.com>> wrote:
I’ve tried using freenas/truenas with a data deduplication volume to store multiple
sequential dumps, however it doesn’t seem to save much space at all – I was hoping someone
could point me in the right direction so that I can download multiple dumps and not have
it take up so much room (uncompressed).

Has anyone tried anything similar and had success with data deduplication?

Is there a guide?
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org<mailto:Xmldatadumps-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org<mailto:Xmldatadumps-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Has anyone had success with data deduplication?