The basic problem is that the page content dumps are ordered by revision
number within each page, which makes good sense for dumps users but means
that the addition of a single revision to a page will shift all of the
remaining data ,resulting in different compressed blocks. That's going to
be true regardless of the compression type.
In the not too distant future we might switch over to multi-stream output
files for all page content, fixing the page id range per stream for bz2
files. This might let a user check the current list of page ids against the
previous one and only get the streams with the pages they want, in the
brave new Hadoop-backed object store of my dreams. 7z files are another
matter altogether and I don't see how we can do better there without
rethinking them altogether.
Can you describe which dump files you are keeping and why having them in
sequence is useful? Maybe we can find a workaround that will let you get
what you need without keeping a bunch of older files.
Ariel
On Tue, Jul 28, 2020 at 8:48 AM Count Count <countvoncount123456(a)gmail.com>
wrote:
Hi!
The underlying filesystem (ZFS) uses block-level deduplication, so unique
chunks of 128KiB (default value) are only stored once. The 128KB chunks
making up dumps are mostly unique since there is no alignment so
deduplication will not help as far as I can see.
Best regards,
Count Count
On Tue, Jul 28, 2020 at 3:51 AM griffin tucker <gtucker4.une(a)hotmail.com>
wrote:
I’ve tried using freenas/truenas with a data
deduplication volume to
store multiple sequential dumps, however it doesn’t seem to save much space
at all – I was hoping someone could point me in the right direction so that
I can download multiple dumps and not have it take up so much room
(uncompressed).
Has anyone tried anything similar and had success with data deduplication?
Is there a guide?
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l