Hello, I'm working on a little project to perform cryptographically-sound timestamping on Wikipedia snapshots. I'm using the opentimestamps.org service, which by default uses the SHA-256 hash. In order to get the SHA-256 for the timestamp, I need to download each file and compute the hash.
Currently the xml data dumps provide only the MD5 and SHA-1 hashes. Both of these hash functions are obsolete because they are cryptographically broken. I'm wondering: would the maintainers of this service be willing to add SHA-256 digests to the dumpstatus and checksum files going forward? SHA-256 is still cryptographically sound and would allow me to verify that I have the correct hash for timestamping.
Thanks in advance!
Best regards, Arthur
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast.
https://wikitech.wikimedia.org/wiki/Dumps has more information on the setup.
Il 20/04/24 03:40, Arthur D. Edelstein ha scritto:
In order to get the SHA-256 for the timestamp, I need to download each file and compute the hash.
I understand it's suboptimal, but if you're in a rush you can also use Toolforge and create a tool, a bit like https://dump-torrents.toolforge.org/ , to run sha256sum on the appropriate files (which are mounted even on the bastion host). I/O tends to be rather slow but may still be faster than your networking.
Best, Federico
Many thanks, Federico! I am taking this approach.
On Sat, Apr 20, 2024 at 10:41 AM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast.
https://wikitech.wikimedia.org/wiki/Dumps has more information on the setup.
Il 20/04/24 03:40, Arthur D. Edelstein ha scritto:
In order to get the SHA-256 for the timestamp, I need to download each file and compute the hash.
I understand it's suboptimal, but if you're in a rush you can also use Toolforge and create a tool, a bit like https://dump-torrents.toolforge.org/ , to run sha256sum on the appropriate files (which are mounted even on the bastion host). I/O tends to be rather slow but may still be faster than your networking.
Best, Federico
Arthur,
The current Dumps infrastructure is in maintenance mode. But I'd be definitely nice to consider SHA-256 for Dumps 2.0.
What Federico mentions seems like the best choice if you need this now. Dumps 2.0 will take a long while, but do feel free to open the task at https://phabricator.wikimedia.org/ and tag it with "Dumps 2.0". Kindly please describe your use case over there as well.
Thanks, -xabriel
On Mon, Apr 22, 2024 at 10:34 PM Arthur D. Edelstein < arthuredelstein@gmail.com> wrote:
Many thanks, Federico! I am taking this approach.
On Sat, Apr 20, 2024 at 10:41 AM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast.
https://wikitech.wikimedia.org/wiki/Dumps has more information on the setup.
Il 20/04/24 03:40, Arthur D. Edelstein ha scritto:
In order to get the SHA-256 for the timestamp, I need to download each file and compute the hash.
I understand it's suboptimal, but if you're in a rush you can also use Toolforge and create a tool, a bit like https://dump-torrents.toolforge.org/ , to run sha256sum on the appropriate files (which are mounted even on the bastion host). I/O tends to be rather slow but may still be faster than your networking.
Best, Federico
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
Hi Xabriel,
Thanks! I opened a task: https://phabricator.wikimedia.org/T363184
Best regards, Arthur
On Tue, Apr 23, 2024 at 7:25 AM Xabriel Collazo Mojica < xcollazo@wikimedia.org> wrote:
Arthur,
The current Dumps infrastructure is in maintenance mode. But I'd be definitely nice to consider SHA-256 for Dumps 2.0.
What Federico mentions seems like the best choice if you need this now. Dumps 2.0 will take a long while, but do feel free to open the task at https://phabricator.wikimedia.org/ and tag it with "Dumps 2.0". Kindly please describe your use case over there as well.
Thanks, -xabriel
On Mon, Apr 22, 2024 at 10:34 PM Arthur D. Edelstein < arthuredelstein@gmail.com> wrote:
Many thanks, Federico! I am taking this approach.
On Sat, Apr 20, 2024 at 10:41 AM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Adding new checksum files may or may not be a big deal. If the snapshot hosts have enough memory to keep the files in cache a bit longer, so they don't need to be read back from disk, running new checksums may be very fast.
https://wikitech.wikimedia.org/wiki/Dumps has more information on the setup.
Il 20/04/24 03:40, Arthur D. Edelstein ha scritto:
In order to get the SHA-256 for the timestamp, I need to download each file and compute the hash.
I understand it's suboptimal, but if you're in a rush you can also use Toolforge and create a tool, a bit like https://dump-torrents.toolforge.org/ , to run sha256sum on the appropriate files (which are mounted even on the bastion host). I/O tends to be rather slow but may still be faster than your networking.
Best, Federico
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation
xmldatadumps-l@lists.wikimedia.org