Hi!
Awesome!
Is there any reason they are tar.gz files of one file and not simply bzip2 of the file contents? Wikidata dumps are bzip2 of one json and that allows parallel decompression. Having both tar (why tar of one file at all?) and gz in there really requires one to first decompress the whole thing before you can process it in parallel. Is there some other way I am missing?
Wikipedia dumps are done with multistream bzip2 with an additional index file. That could be nice here too, if one could have an index file and then be able to immediately jump to a JSON line for corresponding articles.
Also, is there an API endpoint or Special page which can return the same JSON for a single Wikipedia page? The JSON structure looks very useful by itself (e.g., not in bulk).
Mitar
On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF ariel@wikimedia.org wrote:
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for October 17-18th are available for public download; see https://dumps.wikimedia.org/other/enterprise_html/ for more information. We expect to make updated versions of these files available around the 1st/2nd of the month and the 20th/21st of the month, following the cadence of the standard SQL/XML dumps.
This is still an experimental service, so there may be hiccups from time to time. Please be patient and report issues as you find them. Thanks!
Ariel "Dumps Wrangler" Glenn
[1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more about Wikimedia Enterprise and its API. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org