[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

1 Jan 2022

Hi!

Awesome!

Is there any reason they are tar.gz files of one file and not simply
bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
that allows parallel decompression. Having both tar (why tar of one
file at all?) and gz in there really requires one to first decompress
the whole thing before you can process it in parallel. Is there some
other way I am missing?

Wikipedia dumps are done with multistream bzip2 with an additional
index file. That could be nice here too, if one could have an index
file and then be able to immediately jump to a JSON line for
corresponding articles.

Also, is there an API endpoint or Special page which can return the
same JSON for a single Wikipedia page? The JSON structure looks very
useful by itself (e.g., not in bulk).

Mitar

On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF &lt;ariel(a)wikimedia.org&gt; wrote:
...

 I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
 October 17-18th are available for public download; see
 https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
 expect to make updated versions of these files available around the 1st/2nd
 of the month and the 20th/21st of the month, following the cadence of the
 standard SQL/XML dumps.

 This is still an experimental service, so there may be hiccups from time to
 time. Please be patient and report issues as you find them. Thanks!

 Ariel "Dumps Wrangler" Glenn

 [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
 about Wikimedia Enterprise and its API.
 _______________________________________________
 Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
 To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org 

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download