[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps

8 May 2023

On Fri, 5 May 2023, at 22:53, Evan Lloyd New-Schmidt wrote:
...
  Hi, I'm starting a project that will involve
repeated processing of HTML 
 wikipedia articles.

 Using the enterprise dumps seems like it would be much simpler than 
 converting the XML dumps, but I don't know what the "experimental" 
 status really means. 
Hi,

From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407). 

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.

 Jan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps