Hi, I'm starting a project that will involve repeated processing of HTML wikipedia articles.
Using the enterprise dumps seems like it would be much simpler than converting the XML dumps, but I don't know what the "experimental" status really means.
I see in the original announcement post from a year and a half ago that there is a warning about bugs and downtime, but the meta wiki page and dumps site don't have any more information.
Is there less of a commitment to keep posting the enterprise dumps compared to the database XML dumps?
Thanks, Evan
On Fri, 5 May 2023, at 22:53, Evan Lloyd New-Schmidt wrote:
Hi, I'm starting a project that will involve repeated processing of HTML wikipedia articles.
Using the enterprise dumps seems like it would be much simpler than converting the XML dumps, but I don't know what the "experimental" status really means.
Hi,
From my experience working with the Wiktionary HTML dumps I can say that the data quality is quite poor: there are stale and missing entries (https://phabricator.wikimedia.org/T305407).
There are also entire namespaces excluded from the dumps, and more recently there have been issues with the dumps not getting updated.
So it depends what kind of processing you need to do–in general I find the parsing to be much easier, hopefully they'll manage to sort out the problems.
Jan
From my experience working with the Wiktionary HTML dumps I can say
that the data quality is quite poor: there are stale and missing entries (https://phabricator.wikimedia.org/T305407).
Thank you Jan, that is very good to know. I'll follow that issue for updates.
- Evan
Hello Evan,
The Enterprise HTML dumps should be publicly available around the 22nd and the 3rd of each month, though there can be delays. We don't expect that to change any time soon. As to their content or the namespaces, I can't answer to that; someone from WIkimedia Enterprise will have to discuss their plans. More information about their content is available at https://meta.wikimedia.org/wiki/Wikimedia_Enterprise and you might be able to get a question about it answered on the corresponding discussion page. Hope that helps to clarify things a bit.
Ariel Glenn ariel@wikimedia.org
On Fri, May 5, 2023 at 11:54 PM Evan Lloyd New-Schmidt evan@new-schmidt.com wrote:
Hi, I'm starting a project that will involve repeated processing of HTML wikipedia articles.
Using the enterprise dumps seems like it would be much simpler than converting the XML dumps, but I don't know what the "experimental" status really means.
I see in the original announcement post from a year and a half ago that there is a warning about bugs and downtime, but the meta wiki page and dumps site don't have any more information.
Is there less of a commitment to keep posting the enterprise dumps compared to the database XML dumps?
Thanks, Evan _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
xmldatadumps-l@lists.wikimedia.org