Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns 0) empty articles. The first one is for <id>15</id>, the last one for <id>13102212</id> For ns=0, this is a new phenomenon (introduced after 2024-03-01). For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240520-pages-articles.xml 101259
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240301-pages-articles.xml 129
Greetings Sven
Hello.
I have more details. The problematic change was introduced after the dump from 2024-05-01 ; I checked with the grep command listed below.
An obvious indication of a problem is that the (unpacked) dump size dropped by 6.5 % from 2024-05-01 to 2024-05-20.
Hope that helps ... Sven
I wrote:
Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns 0) empty articles. The first one is for <id>15</id>, the last one for <id>13102212</id> For ns=0, this is a new phenomenon (introduced after 2024-03-01). For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240520-pages-articles.xml 101259
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240301-pages-articles.xml 129
Greetings Sven
Hi, this is very likely because of https://phabricator.wikimedia.org/T365155
Once that's fixed, it should get back to normal.
Best
Am Mi., 22. Mai 2024 um 11:18 Uhr schrieb Sven Hartrumpf via Xmldatadumps-l xmldatadumps-l@lists.wikimedia.org:
Hello.
I have more details. The problematic change was introduced after the dump from 2024-05-01 ; I checked with the grep command listed below.
An obvious indication of a problem is that the (unpacked) dump size dropped by 6.5 % from 2024-05-01 to 2024-05-20.
Hope that helps ... Sven
I wrote:
Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns
- empty articles.
The first one is for <id>15</id>, the last one for <id>13102212</id> For ns=0, this is a new phenomenon (introduced after 2024-03-01). For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />'
dewiki-20240520-pages-articles.xml
101259
# grep -c ' <text bytes="[0-9]*" />'
dewiki-20240301-pages-articles.xml
129
Greetings Sven
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
Hello Amir.
You wrote, 2024-05-22:
Hi, this is very likely because of https://phabricator.wikimedia.org/T365155 Once that's fixed, it should get back to normal.
Yes, the reported problem disappeared for my programs. Tested with dewiki-20240601-pages-articles.xml
Thanks Sven
Am Mi., 22. Mai 2024 um 11:18 Uhr schrieb Sven via Xmldatadumps-l xmldatadumps-l@lists.wikimedia.org:
Hello.
I have more details. The problematic change was introduced after the dump from 2024-05-01 ; I checked with the grep command listed below.
An obvious indication of a problem is that the (unpacked) dump size dropped by 6.5 % from 2024-05-01 to 2024-05-20.
Hope that helps ... Sven
I wrote:
Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns 0) empty articles. The first one is for <id>15</id>, the last one for <id>13102212</id> For ns=0, this is a new phenomenon (introduced after 2024-03-01). For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240520-pages-articles.xml 101259
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240301-pages-articles.xml 129
Greetings Sven
xmldatadumps-l@lists.wikimedia.org