Hi data dumpers,
Starting today, some of the URLs I've been using to find the latest dumps for current article revisions have begun 404ing.
Japanese (failing): $ curl -I " http://dumps.wikimedia.org/enwiki/latest/jawiki-latest-pages-articles.xml.bz... " HTTP/1.1 404 Not Found Server: nginx/1.1.19 Date: Tue, 07 Jul 2015 23:40:40 GMT Content-Type: text/html; charset=utf-8 Content-Length: 169 Connection: keep-alive
English (working): $ curl -I " http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz... " HTTP/1.1 200 OK Server: nginx/1.1.19 Date: Tue, 07 Jul 2015 23:39:36 GMT Content-Type: application/octet-stream Content-Length: 11984805689 Last-Modified: Fri, 05 Jun 2015 23:45:33 GMT Connection: keep-alive Accept-Ranges: bytes
Are these particular dump files going away, or is the "latest" symlink being updated before all dumps have completed?
You're trying to download Japanese file from an "enwiki" directory.
On Tue, Jul 7, 2015 at 4:54 PM, Devesh Parekh dparekh@netflix.com wrote:
Hi data dumpers,
Starting today, some of the URLs I've been using to find the latest dumps for current article revisions have begun 404ing.
Japanese (failing): $ curl -I " http://dumps.wikimedia.org/enwiki/latest/jawiki-latest-pages-articles.xml.bz... " HTTP/1.1 404 Not Found Server: nginx/1.1.19 Date: Tue, 07 Jul 2015 23:40:40 GMT Content-Type: text/html; charset=utf-8 Content-Length: 169 Connection: keep-alive
English (working): $ curl -I " http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz... " HTTP/1.1 200 OK Server: nginx/1.1.19 Date: Tue, 07 Jul 2015 23:39:36 GMT Content-Type: application/octet-stream Content-Length: 11984805689 Last-Modified: Fri, 05 Jun 2015 23:45:33 GMT Connection: keep-alive Accept-Ranges: bytes
Are these particular dump files going away, or is the "latest" symlink being updated before all dumps have completed?
-- Devesh
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Good catch. Unfortunately, it doesn't exist in the jawiki directory either.
$ curl -I " http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz... " HTTP/1.1 404 Not Found Server: nginx/1.1.19 Date: Wed, 08 Jul 2015 00:44:31 GMT Content-Type: text/html; charset=utf-8 Content-Length: 169 Connection: keep-alive
These are still 404ing today. They were working for months prior to yesterday.
Does the timing of the issue narrow it down to any recent changes in the dump system? I strongly suspect the latest directory is getting updated before the dump has completed. http://dumps.wikimedia.org/jawiki/20150703/ shows the files in the "latest" directory but also shows that the combined latest revision articles dump hasn't been created yet:
- waiting *Recombine articles, templates, media/file descriptions, and primary meta-pages.* - jawiki-20150703-pages-articles.xml.bz2
I've created https://phabricator.wikimedia.org/T105847 to track this issue.
xmldatadumps-l@lists.wikimedia.org