---------- Forwarded message ----------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Fri, Apr 22, 2016 at 9:21 PM
Subject: Re: [Xmldatadumps-l] Failed dumps
To: InfoSports <al(a)infosports.com>
Cc: gnosygnu <gnosygnu(a)gmail.com>, Ariel Glenn WMF <ariel(a)wikimedia.org>
I've been out ill this week. First day back today. I'm tracking this
issue here, including any reruns: https://phabricator.wikimedia.org/T133416
Ariel
On Fri, Apr 15, 2016 at 6:26 PM, InfoSports <al(a)infosports.com> wrote:
> I noticed the decreased size as well.
>
> Also, there are many duplicates in the download. Example article titles…
>
> Rainbow, California
> Murrieta, California
> Fallbrook, California
> Temecula, California
> Wildomar, California
> Sedeco Hills, California
> Palomar Mountain
> ...and many more
>
>
> Please re-run the process. There are too many errors in this one to be
> usable.
>
> Thank you in advance.
>
> -Al
>
>
> > On Apr 14, 2016, at 8:56 PM, gnosygnu <gnosygnu(a)gmail.com> wrote:
> >
> > Hi. I think there may still be problems with the 2016-04-07 English
> Wikipedia dump. It's missing many articles in the Module namespace.
> >
> > Here are some details:
> > * I downloaded
> https://dumps.wikimedia.org/enwiki/20160407/enwiki-20160407-pages-articles.…
> . I got an XML file that was 10.8 GB (i.e.: it does not look severely
> truncated)
> > * I ran the following grep commands. Note that Module:Hatnote is blank.
> I ran the last grep to show that the criteria should be correct.
> > root~> grep "<title>Earth</title>" /home/root/xowa/wiki/
> en.wikipedia.org/enwiki-latest-pages-articles.xml
> > <title>Earth</title>
> > root~> grep "<title>Template:About</title>" /home/root/xowa/wiki/
> en.wikipedia.org/enwiki-latest-pages-articles.xml
> > <title>Template:About</title>
> > root~> grep "<title>Module:Hatnote</title>" /home/root/xowa/wiki/
> en.wikipedia.org/enwiki-latest-pages-articles.xml
> > root~> grep "<title>Module:" /home/root/xowa/wiki/
> en.wikipedia.org/enwiki-latest-pages-articles.xml
> > <title>Module:Location map/data/Croatia/doc</title>
> > <title>Module:Location map/data/USA Alabama/doc</title>
> > ...
> > * The following Modules appear to be missing in the 2016-04-07 dump
> > Module:Use_mdy_dates
> > Module:Pp-move-indef
> > Module:Protection_banner
> > Module:Unsubst
> > * By my count, there were 2,970 articles in the Module namespace in the
> 2016-03-05 dump. In contrast, there are only 652 in the 2016-04-07 dump.
> >
> > Let me know if you need any other information. I believe that the above
> can be verified by anyone else, but I'd be happy to provide more detail
> >
> > Thanks.
> >
> >
> >
> >
> > On Thu, Apr 14, 2016 at 8:49 AM, Ariel Glenn WMF <ariel(a)wikimedia.org>
> wrote:
> > It hasn't failed. It's still running but the jobs that previously
> failed have been left in that status until they get rerun. That's standard
> behavior. Don't worry, be happy! :-)
> >
> > Ariel
> >
> > On Thu, Apr 14, 2016 at 2:15 PM, Nicolas Vervelle <nvervelle(a)gmail.com>
> wrote:
> > But at least, pages-articles worked, so it's ok for me.
> >
> > On Thu, Apr 14, 2016 at 1:13 PM, Nicolas Vervelle <nvervelle(a)gmail.com>
> wrote:
> > Well, enwiki failed again today...
> >
> > On Wed, Apr 13, 2016 at 4:37 PM, Ariel Glenn WMF <ariel(a)wikimedia.org>
> wrote:
> > You are right. Two jobs were competing for enwiki since I allocated one
> more lousy core to the host that runs them. I've fixed the config to avoid
> that. It will resume in a few hours with cron.
> >
> > Ariel
> >
> > On Wed, Apr 13, 2016 at 4:37 PM, Nicolas Vervelle <nvervelle(a)gmail.com>
> wrote:
> > Thanks Ariel,
> >
> > It seems to have worked for some dumps (frwiki for example), but other
> dumps are still failing (enwiki for example)
> >
> > Nico
> >
> > On Tue, Apr 12, 2016 at 11:04 AM, Ariel Glenn WMF <ariel(a)wikimedia.org>
> wrote:
> > Hi Nicolas,
> >
> > These will be picked up on reruns, which will happen over the next day
> or so. The failure was caused by an obscure hhvm bug which only triggers
> under certain circumstances. For more information about that, see:
> https://phabricator.wikimedia.org/T94277
> >
> > This morning I did jobs cleanup, switched the dump jobs to use php5
> again and the dumps have restarted.
> >
> > Ariel
> >
> > On Tue, Apr 12, 2016 at 11:25 AM, Nicolas Vervelle <nvervelle(a)gmail.com>
> wrote:
> > Hi,
> >
> > Is anyone working on the failed dumps for April ? (enwiki, frwiki,
> ruwiki, itwiki, ...)
> >
> > Nico
> >
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> >
> >
> > _______________________________________________
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
Dear all,
I recently tried to import one of the latest SQL dumps of English
Wikipedia titles in MySQL:
* 2016-04-11 20:07:38 done Base per-page data (id, title, old
restrictions, etc).
o enwiki-20160407-page.sql.gz
<https://dumps.wikimedia.org/enwiki/20160407/enwiki-20160407-page.sql.gz>
1.3 GB
I think there is a performance issue that can be largely improved.
Despite I disabled the check of foreign constraints and I enabled a
large packet size and did some other optimizations, I took around 2 days
to do it in a laptop with 8GB of RAM.
I think I can't disable the autocommit due to the lack of RAM, but even
without auto_commit I tried to modify some columns and it took ages.
I did a test. I edited the SQL dump file and remove the creation of
indexes when the table is created. This was not very easy because very
few test editors can open such a large file.
I ran the import again, without inidices and key constraints, etc. and
it took less than two hours.
Creating the indices / key constraints after importing the tables took
around 10-15 minutes per index.
I think with the original dump there is a problem: all the indices are
being rebuilt after every insertion, which seems to be very inefficient.
So I would like to ask if there is any way that the dump files are
created in a more optimized way, skipping the index creation at first
and leaving it for the end of the script.
Best regards,
Rafael
The current dump
enwiki-20160204-pages-articles.xml.bz2
contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.
Is this going to be fixed or should we assume that there might be duplicated pages in the dump? This never happened to us before.
Ciao,
seba
Hi!
Last year I was provided very kindly with a list of all svg-files on
commons, that is their then *real* http(s)-paths. (Either by John
phoenixoverride(a)gmail.com or by Ariel T. Glenn aglenn(a)wikimedia.org
aglenn(a)wikimedia.org.)
Could I get a current version of this dump, please? (With the real paths
and really existing files.)
Back then the dump was
>
http://tools.wmflabs.org/betacommand-dev/reports/commonswiki_svg_list.txt.7z
as far as I remember.
(Someone told me I could create such a dump myself with some wiki-tools.
Is this really possible?)
Greetings
John
I sent this same message twice, but it didn't show up. Sorry, if it's
still appearing twice maybe.