The current dump
enwiki-20160204-pages-articles.xml.bz2
contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.
Is this going to be fixed or should we assume that there might be duplicated pages in the dump? This never happened to us before.
Ciao,
seba
Hi!
Last year I was provided very kindly with a list of all svg-files on
commons, that is their then *real* http(s)-paths. (Either by John
phoenixoverride(a)gmail.com or by Ariel T. Glenn aglenn(a)wikimedia.org
aglenn(a)wikimedia.org.)
Could I get a current version of this dump, please? (With the real paths
and really existing files.)
Back then the dump was
>
http://tools.wmflabs.org/betacommand-dev/reports/commonswiki_svg_list.txt.7z
as far as I remember.
(Someone told me I could create such a dump myself with some wiki-tools.
Is this really possible?)
Greetings
John
I sent this same message twice, but it didn't show up. Sorry, if it's
still appearing twice maybe.
Ariel,
Not sure if this is in the dumps email list domain or the lab domain. The
enwiki dumps for this month are not showing up in lab's dumps directory.
This was brought to my attention by another use trying to use the dump file.
Bryan
Hi,
I have a related question.
Last week I looked for full xml dumps of huwiki from 2013.
* dumps.wikimedia.org provides dumps back to 2015-07-02 now
( https://dumps.wikimedia.org/huwiki/20150702/ )
* in Internet Archive there is a gap between 2012-06-13 and 2014-07-27
( https://archive.org/download/huwiki20120613 )
( https://archive.org/download/huwiki-20140727 )
Do you know any source where I could download dumps from this period?
Or did we loose them forever?
Thank you.
Best,
Samat
> 2015, 2016 dumps are available at the wikimedia dumps website.
>
> However I would like to have access to old xml wikipedia dumps.
> I've been googling all around and tried everything from Torrents to an EBS
> snapshot on amazon which supposedly contained many of the xml dumps.
>
> Ive managed to somehow get access to random dumps for all years from 2006
> to 2016, However I would like to get specific dumps. i.e: all dumpsaround
> March for the mentioned years.
>
> I wonder if there is a repository, or if anyone could share a them via
> torrents( the current torrens don't have any seeds)
>
> Thanks
>
>
Fallback is: cable up the old 1GB nic (Chris has done this and set up the
port), PXE install on that, move to 10gb NIC once we're back up. Gross but
it gets the job done.
Set for tomorrow (Friday) 1 to 4 pm UTC, this time should be much smoother.
Same caveats apply as before.
Ariel
On Wed, Mar 2, 2016 at 8:47 PM, Ariel Glenn WMF <ariel(a)wikimedia.org> wrote:
> PXE boot from non-embedded nic failed spectacularly despite our best
> efforts. This means we'll have to schedule another window once we have
> someting new to try. I apologize for the extra inconvenience. All services
> are back exactly the way they were.
>
> Ariel
>
> On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ariel(a)wikimedia.org>
> wrote:
>
>> Extending this downtime window because we ran into unexpected issues with
>> PXE boot.
>>
>> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ariel(a)wikimedia.org>
>> wrote:
>>
>>> Dataset1001, the host which serves dumps and other datasets to the
>>> public, as well as providing access to various datasets directly on
>>> stats100x, will be unavailable tomorrow for an upgrade to jessie. While I
>>> don't expect to need nearly 3 hours for the upgrade, better safe than
>>> sorry. In the meantime all files will be accessible via
>>> ms1001.wikimedia.org via the web, and all dumps and page view files
>>> from our mirrors as well.
>>>
>>> Thanks for your understanding.
>>>
>>> Ariel Glenn
>>>
>>>
>>>
>>
>
I've turned on 'checkpointing' for the following wikis: eswiki, itwiki,
ruwiki, wikidatawiki
This means that, as for enwiki and now dewiki, smaller and more files will
be produced with page content. See
https://gerrit.wikimedia.org/r/#/c/274730/ for the changeset.
I've done this for two reasons: first, the files were getting large, and
second, recovery from failure in those steps will now be quicker and easier
for those wikis.
Ariel
Dataset1001, the host which serves dumps and other datasets to the public,
as well as providing access to various datasets directly on stats100x, will
be unavailable tomorrow for an upgrade to jessie. While I don't expect to
need nearly 3 hours for the upgrade, better safe than sorry. In the
meantime all files will be accessible via ms1001.wikimedia.org via the web,
and all dumps and page view files from our mirrors as well.
Thanks for your understanding.
Ariel Glenn