We've seen some kernel errors in the logs on the server that hosts the
dumps. In order to upgrade and reboot, we need the currrently running
dumps to exit as soon as they complete the current phase of their runs.
I expect to complete this before the end of the week; if necessary I
will shoot existing processes to get it done. In the meantime as dumps
complete, new ones will not be started; thanks for your patience.
Ariel
Create a script that makes a request to Special:Export using this category
as feed
https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont <jamesmikedupont(a)googlemail.com>
> Well I whould be happy for items like this :
> http://en.wikipedia.org/wiki/Template:Db-a7
> would it be possible to extract them easily?
> mike
>
> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn <ariel(a)wikimedia.org>
> wrote:
> > There's a few other reasons articles get deleted: copyright issues,
> > personal identifying data, etc. This makes maintaning the sort of
> > mirror you propose problematic, although a similar mirror is here:
> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
> >
> > The dumps contain only data publically available at the time of the run,
> > without deleted data.
> >
> > The articles aren't permanently deleted of course. The revisions texts
> > live on in the database, so a query on toolserver, for example, could be
> > used to get at them, but that would need to be for research purposes.
> >
> > Ariel
> >
> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
> >> Hi,
> >> I am thinking about how to collect articles deleted based on the "not
> >> notable" criteria,
> >> is there any way we can extract them from the mysql binlogs? how are
> >> these mirrors working? I would be interested in setting up a mirror of
> >> deleted data, at least that which is not spam/vandalism based on tags.
> >> mike
> >>
> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <ariel(a)wikimedia.org>
> wrote:
> >> > We now have three mirror sites, yay! The full list is linked to from
> >> > http://dumps.wikimedia.org/ and is also available at
> >> >
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Curren…
> >> >
> >> > Summarizing, we have:
> >> >
> >> > C3L (Brazil) with the last 5 good known dumps,
> >> > Masaryk University (Czech Republic) with the last 5 known good dumps,
> >> > Your.org (USA) with the complete archive of dumps, and
> >> >
> >> > for the latest version of uploaded media, Your.org with http/ftp/rsync
> >> > access.
> >> >
> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites for
> >> > volunteering space, time and effort to make this happen.
> >> >
> >> > As people noticed earlier, a series of media tarballs per-project
> >> > (excluding commons) is being generated. As soon as the first run of
> >> > these is complete we'll announce its location and start generating
> them
> >> > on a semi-regular basis.
> >> >
> >> > As we've been getting the bugs out of the mirroring setup, it is
> getting
> >> > easier to add new locations. Know anyone interested? Please let us
> >> > know; we would love to have them.
> >> >
> >> > Ariel
> >> >
> >> >
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > Wikitech-l(a)lists.wikimedia.org
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> James Michael DuPont
> Member of Free Libre Open Source Software Kosova http://flossk.org
> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
Over the next few days you'll see the number of processes drop as jobs
on each host complete. I'll be restarting themon each host as they
finish; this is part of my work to get deployment to suck less.
Ariel
We now have three mirror sites, yay! The full list is linked to from
http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Curren…
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps,
Masaryk University (Czech Republic) with the last 5 known good dumps,
Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync
access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for
volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project
(excluding commons) is being generated. As soon as the first run of
these is complete we'll announce its location and start generating them
on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting
easier to add new locations. Know anyone interested? Please let us
know; we would love to have them.
Ariel
It's amazing there are already so many years available for download.
Especially the larger zips must have been somewhat time-consuming to
compile! It would be great if 2008 or pre-2006 packages become available in
the near future. It is a really interesting development to see
http://dumps.wikimedia.org being interested in not only compiling large
current packaged databases of various wikis, but also more historic
content. In the past, the Internet Archive was the sole distributor of
older (also historic) wiki packages.
Eventually in the far future, there will have to be some sort of viable
mechanism for cloning all the images stored on wikimedia, though as for now
the Picture of the Year packages are very interesting for those more
interested in the pretty images of wikipedia. The POTY images also make
great wallpaper packs!
Hello everyone,
I am just working on a wikipedia reader when I noticed this little issue.
The data in the image metadata dumps (e.g.: enwiki-20120403-image.sql.gz) get somewhat truncated.
This appears in the img_description column being defined as tinyblob. Tinyblobs apparently hold 255 bytes, max.
I'd really love to use this dump instead of straining the servers..and taking forever.
Is this my fault or can you do something to address this issue?
Most interesting for me would be commons of course, then the german, french and spanish wikipedias.
Best from Berlin,
Bastian
Please see the column definition:
img_description` tinyblob NOT NULL
And the table structure:
CREATE TABLE `image` (
`img_name` varbinary(255) NOT NULL DEFAULT '',
`img_size` int(8) unsigned NOT NULL DEFAULT '0',
`img_width` int(5) NOT NULL DEFAULT '0',
`img_height` int(5) NOT NULL DEFAULT '0',
`img_metadata` mediumblob NOT NULL,
`img_bits` int(3) NOT NULL DEFAULT '0',
`img_media_type` enum('UNKNOWN','BITMAP','DRAWING','AUDIO','VIDEO','MULTIMEDIA','OFFICE','TEXT','EXECUTABLE','ARCHIVE') DEFAULT NULL,
`img_major_mime` enum('unknown','application','audio','image','text','video','message','model','multipart') NOT NULL DEFAULT 'unknown',
`img_minor_mime` varbinary(32) NOT NULL DEFAULT 'unknown',
`img_description` tinyblob NOT NULL,
`img_user` int(5) unsigned NOT NULL DEFAULT '0',
`img_user_text` varbinary(255) NOT NULL DEFAULT '',
`img_timestamp` varbinary(14) NOT NULL DEFAULT '',
`img_sha1` varbinary(32) NOT NULL DEFAULT '',
PRIMARY KEY (`img_name`),
KEY `img_size` (`img_size`),
KEY `img_timestamp` (`img_timestamp`),
KEY `img_usertext_timestamp` (`img_user_text`,`img_timestamp`),
KEY `img_sha1` (`img_sha1`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;