Yesterday, I wrote a post with some links to current dumps, old dumps,
and another raw data like Domas visits logs. Also, some links to Internet
Archive where we can download some historical dumps. Please, can you share
Also, what about making a tarball with thumbnails from Commons? 800x600
would be a nice (re)-solution, to avoid a TB dump. If not, probably it will
never be published an image dump. Commons is growing ~5000 images per day.
It is scaring.
Thanks for your reply. I can't download 8TB at home, of course, but perhaps
people from university which want to research or mirroring the data can do
it. You can make image dumps by year (all images uploaded in 2005, all in
2006, ect), and upload them to Internet Archive (that folks rock). Also, I
have calculated that an image dump with all the 7M Commons images resized to
800x600 it would be ~500 GB, today.
On the other hand, you can publish image dumps for individual Wikipedias
(look at the images column). Any legal problems with the English dump
containing fair use images?
2010/8/15 Ariel T. Glenn <ariel(a)wikimedia.org>
Στις 14-08-2010, ημέρα Σαβ, και ώρα 23:25 -0700, ο/η Jamie Morken
> > Hi,
> > ----- Original Message -----
> > From: emijrp <emijrp(a)gmail.com>
> > Date: Friday, August 13, 2010 4:48 am
> > Subject: [Xmldatadumps-l] Dumps, dumps, dumps
> > To: xmldatadumps-l(a)lists.wikimedia.org
> > > Hi all;
> > >
> > > Yesterday, I wrote a post with some links to current dumps,
> > > old dumps,
> > > and another raw data like Domas visits logs. Also, some links to
> > > InternetArchive where we can download some historical dumps.
> > > Please, can you share
> > > your links?
> > >
> > > Also, what about making a tarball with thumbnails from Commons?
> > > 800x600would be a nice (re)-solution, to avoid a TB dump. If
> > > not, probably it will
> > > never be published an image dump. Commons is growing ~5000
> > > images per day.
> > > It is scaring.
> > Yes publicly available tarballs of image dumps would be great. Here's
> > what I think it would take to implement:
> > 1. allocate the server space for the image tarballs
> > 2. allocate the bandwidth for us to download them
> > 3. decide what tarballs will be made available (ie. separated by wiki
> > or whole commons, thumbnails or 800x600max, etc)
> > 3. write the script(s) for collecting the image lists, automating the
> > image scaling and creating the tarballs
> > 4. done!
> > None of those tasks are really that difficult, the hard part is
> > figuring out why there used to be tarball images available but not
> > anymore, especially when apparently there is adequate server space and
> > bandwidth. I guess it is one more thing that could break and then
> > people would complain about it not working.
> Images take up 8T or more these days (of course that includes deletes
> and earlier versions but those aren't the bulk of it). Hosting 8T
> tarballs seems out of the question... who would download them anyways?
> Having said that, hosting small subsets of images is qute possible and
> is something that has been discussed in the past. I would love to hear
> which subsets of images people want and would actually use.
We hada a brief outage of the server that hosts (not runs) the XML dumps
earlier today. There is an issue with one of the spare disks, and when
we queried the RAID controller, the system hung. The server was brought
back up shortly afterwards and dumps have resumed but those that were in
progress were aborted. New runs of those will be conducted in a few
days when their turns come up again.
In other news, I noticed the empty mysql table dumps for en wikipedia,
ran one such dump by hand, and it ran fine. Those dumps were run just
after upgrade to a patched version of mysqld; I can't be sure what the
particular problem was but I'm chalking it up to the transition for now.
We will be keeping our eye on the issue.
my name is Norbert Kurz and I am a student of applied computer science in
I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into
Now I wanted to parse the Text in the text tag (<text>) into an html page,
my Problem is, that there is a special syntax for tables, lists, links etc.
My question is:
Is there a definition of the XML syntax, so it is easily possible to write a
XML to HTML script?
Zu den Regisseuren, die das Pseudonym benutzt haben, gehören:
* [[Don Siegel]] und [[Robert Totten]] (für [[Frank Patch – Deine Stunden
* [[David Lynch]] (für die dreistündige Fernsehfassung von [[Der
Wüstenplanet (Film)|Der Wüstenplanet]]),
* [[Chris Christensen]] (The Omega Imperative),
* [[Stuart Rosenberg]] (für [[Let’s Get Harry]]),
* [[Richard C. Sarafian]] (für [[Starfire]]),
* [[Dennis Hopper]] (für [[Catchfire]]),
* [[Arthur Hiller]] (für [[An Alan Smithee Film: Burn Hollywood Burn]]),
* [[Rick Rosenthal]] (Birds II) und
* [[Kevin Yagher]] ([[Hellraiser IV – Bloodline]]).
* Der Pilotfilm der Serie [[MacGyver]] führt einen Alan Smithee als
Regisseur <ref>http://www.imdb.com/title/tt0165375/ </ref>
The asterix means, that there is a list,
the two brackets [[ means, that there is a link
the pipe: [[ LINKNAME | SHOWN_NAME ]]
Is there a file that descripes all of these special cases and the latex
stuff written in the XML files ( \longrightarrow ) and the tables?
Now I want to thank you all for your great work, I am happy that you make
the effort to export the whole wikipedia, so other people
can download it and play around. Please keep up your good work.
Thanks in advance for your help.
Norbert Kurz, Stuttgart Germany
Ok. How about changing the order and doing the .7z dump first?
-- Cheers, Dmitry
On Wed, Jul 21, 2010 at 4:03 AM, Jamie Morken <jmorken(a)shaw.ca> wrote:
> I was polling the http://download.wikimedia.org/enwiki/20100622/ page
> during the pages-meta-history.xml.bz2 database dump and here is some
> timestamped output from that page showing some errors that caused the dump
> to fail. Regarding the .bz2 dump format, Tomasz earlier suggested removing
> it and using .7z. I thought it might be good to keep the .bz2 format due to
> there being several programs that use it (ie. wikitaxi, bzreader). 7z
> format is probably the way to go though for the future, but I don't know if
> this would fix the database dump errors.