Xmldatadumps-l August 2010

xmldatadumps-l@lists.wikimedia.org

6 participants
6 discussions

by emijrp

Hi all; Yesterday, I wrote a post[1] with some links to current dumps, old dumps, and another raw data like Domas visits logs. Also, some links to Internet Archive where we can download some historical dumps. Please, can you share your links? Also, what about making a tarball with thumbnails from Commons? 800x600 would be a nice (re)-solution, to avoid a TB dump. If not, probably it will never be published an image dump. Commons is growing ~5000 images per day. It is scaring. Regards, emijrp [1] http://emijrp.blogspot.com/2010/08/wikipedia-dumps.html

13 years, 8 months

Fwd: Dumps, dumps, dumps

by emijrp

Hi Ariel; Thanks for your reply. I can't download 8TB at home, of course, but perhaps people from university which want to research or mirroring the data can do it. You can make image dumps by year (all images uploaded in 2005, all in 2006, ect), and upload them to Internet Archive (that folks rock). Also, I have calculated that an image dump with all the 7M Commons images resized to 800x600 it would be ~500 GB, today. On the other hand, you can publish image dumps for individual Wikipedias[1] (look at the images column). Any legal problems with the English dump containing fair use images? Regards, emijrp [1] http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by… 2010/8/15 Ariel T. Glenn <ariel(a)wikimedia.org> Στις 14-08-2010, ημέρα Σαβ, και ώρα 23:25 -0700, ο/η Jamie Morken > έγραψε: > > > > Hi, > > > > ----- Original Message ----- > > From: emijrp <emijrp(a)gmail.com> > > Date: Friday, August 13, 2010 4:48 am > > Subject: [Xmldatadumps-l] Dumps, dumps, dumps > > To: xmldatadumps-l(a)lists.wikimedia.org > > > > > Hi all; > > > > > > Yesterday, I wrote a post[1] with some links to current dumps, > > > old dumps, > > > and another raw data like Domas visits logs. Also, some links to > > > InternetArchive where we can download some historical dumps. > > > Please, can you share > > > your links? > > > > > > Also, what about making a tarball with thumbnails from Commons? > > > 800x600would be a nice (re)-solution, to avoid a TB dump. If > > > not, probably it will > > > never be published an image dump. Commons is growing ~5000 > > > images per day. > > > It is scaring. > > > > Yes publicly available tarballs of image dumps would be great. Here's > > what I think it would take to implement: > > > > 1. allocate the server space for the image tarballs > > 2. allocate the bandwidth for us to download them > > 3. decide what tarballs will be made available (ie. separated by wiki > > or whole commons, thumbnails or 800x600max, etc) > > 3. write the script(s) for collecting the image lists, automating the > > image scaling and creating the tarballs > > 4. done! > > > > None of those tasks are really that difficult, the hard part is > > figuring out why there used to be tarball images available but not > > anymore, especially when apparently there is adequate server space and > > bandwidth. I guess it is one more thing that could break and then > > people would complain about it not working. > > > > Images take up 8T or more these days (of course that includes deletes > and earlier versions but those aren't the bulk of it). Hosting 8T > tarballs seems out of the question... who would download them anyways? > > Having said that, hosting small subsets of images is qute possible and > is something that has been discussed in the past. I would love to hear > which subsets of images people want and would actually use. > > Ariel > >

13 years, 8 months

dumps archive server outage, empty sql.gz files for en wikipedia

by Ariel T. Glenn

We hada a brief outage of the server that hosts (not runs) the XML dumps earlier today. There is an issue with one of the spare disks, and when we queried the RAID controller, the system hung. The server was brought back up shortly afterwards and dumps have resumed but those that were in progress were aborted. New runs of those will be conducted in a few days when their turns come up again. In other news, I noticed the empty mysql table dumps for en wikipedia, ran one such dump by hand, and it ran fine. Those dumps were run just after upgrade to a patched version of mysqld; I can't be sure what the particular problem was but I'm chalking it up to the transition for now. We will be keeping our eye on the issue. Ariel Glenn

13 years, 8 months

wikipedia XML dump, the text tag

by Norbert Kurz

Hello, my name is Norbert Kurz and I am a student of applied computer science in Germany. I downloaded the 7.8GB XML dump of the german wikipedia and splittet it into article files. Now I wanted to parse the Text in the text tag (<text>) into an html page, my Problem is, that there is a special syntax for tables, lists, links etc. My question is: Is there a definition of the XML syntax, so it is easily possible to write a XML to HTML script? E.g. Zu den Regisseuren, die das Pseudonym benutzt haben, gehören: * [[Don Siegel]] und [[Robert Totten]] (für [[Frank Patch – Deine Stunden sind gezählt]]), * [[David Lynch]] (für die dreistündige Fernsehfassung von [[Der Wüstenplanet (Film)|Der Wüstenplanet]]), * [[Chris Christensen]] (The Omega Imperative), * [[Stuart Rosenberg]] (für [[Let’s Get Harry]]), * [[Richard C. Sarafian]] (für [[Starfire]]), * [[Dennis Hopper]] (für [[Catchfire]]), * [[Arthur Hiller]] (für [[An Alan Smithee Film: Burn Hollywood Burn]]), * [[Rick Rosenthal]] (Birds II) und * [[Kevin Yagher]] ([[Hellraiser IV – Bloodline]]). * Der Pilotfilm der Serie [[MacGyver]] führt einen Alan Smithee als Regisseur <ref>http://www.imdb.com/title/tt0165375/ </ref> The asterix means, that there is a list, the two brackets [[ means, that there is a link the pipe: [[ LINKNAME | SHOWN_NAME ]] Is there a file that descripes all of these special cases and the latex stuff written in the XML files ( \longrightarrow ) and the tables? Now I want to thank you all for your great work, I am happy that you make the effort to export the whole wikipedia, so other people can download it and play around. Please keep up your good work. Thanks in advance for your help. Best regards Norbert Kurz, Stuttgart Germany

13 years, 8 months

Re: [Xmldatadumps-l] a bug with wikipedia table dump

by Platonides

Forwarding to xmldatadumps-l Alexander Sibiryakov wrote: > Hello. > > I found a bug with dump of 'page' table on last update at wikipedia dump > service (http://download.wikimedia.org). > > This file shouldn't be empty > http://download.wikimedia.org/enwiki/20100730/enwiki-20100730-page.sql.gz but > it is. > > http://download.wikimedia.org/enwiki/20100730/ status is 'done' for it. > > Thanks for reading.

13 years, 8 months

Re: [Xmldatadumps-l] enwiki dump progress on 20100622

by Dmitry Chichkov

Ok. How about changing the order and doing the .7z dump first? -- Cheers, Dmitry On Wed, Jul 21, 2010 at 4:03 AM, Jamie Morken <jmorken(a)shaw.ca> wrote: > > Hi, > > I was polling the http://download.wikimedia.org/enwiki/20100622/ page > during the pages-meta-history.xml.bz2 database dump and here is some > timestamped output from that page showing some errors that caused the dump > to fail. Regarding the .bz2 dump format, Tomasz earlier suggested removing > it and using .7z. I thought it might be good to keep the .bz2 format due to > there being several programs that use it (ie. wikitaxi, bzreader). 7z > format is probably the way to go though for the future, but I don't know if > this would fix the database dump errors. > > cheers, > Jamie > >

13 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l August 2010