Xmldatadumps-l October 2012

xmldatadumps-l@lists.wikimedia.org

12 participants
8 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

multistream format; media bundle creation problems

by Ariel T. Glenn

now enabled for all wikis, even the small ones where it's completely pointless :-P (But someday they'll be big too!) Some of you may have noticed that media bundle production is behind. We are running into new performance issues and poking at it; for now please just be patient and we'll get them going again as soon as we can. Arie

11 years, 5 months

Re: [Xmldatadumps-l] [Wiki-research-l] 1-year dump of English Wikipedia article ratings

by Jérémie Roquet

cc-ed xmldatadumps-l Hi, 2012/10/23 Dario Taraborelli <dtaraborelli(a)wikimedia.org>: > 2012/10/23 James Forrester <james(a)jdforrester.org>: >> On 22 October 2012 16:03, Hydriz Wikipedia <admin(a)alphacorp.tk> wrote: >>> I have long been wanting to say this, but is it possible for the team behind >>> compiling such datasets to put future (and if possible, current) datasets >>> into dumps.wikimedia.org so that it is easier for everyone to find stuff and >>> not be all over the place? Thanks for that! >> >> Many one-off and regular datasets, from query results to data dumps >> and similar, are now indexed[0] on The Data Hub (formerly CKAN) run by >> the Open Knowledge Foundation for precisely this reason - so that data >> researchers can easily find data about Wikimedia, and see when it's >> updated. >> >> [0] - http://thedatahub.org/en/group/wikimedia > > The dumps server was never meant to become a permanent open data repository, but it started being used as an ad-hoc solution to host all sort of datasets published by WMF on top of the actual XML dumps: that's the problem we're trying to fix. > > Regardless of where the data is physically hosted, your go-to point to discover WMF datasets from now on is the DataHub. Think of it as a data registry: the registry is all you need to know in order to find where the data is hosted and to extract the appropriate metadata/documentation. That's fine for me but I think more communication about this would be welcome. I've added a link to meta:Data_dumps¹ and I'll communicate about this on the French Wikipedia, but a link on the dumps' page for other downloads² would be great. Most people I've helped to find data on the Wikimedia projects now know about dumps.wikimedia.org, but AFAIK none of them is reading wiki-research-l. Best regards, ¹ https://meta.wikimedia.org/wiki/Data_dumps ² http://dumps.wikimedia.org/other/ -- Jérémie

11 years, 6 months

Fwd: [Wikitech-l] "Latest" md5sums of dumps

by Jeremy Baron

---------- Forwarded message ---------- From: Zachary Harris <zacharyharris(a)hotmail.com> Date: Mon, Oct 8, 2012 at 5:49 PM Subject: [Wikitech-l] "Latest" md5sums of dumps To: wikitech-l(a)lists.wikimedia.org It is a bit confusing that http://dumps.wikimedia.org/enwiki/20121001/enwiki-20121001-md5sums.txt has newer hash values that aren't in http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-md5sums.txt. I'm guessing that the "latest" md5sums files get updated only at the end of a cycle when all dumps for the month have been completed? (In other words, the idea is that the "latest" directory contains backups that have been *completed*, and the md5sums file isn't complete until *all of the other backups* which it is going to reference are complete.) I was expecting the "latest" md5sums file to have "rolling" content matching the rolling content of the "latest" directory itself. If the latter is possible without too great of administrative hassle, it could be a nice feature. Otherwise, it would seem that "latest-md5sums" is almost always outdated relative to the other files in "latest". -Zach

11 years, 6 months

Wikipedia page content dump based on category

by Venkatesh Channal

Hi, I would like to fetch all page text information of all wiki pages that belong to a movie category. Eg: http://en.wikipedia.org/wiki/Category:Hindi_songs >From the page text I would like to extract information related to song title, song length, singer, name of movie/album etc. I am not interested in extracting images just the information about the song. My questions: 1) Is there a way to download only those pages that I am interested in that belong to a particular category instead of downloading the entire dump? 2) Is it required to have PHP knowledge to install the db dump on a local machine? 3) Are there are tools that extract the information and provide the required data to be stored in MySQL database? If this is not the right forum to have my questions answered could you please redirect me to the appropriate forum. Thanks and regards, Venkatesh Channal

11 years, 6 months

mail

by Abhishek prataap R

i want all the lists and images

11 years, 6 months

Re: [Xmldatadumps-l] [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates?

by James L

£1.99 per page? :D From: Roberto Flores Sent: Thursday, October 04, 2012 3:15 AM To: Pablo N. Mendes Cc: James L ; Wikipedia Xmldatadumps-l ; Wikimedia developers Subject: Re: [Xmldatadumps-l] [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates? Could we have an HTML dump for X amount of money? Something like a paid feature. Include the CSS of course. Also, leave the <math> tags as they are, as those have to be processed by 3rd party libraries. 2012/9/17 Pablo N. Mendes <pablomendes(a)gmail.com> I also think the HTML dumps would be super useful! Cheers Pablo On Sep 17, 2012 8:05 PM, "James L" <james_leaver(a)hotmail.com> wrote: I’m all vote for continuing the HTML wiki dumps that were once done, 2007 was the last? Why are these discontinued? they would be more useful than the so called “XML”. There is no complete solution to processing dumps, the XML is most certainly not XML in its lowest form, and it IS DEFINITELY a moving target! Regards, From: Roberto Flores Sent: Sunday, September 09, 2012 8:07 PM To: Wikimedia developers Cc: Wikipedia Xmldatadumps-l Subject: Re: [Xmldatadumps-l] [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates? Allow me to reply to each point: (By the way, my offline app is called WikiGear Offline:) http://itunes.apple.com/us/app/wikigear-offline/id453614487?mt=8 > Templates are dumped just like all other pages are... Yes, but that's only a text description of what the template does. Code must be written to actually process them into HTML. There are tens of thousands of them, and some can't be even programmed by me (e.g., Wiktionary's conjugation templates) If they were already pre-processed into HTML inside the articles' contents, that would solve all of my problems. > what purpose would the dump serve? you dont want to keep the full dump > on the device. I made an indexing program that selects only content articles (namespaces included) and compresses it all to a reasonable size (e.g. about 7gb for the English Wikipedia) > How would this template API function? What does import mean? By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display. Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow. 2012/9/9 Jeremy Baron <jeremy(a)tuxmachine.com> On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores <f.roberto.isc(a)gmail.com> wrote: > I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for > the iPhone, which does a somewhat decent job at interpreting the wiki > markup into HTML. > However, there are too many templates for me to program (not to mention, > it's a moving target). > Without converting these templates, many articles are simply unreadable and > useless. Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now? > Could you please provide HTML dumps (I mean, with the templates > pre-processed into HTML, everything else the same as now) every 3 or 4 > months? 3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment. > Or alternatively, could you make the template API available so I could > import it in my program? How would this template API function? What does import mean? -Jeremy _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ---------------------------------------------------------------------------- _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

11 years, 6 months

bz2 multistreams redux

by Ariel T. Glenn

Some time ago I announced a trial of bz2 multistream files for en wikipedia on this list. The generated index files turned out to have a problem with the offsets somewhere and due to various other tasks this fell by the wayside. That bug is now fixed, the September en wikipedia bz2 multistream index file was regenerated, and a little toy offline reader is now available as a proof of concept for how one might work with these files. A brief reminder about what the format does: it allows rough random access to the XML page content. For the code, see: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys/… If I don't hear how broken things are over the next few days, I expect to enable generation of this format for all wiki projects shortly. Ariel

11 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l October 2012