Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

720 discussions

media tarballs for the month *not* ready
by Ariel T. Glenn 05 Jan '13

05 Jan '13

Don't start downloading them yet even if you see them showing up; some of them may be corrupt. I'll send an update when they are ready to go, there was a hardware issue at our hoster's end. Thanks. Ariel

1 0

[WP-MIRROR] Questions regarding Metalink and SPDY
by wp mirror 03 Jan '13

03 Jan '13

Dear Jeremy, Happy New Year, and thanks for your e-mail of 2012-12-31. 0) ADMINISTRATIVE I fixed the capitalization of "Wikimedia" in both documentation and home page. I am now subscribed to this list and have read the last two years of postings. 1) SPDY WP-MIRROR 0.5 and prior versions, obtains image files from <http://upload.wikimedia.org/>. SPDY would reduce latency. WP-MIRROR 0.6 (not yet released) uses HTTP/1.1 persistent connections. WP-MIRROR 0.6 has built-in profiling, and the image downloading process now uses 64% less (wall clock) time. Therefore SPDY may not provide much advantage. Thanks also for informing me of the image tarballs. Conclusion: I will not pursue SPDY, for lack of a requirement. Action Item: WP-MIRROR 0.6 will make use of image tarballs. 2) METALINK WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading. Metalink was brought to my attention by Jason Skomorowski. Relevant documents are RFC 5854, RFC 6249. From the later we have: "This document describes a mechanism by which the benefit of mirrors can be automatically and more effectively realized. All the information about a download, including mirrors, cryptographic hashes, digital signatures, and more can be transferred in coordinated HTTP header fields, hereafter referred to as a "Metalink". This Metalink transfers the knowledge of the download server (and mirror database) to the client. Clients can fall back to other mirrors if the current one has an issue. With this knowledge, the client is enabled to work its way to a successful download even under adverse circumstances. All this can be done without complicated user interaction, and the download can be much more reliable and efficient. In contrast, a traditional HTTP redirect to a mirror conveys only minimal information -- one link to one server -- and there is no provision in the HTTP protocol to handle failures. Furthermore, in order to provide better load distribution across servers and potentially faster downloads to users, Metalink/HTTP facilitates multi-source downloads, where portions of a file are downloaded from multiple mirrors (and, optionally, Peer-to-Peer) simultaneously. Upon connection to a Metalink/HTTP server, a client will receive information about other sources of the same resource and a cryptographic hash of the whole resource. The client will then be able to request chunks of the file from the various sources, scheduling appropriately in order to maximize the download rate." The benefit to WP-MIRROR would be much more reliable downloads, that would obviate the file validation process. The benefit to folks on this e-main list are: a) Your mirror sites would get more traffic (Ariel mentioned that they are getting very little); b) the download process (for metalink capable clients) would be robust against the outage of any one mirror; and c) metalink capable clients are now common (cURL, kget, ...). I understand that the idea for metalink originated in those who posted GNU/Linux distributions in .iso format. With each new .iso release, there would be a surge of downloading, causing many partial downloads (i.e. much wasted bandwidth). Metalink helped spread the load; and, by transporting hashes, improved download integrity. Conclusion: I will table the issue of metalink, for lack of an immediate requirement. Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball mirror sites as a configurable parameter. 3) RSYNC Thanks for letting me know that dumps and tarballs are available using rsync. I much prefer rsync over http and ftp. I mirror the Debian archive, and recently switched from apt-mirror which uses wget, to ftpsync which uses rsync; and am very happy with the results. Action Item: WP-MIRROR 0.6 will make use of rsync. Ariel raised some other points with I shall address in a separate email. Sincerely Yours, Kent

4 7

Re: [Xmldatadumps-l] [WP-MIRROR] Questions regarding Metalink and SPDY
by wp mirror 02 Jan '13

02 Jan '13

Dear List Members, Does anyone know if the WikiMedia Foundation plans to support Metalink or SPDY for its dump files and/or image files? See RFP references below. WP-MIRROR downloads dump and image files to build a mirror of a set of wikipedias. WP-MIRROR 0.5 is feature complete. I am now looking for ways to optimize performance (i.e. reduce mirror build time). Were the WMF to support the above two protocols, downloads would be faster and require less time spent on validation. Sincererly Yours, Kent On 12/29/12, Sumana Harihareswara <sumanah(a)wikimedia.org> wrote: > Hello! I'm sorry, but I don't know the answer to these questions; > perhaps you could email the dumps mailing list > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ? My > apologies. > > > Sumana Harihareswara > Engineering Community Manager > Wikimedia Foundation > > > On Sun, Dec 16, 2012 at 6:14 AM, wp mirror <wpmirrordev(a)gmail.com> wrote: >> Dear Sumana, >> >> 1) Metalink. Does the Wikimedia Foundation have any plans to support >> metalink for either its dump files or its image files? >> >> Documentation: >> <http://tools.ietf.org/html/rfc5854>, "The Metalink Download Description >> Format" >> <http://tools.ietf.org/html/rfc6249>, "Metalink/HTTP: Mirrors and Hashes" >> >> 2) SPDY. Does the Wikimedia Foundation have any plans to support SPDY? >> >> Documentation: <http://www.chromium.org/spdy> >> >> 3) WP-MIRROR. We last communicated 2012-01-06 in regards to WP-MIRROR. >> >> Status: WP-MIRROR 0.5 is `feature complete', and works >> `out-of-the-box' for the GNU/Linux distributions: Debian 7.0 (wheezy) >> and Ubuntu 12.10 (quantal). >> >> Future: Attention is turning towards performance enhancement and >> porting to other distributions. >> >> Homepage: <http://www.nongnu.org/wp-mirror/> >> >> Please give it a try. Feedback is most welcome. >> >> Sincerely Yours, >> Kent >

4 5

Only two dumps
by Andreas Meier 18 Dec '12

18 Dec '12

Hello, at the moment there are only 2 dumps of the bigger wikis produced. Best regards Andreas

2 1

encouraging mirro use (making list more visible)
by Ariel T. Glenn 11 Dec '12

11 Dec '12

I had an email exchange wth one of the folks at our mirror sites about the low volume of traffic they are getting. Clearly we need to publicize this list better, bearing in mind that files on our mirrors may be a day behind the live site. I wouldn't think that a day's delay is very important in the grand scheme of things though. So I'm looking for suggestions on how to best make the list of mirrors visible to dumps users/downloaders. This includes changes to [1] and [2] among other things. Bear in mind that'best' also implies 'easy to do' or 'here is a patch' :-D Ariel [1] https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu… (downliad page for all dumps showing each dump in order of completion) [2] https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldu… (download page for a given dump)

1 0

snapshot1 server outage
by Ariel T. Glenn 05 Dec '12

05 Dec '12

Snapshot1, which was running several dumps for 'big' wikis, fell over due to swapdeath today. While we investigate the issue, those jobs will be stalled. I'll send an update as soon as we have more info. Ariel

1 1

Translation extraction from Wiktionaries
by Judit, Ács 27 Nov '12

27 Nov '12

Hi, I am trying to tranlations from Wiktionaries in different languages. Currently I use the "All pages, current versions only" dump. Is there a way to find out the language template tags (is that the correct term?) for each Wiktionary and each language? For example: This is the Hungarian page 'karcsu' (slim, slender) http://hu.wiktionary.org/wiki/karcs%C3%BA (the edit page: http://hu.wiktionary.org/w/index.php?title=karcs%C3%BA&action=edit) The translation table always (?) starts like this: {{-ford-}} {{trans-top}} *{{en}}: {{t|en|slim}}, {{t|en|slender}} Where {{-ford-}} comes from the word forditas (translation in Hungarian, I skipped the accents). The translations look like the 3rd row and (hopefully) contain the other languages wiki codes (en, fr, de). Also on the page 'slim' in the Hungarian Wiktionary there are some tags which nobody would understand unless they are Hungarian and they have learned some Hungarian grammar. http://hu.wiktionary.org/wiki/slim and http://hu.wiktionary.org/w/index.php?title=slim&action=edit The first line is: {{engmell|comp=slimmer|sup=slimmest|pron=/slɪm/|audio=us}} Where 'engmell' is derived from 'english melleknev', melleknev meaning adjective in Hungarian. There rest is similarly confusing. It gets even more confusing if I look at other Wiktionaries. It seems that there are no standards that all Wiktionaries follow. Is this meta-information available somewhere? I hope I managed to explain it clearly and I am asking on the right list. Thank you in advance, Judit Acs

2 1

Issues importing Wikipedia XML dumps
by Christoph Sackl 20 Nov '12

20 Nov '12

Hello, I am new to this list and have a question about importing XML dumps from Wikipedia (http://dumps.wikimedia.org/enwiki/20121101/) into an offline MediaWiki database. I have locally installed XAMPP on Windows 8 and replaced the included 32-bit MySQL version with the latest 64-bit version. I then installed MediaWiki 1.20.0 with an empty database. When trying to import an XML dump (Nov 2011 dump) with importDump.php in the maintenance folder of the MediaWiki installation, I get the following error after about 2 seconds: "WikiRevision given a null title in import. You may need to adjust $wgLegalTitleChars." which is thrown at line 1032 in Import.php, because some $title seems to be null. Replacing the exception with "$this->title = null" (evil ^^) leads to other errors. xml2sql and mwdumper seem to be outdated as I cannot get them working with the current dumps. Special:Import is not an option due to the size of the XML files. Any help would be appreciated :) P.S. it's not the missing + in $wgLegalTitleChars that is missing which is suggested by a Google Search on that error Best Regards Chris

2 1

media tarballs
by Ariel T. Glenn 19 Nov '12

19 Nov '12

Yay, the network (or nfs) performance issues on your.org seem to have been straightened out and last month's full is available; this month's is running now. Ariel

1 0

poty bundles for 2008, 2011 available
by Ariel T. Glenn 16 Nov '12

16 Nov '12

Thanks to Erik for reminding me about these. Years 2008 and 2011 now appear in the index http://dumps.wikimedia.org/other/poty/ so enjoy. Ariel

1 0

← Newer
1
...
49
50
51
52
53
54
55
...
72
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l