As a Commons admin I've thought a lot about the problem of distributing Commons dumps. As for distribution, I believe BitTorrent is absolutely the way to go, but the Torrent will require a small network of dedicated permaseeds (servers that seed indefinitely). These can easily be set up at low cost on Amazon EC2 "small" instances - the disk storage for the archives is free, since small instances include a large (~120 GB) ephemeral storage volume at no additional cost, and the cost of bandwidth can be controlled by configuring the BitTorrent client with either a bandwidth throttle or a transfer cap (or both). In fact, I think all Wikimedia dumps should be available through such a distribution solution, just as all Linux installation media are today.
Additionally, it will be necessary to construct (and maintain) useful subsets of Commons media, such as "all media used on the English Wikipedia", or "thumbnails of all images on Wikimedia Commons", of particular interest to certain content reusers, since the full set is far too large to be of interest to most reusers. It's on this latter point that I want your feedback: what useful subsets of Wikimedia Commons does the research community want? Thanks for your feedback.
--=20 Derrick Coetzee User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator http://www.eecs.berkeley.edu/~dcoetzee/
On Mon, Jun 27, 2011 at 6:49 AM, wiki-research-l-request@lists.wikimedia.org wrote:
Date: Mon, 27 Jun 2011 06:18:31 -0400 From: Samuel Klein sjklein@hcs.harvard.edu Subject: Re: [Wiki-research-l] Wikipedia dumps downloader
Thank you, Emijrp!
What about the dump of Commons images? =A0 [for those with 10TB to spare]
SJ
On Sun, Jun 26, 2011 at 8:53 AM, emijrp emijrp@gmail.com wrote:
Hi all;
Can you imagine a day when Wikipedia is added to this list?[1]
WikiTeam have developed a script[2] to download all the Wikipedia dumps =
(and
her sister projects) from dumps.wikimedia.org. It sorts in folders and checks md5sum. It only works on Linux (it uses wget).
You will need about 100GB to download all the 7z files.
Save our memory.
Regards, emijrp
[1] http://en.wikipedia.org/wiki/Destruction_of_libraries [2] http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloade=
r.py
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Samuel Klein ? ? ? ? ?identi.ca:sj ? ? ? ? ? w:user:sj ? ? ? ? ?+1 617 52=
9 4266
Message: 5 Date: Mon, 27 Jun 2011 13:07:51 +0200 From: emijrp emijrp@gmail.com Subject: Re: [Wiki-research-l] [Xmldatadumps-l] Wikipedia dumps =A0 =A0 =A0 =A0downloader To: Richard Farmbrough richard@farmbrough.co.uk Cc: xmldatadumps-l@lists.wikimedia.org, =A0 =A0 =A0 =A0wikiteam-discuss@googlegroups.com, =A0 =A0 =A0Wikimedia Fo=
undation Mailing List
=A0 =A0 =A0 =A0foundation-l@lists.wikimedia.org, =A0 =A0 Research into =
Wikimedia content
=A0 =A0 =A0 =A0and communities wiki-research-l@lists.wikimedia.org Message-ID: BANLkTim9bTwCb75qOE4Cm935SK+3SSh35Q@mail.gmail.com Content-Type: text/plain; charset=3D"iso-8859-1"
Hi Richard;
Yes, a distributed project would be probably the best solution, but it is not easy to develop, unless you use a library like bittorrent, or similar and you have many peers. Althought most of the people don't seed the file=
s
long time, so sometimes is better to depend on a few committed persons th=
an
a big but ephemeral crowd.
Regards, emijrp
2011/6/26 Richard Farmbrough richard@farmbrough.co.uk
** It would be useful to have =A0an archive of archives. =A0I have to delet=
e my
old data dumps as time passes, for space reasons, however a team could, between them, maintain multiple copies of every data dump. This would ma=
ke a
nice distributed project.
On 26/06/2011 13:53, emijrp wrote:
Hi all;
Can you imagine a day when Wikipedia is added to this list?[1]
WikiTeam have developed a script[2] to download all the Wikipedia dumps (and her sister projects) from dumps.wikimedia.org. It sorts in folders and checks md5sum. It only works on Linux (it uses wget).
You will need about 100GB to download all the 7z files.
Save our memory.
Regards, emijrp
[1] http://en.wikipedia.org/wiki/Destruction_of_libraries [2] http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloade=
r.py
Xmldatadumps-l mailing listXmldatadumps-l@lists.wikimedia.orghttps://lis=
ts.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi;
@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation either. They can't and/or they don't want to provide image dumps (what is worst?). Community donates images to Commons, community donates money every year, and now community needs to develop a software to extract all the images and packed them, and of course, host them in a permanent way. Crazy, right?
@Milos: Instead of spliting image dump using the first letter of filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).
Regards, emijrp
2011/6/28 Derrick Coetzee dcoetzee@eecs.berkeley.edu
As a Commons admin I've thought a lot about the problem of distributing Commons dumps. As for distribution, I believe BitTorrent is absolutely the way to go, but the Torrent will require a small network of dedicated permaseeds (servers that seed indefinitely). These can easily be set up at low cost on Amazon EC2 "small" instances
- the disk storage for the archives is free, since small instances
include a large (~120 GB) ephemeral storage volume at no additional cost, and the cost of bandwidth can be controlled by configuring the BitTorrent client with either a bandwidth throttle or a transfer cap (or both). In fact, I think all Wikimedia dumps should be available through such a distribution solution, just as all Linux installation media are today.
Additionally, it will be necessary to construct (and maintain) useful subsets of Commons media, such as "all media used on the English Wikipedia", or "thumbnails of all images on Wikimedia Commons", of particular interest to certain content reusers, since the full set is far too large to be of interest to most reusers. It's on this latter point that I want your feedback: what useful subsets of Wikimedia Commons does the research community want? Thanks for your feedback.
--=20 Derrick Coetzee User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator http://www.eecs.berkeley.edu/~dcoetzee/
I remember that there's a request for comments at http://www.mediawiki.org/wiki/Research_Data_Proposals#Dump to prioritize the things which need to get done (or not).
Nemo
wiki-research-l@lists.wikimedia.org