New subject: Wikipedia dumps downloader

28 Jun 2011

      As a Commons admin I've thought a lot about the problem of
distributing Commons dumps. As for distribution, I believe BitTorrent
is absolutely the way to go, but the Torrent will require a small
network of dedicated permaseeds (servers that seed indefinitely).
These can easily be set up at low cost on Amazon EC2 "small" instances
- the disk storage for the archives is free, since small instances
include a  large (~120 GB) ephemeral storage volume at no additional
cost, and the cost of bandwidth can be controlled by configuring the
BitTorrent client with either a bandwidth throttle or a transfer cap
(or both). In fact, I think all Wikimedia dumps should be available
through such a distribution solution, just as all Linux installation
media are today.
Additionally, it will be necessary to construct (and maintain) useful
subsets of Commons media, such as "all media used on the English
Wikipedia", or "thumbnails of all images on Wikimedia Commons", of
particular interest to certain content reusers, since the full set is
far too large to be of interest to most reusers. It's on this latter
point that I want your feedback: what useful subsets of Wikimedia
Commons does the research community want? Thanks for your feedback.
--=20
Derrick Coetzee
User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator
http://www.eecs.berkeley.edu/~dcoetzee/
On Mon, Jun 27, 2011 at 6:49 AM,
wiki-research-l-request@lists.wikimedia.org wrote:
...
Date: Mon, 27 Jun 2011 06:18:31 -0400
From: Samuel Klein sjklein@hcs.harvard.edu
Subject: Re: [Wiki-research-l] Wikipedia dumps downloader
Thank you, Emijrp!
What about the dump of Commons images? =A0 [for those with 10TB to spare]
SJ
On Sun, Jun 26, 2011 at 8:53 AM, emijrp emijrp@gmail.com wrote:
...
Hi all;
Can you imagine a day when Wikipedia is added to this list?[1]
WikiTeam have developed a script[2] to download all the Wikipedia dumps =
(and
...
...
her sister projects) from dumps.wikimedia.org. It sorts in folders and
checks md5sum. It only works on Linux (it uses wget).
You will need about 100GB to download all the 7z files.
Save our memory.
Regards,
emijrp
[1] http://en.wikipedia.org/wiki/Destruction_of_libraries
[2]
http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloade=
r.py
...
...

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Samuel Klein ? ? ? ? ?identi.ca:sj ? ? ? ? ? w:user:sj ? ? ? ? ?+1 617 52=
9 4266
...

Message: 5
Date: Mon, 27 Jun 2011 13:07:51 +0200
From: emijrp emijrp@gmail.com
Subject: Re: [Wiki-research-l] [Xmldatadumps-l] Wikipedia dumps
=A0 =A0 =A0 =A0downloader
To: Richard Farmbrough richard@farmbrough.co.uk
Cc: xmldatadumps-l@lists.wikimedia.org,
=A0 =A0 =A0 =A0wikiteam-discuss@googlegroups.com, =A0 =A0 =A0Wikimedia Fo=
undation Mailing List
...
=A0 =A0 =A0 =A0foundation-l@lists.wikimedia.org, =A0 =A0 Research into =
Wikimedia content
...
=A0 =A0 =A0 =A0and communities wiki-research-l@lists.wikimedia.org
Message-ID: BANLkTim9bTwCb75qOE4Cm935SK+3SSh35Q@mail.gmail.com
Content-Type: text/plain; charset=3D"iso-8859-1"
Hi Richard;
Yes, a distributed project would be probably the best solution, but it is
not easy to develop, unless you use a library like bittorrent, or similar
and you have many peers. Althought most of the people don't seed the file=
s
...
long time, so sometimes is better to depend on a few committed persons th=
an
...
a big but ephemeral crowd.
Regards,
emijrp
2011/6/26 Richard Farmbrough richard@farmbrough.co.uk
...
**
It would be useful to have =A0an archive of archives. =A0I have to delet=
e my
...
...
old data dumps as time passes, for space reasons, however a team could,
between them, maintain multiple copies of every data dump. This would ma=
ke a
...
...
nice distributed project.
On 26/06/2011 13:53, emijrp wrote:
Hi all;
Can you imagine a day when Wikipedia is added to this list?[1]
WikiTeam have developed a script[2] to download all the Wikipedia dumps
(and her sister projects) from dumps.wikimedia.org. It sorts in folders
and checks md5sum. It only works on Linux (it uses wget).
You will need about 100GB to download all the 7z files.
Save our memory.
Regards,
emijrp
[1] http://en.wikipedia.org/wiki/Destruction_of_libraries
[2]
http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloade=
r.py
...
...

Xmldatadumps-l mailing listXmldatadumps-l@lists.wikimedia.orghttps://lis=
ts.wikimedia.org/mailman/listinfo/xmldatadumps-l
...
...