Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
Thanks,
Diane
Hey guys,
Sorry for breaking the thread, but I just subscribed, so I think
this'll probably break mailman's threading headers.
This is very exciting news, and IA would love to have a copy! We're
more interested in being a historical mirror (on our item
infrastructure), rather than a live rsync/http/ftp mirror, but perhaps
we can also work something out mirroring the latest dumps. (How big
are the last 2 or so?)
I suppose the next step is for me and Ariel to talk about technical
procedures and details, et cetera, but I just wanted to subscribe to
this ml and introduce myself.
Ariel, when you have a minute to chat, shoot me an email (or skype).
I'm thinking we just pull things at whatever frequency you guys push
out the data to your.org (which may or may not be scheduled yet) and
throw them into new items on the cluster.
Others' thoughts are, of course, always welcome.
Thanks!
Alex Buie
Collections Group
Internet Archive, a registered California non-profit library
abuie(a)archive.org
This is phase one of a plan to make uploaded media from WMF projects
accessible for download in bulk. It, like many other things lately, is
experimental and subject to breakage, change, etc.
First, a big thanks to Kevin Day from Your.org who offered us the space
and worked with us many hours to sort out networking issues, try
different NAS setups, and generally do what was needed to get this
going.
Rsync url: ftpmirror.your.org::wikimedia-images/projectname/languagecode
For example:
rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir
would get you all of commons including archived versions (no deleted
images of course).
Folks who are trying to download media for a specific project should
bear in mind that they will need the files not only from that project
but also those which are hosted on commons and used on the local
project. I'm looking into producing lists of those files for easy use
by rsyncers.
I would suggest rather than everyone downloading a zillion copies of
commons at once, that folks coordinate a little bit, or just get the
pieces they need :-D
The data that is there now is probably about 15-20 days old. It will
likely be a little while before I get the media rsync going on a regular
basis, I'm juggling a lot of pieces right now.
Ariel
P.S. This is not an April fools joke, it's April 2 here already :-P
I'm doing a little bit of work on deployment procedures for the dump
scripts as I push out a few small bug fixes and turn on logging. Over
the next day or so you'll notice interruptions or delays while the
conversion is happening.
Ariel
Hi all,
I have beem loooking at the wikipedia database scheme and I haven't
found any field that suggest that some contents are geographical located.
Am I wrong?
If it is possible I would like to download the geographical located
contents of Wikipedia to do something similar to what googleearth does
with the wikipedia layer
Is that possible?
Thanks in advanced.
I can't go but some people on this list should think about a panel that
disusses forkability, archival of content and other related things. In
case this sounds attractive to someone that is planning to go, deadline
for submission is in a week!
http://wikimania2012.wikimedia.org/wiki/Submissions
I'm willing to have my brain picked by anyone who decides this is worth
doing, in case that's helpful.
Ariel
I've cranked up all the usual workers and kicked off an en wp run in
addition. If there is anything that squeaked by my spot checks we'll
know about it soon...
Ariel