It was not the original intention of us at WikiTeam to create these media tarballs so that researchers can use them from there. We created these tarballs so that everyone in the Wikimedia movement can be rest assured that there is one backup copy of their media on the Internet Archive. Trust me, the number of people who are going to actually use these tarballs are going to be lesser than the number of people editing the smaller wikis combined, certainly everyone is going to be using the data on Commons itself. So, we can fully focus on improving Commons to make it more data-accessible without taking the risk of having people working on the tarballs on the Internet Archive for research instead.

That being said, we can't even guarantee that the images in these tarballs are up-to-date. They are all downloaded and should be regarded as a snapshot of the image at the time of download, not an effective live backup of all the images on Commons. We are looking into creating subsequent tarballs that take into account the new uploads and the re-uploads so that Commons is actually backed up.

I guess the way we presented the tarballs on the Internet Archive is enough to deter anyone from conducting research directly from it, unless he/she does an in-depth mining of the data to get what he/she wants, but it certainly is going to be much tougher than mining the information from Commons directly in its current state.

On Mon, Oct 14, 2013 at 8:59 PM, Gerard Meijssen <gerard.meijssen@gmail.com> wrote:

Hoi,
While I do agree that it is good to have the data in many places and, the Internet Archive on its own moves it to several places as well. Many of us have seen the IA servers at the Library of Alexandria.

While it is ok to find a use for the data at the IA, I would like us to concentrate first and foremost on how we can make better use of the media that is in Commons itself. How we can open it up to more use. Make Commons more accessable.

Do realise that when there is a good use for all the data that is in the IA, the same use and more could be made with the larger amount of data that is in Commons itself.
Thanks,
GerardM

On 14 October 2013 14:26, Federico Leva (Nemo) <nemowiki@gmail.com> wrote:

Emilio J. Rodríguez-Posada, 14/10/2013 14:18:

Internet Archive has this problem in several other topics, like its
Wayback Machine, there is not search engine to search the billions
grabbed websites by keyword of whatever.

Internet Archive is a pile of hard disks and a time capsule with
backups, and they try to do the best at showing the materials (media
players, pdf viewers), but it is not always easy or possible.

...and that's why Hay said we need someone with a good idea. :)
Now it's easy to download the dataset (though it's not perfect), of course this doesn't automatically make something cool happen with it. Except replication of the data in multiple places, which is a good thing in itself.

Nemo

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l

_______________________________________________
Commons-l mailing list
Commons-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/commons-l

Regards,

Hydriz

Be social, follow/add me:

Facebook: http://tinyurl.com/hydrizfb

Google+: http://tinyurl.com/hydrizgl

Twitter: @hydrizwiki