We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Good work. We are approaching finally to an indestructible corpus of knowledge.
2012/5/17 Ariel T. Glenn ariel@wikimedia.org
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course. The revisions texts live on in the database, so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 17/05/12 14:23, Ariel T. Glenn wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course.
And is a much better way to retrieve them than the binlogs (which are only kept for a short time anyway).
The revisions texts live on in the database,
so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Not really. You could get a list of deleted titles/authors from the toolserver, but not the page contents, which for some strange reason are not replicated there (not even available to the roots).
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course. The revisions texts live on in the database, so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Create a script that makes a request to Special:Export using this category as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course. The revisions texts live on in the database, so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org
wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating
them
on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is
getting
easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this category as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course. The revisions texts live on in the database, so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05 http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this category as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the run, without deleted data.
The articles aren't permanently deleted of course. The revisions texts live on in the database, so a query on toolserver, for example, could be used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont έγραψε:
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this
category
as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the
run,
without deleted data.
The articles aren't permanently deleted of course. The revisions
texts
live on in the database, so a query on toolserver, for example,
could be
used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
έγραψε:
Hi, I am thinking about how to collect articles deleted based on the
"not
notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror
of
deleted data, at least that which is not spam/vandalism based on
tags.
mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
wrote: > We now have three mirror sites, yay! The full list is linked to
from
> http://dumps.wikimedia.org/ and is also available at > >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
> > Summarizing, we have: > > C3L (Brazil) with the last 5 good known dumps, > Masaryk University (Czech Republic) with the last 5 known good
dumps,
> Your.org (USA) with the complete archive of dumps, and > > for the latest version of uploaded media, Your.org with > http/ftp/rsync > access. > > Thanks to Carlos, Kevin and Yenya respectively at the above sites
for
> volunteering space, time and effort to make this happen. > > As people noticed earlier, a series of media tarballs per-project > (excluding commons) is being generated. As soon as the first run
of
> these is complete we'll announce its location and start generating > them > on a semi-regular basis. > > As we've been getting the bugs out of the mirroring setup, it is > getting > easier to add new locations. Know anyone interested? Please let
us
> know; we would love to have them. > > Ariel > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this
category
as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
There's a few other reasons articles get deleted: copyright issues, personal identifying data, etc. This makes maintaning the sort of mirror you propose problematic, although a similar mirror is here: http://deletionpedia.dbatley.com/w/index.php?title=Main_Page
The dumps contain only data publically available at the time of the
run,
without deleted data.
The articles aren't permanently deleted of course. The revisions
texts
live on in the database, so a query on toolserver, for example,
could be
used to get at them, but that would need to be for research purposes.
Ariel
Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
έγραψε:
> Hi, > I am thinking about how to collect articles deleted based on the
"not
> notable" criteria, > is there any way we can extract them from the mysql binlogs? how are > these mirrors working? I would be interested in setting up a mirror
of
> deleted data, at least that which is not spam/vandalism based on
tags.
> mike > > On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
> wrote: > > We now have three mirror sites, yay! The full list is linked to
from
> > http://dumps.wikimedia.org/ and is also available at > > > >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
> > > > Summarizing, we have: > > > > C3L (Brazil) with the last 5 good known dumps, > > Masaryk University (Czech Republic) with the last 5 known good
dumps,
> > Your.org (USA) with the complete archive of dumps, and > > > > for the latest version of uploaded media, Your.org with > > http/ftp/rsync > > access. > > > > Thanks to Carlos, Kevin and Yenya respectively at the above sites
for
> > volunteering space, time and effort to make this happen. > > > > As people noticed earlier, a series of media tarballs per-project > > (excluding commons) is being generated. As soon as the first run
of
> > these is complete we'll announce its location and start generating > > them > > on a semi-regular basis. > > > > As we've been getting the bugs out of the mirroring setup, it is > > getting > > easier to add new locations. Know anyone interested? Please let
us
> > know; we would love to have them. > > > > Ariel > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > >
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this
category
as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com
Well I whould be happy for items like this : http://en.wikipedia.org/wiki/Template:Db-a7 would it be possible to extract them easily? mike
On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org wrote: > There's a few other reasons articles get deleted: copyright issues, > personal identifying data, etc. This makes maintaning the sort of > mirror you propose problematic, although a similar mirror is here: > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page > > The dumps contain only data publically available at the time of the
run,
> without deleted data. > > The articles aren't permanently deleted of course. The revisions
texts
> live on in the database, so a query on toolserver, for example,
could be
> used to get at them, but that would need to be for research purposes. > > Ariel > > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
έγραψε:
>> Hi, >> I am thinking about how to collect articles deleted based on the
"not
>> notable" criteria, >> is there any way we can extract them from the mysql binlogs? how are >> these mirrors working? I would be interested in setting up a mirror
of
>> deleted data, at least that which is not spam/vandalism based on
tags.
>> mike >> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
>> wrote: >> > We now have three mirror sites, yay! The full list is linked to
from
>> > http://dumps.wikimedia.org/ and is also available at >> > >> >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
>> > >> > Summarizing, we have: >> > >> > C3L (Brazil) with the last 5 good known dumps, >> > Masaryk University (Czech Republic) with the last 5 known good
dumps,
>> > Your.org (USA) with the complete archive of dumps, and >> > >> > for the latest version of uploaded media, Your.org with >> > http/ftp/rsync >> > access. >> > >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
for
>> > volunteering space, time and effort to make this happen. >> > >> > As people noticed earlier, a series of media tarballs per-project >> > (excluding commons) is being generated. As soon as the first run
of
>> > these is complete we'll announce its location and start generating >> > them >> > on a semi-regular basis. >> > >> > As we've been getting the bugs out of the mirroring setup, it is >> > getting >> > easier to add new locations. Know anyone interested? Please let
us
>> > know; we would love to have them. >> > >> > Ariel >> > >> > >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote:
Create a script that makes a request to Special:Export using this
category
as feed https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion
More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
2012/5/21 Mike Dupont jamesmikedupont@googlemail.com > > Well I whould be happy for items like this : > http://en.wikipedia.org/wiki/Template:Db-a7 > would it be possible to extract them easily? > mike > > On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org > wrote: > > There's a few other reasons articles get deleted: copyright issues, > > personal identifying data, etc. This makes maintaning the sort of > > mirror you propose problematic, although a similar mirror is here: > > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page > > > > The dumps contain only data publically available at the time of the
run,
> > without deleted data. > > > > The articles aren't permanently deleted of course. The revisions
texts
> > live on in the database, so a query on toolserver, for example,
could be
> > used to get at them, but that would need to be for research purposes. > > > > Ariel > > > > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
έγραψε:
> >> Hi, > >> I am thinking about how to collect articles deleted based on the
"not
> >> notable" criteria, > >> is there any way we can extract them from the mysql binlogs? how are > >> these mirrors working? I would be interested in setting up a mirror
of
> >> deleted data, at least that which is not spam/vandalism based on
tags.
> >> mike > >> > >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
> >> wrote: > >> > We now have three mirror sites, yay! The full list is linked to
from
> >> > http://dumps.wikimedia.org/ and is also available at > >> > > >> >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
> >> > > >> > Summarizing, we have: > >> > > >> > C3L (Brazil) with the last 5 good known dumps, > >> > Masaryk University (Czech Republic) with the last 5 known good
dumps,
> >> > Your.org (USA) with the complete archive of dumps, and > >> > > >> > for the latest version of uploaded media, Your.org with > >> > http/ftp/rsync > >> > access. > >> > > >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
for
> >> > volunteering space, time and effort to make this happen. > >> > > >> > As people noticed earlier, a series of media tarballs per-project > >> > (excluding commons) is being generated. As soon as the first run
of
> >> > these is complete we'll announce its location and start generating > >> > them > >> > on a semi-regular basis. > >> > > >> > As we've been getting the bugs out of the mirroring setup, it is > >> > getting > >> > easier to add new locations. Know anyone interested? Please let
us
> >> > know; we would love to have them. > >> > > >> > Ariel > >> > > >> > > >> > _______________________________________________ > >> > Wikitech-l mailing list > >> > Wikitech-l@lists.wikimedia.org > >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > >> > >> > >> > > > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > -- > James Michael DuPont > Member of Free Libre Open Source Software Kosova http://flossk.org > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3 > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
I'm still intressted in running a mirror also, like noted on Meta and send out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main server? Its "strange" when we need to rsync from a mirror.
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent others from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens sterkebak@gmail.com wrote:
I'm still intressted in running a mirror also, like noted on Meta and send out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main server? Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, cool.
And how will I get wikimedia to allow our IP to rsync?
Best,
Huib
On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia admin@alphacorp.tkwrote:
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent others from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens sterkebak@gmail.com wrote:
I'm still intressted in running a mirror also, like noted on Meta and
send
out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main
server?
Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ariel will do that :)
BTW just dig around inside their puppet configuration repository on Gerrit and you can know more :)
On Wed, May 30, 2012 at 4:58 PM, Huib Laurens sterkebak@gmail.com wrote:
Ok, cool.
And how will I get wikimedia to allow our IP to rsync?
Best,
Huib
On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin@alphacorp.tk
wrote:
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent
others
from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens sterkebak@gmail.com
wrote:
I'm still intressted in running a mirror also, like noted on Meta and
send
out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main
server?
Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok.
I mailed Ariel about this, if all goes will I can have the mirror up and running by Friday.
Best, Huib
On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia admin@alphacorp.tkwrote:
Ariel will do that :)
BTW just dig around inside their puppet configuration repository on Gerrit and you can know more :)
On Wed, May 30, 2012 at 4:58 PM, Huib Laurens sterkebak@gmail.com wrote:
Ok, cool.
And how will I get wikimedia to allow our IP to rsync?
Best,
Huib
On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin@alphacorp.tk
wrote:
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent
others
from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens sterkebak@gmail.com
wrote:
I'm still intressted in running a mirror also, like noted on Meta and
send
out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main
server?
Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Do you have a url that you can reveal so that some of us can have a sneak peak? :P
On Wed, May 30, 2012 at 5:16 PM, Huib Laurens sterkebak@gmail.com wrote:
Ok.
I mailed Ariel about this, if all goes will I can have the mirror up and running by Friday.
Best, Huib
On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia <admin@alphacorp.tk
wrote:
Ariel will do that :)
BTW just dig around inside their puppet configuration repository on
Gerrit
and you can know more :)
On Wed, May 30, 2012 at 4:58 PM, Huib Laurens sterkebak@gmail.com
wrote:
Ok, cool.
And how will I get wikimedia to allow our IP to rsync?
Best,
Huib
On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <admin@alphacorp.tk
wrote:
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see rsync dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent
others
from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens sterkebak@gmail.com
wrote:
I'm still intressted in running a mirror also, like noted on Meta
and
send
out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the main
server?
Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Sure :)
later on we will duplicate this mirror to a Dutch mirror also :)
Best, Huib
On Wed, May 30, 2012 at 11:18 AM, Hydriz Wikipedia admin@alphacorp.tkwrote:
Do you have a url that you can reveal so that some of us can have a sneak peak? :P
On Wed, May 30, 2012 at 5:16 PM, Huib Laurens sterkebak@gmail.com wrote:
Ok.
I mailed Ariel about this, if all goes will I can have the mirror up and running by Friday.
Best, Huib
On Wed, May 30, 2012 at 10:59 AM, Hydriz Wikipedia <admin@alphacorp.tk
wrote:
Ariel will do that :)
BTW just dig around inside their puppet configuration repository on
Gerrit
and you can know more :)
On Wed, May 30, 2012 at 4:58 PM, Huib Laurens sterkebak@gmail.com
wrote:
Ok, cool.
And how will I get wikimedia to allow our IP to rsync?
Best,
Huib
On Wed, May 30, 2012 at 10:54 AM, Hydriz Wikipedia <
admin@alphacorp.tk
wrote:
Eh, mirrors rsync directly from dataset1001.wikimedia.org, see
rsync
dataset1001.wikimedia.org::
However, the system limits the rsyncers to only mirrors, to prevent
others
from rsyncing directly from Wikimedia.
On Wed, May 30, 2012 at 4:52 PM, Huib Laurens <sterkebak@gmail.com
wrote:
I'm still intressted in running a mirror also, like noted on Meta
and
send
out earlier per mail also.
I'm just wondering, why is there no rsync possibility from the
main
server?
Its "strange" when we need to rsync from a mirror.
-- *Kind regards,
Huib Laurens**
Certified cPanel Specialist Certified Kaspersky Specialist ** WickedWay Webhosting, webhosting the wicked way!
www.wickedway.nl - www.wickedway.be* . _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in
history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history.
Help
protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Kind regards,
Huib Laurens WickedWay.nl
Webhosting the wicked way. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_deleted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/archive2012...
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote: > Create a script that makes a request to Special:Export using this
category
> as feed > https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_deletion > > More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
> > > 2012/5/21 Mike Dupont jamesmikedupont@googlemail.com >> >> Well I whould be happy for items like this : >> http://en.wikipedia.org/wiki/Template:Db-a7 >> would it be possible to extract them easily? >> mike >> >> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn ariel@wikimedia.org >> wrote: >> > There's a few other reasons articles get deleted: copyright issues, >> > personal identifying data, etc. This makes maintaning the sort of >> > mirror you propose problematic, although a similar mirror is here: >> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page >> > >> > The dumps contain only data publically available at the time of the
run,
>> > without deleted data. >> > >> > The articles aren't permanently deleted of course. The revisions
texts
>> > live on in the database, so a query on toolserver, for example,
could be
>> > used to get at them, but that would need to be for research purposes. >> > >> > Ariel >> > >> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike Dupont
έγραψε:
>> >> Hi, >> >> I am thinking about how to collect articles deleted based on the
"not
>> >> notable" criteria, >> >> is there any way we can extract them from the mysql binlogs? how are >> >> these mirrors working? I would be interested in setting up a mirror
of
>> >> deleted data, at least that which is not spam/vandalism based on
tags.
>> >> mike >> >> >> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
>> >> wrote: >> >> > We now have three mirror sites, yay! The full list is linked to
from
>> >> > http://dumps.wikimedia.org/ and is also available at >> >> > >> >> >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
>> >> > >> >> > Summarizing, we have: >> >> > >> >> > C3L (Brazil) with the last 5 good known dumps, >> >> > Masaryk University (Czech Republic) with the last 5 known good
dumps,
>> >> > Your.org (USA) with the complete archive of dumps, and >> >> > >> >> > for the latest version of uploaded media, Your.org with >> >> > http/ftp/rsync >> >> > access. >> >> > >> >> > Thanks to Carlos, Kevin and Yenya respectively at the above sites
for
>> >> > volunteering space, time and effort to make this happen. >> >> > >> >> > As people noticed earlier, a series of media tarballs per-project >> >> > (excluding commons) is being generated. As soon as the first run
of
>> >> > these is complete we'll announce its location and start generating >> >> > them >> >> > on a semi-regular basis. >> >> > >> >> > As we've been getting the bugs out of the mirroring setup, it is >> >> > getting >> >> > easier to add new locations. Know anyone interested? Please let
us
>> >> > know; we would love to have them. >> >> > >> >> > Ariel >> >> > >> >> > >> >> > _______________________________________________ >> >> > Wikitech-l mailing list >> >> > Wikitech-l@lists.wikimedia.org >> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> >> >> >> > >> > >> > >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> -- >> James Michael DuPont >> Member of Free Libre Open Source Software Kosova http://flossk.org >> Contributor FOSM, the CC-BY-SA map of the world http://fosm.org >> Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3 >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > -- > Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com > Pre-doctoral student at the University of Cádiz (Spain) > Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam > Personal website: https://sites.google.com/site/emijrp/ > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mike Dupont Sent: Saturday, June 02, 2012 1:28 AM To: Wikimedia developers; wikiteam-discuss@googlegroups.com Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de leted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a rchive2012-05-28T21:34:02.302183.zip
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote: > Create a script that makes a request to Special:Export using > this
category
> as feed > https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de > letion > > More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
> > > 2012/5/21 Mike Dupont jamesmikedupont@googlemail.com >> >> Well I whould be happy for items like this : >> http://en.wikipedia.org/wiki/Template:Db-a7 >> would it be possible to extract them easily? >> mike >> >> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn >> ariel@wikimedia.org >> wrote: >> > There's a few other reasons articles get deleted: copyright >> > issues, personal identifying data, etc. This makes >> > maintaning the sort of mirror you propose problematic, although a similar mirror is here: >> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page >> > >> > The dumps contain only data publically available at the time >> > of the
run,
>> > without deleted data. >> > >> > The articles aren't permanently deleted of course. The >> > revisions
texts
>> > live on in the database, so a query on toolserver, for >> > example,
could be
>> > used to get at them, but that would need to be for research purposes. >> > >> > Ariel >> > >> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike >> > Dupont
έγραψε:
>> >> Hi, >> >> I am thinking about how to collect articles deleted based >> >> on the
"not
>> >> notable" criteria, >> >> is there any way we can extract them from the mysql >> >> binlogs? how are these mirrors working? I would be >> >> interested in setting up a mirror
of
>> >> deleted data, at least that which is not spam/vandalism >> >> based on
tags.
>> >> mike >> >> >> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
>> >> wrote: >> >> > We now have three mirror sites, yay! The full list is >> >> > linked to
from
>> >> > http://dumps.wikimedia.org/ and is also available at >> >> > >> >> >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum ps#Current_Mirrors
>> >> > >> >> > Summarizing, we have: >> >> > >> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk >> >> > University (Czech Republic) with the last 5 known good
dumps,
>> >> > Your.org (USA) with the complete archive of dumps, and >> >> > >> >> > for the latest version of uploaded media, Your.org with >> >> > http/ftp/rsync access. >> >> > >> >> > Thanks to Carlos, Kevin and Yenya respectively at the >> >> > above sites
for
>> >> > volunteering space, time and effort to make this happen. >> >> > >> >> > As people noticed earlier, a series of media tarballs >> >> > per-project (excluding commons) is being generated. As >> >> > soon as the first run
of
>> >> > these is complete we'll announce its location and start >> >> > generating them on a semi-regular basis. >> >> > >> >> > As we've been getting the bugs out of the mirroring >> >> > setup, it is getting easier to add new locations. Know >> >> > anyone interested? Please let
us
>> >> > know; we would love to have them. >> >> > >> >> > Ariel >> >> > >> >> > >> >> > _______________________________________________ >> >> > Wikitech-l mailing list >> >> > Wikitech-l@lists.wikimedia.org >> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> >> >> >> > >> > >> > >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> -- >> James Michael DuPont >> Member of Free Libre Open Source Software Kosova >> http://flossk.org Contributor FOSM, the CC-BY-SA map of the >> world http://fosm.org Mozilla Rep >> https://reps.mozilla.org/u/h4ck3rm1k3 >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > -- > Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com > Pre-doctoral student at the University of Cádiz (Spain) > Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | > WikiTeam Personal website: > https://sites.google.com/site/emijrp/ > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I second this idea. Large archives should always be available using bittorrent. I would actually suggest posting magnet links for them though instead of .torrent files. This way you can leverage the acceptable source feature of magnet links.
https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file
This way we get the best of both worlds: the constant availability of direct downloads, and the reduction in load that p2p filesharing provides.
Thank you, Derric Atzrott
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Oren Bochman Sent: 05 June 2012 08:44 To: 'Wikimedia developers' Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mike Dupont Sent: Saturday, June 02, 2012 1:28 AM To: Wikimedia developers; wikiteam-discuss@googlegroups.com Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de leted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a rchive2012-05-28T21:34:02.302183.zip
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Thanks! and run that 1 time per day, they dont get deleted that quickly. mike
On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote: > Create a script that makes a request to Special:Export using > this
category
> as feed > https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de > letion > > More info
https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
> > > 2012/5/21 Mike Dupont jamesmikedupont@googlemail.com >> >> Well I whould be happy for items like this : >> http://en.wikipedia.org/wiki/Template:Db-a7 >> would it be possible to extract them easily? >> mike >> >> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn >> ariel@wikimedia.org >> wrote: >> > There's a few other reasons articles get deleted: copyright >> > issues, personal identifying data, etc. This makes >> > maintaning the sort of mirror you propose problematic, although a similar mirror is here: >> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page >> > >> > The dumps contain only data publically available at the time >> > of the
run,
>> > without deleted data. >> > >> > The articles aren't permanently deleted of course. The >> > revisions
texts
>> > live on in the database, so a query on toolserver, for >> > example,
could be
>> > used to get at them, but that would need to be for research purposes. >> > >> > Ariel >> > >> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike >> > Dupont
έγραψε:
>> >> Hi, >> >> I am thinking about how to collect articles deleted based >> >> on the
"not
>> >> notable" criteria, >> >> is there any way we can extract them from the mysql >> >> binlogs? how are these mirrors working? I would be >> >> interested in setting up a mirror
of
>> >> deleted data, at least that which is not spam/vandalism >> >> based on
tags.
>> >> mike >> >> >> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn <
ariel@wikimedia.org>
>> >> wrote: >> >> > We now have three mirror sites, yay! The full list is >> >> > linked to
from
>> >> > http://dumps.wikimedia.org/ and is also available at >> >> > >> >> >
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum ps#Current_Mirrors
>> >> > >> >> > Summarizing, we have: >> >> > >> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk >> >> > University (Czech Republic) with the last 5 known good
dumps,
>> >> > Your.org (USA) with the complete archive of dumps, and >> >> > >> >> > for the latest version of uploaded media, Your.org with >> >> > http/ftp/rsync access. >> >> > >> >> > Thanks to Carlos, Kevin and Yenya respectively at the >> >> > above sites
for
>> >> > volunteering space, time and effort to make this happen. >> >> > >> >> > As people noticed earlier, a series of media tarballs >> >> > per-project (excluding commons) is being generated. As >> >> > soon as the first run
of
>> >> > these is complete we'll announce its location and start >> >> > generating them on a semi-regular basis. >> >> > >> >> > As we've been getting the bugs out of the mirroring >> >> > setup, it is getting easier to add new locations. Know >> >> > anyone interested? Please let
us
>> >> > know; we would love to have them. >> >> > >> >> > Ariel >> >> > >> >> > >> >> > _______________________________________________ >> >> > Wikitech-l mailing list >> >> > Wikitech-l@lists.wikimedia.org >> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> >> >> >> > >> > >> > >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> -- >> James Michael DuPont >> Member of Free Libre Open Source Software Kosova >> http://flossk.org Contributor FOSM, the CC-BY-SA map of the >> world http://fosm.org Mozilla Rep >> https://reps.mozilla.org/u/h4ck3rm1k3 >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > -- > Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com > Pre-doctoral student at the University of Cádiz (Spain) > Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | > WikiTeam Personal website: > https://sites.google.com/site/emijrp/ > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
This is a place where volunteers can step in and make it happen without the need for Wikimedia's infrastructure. (This means I can concentrate on my already very full plate of things too.)
http://meta.wikimedia.org/wiki/Data_dump_torrents
Have at!
Ariel
Στις 05-06-2012, ημέρα Τρι, και ώρα 08:57 -0400, ο/η Derric Atzrott έγραψε:
I second this idea. Large archives should always be available using bittorrent. I would actually suggest posting magnet links for them though instead of .torrent files. This way you can leverage the acceptable source feature of magnet links.
https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file
This way we get the best of both worlds: the constant availability of direct downloads, and the reduction in load that p2p filesharing provides.
Thank you, Derric Atzrott
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Oren Bochman Sent: 05 June 2012 08:44 To: 'Wikimedia developers' Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
Any chance that these archived can be served via bittorent - so that even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mike Dupont Sent: Saturday, June 02, 2012 1:28 AM To: Wikimedia developers; wikiteam-discuss@googlegroups.com Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
I have run cron archiving now every 30 minutes, http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/ it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia admin@alphacorp.tk wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com
wrote:
first version of the Script is ready , it gets the versions, puts them in a zip and puts that on archive.org https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de leted.py
here is an example output : http://archive.org/details/wikipedia-delete-2012-05
http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a rchive2012-05-28T21:34:02.302183.zip
I will cron this, and it should give a start of saving deleted data. Articles will be exported once a day, even if they they were exported yesterday as long as they are in one of the categories.
mike
On Mon, May 21, 2012 at 7:21 PM, Mike Dupont jamesmikedupont@googlemail.com wrote: > Thanks! and run that 1 time per day, they dont get deleted that quickly. > mike > > On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com wrote: >> Create a script that makes a request to Special:Export using >> this category >> as feed >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de >> letion >> >> More info https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export >> >> >> 2012/5/21 Mike Dupont jamesmikedupont@googlemail.com >>> >>> Well I whould be happy for items like this : >>> http://en.wikipedia.org/wiki/Template:Db-a7 >>> would it be possible to extract them easily? >>> mike >>> >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn >>> ariel@wikimedia.org >>> wrote: >>> > There's a few other reasons articles get deleted: copyright >>> > issues, personal identifying data, etc. This makes >>> > maintaning the sort of mirror you propose problematic, although a similar mirror is here: >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page >>> > >>> > The dumps contain only data publically available at the time >>> > of the run, >>> > without deleted data. >>> > >>> > The articles aren't permanently deleted of course. The >>> > revisions texts >>> > live on in the database, so a query on toolserver, for >>> > example, could be >>> > used to get at them, but that would need to be for research purposes. >>> > >>> > Ariel >>> > >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike >>> > Dupont έγραψε: >>> >> Hi, >>> >> I am thinking about how to collect articles deleted based >>> >> on the "not >>> >> notable" criteria, >>> >> is there any way we can extract them from the mysql >>> >> binlogs? how are these mirrors working? I would be >>> >> interested in setting up a mirror of >>> >> deleted data, at least that which is not spam/vandalism >>> >> based on tags. >>> >> mike >>> >> >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn < ariel@wikimedia.org> >>> >> wrote: >>> >> > We now have three mirror sites, yay! The full list is >>> >> > linked to from >>> >> > http://dumps.wikimedia.org/ and is also available at >>> >> > >>> >> > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum ps#Current_Mirrors >>> >> > >>> >> > Summarizing, we have: >>> >> > >>> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk >>> >> > University (Czech Republic) with the last 5 known good dumps, >>> >> > Your.org (USA) with the complete archive of dumps, and >>> >> > >>> >> > for the latest version of uploaded media, Your.org with >>> >> > http/ftp/rsync access. >>> >> > >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the >>> >> > above sites for >>> >> > volunteering space, time and effort to make this happen. >>> >> > >>> >> > As people noticed earlier, a series of media tarballs >>> >> > per-project (excluding commons) is being generated. As >>> >> > soon as the first run of >>> >> > these is complete we'll announce its location and start >>> >> > generating them on a semi-regular basis. >>> >> > >>> >> > As we've been getting the bugs out of the mirroring >>> >> > setup, it is getting easier to add new locations. Know >>> >> > anyone interested? Please let us >>> >> > know; we would love to have them. >>> >> > >>> >> > Ariel >>> >> > >>> >> > >>> >> > _______________________________________________ >>> >> > Wikitech-l mailing list >>> >> > Wikitech-l@lists.wikimedia.org >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> >> >>> >> >>> >> >>> > >>> > >>> > >>> > _______________________________________________ >>> > Wikitech-l mailing list >>> > Wikitech-l@lists.wikimedia.org >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> >>> >>> >>> -- >>> James Michael DuPont >>> Member of Free Libre Open Source Software Kosova >>> http://flossk.org Contributor FOSM, the CC-BY-SA map of the >>> world http://fosm.org Mozilla Rep >>> https://reps.mozilla.org/u/h4ck3rm1k3 >>> >>> _______________________________________________ >>> Wikitech-l mailing list >>> Wikitech-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> >> >> -- >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com >> Pre-doctoral student at the University of Cádiz (Spain) >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | >> WikiTeam Personal website: >> https://sites.google.com/site/emijrp/ >> >> >> _______________________________________________ >> Xmldatadumps-l mailing list >> Xmldatadumps-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >> > > > > -- > James Michael DuPont > Member of Free Libre Open Source Software Kosova > http://flossk.org Contributor FOSM, the CC-BY-SA map of the > world http://fosm.org Mozilla Rep > https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Dear Ariel,
Consider that people who would need to use Torrent most of all cannot host a mirrors - this is a situation of the little guy being asked to do the heavy lifting.
It would be saving WMF significant resources, - it would be more efficient than Rsync. Doing this outside the the WMF infrastructure does not make sense (authenticity, automation) and is the reason why use of torrents has failed traditionally. If the WMF does this - it should be possible for users to leverage all the mirrors simultaneously - which is why torrents are the preferred form of transport for Linux distribution.
Installing a torrent server should not significantly impact workload. The main problems, as I see it, is to write a maintenance script to create the magnet link/.torernt files once the dumps are generated and to publish them on the dump servers.
With your blessing - I would try to help with it in the context of say a labs, if it would be integrated into the dump release process.
Thanks for the great job with the dumps!
Oren Bochman
On Tue, Jun 5, 2012 at 3:15 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
This is a place where volunteers can step in and make it happen without the need for Wikimedia's infrastructure. (This means I can concentrate on my already very full plate of things too.)
http://meta.wikimedia.org/wiki/Data_dump_torrents
Have at!
Ariel
Στις 05-06-2012, ημέρα Τρι, και ώρα 08:57 -0400, ο/η Derric Atzrott έγραψε:
I second this idea. Large archives should always be available using
bittorrent. I would actually suggest posting magnet links for them though instead of .torrent files. This way you can leverage the acceptable source feature of magnet links.
https://en.wikipedia.org/wiki/Magnet_URI_scheme#Web_links_to_the_file
This way we get the best of both worlds: the constant availability of
direct downloads, and the reduction in load that p2p filesharing provides.
Thank you, Derric Atzrott
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:
wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Oren Bochman
Sent: 05 June 2012 08:44 To: 'Wikimedia developers' Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
Any chance that these archived can be served via bittorent - so that
even partial downloaders can become servers - leveraging p2p to reduce overall bandwidth load on the servers and increase download times?
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:
wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Mike Dupont
Sent: Saturday, June 02, 2012 1:28 AM To: Wikimedia developers; wikiteam-discuss@googlegroups.com Subject: Re: [Wikitech-l] [Xmldatadumps-l] XML dumps/Media mirrors update
I have run cron archiving now every 30 minutes,
http://ia700802.us.archive.org/34/items/wikipedia-delete-2012-06/
it is amazing how fast the stuff gets deleted on wikipedia. what about the proposed deletes are there categories for that? thanks mike
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont <
jamesmikedupont@googlemail.com> wrote:
https://github.com/h4ck3rm1k3/wikiteam code here
On Wed, May 30, 2012 at 6:26 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Ok, I merged the code from wikteam and have a full history dump script that uploads to archive.org, next step is to fix the bucket metadata in the script mike
On Tue, May 29, 2012 at 3:08 AM, Mike Dupont jamesmikedupont@googlemail.com wrote:
Well, I have now updated the script to include the xml dump in raw format. I will have to add more information the achive.org item, at least a basic readme. other thing is that the wikipybot does not support the full history it seems, so that I will have to move over to the wikiteam version and rework it, I just spent 2 hours on this so i am pretty happy for the first version.
mike
On Tue, May 29, 2012 at 1:52 AM, Hydriz Wikipedia <
admin@alphacorp.tk> wrote:
This is quite nice, though the item's metadata is too little :)
On Tue, May 29, 2012 at 3:40 AM, Mike Dupont <jamesmikedupont@googlemail.com > wrote:
> first version of the Script is ready , it gets the versions, puts > them in a zip and puts that on archive.org > https://github.com/h4ck3rm1k3/pywikipediabot/blob/master/export_de > leted.py > > here is an example output : > http://archive.org/details/wikipedia-delete-2012-05 > > http://ia601203.us.archive.org/24/items/wikipedia-delete-2012-05/a > rchive2012-05-28T21:34:02.302183.zip > > I will cron this, and it should give a start of saving deleted
data.
> Articles will be exported once a day, even if they they were > exported yesterday as long as they are in one of the categories. > > mike > > On Mon, May 21, 2012 at 7:21 PM, Mike Dupont > jamesmikedupont@googlemail.com wrote: > > Thanks! and run that 1 time per day, they dont get deleted that
quickly.
> > mike > > > > On Mon, May 21, 2012 at 9:11 PM, emijrp emijrp@gmail.com
wrote:
> >> Create a script that makes a request to Special:Export using > >> this > category > >> as feed > >> https://en.wikipedia.org/wiki/Category:Candidates_for_speedy_de > >> letion > >> > >> More info > https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export > >> > >> > >> 2012/5/21 Mike Dupont jamesmikedupont@googlemail.com > >>> > >>> Well I whould be happy for items like this : > >>> http://en.wikipedia.org/wiki/Template:Db-a7 > >>> would it be possible to extract them easily? > >>> mike > >>> > >>> On Thu, May 17, 2012 at 2:23 PM, Ariel T. Glenn > >>> ariel@wikimedia.org > >>> wrote: > >>> > There's a few other reasons articles get deleted: copyright > >>> > issues, personal identifying data, etc. This makes > >>> > maintaning the sort of mirror you propose problematic,
although a similar mirror is here:
> >>> > http://deletionpedia.dbatley.com/w/index.php?title=Main_Page > >>> > > >>> > The dumps contain only data publically available at the time > >>> > of the > run, > >>> > without deleted data. > >>> > > >>> > The articles aren't permanently deleted of course. The > >>> > revisions > texts > >>> > live on in the database, so a query on toolserver, for > >>> > example, > could be > >>> > used to get at them, but that would need to be for research
purposes.
> >>> > > >>> > Ariel > >>> > > >>> > Στις 17-05-2012, ημέρα Πεμ, και ώρα 13:30 +0200, ο/η Mike > >>> > Dupont > έγραψε: > >>> >> Hi, > >>> >> I am thinking about how to collect articles deleted based > >>> >> on the > "not > >>> >> notable" criteria, > >>> >> is there any way we can extract them from the mysql > >>> >> binlogs? how are these mirrors working? I would be > >>> >> interested in setting up a mirror > of > >>> >> deleted data, at least that which is not spam/vandalism > >>> >> based on > tags. > >>> >> mike > >>> >> > >>> >> On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn < > ariel@wikimedia.org> > >>> >> wrote: > >>> >> > We now have three mirror sites, yay! The full list is > >>> >> > linked to > from > >>> >> > http://dumps.wikimedia.org/ and is also available at > >>> >> > > >>> >> > > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dum > ps#Current_Mirrors > >>> >> > > >>> >> > Summarizing, we have: > >>> >> > > >>> >> > C3L (Brazil) with the last 5 good known dumps, Masaryk > >>> >> > University (Czech Republic) with the last 5 known good > dumps, > >>> >> > Your.org (USA) with the complete archive of dumps, and > >>> >> > > >>> >> > for the latest version of uploaded media, Your.org with > >>> >> > http/ftp/rsync access. > >>> >> > > >>> >> > Thanks to Carlos, Kevin and Yenya respectively at the > >>> >> > above sites > for > >>> >> > volunteering space, time and effort to make this happen. > >>> >> > > >>> >> > As people noticed earlier, a series of media tarballs > >>> >> > per-project (excluding commons) is being generated. As > >>> >> > soon as the first run > of > >>> >> > these is complete we'll announce its location and start > >>> >> > generating them on a semi-regular basis. > >>> >> > > >>> >> > As we've been getting the bugs out of the mirroring > >>> >> > setup, it is getting easier to add new locations. Know > >>> >> > anyone interested? Please let > us > >>> >> > know; we would love to have them. > >>> >> > > >>> >> > Ariel > >>> >> > > >>> >> > > >>> >> > _______________________________________________ > >>> >> > Wikitech-l mailing list > >>> >> > Wikitech-l@lists.wikimedia.org > >>> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > >>> >> > >>> >> > >>> >> > >>> > > >>> > > >>> > > >>> > _______________________________________________ > >>> > Wikitech-l mailing list > >>> > Wikitech-l@lists.wikimedia.org > >>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > >>> > >>> > >>> > >>> -- > >>> James Michael DuPont > >>> Member of Free Libre Open Source Software Kosova > >>> http://flossk.org Contributor FOSM, the CC-BY-SA map of the > >>> world http://fosm.org Mozilla Rep > >>> https://reps.mozilla.org/u/h4ck3rm1k3 > >>> > >>> _______________________________________________ > >>> Wikitech-l mailing list > >>> Wikitech-l@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > >> > >> > >> > >> > >> -- > >> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com > >> Pre-doctoral student at the University of Cádiz (Spain) > >> Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | > >> WikiTeam Personal website: > >> https://sites.google.com/site/emijrp/ > >> > >> > >> _______________________________________________ > >> Xmldatadumps-l mailing list > >> Xmldatadumps-l@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > >> > > > > > > > > -- > > James Michael DuPont > > Member of Free Libre Open Source Software Kosova > > http://flossk.org Contributor FOSM, the CC-BY-SA map of the > > world http://fosm.org Mozilla Rep > > https://reps.mozilla.org/u/h4ck3rm1k3 > > > > -- > James Michael DuPont > Member of Free Libre Open Source Software Kosova http://flossk.org > Contributor FOSM, the CC-BY-SA map of the world http://fosm.org > Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3 > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >
-- Regards, Hydriz
We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.orgContributor FOSM, the CC-BY-SA map of the world
http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2012/6/6 Oren Bochman orenbochman@gmail.com:
Dear Ariel,
Consider that people who would need to use Torrent most of all cannot host a mirrors - this is a situation of the little guy being asked to do the heavy lifting.
It would be saving WMF significant resources, - it would be more efficient than Rsync. Doing this outside the the WMF infrastructure does not make sense (authenticity, automation) and is the reason why use of torrents has failed traditionally. If the WMF does this - it should be possible for users to leverage all the mirrors simultaneously - which is why torrents are the preferred form of transport for Linux distribution.
Installing a torrent server should not significantly impact workload. The main problems, as I see it, is to write a maintenance script to create the magnet link/.torernt files once the dumps are generated and to publish them on the dump servers.
With your blessing - I would try to help with it in the context of say a labs, if it would be integrated into the dump release process.
Actually, the link that Ariel provided contains links to Burnbit, who claim [1] to be using webseeds (i.e. the Wikimedia servers) to provide the torrents with data. So in theory it should be enough to have a script generate a burnbit torrent (or any torrent with webseeds, for that matter) for each dump, which should be well within the reach of an interested user. As to the actual hosting of files, the users will only host the files they choose to seed (presumably, the ones they work with) and nothing more.
Strainu
You can create a script that uses Special:Export to export all articles in the deletion categories just before they are deleted.
Then import them into your "Deletionpedia".
2012/5/17 Mike Dupont jamesmikedupont@googlemail.com
Hi, I am thinking about how to collect articles deleted based on the "not notable" criteria, is there any way we can extract them from the mysql binlogs? how are these mirrors working? I would be interested in setting up a mirror of deleted data, at least that which is not spam/vandalism based on tags. mike
On Thu, May 17, 2012 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
We now have three mirror sites, yay! The full list is linked to from http://dumps.wikimedia.org/ and is also available at
http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current...
Summarizing, we have:
C3L (Brazil) with the last 5 good known dumps, Masaryk University (Czech Republic) with the last 5 known good dumps, Your.org (USA) with the complete archive of dumps, and
for the latest version of uploaded media, Your.org with http/ftp/rsync access.
Thanks to Carlos, Kevin and Yenya respectively at the above sites for volunteering space, time and effort to make this happen.
As people noticed earlier, a series of media tarballs per-project (excluding commons) is being generated. As soon as the first run of these is complete we'll announce its location and start generating them on a semi-regular basis.
As we've been getting the bugs out of the mirroring setup, it is getting easier to add new locations. Know anyone interested? Please let us know; we would love to have them.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org Contributor FOSM, the CC-BY-SA map of the world http://fosm.org Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org