Hi,
we're basically mirroring all the generated dumps, extract them, harvest data etc. Lately I came to examine some of the more exotic languages and to my surprise they were even more exotic than I thought. I propose to ditch them.
Afar (aa) Wikipedia
latest at our servers is aar-20141223.xml.bz with 22974 bytes (we convert into iso639-3)
It seems the wiki has been closed or moved into incubator:
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Afa...
Nevertheless in the xmldumps this wiki keeps showing up and pretending something is there. I believe we'd be all better off if dums of this would cease.
---
Basically the same applies for Ndonga Wikipedia
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Ndo...
But the xmldumps keep pouring in:
ndo-20141223.xml.bz2
etc. Same story with several other wikimedia projects in other languages.
So in general: Could we stop dumping closed projects?
kind regards,
Agree! I also propose to schedule dumps according their importance. For example by article counts.
All the best,
On Sat, Jan 17, 2015 at 7:59 PM, Richard Jelinek rj@petamem.com wrote:
Hi,
we're basically mirroring all the generated dumps, extract them, harvest data etc. Lately I came to examine some of the more exotic languages and to my surprise they were even more exotic than I thought. I propose to ditch them.
Afar (aa) Wikipedia
latest at our servers is aar-20141223.xml.bz with 22974 bytes (we convert into iso639-3)
It seems the wiki has been closed or moved into incubator:
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Afa...
Nevertheless in the xmldumps this wiki keeps showing up and pretending something is there. I believe we'd be all better off if dums of this would cease.
Basically the same applies for Ndonga Wikipedia
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Ndo...
But the xmldumps keep pouring in:
ndo-20141223.xml.bz2
etc. Same story with several other wikimedia projects in other languages.
So in general: Could we stop dumping closed projects?
kind regards,
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Language Technology - We Mean IT! Sitz der Gesellschaft: Fürth 2.58921 * 10^8 Mind Units Registergericht: AG Fürth, HRB-9201
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
All dumps are scheduled to run once a month, some just take longer than others to run, and there are issues some time, not sure why the order really matters.
The WMF dump all of their active databases (any database that is hosting a site, regardless of the status of the site) If you want dumps/wikis removed file a request on meta for the complete shutdown of the project. Otherwise just use some common sense and ignore the inactive wikis.
On Sat, Jan 17, 2015 at 2:54 PM, Alex Druk alex.druk@gmail.com wrote:
Agree! I also propose to schedule dumps according their importance. For example by article counts.
All the best,
On Sat, Jan 17, 2015 at 7:59 PM, Richard Jelinek rj@petamem.com wrote:
Hi,
we're basically mirroring all the generated dumps, extract them, harvest data etc. Lately I came to examine some of the more exotic languages and to my surprise they were even more exotic than I thought. I propose to ditch them.
Afar (aa) Wikipedia
latest at our servers is aar-20141223.xml.bz with 22974 bytes (we convert into iso639-3)
It seems the wiki has been closed or moved into incubator:
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Afa...
Nevertheless in the xmldumps this wiki keeps showing up and pretending something is there. I believe we'd be all better off if dums of this would cease.
Basically the same applies for Ndonga Wikipedia
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Ndo...
But the xmldumps keep pouring in:
ndo-20141223.xml.bz2
etc. Same story with several other wikimedia projects in other languages.
So in general: Could we stop dumping closed projects?
kind regards,
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Language Technology - We Mean IT! Sitz der Gesellschaft: Fürth 2.58921 * 10^8 Mind Units Registergericht: AG Fürth, HRB-9201
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- Thank you.
Alex Druk www.wikipediatrends.com alex.druk@gmail.com (775) 237-8550 Google voice
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi John,
The order matters because for bigger projects it takes much longer to complete their dumps. For example, for English wikipedia it takes about 14 days to process all dumps, so if dump start later in the month, it will be ready next month only. And if something goes wrong, it takes too long to rerun the process. It is already happened several time during this last year only.
Regards, Alex
On Sat, Jan 17, 2015 at 9:31 PM, John phoenixoverride@gmail.com wrote:
All dumps are scheduled to run once a month, some just take longer than others to run, and there are issues some time, not sure why the order really matters.
The WMF dump all of their active databases (any database that is hosting a site, regardless of the status of the site) If you want dumps/wikis removed file a request on meta for the complete shutdown of the project. Otherwise just use some common sense and ignore the inactive wikis.
On Sat, Jan 17, 2015 at 2:54 PM, Alex Druk alex.druk@gmail.com wrote:
Agree! I also propose to schedule dumps according their importance. For example by article counts.
All the best,
On Sat, Jan 17, 2015 at 7:59 PM, Richard Jelinek rj@petamem.com wrote:
Hi,
we're basically mirroring all the generated dumps, extract them, harvest data etc. Lately I came to examine some of the more exotic languages and to my surprise they were even more exotic than I thought. I propose to ditch them.
Afar (aa) Wikipedia
latest at our servers is aar-20141223.xml.bz with 22974 bytes (we convert into iso639-3)
It seems the wiki has been closed or moved into incubator:
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Afa...
Nevertheless in the xmldumps this wiki keeps showing up and pretending something is there. I believe we'd be all better off if dums of this would cease.
Basically the same applies for Ndonga Wikipedia
http://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Ndo...
But the xmldumps keep pouring in:
ndo-20141223.xml.bz2
etc. Same story with several other wikimedia projects in other languages.
So in general: Could we stop dumping closed projects?
kind regards,
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Language Technology - We Mean IT! Sitz der Gesellschaft: Fürth 2.58921 * 10^8 Mind Units Registergericht: AG Fürth, HRB-9201
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- Thank you.
Alex Druk www.wikipediatrends.com alex.druk@gmail.com (775) 237-8550 Google voice
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On Sat, Jan 17, 2015 at 03:31:37PM -0500, John wrote:
All dumps are scheduled to run once a month, some just take longer than others to run, and there are issues some time, not sure why the order really matters.
It doesn't (for us).
The WMF dump all of their active databases (any database that is hosting a site, regardless of the status of the site) If you want dumps/wikis removed file a request on meta for the complete shutdown of the project.
Seriously? There is a difference between a closed project and a deleted project. Many of the closed projects still remain with an active database, because often the voting results in "close but don't delete" And I agree with that, so I would not request a complete shutdown.
My common sense tells me that dumping closed projects is just wrong. The dumps waste resources like space, time and attention.
Otherwise just use some common sense and ignore the inactive wikis.
You are aware that "use some common sense" in this case is targeted at some mirror script/fetchbot? I'd love to see an implementation of that in some formal programming language.
kind regards,
Talking about wasting resource another waste of resource
Dumping summaries for Yahoo! (Really? Why the hell? Is it our job to dump 20 GBs http://dumps.wikimedia.org/wikidatawiki/20141106/ every damn month for just wikidata for Yahoo! bots?)
Please explain why Wikimedia Foundation should spend resource and money for a commercial company for free.
On Sun, Jan 18, 2015 at 12:34 AM, Richard Jelinek rj@petamem.com wrote:
On Sat, Jan 17, 2015 at 03:31:37PM -0500, John wrote:
All dumps are scheduled to run once a month, some just take longer than
others
to run, and there are issues some time, not sure why the order really
matters.
It doesn't (for us).
The WMF dump all of their active databases (any database that is hosting
a
site, regardless of the status of the site) If you want dumps/wikis
removed
file a request on meta for the complete shutdown of the project.
Seriously? There is a difference between a closed project and a deleted project. Many of the closed projects still remain with an active database, because often the voting results in "close but don't delete" And I agree with that, so I would not request a complete shutdown.
My common sense tells me that dumping closed projects is just wrong. The dumps waste resources like space, time and attention.
Otherwise just use some common sense and ignore the inactive wikis.
You are aware that "use some common sense" in this case is targeted at some mirror script/fetchbot? I'd love to see an implementation of that in some formal programming language.
kind regards,
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Language Technology - We Mean IT! Sitz der Gesellschaft: Fürth 2.58921 * 10^8 Mind Units Registergericht: AG Fürth, HRB-9201
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On Sat, Jan 17, 2015 at 11:31:45PM +0100, Federico Leva (Nemo) wrote:
Richard Jelinek, 17/01/2015 19:59:
latest at our servers is aar-20141223.xml.bz with 22974 bytes
22974 entire bytes! What a terrible waste! If we want to save space, I propose to cut on some formats of en.wiki dumps. *That* would save a lot of resources. ;-)
[x] optimizing en.wiki dumps is a different topic where we may agree [ ] you understood the problem of this topic
a) it's only one of many examples (because of it's iso-code the 1st) b) Speaking of resources, I spoke of space, time and attention you stopped reading after "space" - it seems c) How about installing a bot that will send only a couple of meaningless bytes to this ML periodically? Not a big deal - is it?
d) to z) left as an exercise to the reader.
kind regards,
xmldatadumps-l@lists.wikimedia.org