Interesting, considering that Wikispecies itself has 425 thousands main namespace pages in total (including redirects)...
Nemo
-------- Messaggio originale -------- Oggetto: [Wikimedia-l] Lsjbot has now started to generate 1-1, 5 M articles of species on sv:wp Data: Fri, 11 Jan 2013 17:45:25 +0100 Mittente: Anders Wennersten mail@anderswennersten.se Rispondi-a: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org A: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org
Inspired by the botgenerated articles of species made on nl:wp in late 2010 a colleague of mine, User:Lsj, started a similar project on sv:wp early 2012. By October 2012 his bot had generated some 65 000 articles, with essentially complete coverage of all fungi and birds.
He has since then extended the scope to include all living species, both animals and plants, which means another 1-1,5 million articles. Running at full permissible bot speed, the bot generates around 10,000 articles per day, but at a more realistic speed, the full project will take the rest of 2013 to complete.
The botcode has been written in a language-independent way, so that it can be ported to other language versions with only a modest effort. All language-specific text strings are in external files, so the code itself does not need changing between language versions. Beyond Swedish, the code has been tested on Cebuano wikipedia as well; full production on cebwp is ready to go, just awaiting community blessing there.
The source of the core of the data is taken from Catalogue of Life http://en.wikipedia.org/wiki/Catalogue_of_Life but the bot also checks with Commons, other languages(iwlinks) and other appropriate databases, such as the IUCN Redlist of endangered species.
The botcode is written in C# and uses the DotNetWikiBot framework.
Example articles: http://sv.wikipedia.org/wiki/Lichenopora_verrucaria http://sv.wikipedia.org/wiki/Phylactolaemata http://sv.wikipedia.org/wiki/Rundkrassing http://ceb.wikipedia.org/wiki/Sipunculidae http://ceb.wikipedia.org/wiki/Solaster_endeca
The full set of created articles (includes some other stuff as well, besides organisms): http://sv.wikipedia.org/wiki/Kategori:Robotskapade_artiklar http://ceb.wikipedia.org/wiki/Kategoriya:Paghimo_ni_bot
My colleague is much too busy now to discuss himself just now, but I think it could be an inspiration for us all.
Besides Lsj himself there are about 10 users supporting him, with checking that the bot generate correct data etc, it has also been discussed extensively on our village pump etc Wikidata is as yet not used
The page where the project is discussed is just now (in Swedish of course..)
http://sv.wikipedia.org/wiki/Anv%C3%A4ndardiskussion:Lsjbot/Projekt_alla_art...
Anders
_______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
This is a good direction - I apologize for not responding sooner; I just hope, the bot searches the entire web, so as to locate geographical indexing of species: fishnthesea.org. It has some search and display advantages I hope our project can incorporate. The overall model will by necessity be distributed, and fractal in nature. Open-ended input, commenting, and quantifying for validity and interest is the goal. Cheers! Allan
On Fri, Jan 11, 2013 at 9:55 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Interesting, considering that Wikispecies itself has 425 thousands main namespace pages in total (including redirects)...
Nemo
-------- Messaggio originale -------- Oggetto: [Wikimedia-l] Lsjbot has now started to generate 1-1, 5 M articles of species on sv:wp Data: Fri, 11 Jan 2013 17:45:25 +0100 Mittente: Anders Wennersten mail@anderswennersten.se Rispondi-a: Wikimedia Mailing List <wikimedia-l@lists.wikimedia.**orgwikimedia-l@lists.wikimedia.org
A: Wikimedia Mailing List <wikimedia-l@lists.wikimedia.**orgwikimedia-l@lists.wikimedia.org
Inspired by the botgenerated articles of species made on nl:wp in late 2010 a colleague of mine, User:Lsj, started a similar project on sv:wp early 2012. By October 2012 his bot had generated some 65 000 articles, with essentially complete coverage of all fungi and birds.
He has since then extended the scope to include all living species, both animals and plants, which means another 1-1,5 million articles. Running at full permissible bot speed, the bot generates around 10,000 articles per day, but at a more realistic speed, the full project will take the rest of 2013 to complete.
The botcode has been written in a language-independent way, so that it can be ported to other language versions with only a modest effort. All language-specific text strings are in external files, so the code itself does not need changing between language versions. Beyond Swedish, the code has been tested on Cebuano wikipedia as well; full production on cebwp is ready to go, just awaiting community blessing there.
The source of the core of the data is taken from Catalogue of Life http://en.wikipedia.org/wiki/**Catalogue_of_Lifehttp://en.wikipedia.org/wiki/Catalogue_of_Life but the bot also checks with Commons, other languages(iwlinks) and other appropriate databases, such as the IUCN Redlist of endangered species.
The botcode is written in C# and uses the DotNetWikiBot framework.
Example articles: http://sv.wikipedia.org/wiki/**Lichenopora_verrucariahttp://sv.wikipedia.org/wiki/Lichenopora_verrucaria http://sv.wikipedia.org/wiki/**Phylactolaematahttp://sv.wikipedia.org/wiki/Phylactolaemata http://sv.wikipedia.org/wiki/**Rundkrassinghttp://sv.wikipedia.org/wiki/Rundkrassing http://ceb.wikipedia.org/wiki/**Sipunculidaehttp://ceb.wikipedia.org/wiki/Sipunculidae http://ceb.wikipedia.org/wiki/**Solaster_endecahttp://ceb.wikipedia.org/wiki/Solaster_endeca
The full set of created articles (includes some other stuff as well, besides organisms): http://sv.wikipedia.org/wiki/**Kategori:Robotskapade_artiklarhttp://sv.wikipedia.org/wiki/Kategori:Robotskapade_artiklar http://ceb.wikipedia.org/wiki/**Kategoriya:Paghimo_ni_bothttp://ceb.wikipedia.org/wiki/Kategoriya:Paghimo_ni_bot
My colleague is much too busy now to discuss himself just now, but I think it could be an inspiration for us all.
Besides Lsj himself there are about 10 users supporting him, with checking that the bot generate correct data etc, it has also been discussed extensively on our village pump etc Wikidata is as yet not used
The page where the project is discussed is just now (in Swedish of course..)
http://sv.wikipedia.org/wiki/**Anv%C3%A4ndardiskussion:** Lsjbot/Projekt_alla_arterhttp://sv.wikipedia.org/wiki/Anv%C3%A4ndardiskussion:Lsjbot/Projekt_alla_arter
Anders
______________________________**_________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.**org Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-lhttps://lists.wikimedia.org/mailman/listinfo/wikimedia-l
______________________________**_________________ Wikispecies-l mailing list Wikispecies-l@lists.wikimedia.**org Wikispecies-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikispecies-lhttps://lists.wikimedia.org/mailman/listinfo/wikispecies-l
Similar attempts have been made before in English Wikipedia. Polbot have written a large number of missing articles by cross-checking with IUCN [1].
In Wikispecies, there were bots that have written articles in similar fashion [2], but the bot owner has since gone inactive. For our project, computer programmers are hard to come by because of our perceived strong biology knowledge required to contribute. There are limited numbers of computer programmers who are also biologist (and vice versa). For us, the limiting factor is not the source data website, but rather the programmers who could do such tasks.
We're less than 40 articles away from reaching 350,000 article milestone. Perhaps we should have a brainstorm session on how and where to recruit these volunteer programmers?
[1] http://en.wikipedia.org/wiki/User:Polbot/older_tasks (see Polbot Function #6) [2] http://species.wikimedia.org/wiki/Wikispecies:Bots/Requests_for_approval/Mon...
Andrew
"Fill the world with children who care and things start looking up."
From: allan.sauter@gmail.com Date: Sat, 12 Jan 2013 11:27:26 -0700 To: nemowiki@gmail.com CC: wikispecies-l@lists.wikimedia.org Subject: Re: [Wikispecies-l] Fwd: [Wikimedia-l] Lsjbot has now started to generate 1-1, 5 M articles of species on sv:wp
This is a good direction - I apologize for not responding sooner; I just hope, the bot searches the entire web, so as to locate geographical indexing of species: fishnthesea.org. It has some search and display advantages I hope our project can incorporate. The overall model will by necessity be distributed, and fractal in nature. Open-ended input, commenting, and quantifying for validity and interest is the goal. Cheers! Allan
On Fri, Jan 11, 2013 at 9:55 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Interesting, considering that Wikispecies itself has 425 thousands main namespace pages in total (including redirects)...
Nemo
-------- Messaggio originale -------- Oggetto: [Wikimedia-l] Lsjbot has now started to generate 1-1, 5 M articles of species on sv:wp Data: Fri, 11 Jan 2013 17:45:25 +0100 Mittente: Anders Wennersten mail@anderswennersten.se Rispondi-a: Wikimedia Mailing List <wikimedia-l@lists.wikimedia.**orgwikimedia-l@lists.wikimedia.org
A: Wikimedia Mailing List <wikimedia-l@lists.wikimedia.**orgwikimedia-l@lists.wikimedia.org
Inspired by the botgenerated articles of species made on nl:wp in late 2010 a colleague of mine, User:Lsj, started a similar project on sv:wp early 2012. By October 2012 his bot had generated some 65 000 articles, with essentially complete coverage of all fungi and birds.
He has since then extended the scope to include all living species, both animals and plants, which means another 1-1,5 million articles. Running at full permissible bot speed, the bot generates around 10,000 articles per day, but at a more realistic speed, the full project will take the rest of 2013 to complete.
The botcode has been written in a language-independent way, so that it can be ported to other language versions with only a modest effort. All language-specific text strings are in external files, so the code itself does not need changing between language versions. Beyond Swedish, the code has been tested on Cebuano wikipedia as well; full production on cebwp is ready to go, just awaiting community blessing there.
The source of the core of the data is taken from Catalogue of Life http://en.wikipedia.org/wiki/**Catalogue_of_Lifehttp://en.wikipedia.org/wiki/Catalogue_of_Life but the bot also checks with Commons, other languages(iwlinks) and other appropriate databases, such as the IUCN Redlist of endangered species.
The botcode is written in C# and uses the DotNetWikiBot framework.
Example articles: http://sv.wikipedia.org/wiki/**Lichenopora_verrucariahttp://sv.wikipedia.org/wiki/Lichenopora_verrucaria http://sv.wikipedia.org/wiki/**Phylactolaematahttp://sv.wikipedia.org/wiki/Phylactolaemata http://sv.wikipedia.org/wiki/**Rundkrassinghttp://sv.wikipedia.org/wiki/Rundkrassing http://ceb.wikipedia.org/wiki/**Sipunculidaehttp://ceb.wikipedia.org/wiki/Sipunculidae http://ceb.wikipedia.org/wiki/**Solaster_endecahttp://ceb.wikipedia.org/wiki/Solaster_endeca
The full set of created articles (includes some other stuff as well, besides organisms): http://sv.wikipedia.org/wiki/**Kategori:Robotskapade_artiklarhttp://sv.wikipedia.org/wiki/Kategori:Robotskapade_artiklar http://ceb.wikipedia.org/wiki/**Kategoriya:Paghimo_ni_bothttp://ceb.wikipedia.org/wiki/Kategoriya:Paghimo_ni_bot
My colleague is much too busy now to discuss himself just now, but I think it could be an inspiration for us all.
Besides Lsj himself there are about 10 users supporting him, with checking that the bot generate correct data etc, it has also been discussed extensively on our village pump etc Wikidata is as yet not used
The page where the project is discussed is just now (in Swedish of course..)
http://sv.wikipedia.org/wiki/**Anv%C3%A4ndardiskussion:** Lsjbot/Projekt_alla_arterhttp://sv.wikipedia.org/wiki/Anv%C3%A4ndardiskussion:Lsjbot/Projekt_alla_arter
Anders
______________________________**_________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.**org Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-lhttps://lists.wikimedia.org/mailman/listinfo/wikimedia-l
______________________________**_________________ Wikispecies-l mailing list Wikispecies-l@lists.wikimedia.**org Wikispecies-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikispecies-lhttps://lists.wikimedia.org/mailman/listinfo/wikispecies-l
Wikispecies-l mailing list Wikispecies-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikispecies-l
Andrew Leung, 13/01/2013 01:55:
Perhaps we should have a brainstorm session on how and where to recruit these volunteer programmers?
If Anders is right, the code was made so that what you really have to modify in order to run it is only the (probably small) linguistic content. You could ask the bot owner some directions on what exactly needs to be customised; when there's consensus and the community has "translated" what needs translating, it's not hard to find someone to merely run a bot.
Nemo
I'm skeptical that mass creation of species articles is a good idea, at least until we have good integration with Wikidata. Such a bot would work with database data, and database data belongs in a database. Who is going to maintain millions of articles in a small Wikipedia when taxonomic changes happen, errors in the underlying database are corrected, or new information becomes available? On the English Wikipedia, we have enough of a problem maintaining the articles Polbot generated; the problems will be far worse on a smaller wiki that has fewer people qualified to work on biological articles.
Wikipedias are better at providing textual, complex information that does not fit well in a database. For database data, we should provide a bridge to a database (e.g., Wikidata), not replicate database content in an unmaintainable form.
2013/1/12 Federico Leva (Nemo) nemowiki@gmail.com
Andrew Leung, 13/01/2013 01:55:
Perhaps we should have a brainstorm session on how and where to recruit
these volunteer programmers?
If Anders is right, the code was made so that what you really have to modify in order to run it is only the (probably small) linguistic content. You could ask the bot owner some directions on what exactly needs to be customised; when there's consensus and the community has "translated" what needs translating, it's not hard to find someone to merely run a bot.
Nemo
______________________________**_________________ Wikispecies-l mailing list Wikispecies-l@lists.wikimedia.**org Wikispecies-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikispecies-lhttps://lists.wikimedia.org/mailman/listinfo/wikispecies-l
Jelle Zijlstra, 13/01/2013 02:34:
I'm skeptical that mass creation of species articles is a good idea, at least until we have good integration with Wikidata. Such a bot would work with database data, and database data belongs in a database. Who is going to maintain millions of articles in a small Wikipedia when taxonomic changes happen, errors in the underlying database are corrected, or new information becomes available? On the English Wikipedia, we have enough of a problem maintaining the articles Polbot generated; the problems will be far worse on a smaller wiki that has fewer people qualified to work on biological articles.
Wikipedias are better at providing textual, complex information that does not fit well in a database. For database data, we should provide a bridge to a database (e.g., Wikidata), not replicate database content in an unmaintainable form.
I surely agree with you, but I think Wikispecies is the only wiki exempt from such a consideration: it's its job, after all. As for Wikidata, there's https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikispecies which would use some additional work. As far as I understand, given that use of Wikidata for all projects other than Wikipedia is very far in the future, it's currently considered ok to have a plan where data is first ingested on a local wiki and then migrated to Wikidata. All the data they're adding to sv.wiki will eventually go to Wikidata together with all infoboxes data, so some kind of central planning is needed and Wikispecies seems the most logical place. Again, if the Wikispecies community is interested you should probably get some feedback from the Wikidata team, but also not wait indefinitely for some perfect solution before starting work to make things less broken.
Nemo
Under the current conditions of Wikidata, I wouldn't touch it with a 10-foot pole, let alone letting it integrating/replacing Wikispecies. And here's the reasons why:
Note: I will be using Canis lupus (Gray wolf) for illustrative purposes.
1) The search result in Wikidata is woefully incorrectly. If I type in Canis lupus in Wikidata search, the correct result turns up on the 18th item on the search list. [1] 2) Spelling variation completely throws off Wikidata search. If I search for Grey wolf (with an e and not an a), Wikidata said there is no page found. [2] 3) It's lacking taxonomy navigation, which is crucial for a taxonomy database. Even Commons do a better job than Wikidata. [3]
To show that this is a much more widespread problem (and avoid me being accused of cherry-picking), I will be using Black Oak as the search term for following example.
4) Wikidata is very poor at handling different scientific species names that share the same common name based on different locations. A search on Wikipedia identifies three tree species [4] (one in western U.S., one in eastern U.S. + Canada, one in Australia). Same search on Wikispecies correctly identifies the first two species on Wikipedia. [5] The Australian species page was not yet created on Wikispecies so you can conclude that the correct rate is 2 out of 2 (100%), or 2 out of 3 (66.6%) if you argue that missing page should be counted. Conducting the same search on Wikidata brings up 7 pages. [6] None of them were species pages (they were either links to a band, an album that band produced, or a town). The correct rate on Wikidata is 0 out of 3, or 0%.
I haven't had time to investigate on the accuracy of the interwiki links on Wikidata but I think I could write an essay on how inaccurate those links point to. Plus, where will you store reference links to articles that describe the species. Certainly it's not on Wikidata or Commons and I rarely see editors do that on Wikipedia. Through my examples presented above, I believe that Wikidata is ill-suited to integrate with Wikispecies and in my opinion, we should be very cautious about the data quality of Wikidata if we decide to import information from there into Wikispecies.
[1] http://en.wikidata.org/w/index.php?search=Canis+lupus&title=Special%3ASe... [2] http://en.wikidata.org/w/index.php?title=Special%3ASearch&profile=defaul... [3] http://commons.wikimedia.org/wiki/Category:Canis_lupus [4] http://en.wikipedia.org/wiki/Black_oak [5] http://species.wikimedia.org/w/index.php?title=Special:Search&search=bla... [6] http://en.wikidata.org/w/index.php?search=Black+oak&title=Special%3ASear...
Andrew
"Fill the world with children who care and things start looking up."
Date: Sun, 13 Jan 2013 03:03:45 +0100 From: nemowiki@gmail.com To: jelle.zijlstra@gmail.com CC: andrewcleung@hotmail.com; wikispecies-l@lists.wikimedia.org Subject: Re: [Wikispecies-l] Fwd: [Wikimedia-l] Lsjbot has now started to generate 1-1, 5 M articles of species on sv:wp
Jelle Zijlstra, 13/01/2013 02:34:
I'm skeptical that mass creation of species articles is a good idea, at least until we have good integration with Wikidata. Such a bot would work with database data, and database data belongs in a database. Who is going to maintain millions of articles in a small Wikipedia when taxonomic changes happen, errors in the underlying database are corrected, or new information becomes available? On the English Wikipedia, we have enough of a problem maintaining the articles Polbot generated; the problems will be far worse on a smaller wiki that has fewer people qualified to work on biological articles.
Wikipedias are better at providing textual, complex information that does not fit well in a database. For database data, we should provide a bridge to a database (e.g., Wikidata), not replicate database content in an unmaintainable form.
I surely agree with you, but I think Wikispecies is the only wiki exempt from such a consideration: it's its job, after all. As for Wikidata, there's https://meta.wikimedia.org/wiki/Wikidata/Notes/Future#Wikispecies which would use some additional work. As far as I understand, given that use of Wikidata for all projects other than Wikipedia is very far in the future, it's currently considered ok to have a plan where data is first ingested on a local wiki and then migrated to Wikidata. All the data they're adding to sv.wiki will eventually go to Wikidata together with all infoboxes data, so some kind of central planning is needed and Wikispecies seems the most logical place. Again, if the Wikispecies community is interested you should probably get some feedback from the Wikidata team, but also not wait indefinitely for some perfect solution before starting work to make things less broken.
Nemo
For the search results order there's https://bugzilla.wikimedia.org/show_bug.cgi?id=43238 . For the rest, I'm not sure; I've asked the Wikidata team to comment.
Andrew Leung, 13/01/2013 05:57:
I haven't had time to investigate on the accuracy of the interwiki links on Wikidata but I think I could write an essay on how inaccurate those links point to. Plus, where will you store reference links to articles that describe the species. Certainly it's not on Wikidata or Commons and I rarely see editors do that on Wikipedia. Through my examples presented above, I believe that Wikidata is ill-suited to integrate with Wikispecies and in my opinion, we should be very cautious about the data quality of Wikidata if we decide to import information from there into Wikispecies.
I think the aim should be the opposite: Wikidata should be fed by Wikispecies, which is the natural provider of this information, and then other wikis should use this data for their infoboxes etc. If the *presentation* of data on Wikidata itself is wrong/useless, this is not a problem but rather something natural: it only means that Wikispecies (like the other wikis) will continue to exist as presentation of the data and other less structured information. What you see now are only interwikis, so it's quite natural that they're not of much value.
What I fear is multiple or dozens of Wikipedias feeding species data into Wikidata in an unorderly manner. Deduplicating millions of data entries from different sources is not fun. But if Wikispecies doesn't take a lead and use its competency to make sure this is done right, and if Wikispecies and/or Wikidata do not have all the data the various Wikipedias have/need, then it will be inevitable for unordered feeding of data (like sv.wiki's) to happen at some point.
Nemo
wikispecies-l@lists.wikimedia.org