One major concern... The next generation would believe that Botpedia started on 2001 and that Nupedia and Wikipedia lasted no more than a few days. We have to make sure not to hide facts from them. Let's document this in The Guinness Book of Records.
Fayssal F.
On Sun, 1 Jun 2008 01:36:44 -0700 (PDT) bobolozo bobolozo@yahoo.com
Subject: [WikiEN-l] User:FritzpollBot creating millions of new articles To: wikien-l@lists.wikimedia.org Message-ID: 623431.93402.qm@web63509.mail.re1.yahoo.com Content-Type: text/plain; charset=iso-8859-1
See http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28proposals%29#User:Fri...
A bot has been approved to create articles on every place in the world which doesn't have an article yet, predicted to be about 2 million new stub articles.
There is some question as to whether or not this is a good idea, as it would double our number of articles within a few months, perhaps mess up Special:Random, and most of the new articles would forever be tiny stubs. There are suggestions that perhaps the bot could be limited to towns of a certain population size, or perhaps the tiny villages could be combined into lists instead of each having its own article.
I'm not arguing for or against this, just bringing it up here. If there are any concerns, speak now before the bot begins.
2008/6/1 Fayssal F. szvest@gmail.com:
One major concern... The next generation would believe that Botpedia started on 2001 and that Nupedia and Wikipedia lasted no more than a few days.
heh.
I think this bot-assisted programme of article creation is a Good Thing for topics where we do in fact have the data. Rambot's 2003 creation of 30,000 US placenames meant Wikipedia could claim *completeness* on the topic. (That's what the "encyclo-" prefix of "encyclopedia" means.) There is no reason not to bring similar completeness to our coverage of the rest of the world. It'll certainly help alleviate our systemic bias.
The issues I can see are editorial - the Rambot articles are data put in prose form that these days we'd do with a parameterised template, etc - but Fritzpoll seems quite aware of this and the programme appears to include considerable human review. Good.
The jump in article numbers will be noticed by the world. I've told the comcom list about this and have suggested a Wikimedia blog post on the topic.
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
- d.
How does simply having what is basically a placeholder entry for each town/village/city etc. alleviate systemic bias? Most of these articles will be ignored completely after creation, while those that are expanded and improved will reflect the factors that led to systemic bias in content initially.
As to the proposal itself, I don't think its a big deal. I also don't think its of much value - if it works out to be mainly one or two line stubs, what encyclopedic value does that really add? Only people who already know that a town exists and what country its in will search for it, right? The one reference is nice, and the links to maps (if its on all stubs) is cool, but still. Creating 2 million more articles that won't be touched for a decade seems unnecessary and doesn't really bring much actual knowledge to the 'pedia, but aside from damaging the hell out of my random article patrolling it doesn't appear to present a serious problem either.
As to random article patrolling, its past time that there was a way to restrict this "randomness." Mostly I look for unreferenced and uncategorized BLPs, and if I could exclude all articles categorized as towns/villages/cities etc. that would really help. Is there already a way to do that?
Nathan
Nathan wrote:
How does simply having what is basically a placeholder entry for each town/village/city etc. alleviate systemic bias? Most of these articles will be ignored completely after creation, while those that are expanded and improved will reflect the factors that led to systemic bias in content initially.
It certainly can't _hurt_ systemic bias. Having a stub will allow people to start editing on the subject more easily, especially anonymous editors. And even if the aritcle's ignored after creation, at least we'll have _something_ on that subject.
2008/6/1 Nathan nawrich@gmail.com:
How does simply having what is basically a placeholder entry for each town/village/city etc. alleviate systemic bias? Most of these articles will be ignored completely after creation,
I think you're dead wrong there. How many Rambot articles are untouched?
Creating 2 million more articles that won't be touched for a decade
[citation needed]
- d.
<ref>precognitive vision</ref>
Does that work? I don't have proof that its true, just a strong suspicion. It seems unlikely to me that two million articles that have gone 7-odd years without creation are going to be heavily trafficked by editors once they are created en masse. Of course it does make it easier for IP editors to contribute information on these places, which is a good thing. But by and large, I think it will take many years for these articles to evolve into something useful.
Regarding the comparison to Rambot - not something I know a whole lot about. Rambot created articles based on incorporated communities in the US using census data, right? 30,000 articles according to Tim Starling and 90,000 according to MBisanz. Rambot ran 5 years ago, and the articles are about US townships - and yet a fair number of them, according to what I've seen and the debate on the proposal, are still not heavily modified ([[Nemacolin, Pennsylvania]] is an example). I think you can see how a direct comparison between these and the Fritzpoll bot-created articles doesn't really work. Articles about 100,000 towns in Africa are somewhat less likely to be edited than the Rambot articles were when they were created (at a time when they represented a significant portion of a much smaller 'pedia).
All that said, I'm not saying it shouldn't be done - just that it has less value than some seem to have assigned to it.
Nathan
On Sun, Jun 1, 2008 at 12:12 PM, David Gerard dgerard@gmail.com wrote:
2008/6/1 Nathan nawrich@gmail.com:
How does simply having what is basically a placeholder entry for each town/village/city etc. alleviate systemic bias? Most of these articles
will
be ignored completely after creation,
I think you're dead wrong there. How many Rambot articles are untouched?
Creating 2 million more articles that won't be touched for a decade
[citation needed]
- d.
2008/6/1 Nathan nawrich@gmail.com:
<ref>precognitive vision</ref> Does that work? I don't have proof that its true, just a strong suspicion. It seems unlikely to me that two million articles that have gone 7-odd years without creation are going to be heavily trafficked by editors once they are created en masse.
They're being created with considerable human review and the active involvement of the relevant country WikiProjects, which should help.
a fair number of them, according to what I've seen and the debate on the proposal, are still not heavily modified ([[Nemacolin, Pennsylvania]] is an example).
I think a lot of that is that they look more substantial than they are, because of the data being rendered as prose. There are still many US town articles that have been substantially expanded but still have an unedited blob of Rambot output in there.
- d.
On Sun, Jun 1, 2008 at 10:38 AM, Nathan nawrich@gmail.com wrote:
<ref>precognitive vision</ref>
Does that work? I don't have proof that its true, just a strong suspicion. It seems unlikely to me that two million articles that have gone 7-odd years without creation are going to be heavily trafficked by editors once they are created en masse. Of course it does make it easier for IP editors to contribute information on these places, which is a good thing. But by and large, I think it will take many years for these articles to evolve into something useful.
Regarding the comparison to Rambot - not something I know a whole lot about. Rambot created articles based on incorporated communities in the US using census data, right? 30,000 articles according to Tim Starling and 90,000 according to MBisanz. Rambot ran 5 years ago, and the articles are about US townships - and yet a fair number of them, according to what I've seen and the debate on the proposal, are still not heavily modified ([[Nemacolin, Pennsylvania]] is an example). I think you can see how a direct comparison between these and the Fritzpoll bot-created articles doesn't really work. Articles about 100,000 towns in Africa are somewhat less likely to be edited than the Rambot articles were when they were created (at a time when they represented a significant portion of a much smaller 'pedia).
All that said, I'm not saying it shouldn't be done - just that it has less value than some seem to have assigned to it.
Nathan
On Sun, Jun 1, 2008 at 12:12 PM, David Gerard dgerard@gmail.com wrote:
2008/6/1 Nathan nawrich@gmail.com:
How does simply having what is basically a placeholder entry for each town/village/city etc. alleviate systemic bias? Most of these articles
will
be ignored completely after creation,
I think you're dead wrong there. How many Rambot articles are untouched?
Creating 2 million more articles that won't be touched for a decade
[citation needed]
- d.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Alright, so here's the "it shouldn't be done", and unless I see significant consensus I'll happily block the bot if tried. Let's let real people write articles on subjects of actual value, please, and if it takes longer, at least the results will be of some reasonable quality.
2008/6/1 Todd Allen toddmallen@gmail.com:
Alright, so here's the "it shouldn't be done", and unless I see significant consensus I'll happily block the bot if tried. Let's let real people write articles on subjects of actual value, please, and if it takes longer, at least the results will be of some reasonable quality.
He'll be working through the WikiProjects, as has been noted (and on the explanation pages). You appear to be saying you're going to work in opposition to them just because you can. Surely not.
- d.
Todd, you're a bit behind. I think the bot was technically approved awhile ago and has already started.
Nathan
On Sun, Jun 1, 2008 at 11:01 AM, Nathan nawrich@gmail.com wrote:
Todd, you're a bit behind. I think the bot was technically approved awhile ago and has already started.
Nathan _______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
If there's consensus, then there is, and I'll go by it. I'll have to look into it a bit further. I just remember the garbage Rambot wrote, and I'd sure prefer not to see that on a larger scale. If this is better designed, then so it is, but I remain skeptical about bots writing articles.
2008/6/1 Todd Allen toddmallen@gmail.com:
If there's consensus, then there is, and I'll go by it. I'll have to look into it a bit further. I just remember the garbage Rambot wrote, and I'd sure prefer not to see that on a larger scale. If this is better designed, then so it is, but I remain skeptical about bots writing articles.
Fritzpoll is keenly aware of these issue and is addressing them as best he can. The test results look (to my eye) *way* better than what Rambot created. And he's going out and looking for editors to involve in the output.
- d.
Todd Allen schreef:
If this is better designed, then so it is, but I remain skeptical about bots writing articles.
What is your opinion about, for example, [[Ortonovo]], an article created by a bot 2 years ago, and not changed since then?
The major problem with the bot is that there may be more detailed databases on national level, while Fritz's bot uses only the data from a global database. But his last message on [[Wikipedia:Village pump (proposals)/FritzpollBot creating millions of new articles#Motion to recess by the operator]] indicates that he will also take a look at the individual countries, which shows he has the right attitude.
Eugene
On Mon, Jun 2, 2008 at 7:27 AM, Eugene van der Pijll eugene@vanderpijll.nl wrote:
The major problem with the bot is that there may be more detailed databases on national level, while Fritz's bot uses only the data from a global database. But his last message on [[Wikipedia:Village pump (proposals)/FritzpollBot creating millions of new articles#Motion to recess by the operator]] indicates that he will also take a look at the individual countries, which shows he has the right attitude.
Don't forget that since the bulk of these articles will be an infobox, it will be straightforward for other bots to add more information later.
--- Nathan nawrich@gmail.com wrote:
Todd, you're a bit behind. I think the bot was technically approved awhile ago and has already started.
Nathan
The bot was technically approved, by whichever half a dozen people were involved in the bot approval process at the time. The bot approval committee doesn't reallly have the ability to speak to the wider issues at stake here, that being the effect of the creation of 2 million stub articles, which is why the issue is now being discussed on a wider scale.
The bot hasn't actually already started, it has created 100 test articles.
--- David Gerard dgerard@gmail.com wrote:
2008/6/1 Nathan nawrich@gmail.com:
How does simply having what is basically a
placeholder entry for each
town/village/city etc. alleviate systemic bias?
Most of these articles will
be ignored completely after creation,
I think you're dead wrong there. How many Rambot articles are untouched?
Creating 2 million more articles that won't be
touched for a decade
[citation needed]
- d.
The obvious response to that is that the Rambot articles have been added to because they're articles on places full of english speaking people who have internet access.
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
2008/6/1 bobolozo bobolozo@yahoo.com:
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
That doesn't really serve as a justification for express systematic bias (rather than the default systemic bias).
- d.
On Sun, Jun 1, 2008 at 2:18 PM, David Gerard dgerard@gmail.com wrote:
2008/6/1 bobolozo bobolozo@yahoo.com:
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
That doesn't really serve as a justification for express systematic bias (rather than the default systemic bias).
- d.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Well, there is a line between "systemic bias" and "source bias". If secondary, reliable sources don't consider X important enough to write much about, we follow their lead, and do not write much. If they don't think X is important enough to write about at all, we follow their lead, and do not write anything at all. That's not our decision, and introducing anything beyond what sources do is introducing -our own- bias, regardless of how noble of motives such bias may be founded upon.
2008/6/1 Todd Allen toddmallen@gmail.com:
Well, there is a line between "systemic bias" and "source bias". If secondary, reliable sources don't consider X important enough to write much about, we follow their lead, and do not write much. If they don't think X is important enough to write about at all, we follow their lead, and do not write anything at all. That's not our decision, and introducing anything beyond what sources do is introducing -our own- bias, regardless of how noble of motives such bias may be founded upon.
All towns are "notable."
- d.
On Sun, Jun 1, 2008 at 2:47 PM, David Gerard dgerard@gmail.com wrote:
2008/6/1 Todd Allen toddmallen@gmail.com:
Well, there is a line between "systemic bias" and "source bias". If secondary, reliable sources don't consider X important enough to write much about, we follow their lead, and do not write much. If they don't think X is important enough to write about at all, we follow their lead, and do not write anything at all. That's not our decision, and introducing anything beyond what sources do is introducing -our own- bias, regardless of how noble of motives such bias may be founded upon.
All towns are "notable."
- d.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
While that's been an oft-repeated canard in the past, it is by no means a given. Nor, even if true, does it mean the best organizational structure is to have a separate one-sentence article for every tiny dot on the map, when lists could handle it far better.
2008/6/4 Todd Allen toddmallen@gmail.com:
While that's been an oft-repeated canard in the past, it is by no means a given. Nor, even if true, does it mean the best organizational structure is to have a separate one-sentence article for every tiny dot on the map, when lists could handle it far better.
"Lists could handle it far better"? I think that falls equally under "oft-repeated but by no means true" ;-)
Let's consider a few metrics.
* Informational content - about even. A one or two-line article (name, location, coordinates) can easily be replicated as a table without losing any amount of information, or vice versa. (Indeed, there is a case to be made for having both, the way I think we have with French villages)
* Development potential - seperate articles are much better, because tables are horrendous to edit for a new user. Even for an experienced user happy with the format, there's not really any way *to* expand our basic gazetter entry if it's a line in a table - you can't really add a chunk on at the side. With a seperate article, on the other hand, it's about as simple as it can be.
* Ease of maintenance - lists are much better. Only one article need be watched for vandalism, bot updates can be done in one edit, etc, (One small caveat: it does create potential confusion due to the hundreds of redirects - if the lists are reorganised, will all the redirects get moved?)
* Utility to reader - seperate articles are marginally better. As mentioned above, the actual amount of content is the same for a line in a list as for a stub article. The difference is that if you click on [[Jandaba, Georgia]] and get a stub you get the content presented to you immediately, whereas if you get a list you have to poke around to find it.
Given all this, I really don't see a clear case that combined lists handle this sort of thing better - each has one clear plus point, but to my mind the expansion potential more than outweighs the maintenance benefit.
We can make a much better case for using lists when thinking of things like asteroids, where the known information is both limited and very unlikely to be significantly expanded in the normal course of events. However, given that there *is* information out there on the town of Jandaba*, and we would quite like people to write about it, using the method that encourages them to do so seems advisable...
This whole Fritzbot thing is a good thing--just like Rambot was. How long did it take most of those articles to get built up? Years?
The question is do you always look at the short term, or the long term. Fritzbot is a long term answer to some very large gaps in the system.
Honestly, the only real concern I can see is that it will flood newpages (fixable) or that it will flood Special:Random (fixable--just devalue any article smaller than a given size or with a hidden Fritzbot category).
Joe
Joe Szilagyi wrote:
Honestly, the only real concern I can see is that it will flood newpages (fixable) or that it will flood Special:Random (fixable--just devalue any article smaller than a given size or with a hidden Fritzbot category).
Or loosen up a bit on what warrants an article to allow the rest of the encyclopedia to grow to the same relative size. From my understanding of how random page works it wouldn't be easy to "weight" it this way.
On Thu, Jun 5, 2008 at 12:45 AM, Andrew Gray shimgray@gmail.com wrote:
- Ease of maintenance - lists are much better. Only one article need
be watched for vandalism, bot updates can be done in one edit, etc,
Most simple vandalism isn't dealt with by watchlists anyway, but by recent changes listings, whether on-wiki, on the IRC feed or through separate tools that monitor these. Having separate articles makes no difference to these methods whatsoever, since they're edit-based and not page-based.
As for bots: well, the very reason for their existence is so that they can do the grunt work for us, and updates to large groups of articles is a perfect example of such work.
2008/6/4 Stephen Bain stephen.bain@gmail.com:
Most simple vandalism isn't dealt with by watchlists anyway, but by recent changes listings, whether on-wiki, on the IRC feed or through separate tools that monitor these. Having separate articles makes no difference to these methods whatsoever, since they're edit-based and not page-based.
Mmm... yes and no. I have a gut assumption that basically says "the fewer articles, the easier it is to monitor them"; it may not be a very strong connection compared to how it was Back In The Day, but I think the relationship is still there.
As for bots: well, the very reason for their existence is so that they can do the grunt work for us, and updates to large groups of articles is a perfect example of such work.
Listing does have the subtle advantage here that you *know* you're going to get all the entries - with individual articles, it's relatively easy to lose a few in your later updates.
Anyhow, you're supporting my position, I won't argue back too much :-)
On Wed, Jun 4, 2008 at 11:24 AM, Stephen Bain stephen.bain@gmail.com wrote:
On Thu, Jun 5, 2008 at 12:45 AM, Andrew Gray shimgray@gmail.com wrote:
- Ease of maintenance - lists are much better. Only one article need
be watched for vandalism, bot updates can be done in one edit, etc,
Most simple vandalism isn't dealt with by watchlists anyway, but by recent changes listings, whether on-wiki, on the IRC feed or through separate tools that monitor these. Having separate articles makes no difference to these methods whatsoever, since they're edit-based and not page-based.
Having separate articles makes no difference for watchlists either, once you've got the articles loaded on your watchlist. And mass-addition of lots of pages to your watchlist is fairly trivial with "edit raw watchlist". I'd say having multiple articles is usually easier for maintenance, and that's not a gut assumption, it's something I've thought about quite a bit.
That said, if you have so many articles that you can't *create* them by hand, then you're naturally going to have trouble *maintaining* them by hand. And from what I've seen the data being added by this bot lends itself naturally to a tabular format. It's raw data forced into a text form, and I think that makes all the difference.
I've thought about this enough to decide that I don't think these pages should be added. If the spam links to Encarta and Maplandia are removed (and no others are added), I'm fairly neutral on it. If you someone wants to waste their time adding a couple million useless articles, that's their problem. But the example I've seen with the links to Encarta and Maplandia seems highly inappropriate.
On Wed, Jun 4, 2008 at 6:05 PM, Anthony wikimail@inbox.org wrote:
But the example I've seen with the links to Encarta and Maplandia seems highly inappropriate.
Some more on this, read http://www.maplandia.com/terms-of-use/
This isn't even an open source or non-profit project we're talking about adding millions of links to. And Encara...Microsoft...I'm amazed I'm the only one raising an objection to this.
On Wed, Jun 4, 2008 at 3:20 PM, Anthony wikimail@inbox.org wrote:
On Wed, Jun 4, 2008 at 6:05 PM, Anthony wikimail@inbox.org wrote:
But the example I've seen with the links to Encarta and Maplandia seems highly inappropriate.
Some more on this, read http://www.maplandia.com/terms-of-use/
This isn't even an open source or non-profit project we're talking about adding millions of links to. And Encara...Microsoft...I'm amazed I'm the only one raising an objection to this.
You aren't the only one: I've seen this brought up elsewhere (on Wikipedia itself, I think) - and Fritzpoll has already agreed that those links are problematic and won't be included, if I understand him correctly.
-Matt
On 6/5/08, Anthony wikimail@inbox.org wrote:
Having separate articles makes no difference for watchlists either, once you've got the articles loaded on your watchlist. And mass-addition of lots of pages to your watchlist is fairly trivial with "edit raw watchlist".
...
That said, if you have so many articles that you can't *create* them by hand, then you're naturally going to have trouble *maintaining* them by hand.
If you're in a watchlist mindset, yes. This was my point about recentchanges though: since it's edit-based, not page-based, the only thing that affects the rate at which vandalism/etc can be dealt with is the project-wide edit rate. You don't need to know that a page exists to be able to check a diff via recentchanges, or via some automated tool that gets its information from recentchanges.
You're only going to have trouble maintaining more articles if the very existence of those articles attracts more vandalism, but that view would completely ignore what we've seen through the whole history of the project, which is that productive contributors are attracted at a greater rate than vandals. If FritzpollBot articles attract more vandals, they'll simultaneously attract more contributors.
On Fri, Jun 6, 2008 at 3:53 AM, Stephen Bain stephen.bain@gmail.com wrote:
On 6/5/08, Anthony wikimail@inbox.org wrote:
That said, if you have so many articles that you can't *create* them by hand, then you're naturally going to have trouble *maintaining* them by hand.
You're only going to have trouble maintaining more articles if the very existence of those articles attracts more vandalism, but that view would completely ignore what we've seen through the whole history of the project, which is that productive contributors are attracted at a greater rate than vandals. If FritzpollBot articles attract more vandals, they'll simultaneously attract more contributors.
I see "maintaining" articles as including more than just stopping vandalism. If a city gets a new name, or the province it is in changes, or it merges with another city, or something like that, it requires "maintenance". This type of maintenance will probably be much more difficult to keep up with than protecting against vandalism.
I do think that having these articles will clearly attract both more vandals and more positive contributors. Without thinking about it too deeply I'd guess the ratio of positive to negative contributions will be lower than the current average. But I don't see that as a very big deal. I think the effect would be minor. If no one has created these articles yet, there probably aren't very many English speaking Internet users who care about the location enough to contribute *or* vandalize an article on it.
It often strikes me that the subtext behind the urge to listify is the fairly-well-embedded belief that subjects only get an article to themselves if they "deserve" it.
-Matt
bobolozo wrote:
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
Hmm. By the same token, I guess we shouldn't have articles on [[Troy], [[Pompeii]], [[Neolithic Europe]], [[Xanadu]], [[Atlantis]], or [[Mars]].
You're missing the point. I think anyone can agree that Troy and Pompeii have much more global significance than X fishing village, Cambodia.
Noble Story
Steve Summit scs@eskimo.com wrote: bobolozo wrote:
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
Hmm. By the same token, I guess we shouldn't have articles on [[Troy], [[Pompeii]], [[Neolithic Europe]], [[Xanadu]], [[Atlantis]], or [[Mars]].
_______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Mark Nilrad wrote:
Steve Summit scs@eskimo.com wrote:
bobolozo wrote:
A small fishing village in Cambodia, or a community of 100 people in Kenya, may well have no internet access at all, and if they have it, they would not likely be visiting the English wikipedia as they wouldn't likely speak English.
Hmm. By the same token, I guess we shouldn't have articles on [[Troy], [[Pompeii]], [[Neolithic Europe]], [[Xanadu]], [[Atlantis]], or [[Mars]].
You're missing the point. I think anyone can agree that Troy and Pompeii have much more global significance than X fishing village, Cambodia.
Um, no, you missed my point. Arguing about notability or "global significance" is one thing. But it makes no sense to bring up the question of whether the location of an article has Internet access, or how many people there might speak English.
David Gerard wrote:
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
Television episodes... :)
My initial reaction when learning about this plan was "oh no, not again" - I remember Rambot and how for a long while random page would give me one of its articles half the time I clicked on it. But after thinking for a moment longer I think this will be a good thing, overall. I'm a fan of completeness too and this will help establish that completeness is an important criterion for article inclusion.
The only remaining problem I have with Rambot's articles is the formatting, which is archaic now and would require a rather clever bot to update without damaging edits that have been made to the original base. But I'm sure FritzpollBot will be using modern style guidelines for templates and citations when generating articles so that should be fine.
2008/6/1 Bryan Derksen bryan.derksen@shaw.ca:
The only remaining problem I have with Rambot's articles is the formatting, which is archaic now and would require a rather clever bot to update without damaging edits that have been made to the original base.
It took data and rendered it in prose form. These days we'd just put the data in as template parameters.
But I'm sure FritzpollBot will be using modern style guidelines for templates and citations when generating articles so that should be fine.
The considerable human review and involvement of country WikiProjects will help a great deal (and I suggest makes a nonsense of Nathan's assertion that these articles will be "untouched for a decade").
- d.
David Gerard wrote:
2008/6/1 Bryan Derksen bryan.derksen@shaw.ca:
But I'm sure FritzpollBot will be using modern style guidelines for templates and citations when generating articles so that should be fine.
The considerable human review and involvement of country WikiProjects will help a great deal (and I suggest makes a nonsense of Nathan's assertion that these articles will be "untouched for a decade").
This seems to be presented differently depending on which proposal you read. I'm finding stuff on the wiki about two million articles, which is virtually impossible if there is actually "considerable human review and involvement of the country WikiProjects" (especially since we don't even have WikiProjects on some countries).
This particular proposal also seems to be heavily relying on a few global placename databases that: 1) have very minimal data (often just coordinates); and 2) have lots of errors, or at best ambiguity. Someone already found a village in Nigeria that likely doesn't even exist [1].
I would be much more comfortable with this being done on a country-by-country basis, with country-specific data sources that are both more complete and more reliable. In fact this is already being done---besides RamBot, there have been other country-specific projects to, for example, add articles on all the [[en:woreda]]s of Ethiopia, villages in Afghanistan, and so on. These sorts of projects go on all the time, rarely encounter much opposition, and produce higher-quality results. The attempt to uniformly treat the entire world en masse, based on questionable global data sources, is what's a bit more controversial.
-Mark
[1] http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Gnaa%2C_Nigeria...)
2008/6/1 Delirium delirium@hackish.org:
This seems to be presented differently depending on which proposal you read. I'm finding stuff on the wiki about two million articles, which is virtually impossible if there is actually "considerable human review and involvement of the country WikiProjects" (especially since we don't even have WikiProjects on some countries).
What some people on the Village Pump are going nonlinear about has little to no relation to any actual plans.
This particular proposal also seems to be heavily relying on a few global placename databases that: 1) have very minimal data (often just coordinates); and 2) have lots of errors, or at best ambiguity. Someone already found a village in Nigeria that likely doesn't even exist [1]. I would be much more comfortable with this being done on a country-by-country basis, with country-specific data sources that are both more complete and more reliable. In fact this is already being done---besides RamBot, there have been other country-specific projects to, for example, add articles on all the [[en:woreda]]s of Ethiopia, villages in Afghanistan, and so on. These sorts of projects go on all the time, rarely encounter much opposition, and produce higher-quality results. The attempt to uniformly treat the entire world en masse, based on questionable global data sources, is what's a bit more controversial.
The most sensible actual discussion I can find is on [[User talk:Fritzpoll]]. Fritzpoll appears to be proceeding with all due caution, and at this stage will be preparing only lists of possible articles for creation, not actually creating them.
- d.
Delirium wrote:
This particular proposal also seems to be heavily relying on a few global placename databases that: 1) have very minimal data (often just coordinates); and 2) have lots of errors, or at best ambiguity. Someone already found a village in Nigeria that likely doesn't even exist.
This point was brought up on one of the discussion pages. I believe the plan is to use the intersection of two or three nominally-independent databases.
2008/6/1 David Gerard dgerard@gmail.com:
2008/6/1 Bryan Derksen bryan.derksen@shaw.ca:
The only remaining problem I have with Rambot's articles is the formatting, which is archaic now and would require a rather clever bot to update without damaging edits that have been made to the original base.
It took data and rendered it in prose form. These days we'd just put the data in as template parameters.
The places rambot was covering for the most part actually exist. We've had issues with the National Geospatial-Intelligence Agency in the past. Remember this mess:
http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Gnaa%2C_Nigeria...
On Sun, Jun 1, 2008 at 12:09 PM, Bryan Derksen bryan.derksen@shaw.ca wrote:
The only remaining problem I have with Rambot's articles is the formatting, which is archaic now and would require a rather clever bot to update without damaging edits that have been made to the original base. But I'm sure FritzpollBot will be using modern style guidelines for templates and citations when generating articles so that should be fine.
Modern for 2008, sure. Archaic in 2010? Probably.
Anthony wrote:
On Sun, Jun 1, 2008 at 12:09 PM, Bryan Derksen bryan.derksen@shaw.ca wrote:
The only remaining problem I have with Rambot's articles is the formatting, which is archaic now and would require a rather clever bot to update without damaging edits that have been made to the original base. But I'm sure FritzpollBot will be using modern style guidelines for templates and citations when generating articles so that should be fine.
Modern for 2008, sure. Archaic in 2010? Probably.
Wikipedia's technologies and style guides are much more future-proof now. We use templates more extensively, allowing the formatting of chunks of the article to be changed site-wide in an easy manner, and things like ref tags will make it easier for future bots to parse the articles when attempting to update things.
On Sun, Jun 1, 2008 at 11:47 AM, David Gerard dgerard@gmail.com wrote:
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
The answer to that question as posed is very large, but the practical answer is probably considerably smaller. Cities got grandfathered in before "notability" standards arose. For example, there's lots of public domain information on every 501(c)(3) charity in the United States. But would people accept bot-assisted article creation for every one of them? Other examples would be every company that does business in Florida, every publicly traded corporation, every person who died in the World Trade Center on September 11th, every domain name, every public router on the Internet, every Perl module in CPAN, etc.
Anthony wrote:
On Sun, Jun 1, 2008 at 11:47 AM, David Gerard dgerard@gmail.com wrote:
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
The answer to that question as posed is very large, but the practical answer is probably considerably smaller. Cities got grandfathered in before "notability" standards arose. For example, there's lots of public domain information on every 501(c)(3) charity in the United States. But would people accept bot-assisted article creation for every one of them? Other examples would be every company that does business in Florida, every publicly traded corporation, every person who died in the World Trade Center on September 11th, every domain name, every public router on the Internet, every Perl module in CPAN, etc.
Articles on past publicly traded corporations would be interesting for everyone who pulls an old stock certificate from a trunk in the attic and starts to wonder what ever happened to that company.
Ec
2008/6/3 Ray Saintonge saintonge@telus.net:
Articles on past publicly traded corporations would be interesting for everyone who pulls an old stock certificate from a trunk in the attic and starts to wonder what ever happened to that company.
However in many cases there will not be enough info to write a NPOV article. For example an article on say the Salisbury and Southampton Canal company that simply listed it's dates of founding and collapse would be somewhat misleading since it might lead the person to think that at some point the company was doing okay at least.
Now in practice most publicly listed companies will have generated some news coverage so it is possible to write a NPOV article on them but a data dump from company house isn't the way to do it.
geni wrote:
2008/6/3 Ray Saintonge saintonge@telus.net:
Articles on past publicly traded corporations would be interesting for everyone who pulls an old stock certificate from a trunk in the attic and starts to wonder what ever happened to that company.
However in many cases there will not be enough info to write a NPOV article. For example an article on say the Salisbury and Southampton Canal company that simply listed it's dates of founding and collapse would be somewhat misleading since it might lead the person to think that at some point the company was doing okay at least.
Now in practice most publicly listed companies will have generated some news coverage so it is possible to write a NPOV article on them but a data dump from company house isn't the way to do it.
There's an amazing amount of information available if one knows where to look. I recently encountered a series of "Moody's Industrials" from the 1930s and 1940s available cheap on eBay. I didn't bid on them since I already have more than enough potential projects on other topics, but it's the kind of thing that would be a valuable reference for an interested person.
The relationship between lists and articles for each list element should be a matter of common sense, and it should be viewed as expansive rather than contractive. If we can say everything that can be said about the subject in one line of information in a table with similar items that is a good basis. We begin with that, and if there is any additional information available there is a basis for an article. Contracting into a list is never a good idea if it means a loss on information.
Ec
David Gerard wrote:
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
From various data sources, mostly high quality, we could probably put together over a million new bot-generated articles on living species. However the current most common approach is to add them manually, attempting to flesh out the articles at least minimally as they're being added. This lets redlinks partly be used as a TODO list, instead of having to maintain a separate list of "articles that were added by a bot but still need to be expanded by real people". That could be done with a hidden category, though.
Starting with just a few special-purposes data sources, FishBase includes 30,000 or so species of fish; the Blattodea database includes 4,560 cockroaches; Antbase includes 10k+ ants; Avibase includes 10k species and 22k subspecies; etc.
It's not clear to me that importing of that sort would be an improvement over our current process, though. We're adding new species coverage at a fairly significant rate as it is, and the current loose arrangements are somewhat manageable.
It's even less clear to me that automatically adding articles on politicians would be useful, unless you can get at least *some* minimal data on what they did, as opposed to just a listing of office/birth/death. The latter could be useful in creating a list article, like [[List of Governors of SomeState]], but it wouldn't be particularly useful in creating articles on the individual people.
-Mark
On Sun, Jun 1, 2008 at 4:47 PM, David Gerard dgerard@gmail.com wrote:
2008/6/1 Fayssal F. szvest@gmail.com:
One major concern... The next generation would believe that Botpedia started on 2001 and that Nupedia and Wikipedia lasted no more than a few days.
heh.
I think this bot-assisted programme of article creation is a Good Thing for topics where we do in fact have the data. Rambot's 2003 creation of 30,000 US placenames meant Wikipedia could claim *completeness* on the topic. (That's what the "encyclo-" prefix of "encyclopedia" means.) There is no reason not to bring similar completeness to our coverage of the rest of the world. It'll certainly help alleviate our systemic bias.
The issues I can see are editorial - the Rambot articles are data put in prose form that these days we'd do with a parameterised template, etc - but Fritzpoll seems quite aware of this and the programme appears to include considerable human review. Good.
I agree. It might be worth the effort to add placeholders (e.g., HTML comments) in the wikitext in case information that is missing at article creation becomes available in machine-readable form later on. The articles could then be updated with little effort. Yay hackish database! ;-)
Also, does this bot try to suggest photos of the place in question during article creation? That might be neat.
The question that springs to mind is: what else can we get complete data on for bot-assisted article creation? Every state-level or higher politician in every country ever? What else?
Scientific data springs to mind. Proteins. Minerals. Small molecules. And the redirects for all the "trivial" names. Now, if we only had a SMILES extension...
The rfam project [1] has put all their RNAs on Wikipedia for community annotation. There's a nice wikipedia-academia collaboration success story!
Species. Where's WikiSpecies? How's EOL [2] doing?
Astronomical objects. Oh wait, we *do* have all these space rocks covered, at least in lists...
Magnus
[1] http://en.wikipedia.org/wiki/Rfam [2] http://www.eol.org/
On Mon, 2008-06-02 at 14:20 +0100, Magnus Manske wrote:
Species. Where's WikiSpecies? How's EOL [2] doing?
Probably better than WikiSpecies, but that's not difficult at the moment.
KTC