http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
- d.
On 16/10/2007, David Gerard dgerard@gmail.com wrote:
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
See http://commons.wikimedia.org/wiki/Image_talk:Enwikipedia_articles_bios_20071...
It looks like he took a current snapshot of the category and looked at their histories to find creation dates. That method misses any articles which have been deleted, but I don't think that's a serious issue (they were probably deleted for a good reason, so including them would artificially inflate past figures).
On 10/16/07, David Gerard dgerard@gmail.com wrote:
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
Rejoice for the the impossible super-intelligence of hindsight: If an article is *currently* a living person bio it *always was*. The graph doesn't include the last few weeks of data because tagging needs time to handle new creations.
Obviously this doesn't allow us to count biographies which have since been deleted... but the hundreds of "Mike is Gay!" bios deleted every day(?) just moments after creation would probably add enough noise to make the graph unreadable. :)
As such these graphs provide insight into the composition of Wikipedia somewhere more so than into the editing practices of the users.
Can someone please explain why Rambot appears to have sparked two years of exponential growth of *biographies*? :)
I don't recall seeing any appeals to write bios to offset the rambot geostubs.
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/16/07, David Gerard dgerard@gmail.com wrote:
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
Rejoice for the the impossible super-intelligence of hindsight: If an article is *currently* a living person bio it *always was*.
The converse, however, is not true. [[Category:2007 deaths]] has over 2000 entries. [[Category:2006 deaths]] has over 3000 entries. Combined that only makes up for a quarter of a percent of all articles, but that's probably enough to put the drop at the end of the curve into question.
On 17/10/2007, Anthony wikimail@inbox.org wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/16/07, David Gerard dgerard@gmail.com wrote:
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
Rejoice for the the impossible super-intelligence of hindsight: If an article is *currently* a living person bio it *always was*.
The converse, however, is not true. [[Category:2007 deaths]] has over 2000 entries. [[Category:2006 deaths]] has over 3000 entries. Combined that only makes up for a quarter of a percent of all articles, but that's probably enough to put the drop at the end of the curve into question.
Very good point. This would be an interesting statistic: What %age of articles on recently deceased people were started before they died? Anyone with a suitable bot care to calculate it? (Go through the 200* deaths categories, parse the article for the date of death [shouldn't be too hard in most cases], compare with the history page, repeat.)
On 17/10/2007, Anthony wikimail@inbox.org wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/16/07, David Gerard dgerard@gmail.com wrote:
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
So what did you do for numbers before [[Category:Living people]] was created in late 2005?
Rejoice for the the impossible super-intelligence of hindsight: If an article is *currently* a living person bio it *always was*.
The converse, however, is not true. [[Category:2007 deaths]] has over 2000 entries. [[Category:2006 deaths]] has over 3000 entries. Combined that only makes up for a quarter of a percent of all articles, but that's probably enough to put the drop at the end of the curve into question.
I was thinking about this last night. If we have a roughly even distribution of people, we'd expect something like ~1.5% of our living biographies to "die off" every year. This shouldn't noticeably affect the graph, though; it will mean that the proportion in the past was slightly higher than we indicate, and the growth curve thus slightly shallower, but the broad results still stand.
On the other hand, simply going by the "2--- deaths" categories won't help - a lot of these are people who we wrote about from their obituaries, and so the article was created after their death. Very close to being a living person, but not quite...
(Obituaries are a surprisingly good tool, in some regards - not just for content, but as indicators, If someone has an obituary appearing in multiple national newspapers, it's a good indication they have or had some degree of significance...)
The recent drop (as opposed to the slowing of the graph) is, I would argue, just an indicator of the fact that marking as a living/nonliving person is often done by someone other than the original author (it's an easy thing to forget), so new articles can take some time to be incorporated into the category.
On 10/17/07, Andrew Gray shimgray@gmail.com wrote:
The recent drop (as opposed to the slowing of the graph) is, I would argue, just an indicator of the fact that marking as a living/nonliving person is often done by someone other than the original author (it's an easy thing to forget), so new articles can take some time to be incorporated into the category.
It's quite possibly true... I guess we'll know when I regenerate the graphs in a couple of months.
I trimmed off the most recent data under the theory that it might not be valid, but the length of my trim was just a guess based on the fact that in other more easily measured processes (image tagging, article deletion, etc) show very strong front-loading that settles within a week or so.
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
Can someone please explain why Rambot appears to have sparked two years of exponential growth of *biographies*? :)
I don't recall seeing any appeals to write bios to offset the rambot geostubs.
I have two hypotheses.
The first is that geostubs tend to cry out for bios, usually politicians of the mayor/state legislator/city councilor varieties. This can even imply exponential growth in proportion to the original geostubs, as the positions turn over but the places stay the same.
The second hypothesis is that Rambot introduced a culture of comprehensiveness into the editor pool; now that we had an article on *every* place in the US, we now needed to have articles on *every* Nobel Prize winner, *every* current member of the national legislature, every *former* member of the national legislature, and so on. I would even go so far as to posit that some of our slowdown has been because lists of tenured faculty at Ivy League universities are harder to come by and less compelling than lists of politicians who appear in your local newspaper every day.
On 17/10/2007, Michael Noda michael.noda@gmail.com wrote:
The second hypothesis is that Rambot introduced a culture of comprehensiveness into the editor pool; now that we had an article on *every* place in the US, we now needed to have articles on *every* Nobel Prize winner, *every* current member of the national legislature, every *former* member of the national legislature, and so on.
Give that man a cookie. The lure of completeness is very good for getting the final 25% of anything written; the more complete sets we have to point to as precedents, the more likely any given grouping is going to be defined as "completable".
And things like biographies, which are human-written and individual by neccessity, are a bigger lure than manually writing stubs on French communes, so these are the kind of things that people focus on as their sets to work on.
I like it.
On 10/17/07, Michael Noda michael.noda@gmail.com wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
Can someone please explain why Rambot appears to have sparked two years of exponential growth of *biographies*? :)
I don't recall seeing any appeals to write bios to offset the rambot geostubs.
I have two hypotheses.
The first is that geostubs tend to cry out for bios, usually politicians of the mayor/state legislator/city councilor varieties. This can even imply exponential growth in proportion to the original geostubs, as the positions turn over but the places stay the same.
The second hypothesis is that Rambot introduced a culture of comprehensiveness into the editor pool; now that we had an article on *every* place in the US, we now needed to have articles on *every* Nobel Prize winner, *every* current member of the national legislature, every *former* member of the national legislature, and so on. I would even go so far as to posit that some of our slowdown has been because lists of tenured faculty at Ivy League universities are harder to come by and less compelling than lists of politicians who appear in your local newspaper every day.
Do we even have any evidence that Rambot in fact did affect the growth of biographies. It affected the *percentage* of biographies, but that could be because it affected the growth of biographies *or* that it affected the growth of non-biographies. Trivially, it certainly caused a massive spike in non-biographies by introducing lots of non-biographies. And presumably in the immediate aftermath it caused a reduction in the number of non-biographies, because suddenly all US cities no longer needed creation. There might be more to it then that, but a graph of the *percentage* of biographies doesn't provide the information to answer it.
On 10/17/07, Anthony wikimail@inbox.org wrote:
Do we even have any evidence that Rambot in fact did affect the growth of biographies.
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_200710.svg
And there's the answer. No. Rambot didn't affect the growth of biographies much at all. There's a spike in early 2002 (can someone check out what that is?), but the graph of biographies is basically unaffected by rambot (late 2002, right?).
A non-exponential graph focusing on the months right around the rambot push would make this more clear. Are the raw numbers available somewhere?
On 10/17/07, Anthony wikimail@inbox.org wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_200710.svg
And there's the answer. No. Rambot didn't affect the growth of biographies much at all. There's a spike in early 2002 (can someone check out what that is?), but the graph of biographies is basically unaffected by rambot (late 2002, right?).
Thats not true, you just can't see it on that scale. The rate of new article bio creation changed from an average of 4/day around 2002-07 to 15/day in 2003-02 and the rate has continued climbing generally faster than the rate of new article creation has climbed since.
The early spike is almost certainly the conversion script artifact.
Perhaps rambot has nothing to do with it, but the bio creation behavior did change around that time.
You need to look at smoothed data because there is a huge weekly cycle in all WP data. ;)
A non-exponential graph focusing on the months right around the rambot push would make this more clear. Are the raw numbers available somewhere?
Linked directly from the image page.
On 10/17/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/17/07, Anthony wikimail@inbox.org wrote: http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_200710.svg
And there's the answer. No. Rambot didn't affect the growth of biographies much at all. There's a spike in early 2002 (can someone check out what that is?), but the graph of biographies is basically unaffected by rambot (late 2002, right?).
Thats not true, you just can't see it on that scale. The rate of new article bio creation changed from an average of 4/day around 2002-07 to 15/day in 2003-02
July 2002 to February 2003 is a broad range, and July 2002 was during a period of lots of server downtime.
and the rate has continued climbing generally faster than the rate of new article creation has climbed since.
Well, yes, that's indisputable since the *percentage* has been rising.
The early spike is almost certainly the conversion script artifact.
Yes, that's it.
Perhaps rambot has nothing to do with it, but the bio creation behavior did change around that time.
http://spreadsheets.google.com/pub?key=p-pyYERq1P4N0GKZ6EvRPSw
It seems to me that bio creation increased *before* rambot, which was mid-October 2002. The increase happened in August/September 2002 (first time over 100/week on that graph was week ending 9/26/2002). Next big increase was July 2003 (last time under 100 on that graph was week ending 6/19/2003). I don't see any effect by rambot at all (week ending 10/24/2002 was a ho-hum 81 new bios).
What happened in August/September 2002? Well, July 21, 2002 brought new software on new servers. August 10, 2002 "David A. Wheeler...released html2wikipedia, a tool that translates HTML into Wikipedia's Wiki format." August 15, 2002 "We are now at www.wikipedia.org instead of www.wikipedia.com."
September 21, 2002 "The much-belated import of pre-January 2002 article edit histories from the old software has been done at last!"
October 18-26, 2002: "The so called rambot completed its mass entry of approximately 30,000 articles on U.S. cities. The process which began on October 18, took over a week to finish. It caused lots of discussion and problems with cluttering up the Recent Changes."
http://en.wikipedia.org/wiki/Wikipedia:Announcements_2002
January 22, 2003: Slashdot (this is the spike in the middle of the graph, 138 new bios that week)
You need to look at smoothed data because there is a huge weekly cycle in all WP data. ;)
Yeah, I decided to just go with weekly data.
On 16/10/2007, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
Bio stubs are among the easiest stubs to write, so they were the low hanging fruit that everyone picked. Now most of the easy/obvious ones have been written (as evidenced by the recent statistics on the Nobel Prize winners), other types of articles are catching up.
One more step along the road to maturity? (I prefer maturity to completion as a measure - it's vaguely achievable, if nothing else.)
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
So to make sure I understand properly, this is saying that currently, roughly 11% of all articles in the English Wikipedia are biographies of living people?
If so, yikes.
I'm curious how much of the article count biographies in general (of either living or dead people) make up as well. I guess you'd have to follow the category tree of the (much less well known) [[Category:Dead_people]] to figure that out and add the two up ... the upper-level people categories seem a bit disorganized.
Re: Rambot -- it's a self-fulfilling prophecy :) if you have articles about places, then clearly you need articles about people who live in those places... right?
-- phoebe
So to make sure I understand properly, this is saying that currently, roughly 11% of all articles in the English Wikipedia are biographies of living people?
If so, yikes.
Sounds about right to me. Not sure if it's a good thing or not, but I don't think it's a terrible thing. Good job it's stopped increasing, though... much more and it would be a problem.
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
So to make sure I understand properly, this is saying that currently, roughly 11% of all articles in the English Wikipedia are biographies of living people?
Yes: 11% of non-redirect articles pages are tagged with Category:Living people.
Since not all redirect pages are articles (90k disambigs, etc) but all the Living people pages should be articles the actual concentration is somewhat greater than 11%.
If so, yikes.
I'm curious how much of the article count biographies in general (of either living or dead people) make up as well. I guess you'd have to follow the category tree of the (much less well known) [[Category:Dead_people]] to figure that out and add the two up ... the upper-level people categories seem a bit disorganized.
I'm curious about this too, but ideas that include the words "follow the category tree" are generally complete non-starters if you care about remotely sane results.
I really wanted to break all of Wikipedia down into a dozen or so top level categories so I can make a stacked line graph showing the composition over time... but I've found no way of breaking up the articles using automated category analysis that doesn't produce utterly rubbish results.
I haven't looked specifically at doing that to identify dead people articles... and I will... but I do not have high hopes. My past experience suggests that the results will be nearly useless.
Re: Rambot -- it's a self-fulfilling prophecy :) if you have articles about places, then clearly you need articles about people who live in those places... right?
I'm sure that this is a sub-subject worthy of a research paper on its own. Some kind of spontaneous symmetry breaking? "What you lack is what you get" becomes "What you're getting you get more of" which becomes "What you've got you get more of"? ;)
Gregory Maxwell wrote:
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
If so, yikes.
I'm curious how much of the article count biographies in general (of either living or dead people) make up as well. I guess you'd have to follow the category tree of the (much less well known) [[Category:Dead_people]] to figure that out and add the two up ... the upper-level people categories seem a bit disorganized.
I'm curious about this too, but ideas that include the words "follow the category tree" are generally complete non-starters if you care about remotely sane results.
I really wanted to break all of Wikipedia down into a dozen or so top level categories so I can make a stacked line graph showing the composition over time... but I've found no way of breaking up the articles using automated category analysis that doesn't produce utterly rubbish results.
I haven't looked specifically at doing that to identify dead people articles... and I will... but I do not have high hopes. My past experience suggests that the results will be nearly useless.
The interesting graph could be based on the distribution in the category series "yyyy deaths".
Re: Rambot -- it's a self-fulfilling prophecy :) if you have articles about places, then clearly you need articles about people who live in those places... right?
I'm sure that this is a sub-subject worthy of a research paper on its own. Some kind of spontaneous symmetry breaking? "What you lack is what you get" becomes "What you're getting you get more of" which becomes "What you've got you get more of"? ;)
My impression would be that Rambot had more to do with the 2002 dip in the graph. There were complaints long after that all these small towns were overwhelming "random article" selection. If the biography articles represent the low-hanging fruit then Rambot was picking them up off the ground.
Ec
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
So to make sure I understand properly, this is saying that currently, roughly 11% of all articles in the English Wikipedia are biographies of living people?
If so, yikes.
http://en.wikipedia.org/wiki/Special:Mostlinkedcategories
Several of the largest categories are devoted to people, which immediately makes clear they are a large fraction of all articles.
"Category:Biography articles without listas parameter" has 300,000 members, so presumably there are at least 80,000 articles about people not currently presumed to be living.
Or put another way, we have slightly more than 2 biographies of living people for every 3 non-free images.
-Robert Rohde
On 10/16/07, Robert Rohde rarohde@gmail.com wrote:
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
So to make sure I understand properly, this is saying that currently, roughly 11% of all articles in the English Wikipedia are biographies of living people?
If so, yikes.
http://en.wikipedia.org/wiki/Special:Mostlinkedcategories
Several of the largest categories are devoted to people, which immediately makes clear they are a large fraction of all articles.
"Category:Biography articles without listas parameter" has 300,000 members, so presumably there are at least 80,000 articles about people not currently presumed to be living.
Or put another way, we have slightly more than 2 biographies of living people for every 3 non-free images.
Oh wait, there is also a "Category:Biography articles with listas parameter" with 140,000 members. If the with/without "listas" are assumed to be a complete set then there are 445,000 biography articles, which would give roughly equal numbers of living and dead biographies. So ~22% of Wikipedia is biographies.
Oh wait, there is also a "Category:Biography articles with listas parameter" with 140,000 members. If the with/without "listas" are assumed to be a complete set then there are 445,000 biography articles, which would give roughly equal numbers of living and dead biographies. So ~22% of Wikipedia is biographies.
Sounds about right to me.
Conclusive proof: I just clicked Random Page 10 times and got 2 biographies (didn't check if they were alive or not).
On 10/16/07, Robert Rohde rarohde@gmail.com wrote:
"Category:Biography articles without listas parameter" has 300,000 members, so presumably there are at least 80,000 articles about people not currently presumed to be living.
Or put another way, we have slightly more than 2 biographies of living people for every 3 non-free images.
Oh wait, there is also a "Category:Biography articles with listas parameter" with 140,000 members. If the with/without "listas" are assumed to be a complete set then there are 445,000 biography articles, which would give roughly equal numbers of living and dead biographies. So ~22% of Wikipedia is biographies.
I was excited for a moment ... I thought you'd found a way to identify the rest of them.. but alas: A quick glance shows that those categories have been applied to enormous numbers of articles which are very clearly not biographies... Mostly by overzealous bot operators who think they can trust the category hierarchy and who do not check their work. :(
I selected ten pages at random from it and got an equal mix of bands and albums. No bios.
Also, it's applied to talk pages which makes counting based on it more computationally expensive.. alas..
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
I'm curious how much of the article count biographies in general (of either living or dead people) make up as well.
22 months ago, it was around 25%. Six months ago, it was around 30%. Margin of error on both numbers is unknown, but I believe it to be about +/- 5%.
On 10/16/07, Mark Wagner carnildo@gmail.com wrote:
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
I'm curious how much of the article count biographies in general (of
either
living or dead people) make up as well.
22 months ago, it was around 25%. Six months ago, it was around 30%. Margin of error on both numbers is unknown, but I believe it to be about +/- 5%.
Out of curiosity, what's the source for this? (since we were just trying to figure out how to calculate it).
Some number in between 20 and 30 percent does feel about right, for what it's worth (not much).
-- phoebe
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
On 10/16/07, Mark Wagner carnildo@gmail.com wrote:
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
I'm curious how much of the article count biographies in general (of
either
living or dead people) make up as well.
22 months ago, it was around 25%. Six months ago, it was around 30%. Margin of error on both numbers is unknown, but I believe it to be about +/- 5%.
Out of curiosity, what's the source for this? (since we were just trying to figure out how to calculate it).
Some number in between 20 and 30 percent does feel about right, for what it's worth (not much).
Statistical sample of about 400 random articles, combined with my personal judgement of what constitutes a "biography of a person".
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
Thank goodness it seems to have peaked, definitely interested in the next couple months though... I'm a little surprised the percentage is so high, but I don't think it's bad really. Biographies of many living people are useful. Who's to say what the right percentage is. As long as it's somewhat stable of course :)
On 10/16/07, cohesion cohesion@sleepyhead.org wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
Thank goodness it seems to have peaked, definitely interested in the next couple months though... I'm a little surprised the percentage is so high, but I don't think it's bad really. Biographies of many living people are useful. Who's to say what the right percentage is. As long as it's somewhat stable of course :)
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
Cheers WilyD
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
That's a pretty irrelevant number. The %age of (known) notable people alive today is much higher. How many people who lived over 1000 years ago do we know enough about to write an article on (if we still know about them, they're notable enough for me, so it's just a matter of them still being known)? A few hundred, maybe? (most will be rulers of the various countries that have records going back that far, I guess) The number increases as we get nearer to present day, but the proportion of living people that are notable is far higher than the proportion of dead people that are notable (that we know of).
On 10/16/07, Thomas Dalton thomas.dalton@gmail.com wrote:
The number increases as we get nearer to present day, but the proportion of living people that are notable is far higher than the proportion of dead people that are notable (that we know of).
Here comes the notability paradox again. "Notable" supposedly means "worthy of being noted", but in practice the meaning may be closer to "worthy of notations which still exist, preferably on the internet, preferably without an access fee, and preferably in English", which generally excludes the distant back issues of most periodical publications, and all of the defunct ones, or books which were burned prior to mass reproduction, or personal journals which fell into the drink, or anything that could have been written if only the would-be author was literate or even part of a time and culture where a written language existed.
We have to accept the unpleasant reality that there are a lot of dead people, ones whom we will never hear of, ever, whose lives — by virtue of being remarkably more interesting than their contemporaries — were certainly "worthy of note", but for whom the notes failed to preserve themselves, (or in some cases, even materialize), for whatever reason.
Not much we can do about that except keep looking.
Meanwhile, "verifiability" is the one objective and practical criterion for inclusion, and information about living people is exponentially easier to verify.
—C.W.
On 17/10/2007, Charlotte Webb charlottethewebb@gmail.com wrote:
On 10/16/07, Thomas Dalton thomas.dalton@gmail.com wrote:
The number increases as we get nearer to present day, but the proportion of living people that are notable is far higher than the proportion of dead people that are notable (that we know of).
Here comes the notability paradox again. "Notable" supposedly means "worthy of being noted", but in practice the meaning may be closer to "worthy of notations which still exist, preferably on the internet, preferably without an access fee, and preferably in English", which generally excludes the distant back issues of most periodical publications, and all of the defunct ones, or books which were burned prior to mass reproduction, or personal journals which fell into the drink, or anything that could have been written if only the would-be author was literate or even part of a time and culture where a written language existed.
That's why I was careful to say "that we know of". Clearly, we can't right articles about people we don't know about.
On 10/16/07, Wily D wilydoppelganger@gmail.com wrote:
On 10/16/07, cohesion cohesion@sleepyhead.org wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
Thank goodness it seems to have peaked, definitely interested in the next couple months though... I'm a little surprised the percentage is so high, but I don't think it's bad really. Biographies of many living people are useful. Who's to say what the right percentage is. As long as it's somewhat stable of course :)
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
Cheers WilyD
Here's a historical tidbit from a lovely book I'm slowly reading by Robert Collison called "Encyclopedias: their history throughout the ages" (1966) -- he claims that Johann Heinrich Zedler's "Grosses vollstandiges Universal-Lexicon", first pub. in 1731, was the first encyclopedia to include biographies of living people. Not sure, in turn, how he figured this out (extensive historical research, I think) but it's nice to know that living bios have at least as long a pedigree in the modern encyclopedia as philosophical articles (e.g. the "Encyclopedie", first published in 1751) and technical/practical articles (e.g. Chambers' "Cyclopedia", first published in 1728).
Incidentally, if any of you are encyclopedia fans and you find an inexpensive copy of Collison's book, buy it -- it's out of print and difficult to find.
-- phoebe
phoebe ayers wrote:
Here's a historical tidbit from a lovely book I'm slowly reading by Robert Collison called "Encyclopedias: their history throughout the ages" (1966) -- he claims that Johann Heinrich Zedler's "Grosses vollstandiges Universal-Lexicon", first pub. in 1731, was the first encyclopedia to include biographies of living people. Not sure, in turn, how he figured this out (extensive historical research, I think) but it's nice to know that living bios have at least as long a pedigree in the modern encyclopedia as philosophical articles (e.g. the "Encyclopedie", first published in 1751) and technical/practical articles (e.g. Chambers' "Cyclopedia", first published in 1728).
Incidentally, if any of you are encyclopedia fans and you find an inexpensive copy of Collison's book, buy it -- it's out of print and difficult to find.
I found 13 copies listed at Abebooks, with the cheapest being for $50.00. It is a British publication so it uses the Encyclopaedia spelling in its title.
Ec
Wily D wrote:
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
I'd guess ours is higher, and I think it *should* be higher, mainly due to our lack of space constraints. To a first approximation, the further you go back in history, the more biased the historical record is towards only documenting the exploits of very famous people; it's only relatively recently that good information is easily available on a very broad range of moderately-notable people. So you will get a much lower percentage of living people if you have 10,000 biographies versus if you have 250,000---not because the other 240,000 aren't useful biographies to have, but just because you didn't have any room for them.
-Mark
On 10/16/07, Delirium delirium@hackish.org wrote:
I'd guess ours is higher, and I think it *should* be higher, mainly due to our lack of space constraints. To a first approximation, the further you go back in history, the more biased the historical record is towards only documenting the exploits of very famous people; it's only relatively recently that good information is easily available on a very broad range of moderately-notable people. So you will get a much lower percentage of living people if you have 10,000 biographies versus if you have 250,000---not because the other 240,000 aren't useful biographies to have, but just because you didn't have any room for them.
I'm skeptical of the NOTPAPER argument for things like this.
We are constrained. True, we are not space constrained but neither are many modern commercial reference works.
We have many types of constraints, manpower, interest, process, and others...
Whenever resources are limited there are some possibile allocatations of resources which are more ideal (by some metric) than others.
I see no reason why the removal of the space constraint should change the *ideal* subject matter distribution substantially.
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
We have many types of constraints, manpower, interest, process, and others...
Whenever resources are limited there are some possibile allocatations of resources which are more ideal (by some metric) than others.
I'm not sure volunteer labor of the Wikipedia sort is a resource that can be significantly allocated, other than by an invisible hand. I wonder if it'd be possible to prove that a [[laissez-faire]] approach to notability produces a [[Pareto efficient]] allocation of resources.
I see no reason why the removal of the space constraint should change the *ideal* subject matter distribution substantially.
Doesn't the term "ideal" essentially mean "without constraints"?
Anthony wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
I see no reason why the removal of the space constraint should change the *ideal* subject matter distribution substantially.
Doesn't the term "ideal" essentially mean "without constraints"?
No, It's more like "optimum", in which case it may or may not have constraints.
Ec
On 10/17/07, Ray Saintonge saintonge@telus.net wrote:
Anthony wrote:
On 10/16/07, Gregory Maxwell gmaxwell@gmail.com wrote:
I see no reason why the removal of the space constraint should change the *ideal* subject matter distribution substantially.
Doesn't the term "ideal" essentially mean "without constraints"?
No, It's more like "optimum", in which case it may or may not have constraints.
Whether you call it "ideal" or "optimum", I'd say the sum of all human knowledge would be the answer. Ideally, if I wanted to know anything about anything I could ask Wikipedia and it'd tell me. Practically, space would limit that, verifiability/lack of records would limit that, policy against original research would limit that, interest would limit that, ease of use would limit that, privacy considerations would limit that, etc. I doubt the distribution of topics would be preserved after those limitations.
Gregory Maxwell wrote:
On 10/16/07, Delirium delirium@hackish.org wrote:
I'd guess ours is higher, and I think it *should* be higher, mainly due to our lack of space constraints. To a first approximation, the further you go back in history, the more biased the historical record is towards only documenting the exploits of very famous people; it's only relatively recently that good information is easily available on a very broad range of moderately-notable people. So you will get a much lower percentage of living people if you have 10,000 biographies versus if you have 250,000---not because the other 240,000 aren't useful biographies to have, but just because you didn't have any room for them.
I'm skeptical of the NOTPAPER argument for things like this.
We are constrained. True, we are not space constrained but neither are many modern commercial reference works.
We have many types of constraints, manpower, interest, process, and others...
Whenever resources are limited there are some possibile allocatations of resources which are more ideal (by some metric) than others.
I see no reason why the removal of the space constraint should change the *ideal* subject matter distribution substantially.
My point *was* precisely that the ideal distribution of living/nonliving biographies should change quite substantially depending on the size of the work, because the ideal living/nonliving distribution varies by level of notability. If you write a small encyclopedia of say 1,000 biographies, you're covering only the top-tier of famous people, of whom many are no longer living. If you write a comprehensive one of say 250,000 biographies (or a million, or two million), you're covering many more people who are notable only in niches, or only moderately notable---not just kings and famous generals and philosophers---of whom a much larger percentage (of those about whom any information survives, anyway) are alive.
Consider an area like philosophy: If you were to pick the top 100 most influential philosophers of all time, many (most?) would be dead. But if you were to pick the top 5,000, a much larger percentage would be currently alive. We want to cover the top 5,000 (or more!), not just the top 100, because we're a broad-coverage encyclopedia. And thus we'll have a larger percentage of our philosopher bios be on living people than if we were to delete all but the top 100 most important. This doesn't, of course, harm our coverage of those top 100 in any way, which is why I think focusing on percentages is worse than useless.
-Mark
On 10/17/07, Delirium delirium@hackish.org wrote:
My point *was* precisely that the ideal distribution of living/nonliving biographies should change quite substantially depending on the size of the work
[snip]
I fear that we're venturing deep into the realm of the subjective, so I'll spare the list a long reply...
Wouldn't your position only hold if there were more living people to write about than items in other subject areas and the project became large enough to actually begin exhausting them?
Hm.
On 10/16/07, Delirium delirium@hackish.org wrote:
Wily D wrote:
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
I'd guess ours is higher, and I think it *should* be higher, mainly due to our lack of space constraints. To a first approximation, the further you go back in history, the more biased the historical record is towards only documenting the exploits of very famous people; it's only relatively recently that good information is easily available on a very broad range of moderately-notable people. So you will get a much lower percentage of living people if you have 10,000 biographies versus if you have 250,000---not because the other 240,000 aren't useful biographies to have, but just because you didn't have any room for them.
-Mark
The sourcing issue rings true-- "reliable sources" for people who aren't alive now drop off dramatically the further back you go, especially if you're talking about English-language sources for non-English speaking individuals. Furthermore, the historical sources that are available start to be less and less accessible to the average Wikipedian (i.e. not online or widely held in libraries). Whether or not that source gets cited in an article, you do need to know *something* about the person in order to write the article in the first place -- and as Mark says we know much less about moderately famous people from a long time ago than we do now.
-- phoebe
The question is how much of an article is acceptable. A stub together with a bibliographic reference can be written for anyone who appears in a print reference book--if the position or accomplishment seems notable. deWP has many articles of this sort, but when they are translated into enWP, they are generally deleted very quickly.
On 10/16/07, phoebe ayers phoebe.wiki@gmail.com wrote:
On 10/16/07, Delirium delirium@hackish.org wrote:
Wily D wrote:
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Presumably, in a complete Wikipedia, the percentage would be much lower (I believe the current estimates are that ~5% of all humans are currently alive, and I'd guess our existing biographies are more about alive people than that). But how does it compare to other encyclopaedias?
I'd guess ours is higher, and I think it *should* be higher, mainly due to our lack of space constraints. To a first approximation, the further you go back in history, the more biased the historical record is towards only documenting the exploits of very famous people; it's only relatively recently that good information is easily available on a very broad range of moderately-notable people. So you will get a much lower percentage of living people if you have 10,000 biographies versus if you have 250,000---not because the other 240,000 aren't useful biographies to have, but just because you didn't have any room for them.
-Mark
The sourcing issue rings true-- "reliable sources" for people who aren't alive now drop off dramatically the further back you go, especially if you're talking about English-language sources for non-English speaking individuals. Furthermore, the historical sources that are available start to be less and less accessible to the average Wikipedian (i.e. not online or widely held in libraries). Whether or not that source gets cited in an article, you do need to know *something* about the person in order to write the article in the first place -- and as Mark says we know much less about moderately famous people from a long time ago than we do now.
-- phoebe _______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: http://lists.wikimedia.org/mailman/listinfo/wikien-l
David Goodman wrote:
The question is how much of an article is acceptable. A stub together with a bibliographic reference can be written for anyone who appears in a print reference book--if the position or accomplishment seems notable. deWP has many articles of this sort, but when they are translated into enWP, they are generally deleted very quickly.
I haven't had that problem myself; what sorts of articles are getting deleted? I've translated quite a few very short articles on obscure 19th-century Germans from the German Wikipedia to English and I'd be surprised if anyone proposed deleting any of them, even if many consist of a few sentences of a stub and a reference or two.
-Mark
On 10/17/07, Delirium delirium@hackish.org wrote:
David Goodman wrote:
The question is how much of an article is acceptable. A stub together with a bibliographic reference can be written for anyone who appears in a print reference book--if the position or accomplishment seems notable. deWP has many articles of this sort, but when they are translated into enWP, they are generally deleted very quickly.
I haven't had that problem myself; what sorts of articles are getting deleted? I've translated quite a few very short articles on obscure 19th-century Germans from the German Wikipedia to English and I'd be surprised if anyone proposed deleting any of them, even if many consist of a few sentences of a stub and a reference or two.
While I can't say I'd be surprised if somebody made a systematic effort to delete your work (without giving a damn what exists on other projects), I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
—C.W.
Articles that have been edited by Delirium are, as one would expect, adequate in content and references. Not everyone who translates an article is that careful.
On 10/18/07, Charlotte Webb charlottethewebb@gmail.com wrote:
On 10/17/07, Delirium delirium@hackish.org wrote:
David Goodman wrote:
The question is how much of an article is acceptable. A stub together with a bibliographic reference can be written for anyone who appears in a print reference book--if the position or accomplishment seems notable. deWP has many articles of this sort, but when they are translated into enWP, they are generally deleted very quickly.
I haven't had that problem myself; what sorts of articles are getting deleted? I've translated quite a few very short articles on obscure 19th-century Germans from the German Wikipedia to English and I'd be surprised if anyone proposed deleting any of them, even if many consist of a few sentences of a stub and a reference or two.
While I can't say I'd be surprised if somebody made a systematic effort to delete your work (without giving a damn what exists on other projects), I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
—C.W.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: http://lists.wikimedia.org/mailman/listinfo/wikien-l
On 18/10/2007, Charlotte Webb charlottethewebb@gmail.com wrote:
I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
Hmm. "Language" and "Language edition of Wikipedia" are subtly different here. Consider, for example, the Japanese "no non-public living figures" rule...
Quoting Andrew Gray shimgray@gmail.com:
On 18/10/2007, Charlotte Webb charlottethewebb@gmail.com wrote:
I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
Hmm. "Language" and "Language edition of Wikipedia" are subtly different here. Consider, for example, the Japanese "no non-public living figures" rule...
Well, in practice the rule on .en is in many ways stricter. Daniel Brandt for example is not only a public figure but a willing public figure (although judging from more recent BLP-privacy AfDs, my guess is that an article of someone of about his notability would be kept as of right now).
On 10/18/07, joshua.zelinsky@yale.edu joshua.zelinsky@yale.edu wrote:
Well, in practice the rule on .en is in many ways stricter. Daniel Brandt for example is not only a public figure but a willing public figure (although judging from more recent BLP-privacy AfDs, my guess is that an article of someone of about his notability would be kept as of right now).
Quite true, but that was always the case.
—C.W.
On 10/18/07, Andrew Gray shimgray@gmail.com wrote:
On 18/10/2007, Charlotte Webb charlottethewebb@gmail.com wrote:
I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
Hmm. "Language" and "Language edition of Wikipedia" are subtly different here. Consider, for example, the Japanese "no non-public living figures" rule...
In practice each language edition makes its own rules. But is that the mythical goal?
On 10/18/07, Andrew Gray shimgray@gmail.com wrote:
On 18/10/2007, Charlotte Webb charlottethewebb@gmail.com wrote:
I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
Hmm. "Language" and "Language edition of Wikipedia" are subtly different here. Consider, for example, the Japanese "no non-public living figures" rule...
Just to clarify, I did mean to say "a corresponding Wikipedia article in every language" (as opposed to one existing on some other, less reputable project).
—C.W.
On 10/18/07, Charlotte Webb charlottethewebb@gmail.com wrote:
While I can't say I'd be surprised if somebody made a systematic effort to delete your work (without giving a damn what exists on other projects), I dispute the notion that a subject can possibly deserve an article in one language, but not in another. I thought the mythical goal state was for every "valid topic" to have a corresponding article in every language.
As it stands right now, every language has their own standards for inclusion/notability. Most are very similar, but I would be shocked if some languages didn't choose differently about borderline cases.
Does every Wikipedia agree that every Pokemon, every high school, and the mayor of every town deserve their own wikipage? Probably not.
On 10/18/07, Robert Rohde rarohde@gmail.com wrote:
As it stands right now, every language has their own standards for inclusion/notability. Most are very similar, but I would be shocked if some languages didn't choose differently about borderline cases.
Does every Wikipedia agree that every Pokemon, every high school, and the mayor of every town deserve their own wikipage? Probably not.
Ideally, "content standards" of different Wikipedia projects wouldn't differ any more than the laws governing the jurisdictions in which the servers are located. Incidentally, I'm probably too much of an idealist for my own good.
—C.W.
On 16/10/2007, Wily D wilydoppelganger@gmail.com wrote:
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Very often, "none". It's one of the easy editorial choices to make :-)
On 10/17/07, Andrew Gray shimgray@gmail.com wrote:
On 16/10/2007, Wily D wilydoppelganger@gmail.com wrote:
Indeed, that's the right question to ask: What percentage should it be? What's the percentage in other encyclopaedias?
Very often, "none". It's one of the easy editorial choices to make :-)
WP:BEANS, kids.
—C.W.
Hi Gregory, All,
"Gregory Maxwell" gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
Discuss amongst yourselves.
Do you see any chance of getting a similiar graph for the precentage of articles regarding fictional (persons, places, spaceships, ... everything)?
Regards, Peter
Peter Jacobi schreef:
Do you see any chance of getting a similiar graph for the precentage of articles regarding fictional (persons, places, spaceships, ... everything)?
I've done some work of this recently, resulting in http://commons.wikimedia.org/wiki/Image:Size_of_English_Wikipedia_August_200... .
That image is in percentages of text volume (in bytes), but I also have the percentages of article numbers. Unfortunately no time series. For people, the numbers are close to those of Gregory: 10.8% living, 8.9% dead.
I've identified 7.2% of articles as a location; this is probably an underestimate. 4.2% is disambiguation; 3.4% albums and singles; 3.0% tree-of-life articles; 1.6% movies. Over 60% unclassified stuff. Suggestions for more categories *and how to recognize them* are welcome.
Technical details: these numbers are the percentages of non-redirect articles in the main namespace of articles matching one of the following [[regex]]en: - /[[[Cc]ategory:[Ll]iving people(||]])/ - /[[[Cc]ategory:[^]]+ (births|deaths)(||]])/ - /{{\s*[Cc]oor/ <-- This one is of very dubious quality - /{{[dD]isamb/ - /[[[Cc]ategory:\d+ (albums|singles)(||]])/ - /{{\s*[Tt]axobox\b/ - /[[[Cc]ategory:[^]]+ films(||]])/
Eugene
On 10/17/07, Eugene van der Pijll eugene@vanderpijll.nl wrote:
...following [[regex]]en:
- /[[[Cc]ategory:[Ll]iving people(||]])/
- /[[[Cc]ategory:[^]]+ (births|deaths)(||]])/
- /{{\s*[Cc]oor/ <-- This one is of very dubious quality
Well I did just look at http://en.wikipedia.org/wiki/Special:Prefixindex/Template:Coor and didn't see any brewery-related navboxes.
- /{{[dD]isamb/
Note that there are other templates for specialized types of disambiguation pages (e.g. {{geodis}}, {{hndis}}, {{roaddis}} etc. to list places, people, roads sharing roughly the same name), not to mention the commonly used shorthand "{{dab}}" for the main template. A total membership count of [[Category:Disambiguation]] and its subcategories, if duplicate titles (pages in more than one category) are ignored would be the most accurate obtainable number, and much higher than your estimate.
- /[[[Cc]ategory:\d+ (albums|singles)(||]])/
- /{{\s*[Tt]axobox\b/
- /[[[Cc]ategory:[^]]+ films(||]])/
In other news, although I realize there would be countless duplicates, I'd be interested in seeing some statistics on the subcategories of "people by occupation", "people by nationality", etc. etc. Just to see what sort of bio-topics are most and least likely to be written about. Just the post the raw numbers, don't need a graph. You can post a link to it here if everybody promises not to panic about the number of cricket players.
—C.W.
On 10/17/07, Gregory Maxwell gmaxwell@gmail.com wrote:
http://commons.wikimedia.org/wiki/Image:Enwikipedia_articles_bios_pct_200710...
It must be a significant underestimate, since any article about a living person, which is not explicitly labelled [[Category:Living people]] isn't counted.
Looking through the stubs I've created in the last year or two, I have made 7 about living people: Daniel Wyllie, Lech Kowalski, Franck Sorbier, Vadim Perelman, Eric Harshbarger, Adam Elliot and John Long (climber). Two of those are in the magic category, and 5 aren't. If that ratio held up across Wikipedia, the proportion of articles about living people would be about 40%, not 12%.
Steve
On 19/10/2007, Steve Bennett stevagewp@gmail.com wrote:
Looking through the stubs I've created in the last year or two, I have made 7 about living people: Daniel Wyllie, Lech Kowalski, Franck Sorbier, Vadim Perelman, Eric Harshbarger, Adam Elliot and John Long (climber). Two of those are in the magic category, and 5 aren't. If that ratio held up across Wikipedia, the proportion of articles about living people would be about 40%, not 12%.
In my experience the ratio is higher - 90%, perhaps. The category is generally a lot better known now than it was; I'm surprised you're getting such low results. Out of interest, did all your articles have [[Category:19xx births]]?
Hmm.
Sometime last year - I thought it was in the mailing list, but I can't find it, so it might be in IRC - we needed to produce numbers on how many living bios we had, and checked both the numbers in that category *and* something else. The "something else" numbers were 10-20% higher, I recall.
I *think* this latter was using the {{WP Biography}} talkpage template, but it might have been an aggregation of birth/death categories.
So, if we look for all articles matching something like ["Category:19xx births" NOT "Category:xxxx deaths"], translate that into SQL, see what you get.
Different category structure, but an older one, and one that might get some we've missed...
On 10/19/07, Andrew Gray shimgray@gmail.com wrote: [snip]
So, if we look for all articles matching something like ["Category:19xx births" NOT "Category:xxxx deaths"], translate that into SQL, see what you get.
It's a fairly expensive query to run (ever so much more than just matching a single category) ... but I ran it frequently in the past to help people populate the living people category.
On 19/10/2007, Steve Bennett stevagewp@gmail.com wrote:
It must be a significant underestimate, since any article about a living person, which is not explicitly labelled [[Category:Living people]] isn't counted.
Sure it's a lower bound, but I would be shocked if the coverage were that bad. Manually checking with random page indicates the numbers are about right, and certantly not your 90%. :)