According to their website, Britannica's Deluxe Edition 2004 CD has 75,000 articles.
http://store.britannica.com/escalate/store/DetailPage?pls=britannica&bc=...
Presumably, many of these are longer and of higher quality than our 150,000 English-language articles.
From our groovy new stats tool, we know that the average Wikipedia
article is 2,115 bytes. Perhaps of greater interest, 53% of our articles, or right around 75,000, are longer than 1,500 bytes.
So a question naturally comes to mind - how do our 75,000 >1,500 byte articles stack up again Britannica's 75,000 articles?
I ask because I continue to work on a plan for a drive to Wikipedia 1.0, and a big part of that plan involves getting a realistic assessment of what a Wikipedia 1.0 will look like, relative to Britannica.
If I end up setting a 'target date' for Wikipedia 1.0 of 1 year in the future, what might we realistically expect to achieve? What if I set the 'target date' for 2 years in the future?
What I'd like to find out is that we have a realistic chance of having a Wikipedia 1.0 release 1 year from now that rivals Britannica. But there's no need to hurry, if it will take 2 years or 5 years, that's how long it will take.
But I'd be very interested in getting some feedback and help on how to make that determination realistically.
--Jimbo
As a quick followup, Britannica also claims that the 32 volume print encyclopedia has 44 million words.
I just picked a random article of 2,245 bytes, which was also 376 words. That implies just under 6 bytes per word. This statistic could be improved by checking more articles, but the article looks pretty normal to me, so I think it's basically a sensible ballpark figure for now.
150,000 articles averaging 2,126 bytes means 318,900,000 bytes total or just over 53 million words.
Considering *just* the 75,000 articles over 1500 bytes and assuming conservatively that these are all *only* 1500 bytes long (manifestly untrue), we are looking at 18,750,000 words for just those longer articles.
It seems clear to me that we are already "in the ballpark" of the size of Britannica. Quality is, of course, an entirely different question. I think we are often superior and often drastically inferior. I susppect that our coverage contains strange and conspicuous 'holes' if we went through it via a "top down" approach, i.e. take lists of major topics and see if we've covered them.
--Jimbo
Jimmy Wales wrote:
As a quick followup, Britannica also claims that the 32 volume print encyclopedia has 44 million words.
44,000,000 / 75,000 = 586 words per article
It seems clear to me that we are already "in the ballpark" of the size of Britannica. Quality is, of course, an entirely different question. I think we are often superior and often drastically inferior. I susppect that our coverage contains strange and conspicuous 'holes' if we went through it via a "top down" approach, i.e. take lists of major topics and see if we've covered them.
We already have many "List of ... topics" articles, as well as many other lists that can serve as most wanted lists. We already have a mechanically generated list of "Wanted pages" that works sometimes. A human generated most wanted list would also be very welcome if it's not allowed to become so long that it's useless.
When I first joined Wikipedia I made extensive use of the "Wanted pages" just to find things to do. A more experienced Wikipedian never runs short of things to do, but even then looking at the Wanted pages for something a little different can break monotonous habits. In other words "Wanted pages" is a lot more important to newbies than to veterans; nothing works better for retaining new contributors than to feel that they can contribute something that somebody else wants. It would be nice to see that feature working a little better than has been the case in recent months.
As I said in my previous response, I see a CD edition as a snapshot of WP at a given moment in time. I would suggest that there be a tag for a "print approved" version of any article This is a version that has been reviewed for such things as liabellous comments, copyright infringements or adherence to NPOV. The snapshot would be a collection of all the latest print approved versions stripped of dead links. WP1.0 will clearly be much smaller than its successors, but that's fine too because we can continue with a maintainable promise for better things to come.
Try as we might it is inevitable that unwanted material will creep into the published versions. This could be minimized by having a lot of paid staff to do nothing but checking for this stuff, but this is not a cost-effective strategy consistent with our volunteer nature. To minimize the damage from such incidents, I would recommend short production runs that would ensure that supplies would be exhausted before the expiry of any safe harbor periods. The next production run would have the formally questioned material excised while the matter is being reviewed; it can always be re-inserted at a later time if the challenge turns out to be groundless. Being seen as acting quickly on these problems is a lot more important than going to extraordinary lengths to remove things where we only guess that it '''may''' be offending. Obviously offending material would still regularly be excluded.
Ec
Jimmy Wales wrote:
So a question naturally comes to mind - how do our 75,000 >1,500 byte articles stack up again Britannica's 75,000 articles?
Really, how important is it that we be always looking over our shoulders to see what Britannica is doing? IIRC it was in the Landy/Bannister Miracle Mile race in the 1950s where the leader missed the record because he looked over his shoulder to see how his competitor was doing; that action affected his momentum. Britannica's 75,000 is relatively static compared to our more dynamic and more adaptable collection. To the extent that we have the inferior article on a subject, we also have the greater flexibility for improvement
I ask because I continue to work on a plan for a drive to Wikipedia 1.0, and a big part of that plan involves getting a realistic assessment of what a Wikipedia 1.0 will look like, relative to Britannica.
Again, never mind "relative to Britannica". It may be more important to know who our target audience is going to be, and what kind of markettiing strategy will reach that audience. What retail price will the public find acceptable, and how does that relate to our costs of production and shipping? What infrastructure do we need to support the sales that we do get?
I believe that our deficiencies can be turned into marketting assets. WP1.0 would be a "snapshot" of what Wikipedia is at a given point in time to which is added a promise of improvement. Instead of the cash rebate that Britannica offers, we can offer some number of revised disks to be mailed in the future .
If I end up setting a 'target date' for Wikipedia 1.0 of 1 year in the future, what might we realistically expect to achieve? What if I set the 'target date' for 2 years in the future?
What I'd like to find out is that we have a realistic chance of having a Wikipedia 1.0 release 1 year from now that rivals Britannica. But there's no need to hurry, if it will take 2 years or 5 years, that's how long it will take.
I think the target date for WP1.0 is largely arbitrary. It should be chosen for the best market impact.
Ec
Ray Saintonge wrote:
So a question naturally comes to mind - how do our 75,000 >1,500 byte articles stack up again Britannica's 75,000 articles?
Really, how important is it that we be always looking over our shoulders to see what Britannica is doing?
Not very important! However, it's a nice yardstick and until we leave them in the dust, which we will, it's the gold standard to which we aspire. :-)
Again, never mind "relative to Britannica". It may be more important to know who our target audience is going to be, and what kind of marketing strategy will reach that audience. What retail price will the public find acceptable, and how does that relate to our costs of production and shipping? What infrastructure do we need to support the sales that we do get?
Well, the beauty of free software is that we really don't have to give such issues a lot of thought. Linus lets RedHat worry about that sort of thing. I'll let someone else worry about it. That's a half-facetious answer, but really, our goal in producing 1.0 isn't tied much to marketing issues.
I think the target date for WP1.0 is largely arbitrary. It should be chosen for the best market impact.
Well, I certainly sympathize with what you're saying, but I think we should think long-term. I'd rather wait 5 years before calling something 1.0 and releasing it for print publication and so on, and get it right, than to rush something out the door now. We don't face pressure from the marketing department to deliver early, and that's a good thing.
--Jimbo
Jimmy Wales wrote:
I ask because I continue to work on a plan for a drive to Wikipedia 1.0, and a big part of that plan involves getting a realistic assessment of what a Wikipedia 1.0 will look like, relative to Britannica.
If I end up setting a 'target date' for Wikipedia 1.0 of 1 year in the future, what might we realistically expect to achieve? What if I set the 'target date' for 2 years in the future?
From this and another of your posts that suggested 'retail' (or was that someone else's post?) am I right in inferring that WP1.0 is something along the lines of the "Sifter" project that was proposed earlier? Or is it instead just a marker for the current Wikipedia? The latter would seem unsuited to freezing as a 1.0 though -- you wouldn't want to ever publish a print encyclopedia that contained even one article whose entire contents are "dskafldsafkjhdanz" or "i like ham", and the live Wikipedia invariably has a few of these. You wouldn't even really want to put that sort of thing on a CD-ROM. You wouldn't want to put blatantly factually incorrect information on a CD-ROM either, which is a bit of a problem (I've found plenty of Wikipedia articles with wrong dates and such).
Part of the advantage of Britannica is that you can be reasonably sure that when you read an article that states a fact, that fact is correct, or at least at the time of publication was believed to be correct by the experts in the field (no one can account for future discoveries, of course). If it is intended to rival Britannica, WP1.0 will have to have a similar level of reliability -- when it says someone was born on January 7, 1845, they better have actually been born on that date -- not, say, January 7, 1854. And when something is mentioned as a mainstream physics topic, it had better actually be one, not a fringe theory.
This is really what I see as the major obstacle in rivalling Britannica -- the sheer amount of information is a problem that will fix itself given some time. But to have *reliable* information is more difficult. Right now I use Wikipedia as a way to find out about topics I didn't know about, but not as an authoritative source -- I always check everything of importance (whether factual information or discussion of, say, philosophical topics) with another encyclopedia or a book before taking it as true. This is somewhat diminished on major articles -- I can be reasonably sure that the WW2 article is accurate, as it is high-profile enough and has enough people reading it. But for the articles on less high-profile topics, I'm not nearly as confident.
-Mark
Delirium wrote:
From this and another of your posts that suggested 'retail' (or was that someone else's post?) am I right in inferring that WP1.0 is something along the lines of the "Sifter" project that was proposed earlier?
The way I think about it is that 1.0 is a particular result, while "Sifter" is one proposal for how to to get there, and not necessarily the only one.
I have a number of ideas about 1.0, ideas that need to be subjected to community scrutiny and analysis before we actually do anything. But here's a thumbnail sketch. Remember that these are ideas about 1.0, not ideas about how we should get there:
1. "Wikipedia 1.0 is about as good as Britannica" -- better in some areas, not as good in some other areas. But reasonably complete, and highly reliable.
2. "Wiki is not paper, but Wikipedia 1.0 is paper". The goal of a push towards 1.0 is specifically to produce a version that's purposefully edited and limited in some minor ways. We use 'Wiki is not paper' as a good reason to be permissive about the addition of relative obscurities, but we hope for a print publisher to pick up Wikipedia 1.0 and distribute it profitably and dirt-cheap to all the people of the world, which means paper and which means constraints.
3. "Wikipedia 1.0 is *just* Wikipedia 1.0" -- this is to remind us that this is just a 1.0 release, and so it won't be mature in every way, and that the mistakes we will inevitably make in 1.0 will be rectified in 1.2, 1.4, 2.0, 3.0 and beyond. We'll need to set some policies for 1.0, and then stick to them until we release, but after that, we can and should open the whole thing up to critical reflection for the next round.
I also have some ideas about the Wikipedia 1.0 (or "Sifter") process...
1. The process should unite and energize the existing Wikipedia community, not compete with it. Example: this should not happen on a different website by new volunteers, but by us, we've earned it.
2. The process should not interfere with the miracle that is Wikipedia. Example: no one should think that we're going to close the process of Wikipedia itself.
3. The process should be as open as possible, more open than anyone before us could have imagined possible, but only as open as is consistent with our goals. Example: since Wikipedia 1.0 is paper, we may have to be more strict about letting unknown people randomly do things relating to sifting.
Part of the advantage of Britannica is that you can be reasonably sure that when you read an article that states a fact, that fact is correct, or at least at the time of publication was believed to be correct by the experts in the field (no one can account for future discoveries, of course). If it is intended to rival Britannica, WP1.0 will have to have a similar level of reliability -- when it says someone was born on January 7, 1845, they better have actually been born on that date -- not, say, January 7, 1854. And when something is mentioned as a mainstream physics topic, it had better actually be one, not a fringe theory.
That's to be hoped, yes. How to get there is a different thing from saying that we do, in fact, want to get there.
--Jimbo
Jimmy Wales wrote:
- "Wiki is not paper, but Wikipedia 1.0 is paper". The goal of a
push towards 1.0 is specifically to produce a version that's purposefully edited and limited in some minor ways. We use 'Wiki is not paper' as a good reason to be permissive about the addition of relative obscurities, but we hope for a print publisher to pick up Wikipedia 1.0 and distribute it profitably and dirt-cheap to all the people of the world, which means paper and which means constraints.
I'm not sure 1.0 and "paper-wikipedia" are necessarily the same thing.
But "Paper" is definitely something we need to look into. However, it is more complex than sifting -- for example, we'd need a single article on "Lord of the Rings" that we would have to distill from the current dozens on characters, places, films, etc
This would require a clear plan -- * how much space do we have? * how many articles can we have in each subject? * how many words do we have for each article -- how do we weigh them relatively ... etc -- basically, all the sorts of editorial decisions that Brittanica & co have to make
This sort of work would best be done by a relatively small group of editors -- this is why I don't think that "paper" is the same as 1.0 But I *really* like the idea Jimbo raised a long time ago about producing an at-cost encyclopedia for third world school. :)
tarquin-
But "Paper" is definitely something we need to look into. However, it is more complex than sifting -- for example, we'd need a single article on "Lord of the Rings" that we would have to distill from the current dozens on characters, places, films, etc
Count me as a supporter of that idea for the online version as well ;-). Truthfully, we need a better way to handle multiple versions of a page -- effectively something like branches in CVS. This is useful for handling protected pages (edit a branch copy of the page and merge changes into main page once a certain time has elapsed or when a sysop approves the changes), for handling temp pages, and for handling permanent branches for the printed version. To do this we would need some good merging code which would also be useful to mostly get rid of those damn edit conflicts. Of course we would have to be careful to avoid overbranching into POV versions, perhaps by requiring each new branch to be approved in consensus.
We'll also need a flag for whether to include an article in the printed version or not (at this point the number of page-flags is getting so big that it is becoming increasingly useful to separate them into a meta namespace).
Hehehe, lots of work for our team of trained code monkeys ;-)
But I *really* like the idea Jimbo raised a long time ago about producing an at-cost encyclopedia for third world school. :)
So do I. How cheap can we get? It might be more cost effective to build Wikipedia reader computers using old machines and cheap harddrives in the single digit gigabyte range (of course that requires the cheap availability of electric power, which makes it a non-starter for the poorest countries). With some crack Linux hackers we should be able to whip up something that runs on a 486 (using a miniature Linux distro for embedded devices and a very lean webbrowser like Dillo).
Regards,
Erik
On 19 Aug 2003 01:06:00 +0200, Erik Moeller erik_moeller@gmx.de gave utterance to the following:
But I *really* like the idea Jimbo raised a long time ago about producing an at-cost encyclopedia for third world school. :)
So do I. How cheap can we get? It might be more cost effective to build Wikipedia reader computers using old machines and cheap harddrives in the
single digit gigabyte range (of course that requires the cheap availability of electric power, which makes it a non-starter for the poorest countries). With some crack Linux hackers we should be able to whip up something that runs on a 486 (using a miniature Linux distro for embedded devices and a very lean webbrowser like Dillo).
I recall seeing a TV item where the South African Government had developed an English inventor's idea of a $5 hand (crank) powered radio and put it into production, and that they were looking at the possiblilty of doing something similar with a computer.
. . . . . . . . . . . . . . . . . . . . . . . . . . . till we *) . . .
single digit gigabyte range (of course that requires the cheap availability of electric power, which makes it a non-starter for the poorest countries). With some crack Linux hackers we should be able to whip up something that runs on a 486 (using a miniature Linux distro for embedded devices and a very lean webbrowser like Dillo).
I recall seeing a TV item where the South African Government had developed an English inventor's idea of a $5 hand (crank) powered radio and put it into production, and that they were looking at the possiblilty of doing something similar with a computer.
[[Simputer]] __ . / / / / ... Till Westermayer - till we *) . . . mailto:till@tillwe.de . www.westermayer.de/till/ . icq 320393072 . Habsburgerstr. 82 . 79104 Freiburg . 0761 55697152 . 0160 96619179 . . . . .
Richard Grevers wrote:
On 19 Aug 2003 01:06:00 +0200, Erik Moeller erik_moeller@gmx.de gave utterance to the following:
I recall seeing a TV item where the South African Government had developed an English inventor's idea of a $5 hand (crank) powered radio and put it into production, and that they were looking at the possiblilty of doing something similar with a computer.
Trevor Baylis, BayGen radio we need an article on this! ;)
On Mon, 18 Aug 2003 07:22:40 -0700, Jimmy Wales jwales@bomis.com gave utterance to the following:
According to their website, Britannica's Deluxe Edition 2004 CD has 75,000 articles.
http://store.britannica.com/escalate/store/DetailPage?pls=britannica&bc=...
Presumably, many of these are longer and of higher quality than our 150,000 English-language articles.
From our groovy new stats tool, we know that the average Wikipedia
article is 2,115 bytes. Perhaps of greater interest, 53% of our articles, or right around 75,000, are longer than 1,500 bytes.
So a question naturally comes to mind - how do our 75,000 >1,500 byte articles stack up again Britannica's 75,000 articles?
But aren't a third of those 1500-byte+ articles the machine-generated ones on US small towns?
Richard-
But aren't a third of those 1500-byte+ articles the machine-generated ones on US small towns?
It's fair to include these in our regular count, but the fact that they are concluded should be prominently noted. I am personally somewhat opposed to having articles about <100 person towns (but could live with these being merged into one overview article).
Regards,
Erik
Richard Grevers wrote:
But aren't a third of those 1500-byte+ articles the machine-generated ones on US small towns?
You must stop feeling guilty for this. Those are good articles, a lot better than many initial Wikipedia articles. Just assume that any commercial encyclopedia contains machine-generated articles too.
The real difference is that they don't contain so detailed information on every little place-name. They are more focused on "relevant" topics (how ever you define that) and have left out the irrelevant.
wikipedia-l@lists.wikimedia.org