Jimmy Wales wrote:
To compare Wikipedia to Columbia Encyclopedia...
http://www.encyclopedia.com/
has the full text of Columbia.
There are pages for alphabetic browsing.
http://www.encyclopedia.com/browse/browse-Aa.asp
From these pages, it should be possible to get a list of all their
article titles.
These could be matched up against Wikipedia article titles.
Well, matching them up doesn't prove very easy. For example, what they
call "Abdül Aziz" is on Wikipedia called "Abd-ul-Aziz".
I have used the following heuristics to match up articles:
- redirects (obviously)
- names in the other order ("Thomas Jefferson" rather than
"Jefferson, Thomas")
- middle names deleted
The latter two are already somewhat error-prone (though I haven't
spotted such an error yet).
With just these, I was able to match up 24003 article titles from
encyclopedia.com with articles on Wikipedia. 25101 other article titles
did not yield Wikipedia equivalents (although many of them have one;
e.g. Aziz as mentioned above). A number of other titles (silly me forgot
to output their number) led to the same Wikipedia article; for example,
"Aachen" and "Aix-la-Chapelle" were listed seperately on theirs, but
of
course they're the same thing.
The 24003 Wikipedia articles I could match up amount to 79979774 bytes
(almost 80 MB).
However, unfortunately I also had to find that some of them are
disambiguation pages; for example, where
encyclopedia.com has a one and
only "Adalbert, Saint", Wikipedia's [[Saint Adalbert]] disambiguates to
[[Adalbert of Prague]] and [[Adalbert of Magdeburg]].
So, clearly, this isn't quite as easy. But anyway. Here is the complete
report:
(**WARNING!** 5.3 MB file! Very slow server! Better let it
download, have dinner, and then view locally!)
http://lionking.org/~timwi/t/wikipedia/comparison.html
Greetings,
Timwi
P.S.: More fun projects? ;-)