On Tue, Sep 10, 2002 at 12:24:27PM -0700, Ray Saintonge wrote:
I looked a little more into this, manualy tracing the path of 10 randomly chosen articles. I don't know what it does to the automatic path tracing idea but it did lead to a number of observations.
[...]
Observations:
- In the samples the longest minimum path to the Main Page was only 4
articles. Any article linked from a user page would be 3 steps away from the user page, but this should not be considered a meaningful path.
- Two kinds of effectively orphan pages became evident, but these would
never appear on the special page listing of orphans. In the first example two pages link to each other but nothing else links to them. In example 6 the only links to the article are on user pages. Who would ever think to look there for a reference to an article?
- [[List of countries]] and [[United States]] should probably be
linked from the Main Page. The numberr of paths through these is enormous.
- Many of the links to [[United States]] are excessive. Many of the
uses are in passing where more information about the United States is unlikely to be needed. I think we can always assume a very basic level of understanding about what is meant by "United States" What would surprise me most about those who don't have that very basic level of understanding is how they managed to find Wikipedia in the first place.
I have done much more complex and almost-automatic topological analysis (of Polish Wikipedia).
If you can read Polish or think that you can find out what's going on by just looking at numbers and lists, check: http://pl.wikipedia.com/wiki.cgi?Taw/Topologia_Wikipedii (stats are a couple days old)
Things that are done before computations: * all empty, talk and user pages are removed * all links to redirects are replaced by links to final articles, and then redirects are removed
(About 1) Stats for Polish Wikipedia: * not accessible 227 ( 4.716393102%) * main page 1 ( 0.02077706212%) * 1 hop 78 ( 1.620610846%) * 2 hops 1199 (24.91169749%) * 3 hops 2492 (51.77643881%) * 4 hops 614 (12.75711614%) * 5 hops 175 ( 3.635985872%) * 6 hops 22 ( 0.4570953667%) * 7 hops 5 ( 0.1038853106%)
(About 2) Much more interesting patterns can be found. Don't forget about articles linked from talk pages and yearbook pages.
(About 3) One most interesting thing computed "importance of links on main page". Algoritm is simple - sum of distances from main page to each non-orphan node is computed, and link is as valuable as much it improves this number. Both links-to-be-added and links-to-be-removed are computed. We have now rather more useful main page.
(and ...) If anybody wants the scripts, tell me, but expect to do some work to adapt it to other Wikipedia.