I think that a subject classification of articles would vastly improve "soft security" and would save regulars a lot of time, since not everyone would have to check every edit as currently seems to be the case.
I'd still like to see if we couldn't build those subjects automatically in some way based on links in the database.
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
To compute these topics quickly, the cur table gets two new columns: topic and distance, where distance stands for the link distance from the Main Page topic page. If a new article is created, looking at the distance entries of all articles that link to the new one, and taking the minimum, immediately classifies the new one. If an existing article is saved, the topic and distance entries of all articles it links to (and their children) may need to be updated; these changes can be propagated in a recursive manner.
Would that work?
Axel
Axel Boldt wrote:
I think that a subject classification of articles would vastly improve "soft security" and would save regulars a lot of time, since not everyone would have to check every edit as currently seems to be the case.
I'd still like to see if we couldn't build those subjects automatically in some way based on links in the database.
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
To compute these topics quickly, the cur table gets two new columns: topic and distance, where distance stands for the link distance from the Main Page topic page. If a new article is created, looking at the distance entries of all articles that link to the new one, and taking the minimum, immediately classifies the new one. If an existing article is saved, the topic and distance entries of all articles it links to (and their children) may need to be updated; these changes can be propagated in a recursive manner.
Would that work?
Axel
Interesting! I had a very similar thought a couple months ago, and never bothered to mention it. I guess that qualifies Axel for the4 moral copyrights.on the idea.
The orphan (like the main page) would simply have a distance value of 0.
The path for a page could appear at the top of the article. This already happens in some places, and we manually do something similar in our Tree of Life project. I hope nobody complains about polyphyletic Wikipedia pages.
As for saving time for regulars, that may be a very limited benefit. We all have certain subject areas where we tend to track things and tend to ignore topics outside of that. The real time loss often doesn't appear until we have looked at an article to see if a minor change really is minor.
Eclecticology
Axel Boldt wrote:
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
I looked a little more into this, manualy tracing the path of 10 randomly chosen articles. I don't know what it does to the automatic path tracing idea but it did lead to a number of observations.
First the data:
1. Abu Zubaydah <- Ibn al-Shaykh al-Libi <- Abu Zubaydah (Circular orphan) nothing else leads to these two (Score 0)
2. Analysis of variance <- Statistics <- Main Page (Score 2)
3. Indianapolis Colts <- National Football League <- American football <- Sport <- Main Page (Score 4) -- ". <- Indiana <- United States (=United States of America) <- List of Countries (=Countries of the World) <- Geography <- Main Page (Score 5) - ", <- 1969 <- 20th century <- Historical timeline|Centuries <- Main Page (Score 4)
4. Jerry Springer <-List of television programs <-List of reference tables (=Reference tables) <- Main Page (Score 3)
5.Heinrich Schliemann <- Archaelogy <- Main Page (Score 2)
6. Hitchhiking <- User: Branko <- Special Pages: Registered Users (=User list) <- Main Page (Score 3) Same via User: Rootbeer Access is only through two user pages; it's an orphaned orphan!
7. Nursing <- Health science <- Main page (Score 2)
8. Vsevolod I, Prince of Kiev <- Kievan Rus' <-History of Russia (=Russian history) <- History <- Main Page (Score 4) A score of 3 was possible through the page [[User: H. Jonat]]. (I swear this was random; I didn't ask for THAT user to appear)
9. Morrisville <- Wikipedia:Links to disambiguating pages <- Wikipedia: utilities <- Main Page (Score 3)
10. Celestial sphere <- Astronomy and astrophysics <- Main Page (Score 2)
Observations: 1. In the samples the longest minimum path to the Main Page was only 4 articles. Any article linked from a user page would be 3 steps away from the user page, but this should not be considered a meaningful path.
2. Two kinds of effectively orphan pages became evident, but these would never appear on the special page listing of orphans. In the first example two pages link to each other but nothing else links to them. In example 6 the only links to the article are on user pages. Who would ever think to look there for a reference to an article?
3. [[List of countries]] and [[United States]] should probably be linked from the Main Page. The numberr of paths through these is enormous.
4. Many of the links to [[United States]] are excessive. Many of the uses are in passing where more information about the United States is unlikely to be needed. I think we can always assume a very basic level of understanding about what is meant by "United States" What would surprise me most about those who don't have that very basic level of understanding is how they managed to find Wikipedia in the first place.
Eclecticology
On Tue, Sep 10, 2002 at 12:24:27PM -0700, Ray Saintonge wrote:
I looked a little more into this, manualy tracing the path of 10 randomly chosen articles. I don't know what it does to the automatic path tracing idea but it did lead to a number of observations.
[...]
Observations:
- In the samples the longest minimum path to the Main Page was only 4
articles. Any article linked from a user page would be 3 steps away from the user page, but this should not be considered a meaningful path.
- Two kinds of effectively orphan pages became evident, but these would
never appear on the special page listing of orphans. In the first example two pages link to each other but nothing else links to them. In example 6 the only links to the article are on user pages. Who would ever think to look there for a reference to an article?
- [[List of countries]] and [[United States]] should probably be
linked from the Main Page. The numberr of paths through these is enormous.
- Many of the links to [[United States]] are excessive. Many of the
uses are in passing where more information about the United States is unlikely to be needed. I think we can always assume a very basic level of understanding about what is meant by "United States" What would surprise me most about those who don't have that very basic level of understanding is how they managed to find Wikipedia in the first place.
I have done much more complex and almost-automatic topological analysis (of Polish Wikipedia).
If you can read Polish or think that you can find out what's going on by just looking at numbers and lists, check: http://pl.wikipedia.com/wiki.cgi?Taw/Topologia_Wikipedii (stats are a couple days old)
Things that are done before computations: * all empty, talk and user pages are removed * all links to redirects are replaced by links to final articles, and then redirects are removed
(About 1) Stats for Polish Wikipedia: * not accessible 227 ( 4.716393102%) * main page 1 ( 0.02077706212%) * 1 hop 78 ( 1.620610846%) * 2 hops 1199 (24.91169749%) * 3 hops 2492 (51.77643881%) * 4 hops 614 (12.75711614%) * 5 hops 175 ( 3.635985872%) * 6 hops 22 ( 0.4570953667%) * 7 hops 5 ( 0.1038853106%)
(About 2) Much more interesting patterns can be found. Don't forget about articles linked from talk pages and yearbook pages.
(About 3) One most interesting thing computed "importance of links on main page". Algoritm is simple - sum of distances from main page to each non-orphan node is computed, and link is as valuable as much it improves this number. Both links-to-be-added and links-to-be-removed are computed. We have now rather more useful main page.
(and ...) If anybody wants the scripts, tell me, but expect to do some work to adapt it to other Wikipedia.
Using the data set given, and assuming averaged daily growth between given days, Wikipedia has since 2001-03-07 had an average over all daily growth of 0.632%
The average growth rate (r) between two sampling days was calculated using
1+r=(d2/d1)^(1/n)
where d1 and d2 are the sample amounts on the first and second days and n is the number of days between samplings.
The 0.632% amount is a weighted mean of these results over a period of 551 days. Applying the formula:
n=log(100,000/42021)/log(1+r)
gives 138 days when rounded up to the nearest whole number. Thus the formula projects that article number 100,000 will be reached on 2003-01-24 Using the same techniques, growth in the last 30 days has been at the more modest rate of 0.410% per day. Projecting this gives a figure of 212 days or 2003-04-08
At 2002-09-10 14:09 -0700, Ray Saintonge wrote:
Using the data set given, and assuming averaged daily growth between given days, Wikipedia has since 2001-03-07 had an average over all daily growth of 0.632%
The average growth rate (r) between two sampling days was calculated using
1+r=(d2/d1)^(1/n)
where d1 and d2 are the sample amounts on the first and second days and n is the number of days between samplings.
The 0.632% amount is a weighted mean of these results over a period of 551 days. Applying the formula:
n=log(100,000/42021)/log(1+r)
gives 138 days when rounded up to the nearest whole number. Thus the formula projects that article number 100,000 will be reached on 2003-01-24 Using the same techniques, growth in the last 30 days has been at the more modest rate of 0.410% per day. Projecting this gives a figure of 212 days or 2003-04-08
And suppose I hadn't wasted 10 years of my life on a technical university, how would you explain this to me?
What for example is the growth per year?
Greetings, Jaap
Jaap van Ganswijk wrote:
At 2002-09-10 14:09 -0700, Ray Saintonge wrote:
Using the data set given, and assuming averaged daily growth between given days, Wikipedia has since 2001-03-07 had an average over all daily growth of 0.632%
The average growth rate (r) between two sampling days was calculated using
1+r=(d2/d1)^(1/n)
where d1 and d2 are the sample amounts on the first and second days and n is the number of days between samplings.
The 0.632% amount is a weighted mean of these results over a period of 551 days. Applying the formula:
n=log(100,000/42021)/log(1+r)
gives 138 days when rounded up to the nearest whole number. Thus the formula projects that article number 100,000 will be reached on 2003-01-24 Using the same techniques, growth in the last 30 days has been at the more modest rate of 0.410% per day. Projecting this gives a figure of 212 days or 2003-04-08
And suppose I hadn't wasted 10 years of my life on a technical university, how would you explain this to me?
What for example is the growth per year?
The underlying premise is that growth is exponential. People more commonly encounter this with compound interest calculations. Thus $1,000 invested at 12% for one year will give $1,120 at the end of the year. If it is compounded semi-annually it will give 1.06 * 1.06 * 1000 or $1,123.60 at the end of the year. If it is compounded monthly it will give (1.01)^12 * 1000 = $1,126.83 at the end of the year. The calculationsa that I made are similar, although I have not taken into account any limitations that may exist upon Wikipedia's growth.
The annual growth rate based on 0.632% per day would be (1.00632)^365 - 1 = 896.861% Based on 0.410% per day it would be 345.239% These figures do seem quite high, but for a reality check Wikipedia's size on September 9 of this year was 42,021 and on September 9, 2001 it was 11,208. 42021/11208 is 3.44920, i.e. growth of 274.920%. but this does include some periods when the growth was considerable lower than it has been in the last 30 days.
Eclecticology
Ray Saintonge wrote:
The underlying premise is that growth is exponential. People more commonly encounter this with compound interest calculations. Thus $1,000 invested at 12% for one year will give $1,120 at the end of the year. If it is compounded semi-annually it will give 1.06 * 1.06 * 1000 or $1,123.60 at the end of the year. If it is compounded monthly it will give (1.01)^12 * 1000 = $1,126.83 at the end of the year. The calculationsa that I made are similar, although I have not taken into account any limitations that may exist upon Wikipedia's growth.
The annual growth rate based on 0.632% per day would be (1.00632)^365
- 1 = 896.861%
Based on 0.410% per day it would be 345.239% These figures do seem quite high, but for a reality check Wikipedia's size on September 9 of this year was 42,021 and on September 9, 2001 it was 11,208. 42021/11208 is 3.44920, i.e. growth of 274.920%. but this does include some periods when the growth was considerable lower than it has been in the last 30 days.
Eclecticology
The graph half way down at [[Wikipedia:Size of Wikipedia]] illustrates this rather nicely.
It looks exponential to me, with a kink for the Great Slowdown of the Phase II software. Recent growth is about 217 articles/day for a size of about 42000 articles, and that's about 0.5% / day.
Extrapolated to 1 year, that's growth of about 500% (ie a factor of six size ratio) per year. The implications of this are huge, ''if'' this sort of growth rate keeps up. In a year's time, we can expect not 100,000 articles, but over 250,000. Then -- almost unbelievably -- 3 million the next year.
This suggests that we will definitely need some more scaling features in the software sooner rather than later.
Neil
Hi Neil and Ray,
At 2002-09-11 15:55 +0100, Neil Harris wrote:
Ray Saintonge wrote:
The underlying premise is that growth is exponential. People more commonly encounter this with compound interest calculations. Thus $1,000 invested at 12% for one year will give $1,120 at the end of the year. If it is compounded semi-annually it will give 1.06 * 1.06 * 1000 or $1,123.60 at the end of the year. If it is compounded monthly it will give (1.01)^12 * 1000 = $1,126.83 at the end of the year. The calculationsa that I made are similar, although I have not taken into account any limitations that may exist upon Wikipedia's growth.
The annual growth rate based on 0.632% per day would be (1.00632)^365
- 1 = 896.861%
Based on 0.410% per day it would be 345.239% These figures do seem quite high, but for a reality check Wikipedia's size on September 9 of this year was 42,021 and on September 9, 2001 it was 11,208. 42021/11208 is 3.44920, i.e. growth of 274.920%. but this does include some periods when the growth was considerable lower than it has been in the last 30 days.
I know what exponential behaviour is, I was just hoping you'd give the figures in a clearer way instead of as a formula. It's usual to give the growth per year as a percentage and/or to give the amount of time in which the amount doubles.
The graph half way down at [[Wikipedia:Size of Wikipedia]] illustrates this rather nicely.
It looks exponential to me, with a kink for the Great Slowdown of the Phase II software. Recent growth is about 217 articles/day for a size of about 42000 articles, and that's about 0.5% / day.
Looks very linear to me.
And I think anyway, that the process will be more linear than exponential.
There are several aspects: - When the number of people contributing stays fixed and they write a fixed number of articles per time unit the growth will be linear. - People may get bored or frustrated however and produce less articles. They may also lack the knowledge to write about other than their favorite subjects. Even if they would write about non-favorite subjects it would go slower because they would have to do more research. - People will also spend time on improving articles instead of writing new ones and this get worse the more articles there are. - However, new people will join the club and therefore super linear behaviour could occur, but I think that the new people will at most counteract the amount that the other start writing less articles. - Even when people don't have to write articles themselves but can copy and edit them, the sources that they can easily copy them from may dry out over time. - And a major argument against super linear behaviour of the growth is, that the bigger the data base becomes, the more complicated and time consuming the interrelations will get. Which with a fixed staff would let the growth tend to logarithmic behaviour.
Given all these factors and the current graph, I think that the growth is more likely to be linear (and we should be happy enough with that).
Greetings, Jaap
Jaap van Ganswijk wrote:
Hi Neil and Ray,
I know what exponential behaviour is, I was just hoping you'd give the figures in a clearer way instead of as a formula. It's usual to give the growth per year as a percentage and/or to give the amount of time in which the amount doubles.
I did use an annual growth rate in my previous response, and Neil's comments seem to have answered the second approach. I'm sure that some of ou more mathematically challenged Wikipedians will run the other way at the sight of any mathematical formula, bu it was only fair for those who might want to pursue the matter further to know how I arrived at my view.
It looks exponential to me, with a kink for the Great Slowdown of the Phase II software. Recent growth is about 217 articles/day for a size of about 42000 articles, and that's about 0.5% / day.
Looks very linear to me.
And I think anyway, that the process will be more linear than exponential.
My projection was a hypothesis that is as subject to the constraints of the scientific method as any other.Choosing another data set could have given different results.
- When the number of people contributing stays fixed and
they write a fixed number of articles per time unit the growth will be linear.
Yes, but is the number of people contributing really staying fixed. People who only make a single contribution (including vandals) to Wikipedia are also contributors. What is the relationship between the number of such people in the last thirty days with the number of such people in the preceeding 30 days. Any growth there is a function of finding out that Wikipedia exists.
- People may get bored or frustrated however and produce
less articles. They may also lack the knowledge to write about other than their favorite subjects. Even if they would write about non-favorite subjects it would go slower because they would have to do more research.
I suspect that the proportion of people who have a 500 article exhaustion level will be relatively constant.
- People will also spend time on improving articles
instead of writing new ones and this get worse the more articles there are.
Probably another relative constant. Improving articles includes splitting off sections into "new" articles when they get too long.
- However, new people will join the club and therefore
super linear behaviour could occur, but I think that the new people will at most counteract the amount that the other start writing less articles.
Subject to verification. See my comments above re one-time contributors.
- Even when people don't have to write articles themselves
but can copy and edit them, the sources that they can easily copy them from may dry out over time.
This is one of our limits to growth in the long run, but I don't see it as a factor in the near future.
- And a major argument against super linear behaviour of
the growth is, that the bigger the data base becomes, the more complicated and time consuming the interrelations will get. Which with a fixed staff would let the growth tend to logarithmic behaviour.
Given all these factors and the current graph, I think that the growth is more likely to be linear (and we should be happy enough with that).
Indeed we should be happy with it.
In the spirit of compromise, perhaps the growth rate is now exponential but in the long term the rate of growth will be asymptotic to a linear function.
Eclecticology
On Thu, Sep 12, 2002 at 08:35:22PM +0200, Jaap van Ganswijk wrote:
Given all these factors and the current graph, I think that the growth is more likely to be linear (and we should be happy enough with that).
I don't know about English Wikipedia, but all others certainly grow at exponential rate. If it isn't the case with English Wikipedia, then maybe the software is the bottleneck and too much time is spent on maintaince task.
Ray Saintonge wrote:
The underlying premise is that growth is exponential. People more commonly encounter this with compound interest calculations. Thus $1,000 invested at 12% for one year will give $1,120 at the end of the year. If it is compounded semi-annually it will give 1.06 * 1.06 * 1000 or $1,123.60 at the end of the year. If it is compounded monthly it will give (1.01)^12 * 1000 = $1,126.83 at the end of the year. The calculationsa that I made are similar, although I have not taken into account any limitations that may exist upon Wikipedia's growth.
The annual growth rate based on 0.632% per day would be (1.00632)^365
- 1 = 896.861%
Based on 0.410% per day it would be 345.239% These figures do seem quite high, but for a reality check Wikipedia's size on September 9 of this year was 42,021 and on September 9, 2001 it was 11,208. 42021/11208 is 3.44920, i.e. growth of 274.920%. but this does include some periods when the growth was considerable lower than it has been in the last 30 days.
Eclecticology
The graph half way down at [[Wikipedia:Size of Wikipedia]] illustrates this rather nicely.
It looks exponential to me, with a kink for the Great Slowdown of the Phase II software. Recent growth is about 217 articles/day for a size of about 42000 articles, and that's about 0.5% / day.
Extrapolated to 1 year, that's growth of about 500% (ie a factor of six size ratio) per year. The implications of this are huge, ''if'' this sort of growth rate keeps up. In a year's time, we can expect not 100,000 articles, but over 250,000. Then -- almost unbelievably -- 3 million the next year.
This suggests that we will definitely need some more scaling features in the software sooner rather than later.
Neil
On Tue, 10 Sep 2002 14:09:33 -0700 Ray Saintonge saintonge@telus.net wrote:
<information suggesting a non-linear growth curve for Wikipedia>
I have seen messages talking about changing the database engine from MySql to postgreSQL to fix table locking problems on a busy system.
I am concerned that this _type_ of engineering work may not be what is really needed.
My contention is that Wikipedia load can grow at an exponential rate but may be constrained by resource availablility. There are many factors which cause self-multiplication.
Decisions which need to be made:
1) Do we want Wikipedia to be _able_ to grow at an exponential rate? If yes: a) We need to consider a technical system which can be put in place to distribute load such that no one system needs to handle all the load b) Consider whether the current social system of regulation can scale to meet demand and monitor this c) Keep a conscious review open to ensure the quality of Wikipedia with such an exponential growth and consider adding constraints to growth if such a growth rate starts causing undesirable effects.
If no: a) Consider how availability of the system will be limited in order to prevent exponential growth, and at what rate, if any, availability is extended. b) What parts of the system are best rationed to limit growth rate. ie should searches, page views or edits be limited?
From my experience at using the system the last few days, I percieve there
is currently a technical constraint limiting the rate of growth. This may be desirable, this may be undesirable. Do we know which it is? Has an explicit decision been made?
A scalable solution is to give nearly all responsibility for all wiki functionality to mirror servers. Updates are posted directly to the main Wiki server which in turn posts the database updates to registered first tier mirrors which, in turn, can post database updates to second tier mirrors registered with them and so on. This way, all mirrors can be kept in sync in near real time with a minimum of CPU, memory and network load. The main server then need do nothing other than maintain database consistency, accept and post updates.
On Tue, 12 Nov 2002 11:56:31 +0000 Nick Hill nick@nickhill.co.uk wrote:
A scalable solution is to give nearly all responsibility for all wiki functionality to mirror servers. Updates are posted directly to the main Wiki server which in turn posts the database updates to registered first tier mirrors which, in turn, can post database updates to second tier mirrors registered with them and so on.
With such a scheme, IP blocking and anti-vandalism features would still be implemented in much the same way as they are now, on the main server, where the master database is held. The master server would handle html form puts.
This way, all mirrors can be kept in sync in near real time with a minimum of CPU, memory and network load. The main server then need do nothing other than maintain database consistency, accept and post updates.
The update system can be achieved by either: 1) the main server creating SQL files to be emailed to mirror servers, signed with a key pair, sequentially numbered to ensure they are automatically processed in order this way, the server can run asynchronously with the mirrors which is better for reliability of the server. The server will not need to wait for connection responses from the mirror and updates will be cached in the mail system should the mirror be unavailable. The server will only need to create one email per update. The mail system infrastructure will take care of sending the data to each mirror. In fact, a system such as pipermail used on this list would solve the problem wonderfully. Mirror admins simply subscribe to the list to get all updates sent to their machine and can manually download updates they are missing from the list!)
Or 2) by the master server opening a connection directly to the SQL daemon on each remote machine In which case the server will need to track what the mirrors have and have not received updates and need to wait for time-out on non-operational mirrors)(this system may open exploits on the server via the sql interface).
On Mon, Sep 09, 2002 at 08:34:24PM +0200, Axel Boldt wrote:
I think that a subject classification of articles would vastly improve "soft security" and would save regulars a lot of time, since not everyone would have to check every edit as currently seems to be the case.
I'd still like to see if we couldn't build those subjects automatically in some way based on links in the database.
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
To compute these topics quickly, the cur table gets two new columns: topic and distance, where distance stands for the link distance from the Main Page topic page. If a new article is created, looking at the distance entries of all articles that link to the new one, and taking the minimum, immediately classifies the new one. If an existing article is saved, the topic and distance entries of all articles it links to (and their children) may need to be updated; these changes can be propagated in a recursive manner.
Would that work?
1. No, it wouldn't.
Both deletion and creation of links are hard problems:
Main page -> Biology -> A1 -> Chemistry -> A2 -> A3 -> A4 -> Target A5 -> Target
Now what happens when we add link from A1 to A5 ? There are lot of links from A5 to other articles, recursion here would mean recalculating major part of topology.
Deletion of any of links in current shortest path (if we store it somewhere) require recalculation of whole topology too.
2. But it would be possible to create initial classification that way.
I don't think it would be a good idea to hardwire article *subject categories* at all. We had that discussion some time ago, as Lee said. I was talking about *article types*.
On 9 Sep 2002, at 20:34, Axel Boldt wrote:
I think that a subject classification of articles would vastly improve "soft security" and would save regulars a lot of time, since not everyone would have to check every edit as currently seems to be the case.
Maybe a way around it would be to have a new level of op, say op1, one which would be awarded to anyone who has had an account for 30 days or so and hasn't been banned.
Whenever someone who isn't at op1 level makes an edit to a page a "edit check" counter appears and counts days. When anyone with op1 status looks at this page after checking it for vandalism they could reset the counter back to zero.
That way we wouldn't get multiple people needlessly checking the same page for vandalism, and we could ensure that every newbie edit was checked for vandalism.
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
An alternative idea:
For any page follow all the links from it down to about 3-4 levels, and assume these are all on related topics. To make this more accurate we could follow only two way links. Then strip out any article which has more then say 50 double links as it's likely to be the front page, or something similar unrealted to the topic.
Not only will this provide autoclassification but we could also use it for finding pages that needed to be written on a specific topic by automatically generating a list of unwritten articles related to a topic.
Imran
Imran Ghory wrote:
How about this: the possible topics coincide with the major pages listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The shortest link path from such a topic page to an article defines that article's topic. If there is no such path, then the article is classified as a topic orphan.
An alternative idea:
For any page follow all the links from it down to about 3-4 levels, and assume these are all on related topics. To make this more accurate we could follow only two way links. Then strip out any article which has more then say 50 double links as it's likely to be the front page, or something similar unrealted to the topic.
I think that this would be more problematical than using "what links here". The links on a page include ones to years and countries where the discussion usually has nothing to do with our subject of interest. "What links here" had more specific reason to link to our subject.
Eclecticology
wikitech-l@lists.wikimedia.org