This is a good discussion. The true dynamics won't be known until
you've got live users on the system, but based on what I've seen with
existing Wikipedia edits, the dynamics will be even more complex than
predicted so far (which is already pretty complex!).
Some other things to consider:
- the focus of Wikipedia articles drifts over time (with good feedback
loops built in to the system, this should hopefully be
self-correcting)
- label/description disagreement occurs - title says one thing, first
few sentences (which is often all people scan when working quickly)
say something different, the article taken as a whole is about a third
thing
- you'll see different behavior depending on whether you track by
article number (internal ID) or article title
- the granularity of Wikipedia articles depends on the length of the
text, not just semantics. Concepts with lots of text get split across
multiple articles (e.g. WW II), while concepts which don't have much
written about them risk getting combined into composite articles about
multiple concepts.
- redirects are used for: aliases, misspellings, "see instead"
references to semantically different articles, and probably other
things that I'm not aware of. This can complicate doing something
meaningful with them.
Another source for data on the current articles and their behavior is
Freebase. Wikipedia based topics which have been split or combined
retain an audit trail that lets you figure out what happened. It only
covers the last 5 years and only English Wikipedia, but within those
limitations it could provide some interesting insights. I'm happy to
help anyone who wants to work with this data.
Tom