New subject: A data point -- page parsing on beta.wikipedia.com

10 Jul 2002


      ...
This might be a good idea anyway, since I don't see the need to
having a stats page that recalculates the number of articles for
each and every request (esp. since the main reason for reworking
the php wikiware was to maximize efficiency).
The site stats are calculated each time a page is /saved/, and
just directly fetched when queried (except for page read counts,
which naturally have to be updated on each page read).  This isn't a
performance issue at all.
The same with the "good" article count--it's only updated when
a page is saved, when either (1) a new page is saved that qualifies
as "countable", (2) a formerly uncountable page is edited to become
countable, or (3) a formerly countable page is edited to become
uncountable.  So the {{NUMBEROFPAGES}} query on the front page is
not a calculation, just a lookup (and a faster one than each of
the links).
The criterion for "countable" is flexible; right now, a page is
countable if it (1) does not belong to any namespace, and (2)
contains a comma.
The details of how namespaces are handled is really more of a
techie issue, but since you brought it up here, I'll detail it here.
The full details, of course, can be found in the code.
"Namespace" is a separate field of the database from "Title"; in
fact it's an integer (the exact text of each one depends on the
language).  Regular encyclopedia articles have a namespace of 0,
"Talk" (or "Diskussion") pages have a namespace of 1, etc.  Things
like the search function simply add (namespace=0) to the query,
and never bother looking at the title (which may contain colons)..
The actual names of the namespaces come into play when interpreting
links.  For example, when the software sees a link to [[User:X]],
is grabs whatever appears before the first colon and looks to see
if it is a known namespace or Interwiki.  If it is, then the code
looks up the article with a query along the lines of (namespace=2
and title='X'), and so on.  If it's not a recognized prefix, then
it uses a query like (namespace=0 and title='2001:_A_Space_Odyssey').
"Image:" is magic on other levels as well, but that's more detail
for later.
0

Re: [Wikipedia-l] minor ? about stats in new codebase. + a proposal