2008/9/18 Quse Guy <quseguy(a)yahoo.com>om>:
Greetings all,
Please excuse me if my message is inappropriate for this list, but I'm
looking for a place to start and unsure where to begin.
I'm trying to get some sense of the scope of the Template namespace on the
English-language Wikipedia: anything from sheer numbers, to which templates
are the most edited, to which are the most used (either in terms of total
number of transclusions/What Links Here, or else actual number of "hits").
To be a bit more specific, I'm particularly interested in those Templates
which are in the following two categories:
*
http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
*
http://en.wikipedia.org/wiki/Category:Infobox_templates
But both of these categories consist of a large number of subcategories,
sub-sub-categories, sub-sub-sub-...categories, making it difficult to
attempt even a basic count of all the Navbox and Infobox templates. Of
course, determining which Navboxes and Infoboxes are the most edited or the
most used would be impossible to ascertain manually.
I haven't written an SQL statement since taking a database course in 1998,
so I'm wary of downloading one of the database dumps and attempting to
manipulate things on my own. Nor am I even certain that the data included
in the dumps would allow me to aggregate across sub-sub-sub...categories, or
derive edit counts or use counts.
Perhaps there's a GUI tool or interface that would be helpful in compiling
these stats? Or perhaps these statistics are readily available and I simply
haven't looked in the right places :)
Your best bet is really to befriend a techie to construct the SQL
query for you, then submit it via the Query service:
<https://wiki.toolserver.org/view/Query_service>
However............... you could kinda-sorta hack something together
using the API. <http://en.wikipedia.org/w/api.php>
<http://www.mediawiki.org/wiki/API>
Here's my examples using the Python client mwclient
<https://mwclient.svn.sourceforge.net/svnroot/mwclient/trunk/mwclient>
and the interactive Python commandline:
initialise stuff
>> import mwclient
>> site = mwclient.Site('en.wikipedia.org')
>> topcat = 'Category:Infobox templates'
so first, try and get all the subcats below the topcat in the tree by
recursing (very crudely) through them. Note namespace 14 is the
Category namespace.
>> allcats = []
>> allcats.append(topcat)
>> subcats = [p.name for p in site.Pages[topcat] if p.namespace==14]
>> while len(subcats) > 0:
... newsubcats = []
... for s in subcats:
... allcats.append(s)
... newsubcats += [p.name for p in site.Pages[s] if p.namespace==14]
... subcats = newsubcats
...
(I'm actually not sure this works. I got bored and killed it, and
len(allcats) was already 429. It would be better to be more strict and
record which categories we have checked, to cut off potential cycles
in the category graph.)
Anyway, assuming that actually works, now we want to only get the
templates in those categories. Note the Template namespace is 10.
>> alltemplates = []
>> for cat in allcats:
... for p in site.Pages[cat]:
... if p.namespace == 10:
... alltemplates.append(p.name)
...
OK so now, for each template, we want to find out how many pages it is
embedded in. (Templates are embedded when they are referenced in
{{curly brackets}}, as opposed to regular old [[links]]). To be more
careful here there is probably a way to only get the embeddedin
results for the main namespace (ie, use in articles).
>> embeddict = {}
>> for t in alltemplates:
... template = site.Pages[t]
... embedtotal = len(list(template.embeddedin()))
... embeddict[t] = embedtotal
...
This also takes a particularly long time. (I killed my process so my
examples below are truncated)
>> embeddict.values()
[1, 6, 60, 141, 2, 2,
47, 0, 88, 19, 2, 212, 186, 1, 595, 76, 17, 444,
13, 70, 15, 87, 5, 11, 0, 102, 25, 356, 289, 1, 272, 184, 14, 2, 0,
14, 7, 2, 1407, 20, 0, 7, 32, 19, 0, 63, 1065, 31, 57, 72, 0, 2, 47,
5, 797, 3, 16, 3, 43, 99, 295, 14, 22, 0, 10, 9, 150, 6, 1, 1, 132, 5,
6, 110, 7, 42, 200, 58]
We can get a bit of an idea of really high usage.
>> for k in embeddict.keys():
... v =
embeddict[k]
... if v > 1000:
... print k, v
...
Template:Infobox Website 1407
Template:Infobox Organization 1065
embeddict.keys() will also serve as a list of all the templates in
that whole category tree.
Determining the most edits could also be done via the API. But I would
question the relevance of this, as (A) all of these templates will be
highly nested and dependent on other templates, so perhaps they should
be considered too, and (B) all these templates are very likely to be
highly complex and the vast majority of users will be discouraged
implicitly and usually explicitly from editing them.
As for "hits", do you mean views of the Template: page? This can be
determined from some recent pageview statistics released
(<http://dammit.lt/wikistats/>, see <http://stats.grok.se/> as an
example), but probably more relevant is how many times the template is
viewed when it is used on articles.
You could try adding up the pageviews of the various articles a
template is embedded in (and again this is available via the API) and
this would probably be reasonably reliable, given that these templates
are at the top of pages and so if the article is loaded/viewed, very
likely the template is too. There is a slight difficulty in that it is
hard to figure out when the template was added to a particular
article; obviously pageviews before that time wouldn't have seen the
template. I suspect the only way to figure that out is to wade through
page revisions and that is also possible via the API :) but it gives
me a bit of a headache just thinking about it.
The people at DBPedia might be able to help you.
<http://dbpedia.org/About> I am pretty sure most of the data they
extract from Wikipedia is from these kinds of templates, so they
probably have a good bunch of tips and tricks to share, too.
<plug> I also wrote a blog post about the history of and attitudes to
templates on English Wikipedia, which you might find interesting,
although probably not statistically relevant. :)
<http://brianna.modernthings.org/article/83/templatology-an-essay>
</plug>
cheers,
Brianna
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/