Hi all
apparently as a side effect of the "about wikipedia projects" thread, some people (including myself) have started to put their names and projects on http://meta.wikimedia.org/wiki/Research.
I encurrage everyone to do the same. It's a great way to get an overview and to find people to talk to.
-- Daniel
Greetings all, Please excuse me if my message is inappropriate for this list, but I'm looking for a place to start and unsure where to begin.
I'm trying to get some sense of the scope of the Template namespace on the English-language Wikipedia: anything from sheer numbers, to which templates are the most edited, to which are the most used (either in terms of total number of transclusions/What Links Here, or else actual number of "hits").
To be a bit more specific, I'm particularly interested in those Templates which are in the following two categories: * http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates * http://en.wikipedia.org/wiki/Category:Infobox_templates But both of these categories consist of a large number of subcategories, sub-sub-categories, sub-sub-sub-...categories, making it difficult to attempt even a basic count of all the Navbox and Infobox templates. Of course, determining which Navboxes and Infoboxes are the most edited or the most used would be impossible to ascertain manually.
I haven't written an SQL statement since taking a database course in 1998, so I'm wary of downloading one of the database dumps and attempting to manipulate things on my own. Nor am I even certain that the data included in the dumps would allow me to aggregate across sub-sub-sub...categories, or derive edit counts or use counts.
Perhaps there's a GUI tool or interface that would be helpful in compiling these stats? Or perhaps these statistics are readily available and I simply haven't looked in the right places :)
Again, any advice on this matter would be most appreciated.
Regards, David
2008/9/18 Quse Guy quseguy@yahoo.com:
Greetings all, Please excuse me if my message is inappropriate for this list, but I'm looking for a place to start and unsure where to begin.
I'm trying to get some sense of the scope of the Template namespace on the English-language Wikipedia: anything from sheer numbers, to which templates are the most edited, to which are the most used (either in terms of total number of transclusions/What Links Here, or else actual number of "hits").
To be a bit more specific, I'm particularly interested in those Templates which are in the following two categories:
- http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
- http://en.wikipedia.org/wiki/Category:Infobox_templates
But both of these categories consist of a large number of subcategories, sub-sub-categories, sub-sub-sub-...categories, making it difficult to attempt even a basic count of all the Navbox and Infobox templates. Of course, determining which Navboxes and Infoboxes are the most edited or the most used would be impossible to ascertain manually.
I haven't written an SQL statement since taking a database course in 1998, so I'm wary of downloading one of the database dumps and attempting to manipulate things on my own. Nor am I even certain that the data included in the dumps would allow me to aggregate across sub-sub-sub...categories, or derive edit counts or use counts.
Perhaps there's a GUI tool or interface that would be helpful in compiling these stats? Or perhaps these statistics are readily available and I simply haven't looked in the right places :)
Your best bet is really to befriend a techie to construct the SQL query for you, then submit it via the Query service: https://wiki.toolserver.org/view/Query_service
However............... you could kinda-sorta hack something together using the API. http://en.wikipedia.org/w/api.php http://www.mediawiki.org/wiki/API
Here's my examples using the Python client mwclient https://mwclient.svn.sourceforge.net/svnroot/mwclient/trunk/mwclient and the interactive Python commandline:
initialise stuff
import mwclient site = mwclient.Site('en.wikipedia.org') topcat = 'Category:Infobox templates'
so first, try and get all the subcats below the topcat in the tree by recursing (very crudely) through them. Note namespace 14 is the Category namespace.
allcats = [] allcats.append(topcat) subcats = [p.name for p in site.Pages[topcat] if p.namespace==14] while len(subcats) > 0:
... newsubcats = [] ... for s in subcats: ... allcats.append(s) ... newsubcats += [p.name for p in site.Pages[s] if p.namespace==14] ... subcats = newsubcats ...
(I'm actually not sure this works. I got bored and killed it, and len(allcats) was already 429. It would be better to be more strict and record which categories we have checked, to cut off potential cycles in the category graph.)
Anyway, assuming that actually works, now we want to only get the templates in those categories. Note the Template namespace is 10.
alltemplates = [] for cat in allcats:
... for p in site.Pages[cat]: ... if p.namespace == 10: ... alltemplates.append(p.name) ...
OK so now, for each template, we want to find out how many pages it is embedded in. (Templates are embedded when they are referenced in {{curly brackets}}, as opposed to regular old [[links]]). To be more careful here there is probably a way to only get the embeddedin results for the main namespace (ie, use in articles).
embeddict = {} for t in alltemplates:
... template = site.Pages[t] ... embedtotal = len(list(template.embeddedin())) ... embeddict[t] = embedtotal ...
This also takes a particularly long time. (I killed my process so my examples below are truncated)
embeddict.values()
[1, 6, 60, 141, 2, 2, 47, 0, 88, 19, 2, 212, 186, 1, 595, 76, 17, 444, 13, 70, 15, 87, 5, 11, 0, 102, 25, 356, 289, 1, 272, 184, 14, 2, 0, 14, 7, 2, 1407, 20, 0, 7, 32, 19, 0, 63, 1065, 31, 57, 72, 0, 2, 47, 5, 797, 3, 16, 3, 43, 99, 295, 14, 22, 0, 10, 9, 150, 6, 1, 1, 132, 5, 6, 110, 7, 42, 200, 58]
We can get a bit of an idea of really high usage.
for k in embeddict.keys():
... v = embeddict[k] ... if v > 1000: ... print k, v ... Template:Infobox Website 1407 Template:Infobox Organization 1065
embeddict.keys() will also serve as a list of all the templates in that whole category tree.
Determining the most edits could also be done via the API. But I would question the relevance of this, as (A) all of these templates will be highly nested and dependent on other templates, so perhaps they should be considered too, and (B) all these templates are very likely to be highly complex and the vast majority of users will be discouraged implicitly and usually explicitly from editing them.
As for "hits", do you mean views of the Template: page? This can be determined from some recent pageview statistics released (http://dammit.lt/wikistats/, see http://stats.grok.se/ as an example), but probably more relevant is how many times the template is viewed when it is used on articles.
You could try adding up the pageviews of the various articles a template is embedded in (and again this is available via the API) and this would probably be reasonably reliable, given that these templates are at the top of pages and so if the article is loaded/viewed, very likely the template is too. There is a slight difficulty in that it is hard to figure out when the template was added to a particular article; obviously pageviews before that time wouldn't have seen the template. I suspect the only way to figure that out is to wade through page revisions and that is also possible via the API :) but it gives me a bit of a headache just thinking about it.
The people at DBPedia might be able to help you. http://dbpedia.org/About I am pretty sure most of the data they extract from Wikipedia is from these kinds of templates, so they probably have a good bunch of tips and tricks to share, too.
<plug> I also wrote a blog post about the history of and attitudes to templates on English Wikipedia, which you might find interesting, although probably not statistically relevant. :) http://brianna.modernthings.org/article/83/templatology-an-essay </plug>
cheers, Brianna
You might ask the folks at Freebase http://www.freebase.com/ for help. They gave a presentation at one of the SF-Bay area meetups recently and described how they've managed to extract data about templates and infoboxes from Wikipedia. I am fuzzy on the details but they can probably help... I believe most of their code is open source.
-- Phoebe
On Wed, Sep 17, 2008 at 6:49 PM, Quse Guy quseguy@yahoo.com wrote:
Greetings all, Please excuse me if my message is inappropriate for this list, but I'm looking for a place to start and unsure where to begin.
I'm trying to get some sense of the scope of the Template namespace on the English-language Wikipedia: anything from sheer numbers, to which templates are the most edited, to which are the most used (either in terms of total number of transclusions/What Links Here, or else actual number of "hits").
To be a bit more specific, I'm particularly interested in those Templates which are in the following two categories:
- http://en.wikipedia.org/wiki/Category:Navbox_(navigational)_templates
- http://en.wikipedia.org/wiki/Category:Infobox_templates
But both of these categories consist of a large number of subcategories, sub-sub-categories, sub-sub-sub-...categories, making it difficult to attempt even a basic count of all the Navbox and Infobox templates. Of course, determining which Navboxes and Infoboxes are the most edited or the most used would be impossible to ascertain manually.
I haven't written an SQL statement since taking a database course in 1998, so I'm wary of downloading one of the database dumps and attempting to manipulate things on my own. Nor am I even certain that the data included in the dumps would allow me to aggregate across sub-sub-sub...categories, or derive edit counts or use counts.
Perhaps there's a GUI tool or interface that would be helpful in compiling these stats? Or perhaps these statistics are readily available and I simply haven't looked in the right places :)
Again, any advice on this matter would be most appreciated.
Regards, David
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org