Wikitech-l August 2006

wikitech-l@lists.wikimedia.org

137 participants
168 discussions

by Dion Rowney

Hi, We have setup Mediawiki within our University and have a fair understanding of it. Hoever, we failed to test the block feature and we discovered this after setting it up and starting to use it :( Block seems to clock the username from posting but it does not block the user from logging in. Has there been any work to either block logins for blocked accounts or add an enabled flag in the user table? Although wikis should be open communities, we use it for internal documentation and when an employee leaves (or student drop a course etc), we would like to be able to delete the or block the account from gaining read access. I am willing to work on this code if no one has done so already but want the efforts to be put into the main source so I dont have to do it over and over each time there is an upgrade. Has anyone done this/started this/talked about this? Dion Rowney U of S

17 years, 7 months

Template parameter statistics

by Lars Aronsson

I don't know when I first heard the plans for wikidata, but one year ago I proposed a more light-weight alternative approach on the [[m:talk:Wikidata]] page. Then nothing happened, and today I implemented it. It's 98 lines of Perl that processes an XML dump and extracts template call parameter values. The source code can be had from http://meta.wikimedia.org/wiki/User:LA2/Extraktor The SQL dump of the templatelinks table already tells us which pages calls which templates. This script goes beyond that to get information about each individual call parameter. The output format is a very simple awk-friendly text file. For example, the German Wikipedia page [[de:Anthony Hope]] contains the two template calls {{PND|11901842X}} {{Personendaten| NAME=Hope, Anthony |ALTERNATIVNAMEN=Hawkins, Anthony Hope |KURZBESCHREIBUNG=englischer [[Rechtsanwalt]] und [[Autor]] |GEBURTSDATUM=[[9. Februar]] [[1863]] |GEBURTSORT=[[London]] |STERBEDATUM=[[8. Juli]] [[1933]] |STERBEORT= }} For this page, the output contains: PND|Anthony Hope|1|1|11901842X Personendaten|Anthony Hope|2|NAME|Hope, Anthony Personendaten|Anthony Hope|2|ALTERNATIVNAMEN|Hawkins, Anthony Hope Personendaten|Anthony Hope|2|KURZBESCHREIBUNG|englischer [[Rechtsanwalt]] und [[Autor]] Personendaten|Anthony Hope|2|GEBURTSDATUM|[[9. Februar]] [[1863]] Personendaten|Anthony Hope|2|GEBURTSORT|[[London]] Personendaten|Anthony Hope|2|STERBEDATUM|[[8. Juli]] [[1933]] Personendaten|Anthony Hope|2|STERBEORT| As you can see, the |-separated fields are: 1. Name of the template called 2. Name of the page that called the template 3. Sequence number of this call within the page 4. Name or position number of the parameter 5. Value of the parameter The output for the entire German Wikipedia dump is bunzip2 <dewiki-20060803-pages-articles.xml.bz2 | perl extraktor.pl >de.params du -sm de.params 123 megabytes. With some simple awk, I get the following statistics: There are awk '-F|' '{print $2,$3}' de.params | sort -u | wc -l 790,985 template calls using a total of wc -l de.params 2,076,178 parameters (on average 2.62 parameters per call) from awk '-F|' '{print $2}' de.params | sort -u | wc -l 397,929 different pages to awk '-F|' '{print $1}' de.params | sort -u | wc -l 13,295 different templates. The most commonly supplied parameter names over all templates are awk '-F|' '{print $4}' de.params | sort | uniq -c | sort -nr NAME (113038 occurances), ALTERNATIVNAMEN (101799), KURZBESCHREIBUNG (101723), GEBURTSORT (101706), GEBURTSDATUM (101704), STERBEDATUM (101663), STERBEORT (101649), ID (10061), ZEIT (6255), VORGÄNGER (6210), NACHFOLGER (6210), AMT (6199), EINWOHNER (5942), FLÄCHE (5868), WEBSITE (5680), STAND_EINWOHNER (5619), Name (5307), PJ (5242), PL (5240), LEN (5224), DS (5219), OS (5214), OT (5152), MUSIK (5137), DT (5117), TITEL (4750), INHALT (4739), PRO (4658), REG (4639), DRB (4627), AF (4568), KAMERA (4557), SCHNITT (4516), Bild (4038), BILD (3562), PLZ (3344), HÖHE (3265), GEMEINDEART (3180), BREITENGRAD (3074), LÄNGENGRAD (3061), KANTON (3013), and NAME_ORT (3005). Yes, the bad taste of all-caps parameter names is a disease of the German Wikipedia since the early days of the Personendaten project. Personendaten is also the template that is called from 100,000 different pages. Let's see which templates use the parameter named GEMEINDEART (kind of municipality): awk '-F|' '$4 == "GEMEINDEART" {print $1}' de.templates | sort | uniq -c | sort -nr Ort_Schweiz (2738 calls), Ortschaft_Schweiz (196), Infobox_Slowakische_Gemeinde-K (121), Infobox_Slowakische_Gemeinde (111), Ort_Liechtenstein (11), Infobox_Schweizer_Gemeinden (2), Infobox_Deutsche_Städte (1). Let's see which kinds of municipalities there are in Slovakia: awk '-F|' '$1 == "Infobox_Slowakische_Gemeinde" && $4 == "GEMEINDEART" {print $5}' de.templates | sort | uniq -c | sort -nr Stadt (74), Stadtteil (21), Gemeinde (16). And in Switzerland: awk '-F|' '$1 == "Ort_Schweiz" && $4 == "GEMEINDEART" {print $5}' de.templates | sort | uniq -c | sort -nr Gemeinde (2591), Stadt (126), Gemeinden (12). Perhaps "Gemeinden" (a plural) is an error that should be fixed? Let's see which twelve pages use this value for this parameter to this template: awk '-F|' '$1 == "Ort_Schweiz" && $4 == "GEMEINDEART" && $5 == "Gemeinden" {print $2}' de.templates Benken ZH, Flaach, Adlikon bei Andelfingen, Andelfingen ZH, Berg am Irchel, Buch am Irchel, Dachsen, Dorf ZH, Feuerthalen, Humlikon, Flurlingen, Henggart. Hmm... It turns out that GEMEINDEART is not used in this infobox template. That's odd. I'll leave it there. I hope you get the point. Of course you can use your favorite SQL database instead of awk. If you want speed, be sure to create indexes for every column. Imagine if there was a templateparameter table supported by Mediawiki, then we could do this in real time. I'm wysiwyg filling out an infobox template here. Which parameter names should I supply? Which values should I typically use? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

17 years, 7 months

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [16282] trunk/phase3

by Rob Church

On 30/08/06, brion(a)svn.leuksman.com <brion(a)svn.leuksman.com> wrote: > Revision: 16282 > Author: brion > Date: 2006-08-30 03:56:17 -0700 (Wed, 30 Aug 2006) > > Log Message: > ----------- > * Fix bug in wfRunHooks which caused corruption of objects in the hook list > References EVIL! Not needed anymore in PHP 5 anyway. Oh, thank fuck. I thought for a moment that a recent commit of mine to Parser.php had screwed up the software, but couldn't find the problematic change - all I could determine was that, for me, it seemed to trigger when Cite was active. It's wonderful what PHP will chuck out as HTML output when it encounters "interesting" bugs like this. :) Rob Church

17 years, 7 months

(no subject)

by hunterda＠adelphia.net

Can you please take my son off this mailing list. For some reason he is receiving numerous e-mails, probably due to his signing up mistakenly. So sorry for the trouble. Thank you. Jill Aldridge

17 years, 7 months

MediaWiki automated test run failure 2006-08-30

by brion＠pobox.com

An automated run of parserTests.php showed the following failures: Running test TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html)... FAILED! Running test TODO: Link containing double-single-quotes '' (bug 4598)... FAILED! Running test TODO: Template with thumb image (with link in description)... FAILED! Running test Template infinite loop... FAILED! Running test TODO: message transform: <noinclude> in transcluded template (bug 4926)... FAILED! Running test TODO: message transform: <onlyinclude> in transcluded template (bug 4926)... FAILED! Running test BUG 1887, part 2: A <math> with a thumbnail- math enabled... FAILED! Running test TODO: HTML bullet list, unclosed tags (bug 5497)... FAILED! Running test TODO: HTML ordered list, unclosed tags (bug 5497)... FAILED! Running test TODO: HTML nested bullet list, open tags (bug 5497)... FAILED! Running test TODO: HTML nested ordered list, open tags (bug 5497)... FAILED! Running test TODO: Parsing optional HTML elements (Bug 6171)... FAILED! Running test TODO: Inline HTML vs wiki block nesting... FAILED! Running test TODO: Mixing markup for italics and bold... FAILED! Running test TODO: 5 quotes, code coverage +1 line... FAILED! Running test TODO: HTML Hex character encoding.... FAILED! Running test TODO: dt/dd/dl test... FAILED! Passed 412 of 429 tests (96.04%) FAILED!

17 years, 7 months

Smoother treatment of actions like "unwatch"

by Steve Bennett

Would it be possible to follow the lead of many websites, and replace the two-step phase "The page ... has been removed from your watchlist."/click "Return to ...", by a one-step phase, whereby the page is shown again, but with a banner up the top that reads "This page has been removed from your watchlist"? I imagine the mechanism would be something like what happens when you click a redirect (as we were discussing recently): The target of the redirect is shown, but with some added text explaining where you came from. This would apply to at least these actions: * Watch * Unwatch * Move Something similar could be done when saving an edit, adding text like "Your edit has been saved." Any takers? Should I bugzilla this asa formal feature requset? Steve

17 years, 7 months

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [16217] trunk/phase3/includes/SpecialDisambiguations.php

by Rob Church

On 25/08/06, yurik(a)svn.leuksman.com <yurik(a)svn.leuksman.com> wrote: > Revision: 16217 > Author: yurik > Date: 2006-08-24 21:32:33 -0700 (Thu, 24 Aug 2006) > > Log Message: > ----------- > * Now [[MediaWiki:Disambiguationspage]] may have either disambig template name, or a list of links to disambig templates. For what purpose? > + if( $set === false ) { > + $set = 'FALSE'; # We must always return a valid sql query, but this way DB will always quicly return an empty result Don't issue the SELECT at all in that case, it's a waste of a database connection. You didn't update the release notes. This is a change which will affect the user's use of that message, plus the behaviour of the page; the release notes MUST be updated. Rob Church

17 years, 7 months

Categories problem

by Julien Lemoine

Hello, I read the thread "how bad is a category with ....", and I was wondering how categories were filled. If I understand well, categories were filled by editors of the article. This assume that these editors know the whole set of categories and that these categories will not change with time ? I was wondering if there is projects to help *detect* categories and then to help editors by *suggesting* categories ? I am thinking about two different technologies to help dealing with these two problems : 1) Text clustering to help finding categories but probably not using classical approaches where words space is used to describe a document (applying a part of speech tagging <http://en.wikipedia.org/wiki/Part-of-speech_tagging>, stemming <http://en.wikipedia.org/wiki/Stemmer>, ...). I am thinking about clustering links graph (seems similar to the clique problem <http://en.wikipedia.org/wiki/Clique_problem> but with different constraints), i.e. each document will not be described by his words (or lemmas, LSA vector...) but by his links to other articles using an algorithm that do not needs the number of cluster before processing but needs a distance or a similarity threshold. With this kind of processing, you will have a set of clusters that are linked together, but a cluster will probably not be a complete graph (this is the difference with the clique problem). Once you have the clusters, you need to try labeling them with a category : - give to the user the role of identifying the category name - use the words space to find the better words that describe this set of articles - ... Then you can run this algorithm on a category to try to split it in sub categories. 2) Machine learning or links graph exploration to suggest categories during edition of an article. This first idea is to try to learn existing categories with a machine learning algorithm (using words space) to guess categories of a new article (but this algorithm will have to deal with the new categories and the fact that the number of document not having a category is grater than number of document having a category). The second idea is really more simple and easier to implement : When you edit an article, you can suggest categories of linked articles (can be replaced by an other graph-exploration algorithm). Is there some functions like these in Wikimedia ? and to you think that this kind of algorithms could help ? Finally, do you know people working on this functionalities (maybe people working on semantic web ?) Best Regards. Julien Lemoine

17 years, 7 months

Cite.php footnote problems on en.wikipedia.org

by Ligulem

Error reports about <ref> footnotes on en are piling at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29 Example bad page: http://en.wikipedia.org/wiki/List_of_misnamed_theorems Anybody knows something? Is somebody twiddling with Cite.php?

17 years, 7 months

References - ugh

by Steve Bennett

I know I've done this once before, but this one's worse: The name Pluto was first suggested by [[Venetia Burney|Venetia Phair (née Burney)]], at the time an eleven-year-old girl from [[Oxford, England|Oxford]], [[England]].<ref>{{cite web |url=http://news.bbc.co.uk/1/hi/sci/tech/4596246.stm |title=The girl who named a planet |first= Paul |last= Rincon |publisher=BBC News |accessdate=2006-03-05}}</ref> Venetia, who was interested in [[Classical mythology]] as well as astronomy, suggested the name, the Roman equivalent of [[Hades]], in a conversation to her grandfather [[Falconer Madan]], a former [[librarian]] of [[Oxford University]]'s [[Bodleian Library]].<ref>{{cite web |url=http://www.amblesideonline.org/PR/PR62p030PlanetPluto.shtml |title=The Planet 'Pluto' |first= K.M |last= Claxton |publisher=Parents' Union School Diamond Jubilee Magazine, 1891-1951 (Ambleside: PUS, 1951), p. 30-32 |accessdate=2006-08-24}}</ref> Madan passed the suggestion to Professor [[Herbert Hall Turner]], Turner then cabled the suggestion to colleagues in America. After favourable consideration which was almost unanimous{{fact}}, the name Pluto was officially adopted and an announcement made by Slipher on [[1930-05-01]]. --- Can you believe that in that chunk of text, there are actually three separate pieces of text, with two references between them? It's totally unmanageable - attempting to actually edit the text that's buried in there as a cohesive whole is next to impossible. Solutions desperately wanted. Steve

17 years, 7 months

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l August 2006