On 7/17/05, Steven Hilton mshiltonj@gmail.com wrote:
On 7/17/05, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
Just counting words as something split on white space wouldn't work for MediaWiki since not all languages split words on white space and even English sometimes spits words on other things than white space or ', "it's" expands to "it is" and it therefor two words.
I can't speak to other languages, but the wiki entry on word count (http://en.wikipedia.org/wiki/Word_count) -- found when I was searching to see if mediawiki had a word count feature -- mentions "automated word counting software which can count the actual words as delimited by whitespace" and that "Word counting algorithms can vary between software... Therefor, counting the words of a given document using different software may give unequal results."
I think splitting on whitespace will work 99% of the time, especially if the delimiter can be defined as language specific.
First of all, I've yet to be convinced that this is even worth the effort but even so, you're getting the raw wikitext and counting whitespaces in it, which is not a good idea there will be alot of whitespace that doesn't delimit words in things like wikitables, interwiki links, comments and so forth.
If you feel like seriously working on it open a bug on it, explain what it could be used for ( I presume it's something more than just putting "This article has {{WORDCOUNT}} words" on every page) and make sure you implement it in such a way that it doesn't get false positives and can be redifined in the Language class.