On 7/17/05, Steven Hilton <mshiltonj(a)gmail.com> wrote:
On 7/17/05, Ævar Arnfjörð Bjarmason
<avarab(a)gmail.com> wrote:
Just counting words as something split on white
space wouldn't work
for MediaWiki since not all languages split words on white space and
even English sometimes spits words on other things than white space or
', "it's" expands to "it is" and it therefor two words.
I can't speak to other languages, but the wiki entry on word count
(
http://en.wikipedia.org/wiki/Word_count) -- found when I was
searching to see if mediawiki had a word count feature -- mentions
"automated word counting software which can count the actual words as
delimited by whitespace" and that "Word counting algorithms can vary
between software... Therefor, counting the words of a given document
using different software may give unequal results."
I think splitting on whitespace will work 99% of the time, especially
if the delimiter can be defined as language specific.
First of all, I've yet to be convinced that this is even worth the
effort but even so, you're getting the raw wikitext and counting
whitespaces in it, which is not a good idea there will be alot of
whitespace that doesn't delimit words in things like wikitables,
interwiki links, comments and so forth.
If you feel like seriously working on it open a bug on it, explain
what it could be used for ( I presume it's something more than just
putting "This article has {{WORDCOUNT}} words" on every page) and make
sure you implement it in such a way that it doesn't get false
positives and can be redifined in the Language class.