Brian:
ps: Does anyone know of a script that can strip out wiki syntax? This is pertinent. It will also be necessary to leve only paragraphs of text in the articles..the below data is noticably skewed in some (but not all) of the mesures.
Brian, here an inital reponse:
Some perl code from the WikiCounts job, that strips lots of markup code, used to get cleaner text for word count and article size in chars. It is not 100% accurate, and not all markup is removed, but these regexps slow down the whole job big time. The result is at least far closer to a decent word count than wc would be on the raw data.
$article =~ s/''+//go ; # strip bold/italic formatting $article =~ s/<[^>]+>//go ; # strip <...> html
# these are valid UTF-8 chars, but it takes way too long to process, so # I combine those in one set # $article =~ s/[\xc0-\xdf][\x80-\xbf]| # [\xe0-\xef][\x80-\xbf]{2}| # [\xf0-\xf7][\x80-\xbf]{3}/x/gxo ;
# this one set selects UTF-8 faster (with 99.9% accuracy I would say) $article =~ s/[\xc0-\xf7][\x80-\xbf]+/x/gxo ; # count unicode chars as one char
$article =~ s/&\w+;/x/go ; # count htlm chars as one char $article =~ s/&#\d+;/x/go ; # count htlm chars as one char
$article =~ s/[[ [^:]]+ : [^]]* ]]//gxoi ; # strip image/category/interwiki links # a few internal links with colon in title will get lost too $article =~ s/http : [\w./]+//gxoi ; # strip external links
$article =~ s/==+ [^=]* ==+//gxo ; # strip headers $article =~ s/\n**//go ; # strip linebreaks + unordered list tags (other lists are relatively scarce) $article =~ s/\s+/ /go ; # remove extra spaces
Actually the code in WikiCountsInput.pl is a bit more complicated as it tries to find a decent solution for ja/zh/ko Also numbers are counted as one word (including embedded points and commas).
if ($language eq "ja") { $words = int ($unicodes * 0.37) ; } etc
pss: I recall from the Wikimania meeting that someone had a script to convert a dump to tab-delimited data. That would be useful to me... could someone provide a link?
http://karma.med.harvard.edu/mailman/private/freelogy-discuss/2006-July/0000 47.html
Erik: The largest of articles takes approx. 1/10 of a second running the binary produced by this C code. Using Inline::C in perl, I could fairly easily embed the code (style.c from GNU Diction) into your script. It would take and return strings. "Simple!" =) Otherwise I can just produce the data in csv etc.. and provide it to you.
Questions and caveats: 1/10 secs x 2 million articles early in 2007 is 55 hours. Plus German is 80 hours. Of course you say 1/10 is for largest articles only. Still it adds up big time when all months are processed, and running WikiCounts incrementally only adding data for last month has its drawbacks as explained in out meeting at Wikimania. Is it 1/10 sec for all tests combined? Could we limit ourselves to the better researched tests or the tests which are supported in more languages or deemed more sensible anyway ? I would prefer tests that work in all alphabet based languages. When wiki syntax is introduced that is not stripped by regexps above or some other tool it would produce artificial drift in the results over the months.
This data is very easy to reproduce. I provide a unix command for each that assumes you have installed the lynx text browser, which has a dump command to strip out html and leave text, and the GNU Diction package, which provides style. Style supports English/German.
Strip html is already done. See above.
I could imagine we run these tests on a yet to be determined sample of all articles to save processing costs. Tracking 10.000 or 50.000 articles from month to month, if chosen properly (random ?) should give decent results.
Cheers, Erik Zachte