Brian:
ps: Does anyone know of a script that can strip out
wiki syntax? This
is pertinent. It will also be necessary to leve only paragraphs of
text in the articles..the below data is noticably skewed in some (but
not all) of the mesures.
Brian, here an inital reponse:
Some perl code from the WikiCounts job, that strips lots of markup code,
used to get cleaner text for word count and article size in chars.
It is not 100% accurate, and not all markup is removed, but these regexps
slow down the whole job big time.
The result is at least far closer to a decent word count than wc would be on
the raw data.
$article =~ s/\'\'+//go ; # strip bold/italic formatting
$article =~ s/\<[^\>]+\>//go ; # strip <...> html
# these are valid UTF-8 chars, but it takes way too long to process,
so
# I combine those in one set
# $article =~ s/[\xc0-\xdf][\x80-\xbf]|
# [\xe0-\xef][\x80-\xbf]{2}|
# [\xf0-\xf7][\x80-\xbf]{3}/x/gxo ;
# this one set selects UTF-8 faster (with 99.9% accuracy I would say)
$article =~ s/[\xc0-\xf7][\x80-\xbf]+/x/gxo ; # count unicode chars
as one char
$article =~ s/\&\w+\;/x/go ; # count htlm chars as one char
$article =~ s/\&\#\d+\;/x/go ; # count htlm chars as one char
$article =~ s/\[\[ [^\:\]]+ \: [^\]]* \]\]//gxoi ; # strip
image/category/interwiki links
# a few internal
links with colon in title will get lost too
$article =~ s/http \: [\w\.\/]+//gxoi ; # strip external links
$article =~ s/\=\=+ [^\=]* \=\=+//gxo ; # strip headers
$article =~ s/\n\**//go ; # strip linebreaks + unordered list tags
(other lists are relatively scarce)
$article =~ s/\s+/ /go ; # remove extra spaces
Actually the code in WikiCountsInput.pl is a bit more complicated as it
tries to find a decent solution for ja/zh/ko
Also numbers are counted as one word (including embedded points and commas).
if ($language eq "ja")
{ $words = int ($unicodes * 0.37) ; }
etc
pss: I recall from the Wikimania meeting that someone
had a script to
convert a dump to tab-delimited data. That would be useful to me...
could someone provide a link?
http://karma.med.harvard.edu/mailman/private/freelogy-discuss/2006-July/0000
47.html
Erik: The largest of articles takes approx. 1/10 of a
second running
the binary produced by this C code. Using Inline::C in perl, I could
fairly easily embed the code (style.c from GNU Diction) into your
script. It would take and return strings. "Simple!" =) Otherwise I can
just produce the data in csv etc.. and provide it to you.
Questions and caveats:
1/10 secs x 2 million articles early in 2007 is 55 hours. Plus German is 80
hours. Of course you say 1/10 is for largest articles only.
Still it adds up big time when all months are processed, and running
WikiCounts incrementally only adding data for last month has its drawbacks
as explained in out meeting at Wikimania. Is it 1/10 sec for all tests
combined? Could we limit ourselves to the better researched tests or the
tests which are supported in more languages or deemed more sensible anyway ?
I would prefer tests that work in all alphabet based languages. When wiki
syntax is introduced that is not stripped by regexps above or some other
tool it would produce artificial drift in the results over the months.
This data is very easy to reproduce. I provide a unix
command for each
that assumes you have installed the lynx text browser, which has a
dump command to strip out html and leave text, and the GNU Diction
package, which provides style. Style supports English/German.
Strip html is already done. See above.
I could imagine we run these tests on a yet to be determined sample of all
articles to save processing costs.
Tracking 10.000 or 50.000 articles from month to month, if chosen properly
(random ?) should give decent results.
Cheers, Erik Zachte