Re: [Wiki-research-l] Wiki-research-l Digest, Vol 14, Issue 2

19 Aug 2006

Brian:
...
  ps: Does anyone know of a script that can strip out
wiki syntax? This
 is pertinent. It will also be necessary to leve only paragraphs of
 text in the articles..the below data is noticably skewed in some (but
 not all) of the mesures.

Brian, here an inital reponse:

Some perl code from the WikiCounts job, that strips lots of markup code,
used to get cleaner text for word count and article size in chars.
It is not 100% accurate, and not all markup is removed, but these regexps
slow down the whole job big time.
The result is at least far closer to a decent word count than wc would be on
the raw data.

        $article =~ s/\'\'+//go ; # strip bold/italic formatting
        $article =~ s/\<[^\>]+\>//go ; # strip <...> html

  #     these are valid UTF-8 chars, but it takes way too long to process,
so
  #     I combine those in one set
  #     $article =~  s/[\xc0-\xdf][\x80-\xbf]|
  #                     [\xe0-\xef][\x80-\xbf]{2}|
  #                     [\xf0-\xf7][\x80-\xbf]{3}/x/gxo ;

  #     this one set selects UTF-8 faster (with 99.9% accuracy I would say)
        $article =~  s/[\xc0-\xf7][\x80-\xbf]+/x/gxo ; # count unicode chars
as one char

        $article =~ s/\&\w+\;/x/go ;   # count htlm chars as one char
        $article =~ s/\&\#\d+\;/x/go ; # count htlm chars as one char

        $article =~ s/\[\[ [^\:\]]+ \: [^\]]* \]\]//gxoi ; # strip
image/category/interwiki links
                                                            # a few internal
links with colon in title will get lost too
        $article =~ s/http \: [\w\.\/]+//gxoi ; # strip external links

        $article =~ s/\=\=+ [^\=]* \=\=+//gxo ; # strip headers
        $article =~ s/\n\**//go ; # strip linebreaks + unordered list tags
(other lists are relatively scarce)
        $article =~ s/\s+/ /go ; # remove extra spaces

Actually the code in WikiCountsInput.pl is a bit more complicated as it
tries to find a decent solution for ja/zh/ko
Also numbers are counted as one word (including embedded points and commas).

         if ($language eq "ja")
         { $words = int ($unicodes * 0.37) ; }
         etc

...
  pss: I recall from the Wikimania meeting that someone
had a script to
 convert a dump to tab-delimited data. That would be useful to me...
 could someone provide a link?

http://karma.med.harvard.edu/mailman/private/freelogy-discuss/2006-July/0000
47.html

...
  Erik: The largest of articles takes approx. 1/10 of a
second running
 the binary produced by this C code. Using Inline::C in perl, I could
 fairly easily embed the code (style.c from GNU Diction) into your
 script. It would take and return strings. "Simple!" =) Otherwise I can
 just produce the data in csv etc.. and provide it to you.

Questions and caveats:
1/10 secs x 2 million articles early in 2007 is 55 hours. Plus German is 80
hours. Of course you say 1/10 is for largest articles only.
Still it adds up big time when all months are processed, and running
WikiCounts incrementally only adding data for last month has its drawbacks
as explained in out meeting at Wikimania. Is it 1/10 sec for all tests
combined? Could we limit ourselves to the better researched tests or the
tests which are supported in more languages or deemed more sensible anyway ?
I would prefer tests that work in all alphabet based languages. When wiki
syntax is introduced that is not stripped by regexps above or some other
tool it would produce artificial drift in the results over the months.

...
  This data is very easy to reproduce. I provide a unix
command for each
 that assumes you have installed the lynx text browser, which has a
 dump command to strip out html and leave text, and the GNU Diction
 package, which provides style. Style supports English/German. 
Strip html is already done. See above.

I could imagine we run these tests on a yet to be determined sample of all
articles to save processing costs.
Tracking 10.000 or 50.000 articles from month to month, if chosen properly
(random ?) should give decent results.

Cheers, Erik Zachte

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 14, Issue 2