On 05/11/2012 05:02 PM, Jon Robson wrote:
It would be good if possible to get a more accurate
feel for what
percentage of articles use inline styles.
e.g. articles that contain style= vs articles that don't
This would help us get a better idea of what we are dealing with.
I just grepped an enwiki dump using the dumpGrepper tool in
extensions/VisualEditor/tests/parser:
zcat enwiki-latest-pages-articles.xml.gz \
| node dumpGrepper.js "\bstyle\s*=\s*['\"]"
(..all matches..)
################################################
Total revisions: 11687077
Total matches: 675254
Ratio: 5.77%
################################################
This includes templates, and counts all matches vs. all revisions- the
number of matched articles will be even lower.
So it is safe to assume that most content pages don't contain any inline
styles.
The bzip-compressed (1.2GB uncompressed) output can soon be found here:
http://dev.wikidev.net/gabriel/tmp/style.txt.bz2
Gabriel