On 05/11/2012 05:02 PM, Jon Robson wrote:
It would be good if possible to get a more accurate feel for what percentage of articles use inline styles. e.g. articles that contain style= vs articles that don't This would help us get a better idea of what we are dealing with.
I just grepped an enwiki dump using the dumpGrepper tool in extensions/VisualEditor/tests/parser:
zcat enwiki-latest-pages-articles.xml.gz \ | node dumpGrepper.js "\bstyle\s*=\s*['"]"
(..all matches..) ################################################ Total revisions: 11687077 Total matches: 675254 Ratio: 5.77% ################################################
This includes templates, and counts all matches vs. all revisions- the number of matched articles will be even lower.
So it is safe to assume that most content pages don't contain any inline styles.
The bzip-compressed (1.2GB uncompressed) output can soon be found here: http://dev.wikidev.net/gabriel/tmp/style.txt.bz2
Gabriel