Hi.
I was curious about the prevalence and types of inline styling (or more specifically, inline CSS) in articles on the English Wikipedia.
The relevant task is here: https://phabricator.wikimedia.org/T115228. The results are here: https://phabricator.wikimedia.org/P2230.
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
My hope is that a better understanding of the uses of inline styling will allow us to reduce the overall amount of it in smart ways. Semantic CSS classes and per-page or per-template CSS files might be part of this.
MZMcBride
Is this direct usage of the styling, or does it include styling introduced by templates as well?
On Monday, October 26, 2015, MZMcBride z@mzmcbride.com wrote:
Hi.
I was curious about the prevalence and types of inline styling (or more specifically, inline CSS) in articles on the English Wikipedia.
The relevant task is here: https://phabricator.wikimedia.org/T115228. The results are here: https://phabricator.wikimedia.org/P2230.
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
My hope is that a better understanding of the uses of inline styling will allow us to reduce the overall amount of it in smart ways. Semantic CSS classes and per-page or per-template CSS files might be part of this.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l
K. Peachey wrote:
Is this direct usage of the styling, or does it include styling introduced by templates as well?
Just direct usage of styling, specifically anything approximately matching 'style="[...]"'. The script only looked at the wikitext directly used in pages as that's what's provided by the XML dumps.
I think having styling in templates is at least a much better place for it, as template invocations are tracked/indexed by MediaWiki and the use of templates means that the styling code is a lot more centralized.
MZMcBride
On 2015-10-25 8:27 PM, MZMcBride wrote:
K. Peachey wrote:
Is this direct usage of the styling, or does it include styling introduced by templates as well?
Just direct usage of styling, specifically anything approximately matching 'style="[...]"'. The script only looked at the wikitext directly used in pages as that's what's provided by the XML dumps.
I think having styling in templates is at least a much better place for it, as template invocations are tracked/indexed by MediaWiki and the use of templates means that the styling code is a lot more centralized.
MZMcBride
Perhaps some deeper segregation of those stats would be useful. ie: Separate the numbers of styles used between templates and pages.
Then we might have a better idea of what kind of patterns are being used directly in pages that should actually be moved to templates or stylesheets.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
Daniel Friesen wrote:
Perhaps some deeper segregation of those stats would be useful. ie: Separate the numbers of styles used between templates and pages.
Then we might have a better idea of what kind of patterns are being used directly in pages that should actually be moved to templates or stylesheets.
This reply confused me a little. The script I ran exclusively looked at pages in the main namespace and exclusively looked at an XML dump, which is unexpanded wikitext. That is, assuming people aren't doing a lot of inline styling as arguments/parameters to templates, we should already have a decent amount of segregation as I only looked at direct uses.
Looking at the template namespace or looking at pages post-expansion would be annoying. I think templates aren't necessarily a bad place for inline styling, so I'm a lot less focused on templates than I am on articles.
Vi to wrote:
Do you have a old dump to check whatever the ratio has increased?
I'm personally not very interested in doing this, but using a similar dump from https://dumps.wikimedia.org/ and following the instructions laid out in https://phabricator.wikimedia.org/T115228 should make this fairly easy to do, if anyone is interested. I tried to methodically document all of the relevant source code and commands that I used, so that this same audit or an audit on another project or dump would be less work.
MZMcBride
On 2015-10-26 8:15 PM, MZMcBride wrote:
Daniel Friesen wrote:
Perhaps some deeper segregation of those stats would be useful. ie: Separate the numbers of styles used between templates and pages.
Then we might have a better idea of what kind of patterns are being used directly in pages that should actually be moved to templates or stylesheets.
This reply confused me a little. The script I ran exclusively looked at pages in the main namespace and exclusively looked at an XML dump, which is unexpanded wikitext. That is, assuming people aren't doing a lot of inline styling as arguments/parameters to templates, we should already have a decent amount of segregation as I only looked at direct uses.
Looking at the template namespace or looking at pages post-expansion would be annoying. I think templates aren't necessarily a bad place for inline styling, so I'm a lot less focused on templates than I am on articles.
Sorry, n/m then. I ended up skimming past the namespace mention and thought the stats were for all namespaces.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
Do you have a old dump to check whatever the ratio has increased?
Vito
2015-10-26 4:27 GMT+01:00 MZMcBride z@mzmcbride.com:
K. Peachey wrote:
Is this direct usage of the styling, or does it include styling introduced by templates as well?
Just direct usage of styling, specifically anything approximately matching 'style="[...]"'. The script only looked at the wikitext directly used in pages as that's what's provided by the XML dumps.
I think having styling in templates is at least a much better place for it, as template invocations are tracked/indexed by MediaWiki and the use of templates means that the styling code is a lot more centralized.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I'm not sure what your bug is, but those counts are way too high to be accurate reflections of the wikitext in the main namespace on enwiki.
-Robert Rohde
On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride z@mzmcbride.com wrote:
Hi.
I was curious about the prevalence and types of inline styling (or more specifically, inline CSS) in articles on the English Wikipedia.
The relevant task is here: https://phabricator.wikimedia.org/T115228. The results are here: https://phabricator.wikimedia.org/P2230.
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
My hope is that a better understanding of the uses of inline styling will allow us to reduce the overall amount of it in smart ways. Semantic CSS classes and per-page or per-template CSS files might be part of this.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Rohde wrote:
On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride z@mzmcbride.com wrote:
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
I'm not sure what your bug is, but those counts are way too high to be accurate reflections of the wikitext in the main namespace on enwiki.
Err, based on what? :-)
These numbers are instances of style="[...]", not page counts. Looking at a specific example from https://phabricator.wikimedia.org/P2230:
1164 font-family: 'microsoft yi baiti', 'noto sans yi', nsimsun-18030, simsun-18030, 'sil yi', code2000;
These 1,164 inline styling instances all come from a single article: https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit.
Maybe that's the confusion? I tried to make my descriptions as clear as possible and I'm not saying a major bug is impossible, of course, but I don't have any reason so far to doubt the data I collected.
Another strange case is "background-color: {{/meta/color}};", which had 16,432 instances. This almost looks like it would try to transclude a subpage of the article, but due to subpages being disabled in the main namespace on the English Wikipedia, it's actually transcluding a template named "/meta/color": https://en.wikipedia.org/wiki/Template:/meta/color.
I did concurrently look at the approximate number of non-redirect pages that contain inline styling. My findings were that about 408,777 non-redirect pages contain some kind of inline styling on the English Wikipedia (cf. https://phabricator.wikimedia.org/T115228#1752223).
MZMcBride
Okay, I misunderstood those as page counts, which would be way too high. Even if they are explicit usage counts, I am still surprised they are that high.
BTW, is it surprising to anyone else that style elements aren't searchable by default? Searching for "efcfff" [1], gives only a single article result despite "background: #efcfff;" being reported 200k times.
We can however search using "insource:efcfff" [2], which reports 5516 articles, implying this color is applied _on average_ roughly 39 times per article.
"display: none;" would appear even more impressive, with a reported 140k uses in just 218 articles [3] or an average of 656 usages per page containing it. That doesn't feel very likely to me. One possibility would be if you mistakenly counted some or all pages outside of the main namespace. Though only 218 articles use "display: none", there are nearly 31000 other pages that include it [4], which seems like a much more reasonable way to get to 140k total uses.
-Robert
[1] https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASearc... [2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva... [3] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva... [4] https://en.wikipedia.org/w/index.php?title=Special:Search&search=insourc...
On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride z@mzmcbride.com wrote:
Robert Rohde wrote:
On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride z@mzmcbride.com wrote:
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
I'm not sure what your bug is, but those counts are way too high to be accurate reflections of the wikitext in the main namespace on enwiki.
Err, based on what? :-)
These numbers are instances of style="[...]", not page counts. Looking at a specific example from https://phabricator.wikimedia.org/P2230:
1164 font-family: 'microsoft yi baiti', 'noto sans yi', nsimsun-18030, simsun-18030, 'sil yi', code2000;
These 1,164 inline styling instances all come from a single article: https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit.
Maybe that's the confusion? I tried to make my descriptions as clear as possible and I'm not saying a major bug is impossible, of course, but I don't have any reason so far to doubt the data I collected.
Another strange case is "background-color: {{/meta/color}};", which had 16,432 instances. This almost looks like it would try to transclude a subpage of the article, but due to subpages being disabled in the main namespace on the English Wikipedia, it's actually transcluding a template named "/meta/color": https://en.wikipedia.org/wiki/Template:/meta/color.
I did concurrently look at the approximate number of non-redirect pages that contain inline styling. My findings were that about 408,777 non-redirect pages contain some kind of inline styling on the English Wikipedia (cf. https://phabricator.wikimedia.org/T115228#1752223).
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
It appears that the summary may have normalized the formatting of the CSS.
143095 display: none;
Your query[1] assumes a space after "display:" and gives 218 results. Using no space[2] gives 2,473 results, but still assumes that no other elements occur in the style attribute. A regex query[3] with "display:" + optional spaces + "none" gives 4,296 results, or a more reasonable average of 33 per result. That query may be overly aggressive and match outside of style contexts, but it also matches *list_style = text-align:center;display:none,* and *style="font-size: normal; text-align: left; display: none;"* which I think is a good thing (definitely in the latter case).
Parsing a dump of enwiki is more accurate than running insource: queries.
—Trey
[1] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[3] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=defa...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Oct 27, 2015 at 10:41 AM, Robert Rohde rarohde@gmail.com wrote:
Okay, I misunderstood those as page counts, which would be way too high. Even if they are explicit usage counts, I am still surprised they are that high.
BTW, is it surprising to anyone else that style elements aren't searchable by default? Searching for "efcfff" [1], gives only a single article result despite "background: #efcfff;" being reported 200k times.
We can however search using "insource:efcfff" [2], which reports 5516 articles, implying this color is applied _on average_ roughly 39 times per article.
"display: none;" would appear even more impressive, with a reported 140k uses in just 218 articles [3] or an average of 656 usages per page containing it. That doesn't feel very likely to me. One possibility would be if you mistakenly counted some or all pages outside of the main namespace. Though only 218 articles use "display: none", there are nearly 31000 other pages that include it [4], which seems like a much more reasonable way to get to 140k total uses.
-Robert
[1]
https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASearc... [2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva... [3]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva... [4]
https://en.wikipedia.org/w/index.php?title=Special:Search&search=insourc...
On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride z@mzmcbride.com wrote:
Robert Rohde wrote:
On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride z@mzmcbride.com wrote:
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
I'm not sure what your bug is, but those counts are way too high to be accurate reflections of the wikitext in the main namespace on enwiki.
Err, based on what? :-)
These numbers are instances of style="[...]", not page counts. Looking at a specific example from https://phabricator.wikimedia.org/P2230:
1164 font-family: 'microsoft yi baiti', 'noto sans yi', nsimsun-18030, simsun-18030, 'sil yi', code2000;
These 1,164 inline styling instances all come from a single article: https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit.
Maybe that's the confusion? I tried to make my descriptions as clear as possible and I'm not saying a major bug is impossible, of course, but I don't have any reason so far to doubt the data I collected.
Another strange case is "background-color: {{/meta/color}};", which had 16,432 instances. This almost looks like it would try to transclude a subpage of the article, but due to subpages being disabled in the main namespace on the English Wikipedia, it's actually transcluding a template named "/meta/color": <https://en.wikipedia.org/wiki/Template:/meta/color .
I did concurrently look at the approximate number of non-redirect pages that contain inline styling. My findings were that about 408,777 non-redirect pages contain some kind of inline styling on the English Wikipedia (cf. https://phabricator.wikimedia.org/T115228#1752223).
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
As I read it MZMcBride's code [1] used the pattern,
pattern = r'style[ ]*=[ ]*"(.+?)"'
and then normalized the internal spaces.
Which, after substituting "display:none;" I think translates directly to the regex search:
insource:/style[ ]*=[ ]*"display:[ ]*none;[ ]*"/i
That gives me 487 articles [2]. I would also note that MZMcBride's code will only report a "display: none" if that is the only style element. If multiple style elements are present then it becomes a count on a different line including those multiple elements.
I am happy to agree that searching the XML should be better than the local search tool, but I still find these numbers hard to reconcile.
-Robert Rohde
[1] https://github.com/mzmcbride/dump-reports/blob/a8dbbcb3/xmldumpreader.py [2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
On Tue, Oct 27, 2015 at 4:03 PM, Trey Jones tjones@wikimedia.org wrote:
It appears that the summary may have normalized the formatting of the CSS.
143095 display: none;
Your query[1] assumes a space after "display:" and gives 218 results. Using no space[2] gives 2,473 results, but still assumes that no other elements occur in the style attribute. A regex query[3] with "display:" + optional spaces + "none" gives 4,296 results, or a more reasonable average of 33 per result. That query may be overly aggressive and match outside of style contexts, but it also matches *list_style = text-align:center;display:none,* and *style="font-size: normal; text-align: left; display: none;"* which I think is a good thing (definitely in the latter case).
Parsing a dump of enwiki is more accurate than running insource: queries.
—Trey
[1]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[3]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=defa...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Oct 27, 2015 at 10:41 AM, Robert Rohde rarohde@gmail.com wrote:
Okay, I misunderstood those as page counts, which would be way too high. Even if they are explicit usage counts, I am still surprised they are
that
high.
BTW, is it surprising to anyone else that style elements aren't
searchable
by default? Searching for "efcfff" [1], gives only a single article
result
despite "background: #efcfff;" being reported 200k times.
We can however search using "insource:efcfff" [2], which reports 5516 articles, implying this color is applied _on average_ roughly 39 times
per
article.
"display: none;" would appear even more impressive, with a reported 140k uses in just 218 articles [3] or an average of 656 usages per page containing it. That doesn't feel very likely to me. One possibility
would
be if you mistakenly counted some or all pages outside of the main namespace. Though only 218 articles use "display: none", there are
nearly
31000 other pages that include it [4], which seems like a much more reasonable way to get to 140k total uses.
-Robert
[1]
https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASearc...
[2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[3]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adva...
[4]
https://en.wikipedia.org/w/index.php?title=Special:Search&search=insourc...
On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride z@mzmcbride.com wrote:
Robert Rohde wrote:
On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride z@mzmcbride.com wrote:
The following are the top ten instances of inline styling from main namespace pages on the English Wikipedia, as of about 2015-10-02:
1552197 text-align: center; 499756 text-align: left; 355952 background: #dfffdf; 235222 background: #cfcfff; 215038 background: #efcfff; 210702 text-align: right; 143095 display: none; 93646 background: #efefef; 86391 font-size: 90%; 80420 background: #fff;
I'm not sure what your bug is, but those counts are way too high to be accurate reflections of the wikitext in the main namespace on enwiki.
Err, based on what? :-)
These numbers are instances of style="[...]", not page counts. Looking
at
a specific example from https://phabricator.wikimedia.org/P2230:
1164 font-family: 'microsoft yi baiti', 'noto sans yi',
nsimsun-18030,
simsun-18030, 'sil yi', code2000;
These 1,164 inline styling instances all come from a single article: https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit.
Maybe that's the confusion? I tried to make my descriptions as clear as possible and I'm not saying a major bug is impossible, of course, but I don't have any reason so far to doubt the data I collected.
Another strange case is "background-color: {{/meta/color}};", which had 16,432 instances. This almost looks like it would try to transclude a subpage of the article, but due to subpages being disabled in the main namespace on the English Wikipedia, it's actually transcluding a
template
named "/meta/color": <
https://en.wikipedia.org/wiki/Template:/meta/color
.
I did concurrently look at the approximate number of non-redirect pages that contain inline styling. My findings were that about 408,777 non-redirect pages contain some kind of inline styling on the English Wikipedia (cf. https://phabricator.wikimedia.org/T115228#1752223).
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Rohde wrote:
Which, after substituting "display:none;" I think translates directly to the regex search:
insource:/style[ ]*=[ ]*"display:[ ]*none;[ ]*"/i
That gives me 487 articles.
Almost, but not quite. You actually want this:
insource:/style[ ]*=[ ]*"display:[ ]*none;?[ ]*"/i
With the semicolon being made optional, the search results increase from 487 to 2,487 currently on the English Wikipedia. The normalization script (https://phabricator.wikimedia.org/P2229) made the trailing semicolon consistent, in addition to lowercasing and trying to account for strange spacing. For whatever reason, "display: none;" is often written without the trailing semicolon in main namespace pages on the English Wikipedia.
I was worried that I may have made a major coding mistake, so I re-ran my script using this pattern:
pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"'
The results are available here: https://phabricator.wikimedia.org/P2255. Sixteen articles have over 1,000 instances of "display: none;" each! The total is 142,176 instances of "display: none;" (normalized) in 2,507 main namespace pages on the English Wikipedia, as of about 2015-10-02.
I am happy to agree that searching the XML should be better than the local search tool, but I still find these numbers hard to reconcile.
After re-reviewing the code and re-running the script to focus on "display: none;" specifically, there's strong evidence to suggest that the numbers are accurate, if not a bit surprising in some cases. :-)
MZMcBride
Okay. Thanks for making the extra effort.
-Robert
On Thu, Oct 29, 2015 at 6:05 AM, MZMcBride z@mzmcbride.com wrote:
Robert Rohde wrote:
Which, after substituting "display:none;" I think translates directly to the regex search:
insource:/style[ ]*=[ ]*"display:[ ]*none;[ ]*"/i
That gives me 487 articles.
Almost, but not quite. You actually want this:
insource:/style[ ]*=[ ]*"display:[ ]*none;?[ ]*"/i
With the semicolon being made optional, the search results increase from 487 to 2,487 currently on the English Wikipedia. The normalization script (https://phabricator.wikimedia.org/P2229) made the trailing semicolon consistent, in addition to lowercasing and trying to account for strange spacing. For whatever reason, "display: none;" is often written without the trailing semicolon in main namespace pages on the English Wikipedia.
I was worried that I may have made a major coding mistake, so I re-ran my script using this pattern:
pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"'
The results are available here: https://phabricator.wikimedia.org/P2255. Sixteen articles have over 1,000 instances of "display: none;" each! The total is 142,176 instances of "display: none;" (normalized) in 2,507 main namespace pages on the English Wikipedia, as of about 2015-10-02.
I am happy to agree that searching the XML should be better than the local search tool, but I still find these numbers hard to reconcile.
After re-reviewing the code and re-running the script to focus on "display: none;" specifically, there's strong evidence to suggest that the numbers are accurate, if not a bit surprising in some cases. :-)
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org