Okay. Thanks for making the extra effort.
-Robert
On Thu, Oct 29, 2015 at 6:05 AM, MZMcBride z@mzmcbride.com wrote:
Robert Rohde wrote:
Which, after substituting "display:none;" I think translates directly to the regex search:
insource:/style[ ]*=[ ]*"display:[ ]*none;[ ]*"/i
That gives me 487 articles.
Almost, but not quite. You actually want this:
insource:/style[ ]*=[ ]*"display:[ ]*none;?[ ]*"/i
With the semicolon being made optional, the search results increase from 487 to 2,487 currently on the English Wikipedia. The normalization script (https://phabricator.wikimedia.org/P2229) made the trailing semicolon consistent, in addition to lowercasing and trying to account for strange spacing. For whatever reason, "display: none;" is often written without the trailing semicolon in main namespace pages on the English Wikipedia.
I was worried that I may have made a major coding mistake, so I re-ran my script using this pattern:
pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"'
The results are available here: https://phabricator.wikimedia.org/P2255. Sixteen articles have over 1,000 instances of "display: none;" each! The total is 142,176 instances of "display: none;" (normalized) in 2,507 main namespace pages on the English Wikipedia, as of about 2015-10-02.
I am happy to agree that searching the XML should be better than the local search tool, but I still find these numbers hard to reconcile.
After re-reviewing the code and re-running the script to focus on "display: none;" specifically, there's strong evidence to suggest that the numbers are accurate, if not a bit surprising in some cases. :-)
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l