As I read it MZMcBride's code [1] used the pattern,
pattern = r'style[ ]*=[ ]*"(.+?)"'
and then normalized the internal spaces.
Which, after substituting "display:none;" I think translates directly to
the regex search:
insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i
That gives me 487 articles [2]. I would also note that MZMcBride's code
will only report a "display: none" if that is the only style element. If
multiple style elements are present then it becomes a count on a different
line including those multiple elements.
I am happy to agree that searching the XML should be better than the local
search tool, but I still find these numbers hard to reconcile.
-Robert Rohde
[1]
On Tue, Oct 27, 2015 at 4:03 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
It appears that the summary may have normalized the
formatting of the CSS.
143095 display: none;
Your query[1] assumes a space after "display:" and gives 218 results. Using
no space[2] gives 2,473 results, but still assumes that no other elements
occur in the style attribute. A regex query[3] with "display:" + optional
spaces + "none" gives 4,296 results, or a more reasonable average of 33 per
result. That query may be overly aggressive and match outside of style
contexts, but it also matches *list_style =
text-align:center;display:none,*
and *style="font-size: normal; text-align: left; display: none;"* which I
think is a good thing (definitely in the latter case).
Parsing a dump of enwiki is more accurate than running insource: queries.
—Trey
[1]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adv…
[2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adv…
[3]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=def…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Tue, Oct 27, 2015 at 10:41 AM, Robert Rohde <rarohde(a)gmail.com> wrote:
Okay, I misunderstood those as page counts, which
would be way too high.
Even if they are explicit usage counts, I am still surprised they are
that
high.
BTW, is it surprising to anyone else that style elements aren't
searchable
by default? Searching for "efcfff"
[1], gives only a single article
result
despite "background: #efcfff;" being
reported 200k times.
We can however search using "insource:efcfff" [2], which reports 5516
articles, implying this color is applied _on average_ roughly 39 times
per
article.
"display: none;" would appear even more impressive, with a reported 140k
uses in just 218 articles [3] or an average of 656 usages per page
containing it. That doesn't feel very likely to me. One possibility
would
be if you mistakenly counted some or all pages
outside of the main
namespace. Though only 218 articles use "display: none", there are
nearly
31000 other pages that include it [4], which
seems like a much more
reasonable way to get to 140k total uses.
-Robert
[1]
https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASear…
[2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adv…
[3]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=adv…
[4]
https://en.wikipedia.org/w/index.php?title=Special:Search&search=insour…
On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride <z(a)mzmcbride.com> wrote:
> Robert Rohde wrote:
> >On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride <z(a)mzmcbride.com> wrote:
> >>The following are the top ten instances of inline styling from main
> >>namespace pages on the English Wikipedia, as of about 2015-10-02:
> >>
> >>1552197 text-align: center;
> >>499756 text-align: left;
> >>355952 background: #dfffdf;
> >>235222 background: #cfcfff;
> >>215038 background: #efcfff;
> >>210702 text-align: right;
> >>143095 display: none;
> >>93646 background: #efefef;
> >>86391 font-size: 90%;
> >>80420 background: #fff;
> >
> >I'm not sure what your bug is, but those counts are way too high to be
> >accurate reflections of the wikitext in the main namespace on enwiki.
>
> Err, based on what? :-)
>
> These numbers are instances of style="[...]", not page counts. Looking
at
> a specific example from
<https://phabricator.wikimedia.org/P2230>:
>
> 1164 font-family: 'microsoft yi baiti', 'noto sans yi',
nsimsun-18030,
> simsun-18030, 'sil yi',
code2000;
>
> These 1,164 inline styling instances all come from a single article:
> <https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit>.
>
> Maybe that's the confusion? I tried to make my descriptions as clear as
> possible and I'm not saying a major bug is impossible, of course, but I
> don't have any reason so far to doubt the data I collected.
>
> Another strange case is "background-color: {{/meta/color}};", which had
> 16,432 instances. This almost looks like it would try to transclude a
> subpage of the article, but due to subpages being disabled in the main
> namespace on the English Wikipedia, it's actually transcluding a
template
> named "/meta/color": <
https://en.wikipedia.org/wiki/Template:/meta/color
.
I did concurrently look at the approximate number of non-redirect pages
that contain inline styling. My findings were that about 408,777
non-redirect pages contain some kind of inline styling on the English
Wikipedia (cf. <https://phabricator.wikimedia.org/T115228#1752223>).
MZMcBride
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l