<div dir="ltr">Subbu that's exactly the kind of thing I have in mind, and further reinforces the notions I have that using an API (server- or client-side) built on top of Parsoid could be a huge step forward in reliability.  If you're testing that the API works & DOM spec is correct across that many articles, the things we build on top should "Just Work (TM)."  I'll be sure to look into this in my hacking, thanks for replying!</div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 20, 2015 at 10:52 AM, Subramanya Sastry <span dir="ltr"><<a href="mailto:ssastry@wikimedia.org" target="_blank">ssastry@wikimedia.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <div><br>
       For Parsoid, we run tests [1] against a set of 160K articles that
      we randomly picked a couple years back .. about 10K articles from
      16 wikis. For Parsoid's purposes, we run roundtrip tests (wikitext
      -> html -> wikitext) and compare diffs, as well as run
      trivial edit tests (wikitext -> html -> add comment at end
      of page -> wikitext) and check how clean our roundtripping is.<br>
      <br>
      This testing has been extremely good at telling us when something
      is broken vs. when something is good to be deployed. Checking
      these results is part of our deployment process. We also collect
      performance statistics in each testing run, however our testing
      database / database schema is not sufficiently tuned to let us
      actually track performance regressions well .. so, that data has
      just sat in the db without being used for anything.<br>
      <br>
      But, we've also been recently talking about:<br>
      * refresh this to pick a more proportional set of articles from
      different wikis (more from enwiki, less from others, etc.), but
      not yet done this.<br>
      * throw in a different (non-random selection) set of pages that
      are particularly important (featured articles, etc.). so, we would
      be interested in any set of articles that is considered important
      enough to be regularly tested against.<br>
      <br>
      This map-reduce style testing code is somewhat general enough that
      it could be repurposed for other kinds of testing. For example, we
      have also repurposed this same rt-testing code for running visual
      diffs (compare phantomjs renderings of php parser output and
      parsoid output on the same title) on a set of about 800 enwiki
      articles (random selection) [2].<br>
      <br>
      This kind of testing is very essential for our deployments and not
      sure if it is appropriate for other teams .. but sharing just in
      case.<br>
      <br>
      Subbu.<br>
      <br>
      [1] See <a href="http://parsoid-tests.wikimedia.org/topfails" target="_blank">http://parsoid-tests.wikimedia.org/topfails</a> and
      <a href="http://parsoid-tests.wikimedia.org/commits" target="_blank">http://parsoid-tests.wikimedia.org/commits</a> .. The main page is
      <a href="http://parsoid-tests.wikimedia.org" target="_blank">http://parsoid-tests.wikimedia.org</a> but this page can sometimes
      timeout whenever the db is clogged and old test results need
      clearing out.<br>
      <br>
      [2] <a href="http://parsoid-tests.wikimedia.org/visualdiff/" target="_blank">http://parsoid-tests.wikimedia.org/visualdiff/</a>  with code @
      <a href="https://github.com/subbuss/parsoid_visual_diffs" target="_blank">https://github.com/subbuss/parsoid_visual_diffs</a><div><div class="h5"><br>
      <br>
      <br>
      On 05/20/2015 01:48 AM, Elena Tonkovidova wrote:<br>
    </div></div></div>
    <blockquote type="cite"><div><div class="h5">
      <div dir="ltr">On <a href="https://docs.google.com/spreadsheets/d/14Ei-KWYbZcmvT70irx6NGIJCi17tF2o1szXnQsZ2h-A/edit#gid=0" target="_blank">https://docs.google.com/spreadsheets/d/14Ei-KWYbZcmvT70irx6NGIJCi17tF2o1szXnQsZ2h-A/edit#gid=0</a>
        there are articles that I usually check when I do regression
        testing. 
        <div><br>
        </div>
        <div>One group is a set of articles that used to have some sort
          of performance/display issues<br>
          <div>-  <span style="font-family:arial,sans,sans-serif;font-size:13px;font-weight:bold">Barack
              Obama, Cat, India, Richard Nixon, </span></div>
          <div><span style="font-family:arial,sans,sans-serif;font-size:13px;font-weight:bold">Europe,
              English language</span></div>
          <div><br>
          </div>
        </div>
        <div>Another group of articles - where images or Image Gallery
          is tested(gif, svg, image map, charts, timeline, large amount
          of imgs in the Image Gallery)</div>
        <div><br>
        </div>
        <div>- <b>Claude Monet </b>- extensive Image Gallery(different
          img sizes)</div>
        <div>- <b>List of go games</b> - many svg images</div>
        <div>- <span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;font-weight:bold;white-space:pre-wrap">Lilac
            chaser, </span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;font-weight:bold;white-space:pre-wrap">Caridoid
            escape reaction </span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">-
            animated(gif) images</span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">- </span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap"><b>The
              Club(dining club), Image map</b></span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">
            - for image map img</span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">-
            <b>Tel Aviv(Hebrew</b>) for timeline img template</span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">-
            several specific articles with problems in their lead img</span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap"><br>
          </span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">And,
            yes, it'd be really great if we can 1) define more precisely
            what articles properties we are interested to test(visiting
            statistics, </span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">size,
            structures, special layouts, imgs etc.) </span><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap">
            and 2) create a process(system) to find such articles </span></div>
        <div><span style="color:rgb(0,0,0);font-family:arial,sans,sans-serif;font-size:13px;white-space:pre-wrap"><br>
          </span></div>
        <div>
          <h1 style="margin:0px;padding:8px 0px;border:0px"><span>Also,
              there is still an open task - </span><font color="#464c5c" face="Segoe UI, Segoe UI Web Regular, Segoe UI Symbol,
              Helvetica Neue, Helvetica, Arial, sans-serif"><span style="font-size:15px;font-weight:normal"><a href="https://phabricator.wikimedia.org/T97151" target="_blank">https://phabricator.wikimedia.org/T97151</a>
                - </span></font><span>Testing
              Page issues and disambiguation templates(T90250). Going
              through the list of </span><a href="http://en.wikipedia.org/wiki/Category:Wikipedia_articles_with_content_issues" rel="noreferrer" target="_blank">http://en.wikipedia.org/wiki/Category:Wikipedia_articles_with_content_issues</a></h1>
          <a href="http://en.wikipedia.org/wiki/Wikipedia:Template_messages/General#Disambiguation_and_redirection" rel="noreferrer" target="_blank">http://en.wikipedia.org/wiki/Wikipedia:Template_messages/General#Disambiguation_and_redirection</a> should
          help to catch some issues.</div>
        <div><br>
        </div>
        <div>thanks</div>
        <div>Elena</div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Tue, May 19, 2015 at 9:23 PM,
            Brian Gerstle <span dir="ltr"><<a href="mailto:bgerstle@wikimedia.org" target="_blank">bgerstle@wikimedia.org</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
              <div dir="ltr">+search</div>
              <div>
                <div>
                  <div class="gmail_extra"><br>
                    <div class="gmail_quote">On Tue, May 19, 2015 at
                      3:14 PM, Brian Gerstle <span dir="ltr"><<a href="mailto:bgerstle@wikimedia.org" target="_blank">bgerstle@wikimedia.org</a>></span>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
                        <div dir="ltr">The subject hints at a question
                          that's been nagging me for a while, and now
                          that I'm going to be hacking on testing in
                          Lyon I wanted to ask:
                          <div><br>
                          </div>
                          <div>Do we have a list of articles we usually
                            run tests against?</div>
                          <div><br>
                          </div>
                          <div>If not, do we have any processes for
                            curating such a list?  Would anyone be
                            interested in a brainstorming session at
                            Lyon to discuss this further?</div>
                          <div><br>
                          </div>
                          <div>Basically, as a developer, I would love
                            to have more confidence that some code I
                            wrote doesn't break on our most popular
                            articles.  Or, if we can get more
                            sophisticated, that <b>certain properties
                              of my code hold true for certain kinds of
                              generated pages</b>.*</div>
                          <div><br>
                          </div>
                          <div>Please respond with your thoughts and
                            whether you think I should create a phab
                            task for the hackathon about this.  In
                            either case, ping me anytime or grab me at
                            Lyon to discuss further!</div>
                          <div><br>
                          </div>
                          <div>Regards,</div>
                          <div><br>
                          </div>
                          <div>Brian</div>
                          <div><br>
                          </div>
                          <div>* Yes, I'm talking about using
                            property-based testing generators to create
                            random, shrinkable MW pages that we can run
                            tests on. Not sure if it's practical, but
                            could be an interesting experiment.</div>
                          <span><font color="#888888">
                              <div>
                                <div><br>
                                </div>
                                -- <br>
                                <div>
                                  <div dir="ltr">
                                    <div>
                                      <div dir="ltr">EN Wikipedia user
                                        page: <a href="https://en.wikipedia.org/wiki/User:Brian.gerstle" target="_blank">https://en.wikipedia.org/wiki/User:Brian.gerstle</a><br>
                                        IRC: bgerstle</div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </font></span></div>
                      </blockquote>
                    </div>
                    <br>
                    <br clear="all">
                    <div><br>
                    </div>
                    -- <br>
                    <div>
                      <div dir="ltr">
                        <div>
                          <div dir="ltr">EN Wikipedia user page: <a href="https://en.wikipedia.org/wiki/User:Brian.gerstle" target="_blank">https://en.wikipedia.org/wiki/User:Brian.gerstle</a><br>
                            IRC: bgerstle</div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
              <br>
              _______________________________________________<br>
              reading-wmf mailing list<br>
              <a href="mailto:reading-wmf@lists.wikimedia.org" target="_blank">reading-wmf@lists.wikimedia.org</a><br>
              <a href="https://lists.wikimedia.org/mailman/listinfo/reading-wmf" target="_blank">https://lists.wikimedia.org/mailman/listinfo/reading-wmf</a><br>
              <br>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
      <br>
      </div></div><pre>_______________________________________________
QA mailing list
<a href="mailto:QA@lists.wikimedia.org" target="_blank">QA@lists.wikimedia.org</a>
<a href="https://lists.wikimedia.org/mailman/listinfo/qa" target="_blank">https://lists.wikimedia.org/mailman/listinfo/qa</a>
</pre>
    </blockquote>
    <br>
  </div>

<br>_______________________________________________<br>
reading-wmf mailing list<br>
<a href="mailto:reading-wmf@lists.wikimedia.org">reading-wmf@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/reading-wmf" target="_blank">https://lists.wikimedia.org/mailman/listinfo/reading-wmf</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr">EN Wikipedia user page: <a href="https://en.wikipedia.org/wiki/User:Brian.gerstle" target="_blank">https://en.wikipedia.org/wiki/User:Brian.gerstle</a><br>IRC: bgerstle</div></div></div></div>
</div>