dirty diffs and VE - Wikitech-l

24 Jul 2013


      On Wed, Jul 24, 2013 at 2:06 AM, Subramanya Sastry
ssastry@wikimedia.org wrote:
...
Hi John and Risker,
First off, I do want to once again clarify that my intention in the previous
post was not to claim that VE/Parsoid is perfect.  It was more that we've
fixed sufficient bugs at this point that the most significant "bugs" (bugs,
not missing features) that need fixing (and are being fixed) are those that
have to do with usability tweaks.
How do you know that?  Have you performed automated tests on all
Wikipedia content?  Or are you waiting for users to find these bugs?
...
My intention in that post was also not
one to put some distance between us and the complaints, just to clarify that
we are fixing things as fast as we can and it can be seen in the recent
changes stream.
John: specific answers to the edit diffs you highlighted in your post.  I
acknowledge your intention to make sure we dont make false claims about
VE/Parsoid's usability.   Thanks for taking the time for digging them up.
My answers below are made with an intention of figuring out what the issues
are so they can be fixed where they need to be.
On 07/23/2013 02:50 AM, John Vandenberg wrote:
...
On Tue, Jul 23, 2013 at 4:32 PM, Subramanya Sastry
ssastry@wikimedia.org wrote:
...
On 07/22/2013 10:44 PM, Tim Starling wrote:
...
Round-trip bugs, and bugs which cause a given wikitext input to give
different HTML in Parsoid compared to MW, should have been detected
during automated testing, prior to beta deployment. I don't know why
we need users to report them.
500+ edits are being done per hour using Visual Editor [1] (less at this
time given that it is way past midnight -- I have seen about 700/hour at
times).  I did go and click on over 100 links and examined the diffs.  I
did
that twice in the last hour.  I am happy to report clean diffs on all
edits
I checked both times.
I did run into a couple of nowiki-insertions which
is, strictly speaking not erroneous and based on user input, but is more
a
usability issue.
What is a dirty diff?  One that inserts junk unexpectedly, unrelated
to the user's input?
That is correct.  Strictly speaking, yes, any changes to the wikitext markup
that arose from what the user didn't change.
...
The broken table injection bugs are still happening.
https://en.wikipedia.org/w/index.php?title=Sai_Baba_of_Shirdi&curid=1441...
If the parser isnt going to be fixed quickly to ignore tables it
doesnt understand, we need to find the templates and pages with these
broken tables - preferably using SQL and heuristics and fix them.  The
same needs to be done for all the other wikis, otherwise they are
going to have the same problems happening randomly, causing lots of
grief.
This maybe related to this:
https://bugzilla.wikimedia.org/show_bug.cgi?id=51217  and I have a tentative
fix for it as of y'day.
Fixes are of course appreciated.  The pace of bugfixes is not the problem ...
...
VE and Parsoid devs have put in a lot and lot of effort to recognize broken
wikitext source, fix it or isolate it,
My point was that you dont appear to be doing analysis of how of all
Wikipedia content is broken; at least I dont see a public document
listing which templates and pages are causing the parser problems, so
the communities on each Wikipedia can fix them ahead of deployment.
I believe there is bug about automated testing of the parser against
existing pages, which would identify problems.
I scanned the Spanish 'visualeditor' tag's 50 recentchanges earlier
and found a dirty diff, which I believe hasnt been raised in bugzilla
yet.
https://bugzilla.wikimedia.org/show_bug.cgi?id=51909
50 VE edits on eswp is more than one day of recentchanges.  Most of
the top 10 wikis have roughly the same level of testing going on.
That should be a concern.  The number of VE edits is about to increase
on another nine Wikipedias, with very little real impact analysis
having been done.  That is a shame, because the enwp deployment has
provided us with a list of problems which will impact those wikis if
they are using the same syntax, be it weird or broken or otherwise
troublesome.
...
and protect it across edits, and
roundtrip it back in original form to prevent corruption.  I think we have
been largely successful but we still have more cases to go that are being
exposed here which we will fix.  But, occasionally, these kind of errors do
show up -- and we ask for your patience as we fix these.  Once again, this
is not a claim to perfection, but a claim that this is not a significant
source of corrupt edits.  But, yes even a 0.1% error rate does mean a big
number in the absolute when thousands of pages are being edited -- and we
will continue to pare this down.
Is 0.1% a real data point, or a stab in the dark?  Because I found two
in 100 on enwp; Robert found at least one in 200 on enwp; and I found
1 in 50 on eswp.
...
...
In addition to nowikis, there are also wikilinks that are not what the
user intended
https://en.wikipedia.org/w/index.php?title=Ben_Tre&curid=1822927&dif...
https://en.wikipedia.org/w/index.php?title=Celton_Manx&curid=28176434&am...
You are correct, but this is not a dirty diff.  I dont want to claim this is
an user error entirely  -- but a combination of user and software error.
fwiw, I wasnt claiming these or the ones that followed were dirty
diffs; these are other problems which the software is a contributor
to, *other* than the nowiki cases we know so well.
...
...
Here is three edits to try to add a section header and a sentence,
with a wikilink in the section header.
(In the process they added other junk into the page, probably
unintentionally.)
https://en.wikipedia.org/w/index.php?title=Port_of_Davao&action=history&...
What is the problem here exactly?  (that is a question, not a challenge).
The user might have entered those newlines as well.
The VE UI is confusing, and did many silly things during those edits.
The user had to resort to editing in source editor to clean it up.
Step through the diffs.
-- 
John Vandenberg