On Thu, Aug 17, 2006 at 12:21:13AM -0400, Eric Astor wrote:
> Let's see here. Please consider this an incomplete, unreliable list, meant
> solely as an indication of the basic problems encountered when attempting to
> formalize MediaWiki's wikitext... And I'm no expert on parsing, except in
> that I've spent a large part of the summer constructing parsers for
> essentially unparseable languages. Basic point, though, is that MediaWiki
> wikitext is INCREDIBLY context-sensitive.
>
> Single case that shows something interesting:
> '''hi''hello'''hi'''hello''hi'''
>
> Try running it through MediaWiki, and what do you get?
> <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
>
> In other words, you've discovered that the current syntax supports improper
> nesting of markup, in a rather unique fashion. I don't know of any way to
> duplicate this in any significantly formal system, although I believe a
> multiple-pass parser *might* be capable of handling it. In fact, some sort
> of multiple-pass parser (the MediaWiki parser) obviously can.
I suspect that the "proper" parsing of that particular combination is
undefined, and therefore you cna do anything you like.
That's one of the points I was suggesting.
> Also, templates need to be transcluded before most of the parsing can take
> place, since in the current system, the text may leave some
> syntactically-significant constructs incomplete, finishing them in the
> transclusion stage...
And of course, there's extensions, but I gather they're responsible for
calling the parser themselves, which seemed to make sense.
> In summary, for most definitions of formal, it is impossible to write formal
> grammars for most significant subsets of current MediaWiki syntax. I had
> significant success with a regex-based grammar specification (using Martel),
> backed by a VERY general backend capable of back-tracking and other clever
> tricks (mxTextTools) - but the recursive structure is virtually impossible
> to handle in a regex-based framework.
>
> - Eric Astor
>
> P.S. As indicated above, I honestly feel that the difficulties aren't
> insurmountable - if you're willing to build an appropriate parsing
> framework, which will be semi-formal at best.
>
> P.P.S. When possible, in my *copious* free time (</sarcasm>), I'm hoping to
> take another frontend to mxTextTools (SimpleParse, to be specific), modify
> it sufficiently to support all the necessary features, and then build
> something capable of parsing the current MediaWiki syntax (although I might
> have to drop support for improper nesting). I've no idea if or when this
> might happen, but I'm considering it a long-term goal if the current
> situation doesn't improve.
I don't know that I think that the spec has to be something you can
feed to Bison, certainly. But it has to be unambiguously parseable,
with as many corner cases defined as you can manage, at least by
humans, before it's worth trying anything more complicated.
And it's going to *have* to be done sooner or later. I haven't ever
even looked at the parser code, and just from people talking about, I
can tell that there will come a time when it's just too tense to work
on anymore.
Hopefully it will get replaced before then.
On Wed, Aug 16, 2006 at 11:26:22PM -0400, Ivan Krsti?? wrote:
> Jay R. Ashworth wrote:
> > I don't know how useful it will be to have wikitext specified strictly,
> > and I don't think we'll be able to tell until we see how far off we
> > are, and what might need to be tweaked.
>
> This was discussed at hacking days. Brion's pronouncement is that the
> current syntax will admit essentially no backwards-incompatible changes.
My point was more based on taking advantage on the
implementation-defined and -dependent portions of the current 'spec';
things like specifying binding and precedence rules concerning things
like Eric' first example, above.
It's unfortunate that formalization went on the table so late, but it
gets done for a reason, and, being an outgrowth of an engineering
construct, if you need it, and you don't do it, then you Just Can't do
whatever it was that made you decided you needed it.
Wasn't someone from SoC working on this?
Did we ever get a final status report from the SoC work? (It's done
now, isn't it?)
And let's be quite clear: *brion* (and Tim) will admit no
backwards-imcompatible changes, not the syntax. The syntax is an
inanimate non-object.
(I'm not trying to be combative, there, just honest.)
Cheers,
-- jra
Cheers,
-- jra
--
Jay R. Ashworth jra(a)baylink.com
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274
The Internet: We paved paradise, and put up a snarking lot.
Hi, buddies. Thanks for your information on the grammar.
I'm not an expert on parser, I just translate some MediaWiki PHP code
into Java. I did not take a fully test on the parser, but it seems to
work normally on several articles I copy from Wikipedia. Now only a
subset of the MediaWiki Markup is supported, including heading,
horizontal rule, internal links, list, quotes and tables. The parser
might be very buggy now.
I think a 100% complete compatibility with Mediawiki is very
difficult, and in fact dose not make sense. If only a subset of the
grammar is supported, it still would be usable and useful in most
cases.
According to the interoperability, I am using the Mwapi from MwJed,
and it works very well.
When I download the dump
http://download.wikimedia.org/svwiki/20060808/svwiki-20060808-templatelinks…
and uncompress it, I find violations of UTF-8 (apparently
remainders of ISO-8859-1) in these records:
0xf6 in (141524,10,'F\xf6rfattarstub'),
0xf6 in (154217,10,'Geografistub-Gr\×f6nland'),
0xe4 in (147111,10,'Japanskt_L\xe4n'),
0xe4 in (145703,10,'Motorv\xe4gar_i_Sverige'),
0xc4 in (125122,10,'RA\xc4'),
0xe5 in (146360,10,'Sk\xe5despelarstub'),
0xf6 and 0xd6 in (160822,10,'S\xf6dra_\xd6sterbotten'),
0xe4 in (145703,10,'TrafikplatsLandsv\xe4g'),
I could still import this SQL dump into mysql (4.0), but when I
open the SQL dump file in GNU Emacs (22.0.50) it doesn't go into
Unicode mode as it does for a clean UTF-8 file.
I've found no errors in some other files I've looked at.
This is a total of 9 violations in 8 records referring to 7
different pages (page ID 145703 appears twice) out of 330,000
records, so no real reason for panic. These seven page IDs are
not present in the page.sql dump, so apparently stale link records
that should have been removed from the database. If I run the
inner join:
select page_namespace, page_title, tl_namespace, tl_title
from page, templatelinks where page_id = tl_from;
the result is clean UTF-8. But the result is 2076 rows shorter
than the templatelinks table:
select count(*)
from page, templatelinks where page_id=tl_from;
328349
select count(*)
from templatelinks;
330425
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
On Tue, Aug 15, 2006 at 10:44:39AM +1000, Nick Jenkins wrote:
> As you can see from this edit :
> http://wikiwyg.wikia.com/index.php?title=Testpage&diff=136009&oldid=13
> 6007#Neapolitan_double_quotes:_dsomething (which was typed in wikiwyg
> mode), the '' gets converted to italics upon saving, not rendered as
> a literal '' (i.e. what you see in the wikiwyg mode - two quotes - is
> not what you get in the rendered HTML output after saving - italics in
> the headline).
Ok, now here's a completely different issue:
What should WIKIWyg *do* if you hand it something that looks like
wikitext?
My intuition is that it should *not* treat it as wikitext, and this is
the corner case that demonstrates why, but I can see arguments on both sides.
Discuss.
Cheers,
-- jra
--
Jay R. Ashworth jra(a)baylink.com
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274
The Internet: We paved paradise, and put up a snarking lot.
Hi all,
Two questions: Do any decent editing tools exist, specialised for
editing wikis, and in particular, mediawiki wikis. Such a thing would
have to be capable of browsing, but let you do editing in some kind of
more sophisticated, enhanced way - whatever that is.
Secondly, if such a thing doesn't exist (possible since I haven't
heard of one), are there any real obstacles to it happening? Why has
all the discussion of WYSIWYG wiki editing been focused on server-side
implementations. With the exception of querying the database, why
can't all this be implemented locally, allowing a possibly richer user
experience by using native Windows (for example :)) calls, rather than
the limitations of javascript.
Has no one tried?
Steve
Hello Wikitech,
I am curious about the degree (if any) to which Wikimedia experiences
DoS attacks on its servers. Mainly I'm curious about:
(a) whether attacks happen; and
(b) the character of the attacks themselves (application-level? SYN
flood? ICMP flood?).
Is this mailing list the correct forum in which to ask this question? If
not, should I email noc(a)wikimedia.org? I am a graduate student doing
research on DoS attacks and would be extremely grateful for any
information or help.
Many thanks in advance.
-Mike Walfish
We used to use SORBS to blacklist open proxies, but that's pretty
dodgy (requiring a $50 donation to remove an IP). Now we don't use
anything, which means that admins have to manually block thousands of
IPs if some spammer or vandal starts attacking from open proxies. See
bug 6988: http://bugs.wikimedia.org/show_bug.cgi?id=6988. An admin
from kuwiki just came on #mediawiki asking how to block *all*
anonymous users due to the severity of the onslaught (see
http://ku.wikipedia.org/w/index.php?title=Taybet:Recentchanges&hideliu=1).
Similar issues on a lesser scale occur on many projects, which is why
we have the open-proxy-blocking policy in the first place.
So, after some Googling, I found
http://www.declude.com/Articles.asp?ID=97, a list of various DNSBLs.
One promising one appears to be AHBL: see
http://www.ahbl.org/services.php. They offer various DNSBLs, but the
two of interest to us are probably their Tor and IRC lists (the latter
blocks open proxies and otherwise infected computers). Of course,
these need to be subjected to scrutiny before we actually use them,
and a whitelist (per-project? on Meta?) would be a good idea as well
in case we're convinced there's a false positive.
What should detected proxies be prevented from doing? Editing
anonymously, obviously, and creating accounts, at least to begin with.
Registered editing could eventually be prohibited if known good users
are whitelistable per-project somehow (by username, not by IP).
An automated run of parserTests.php showed the following failures:
Running test TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html)... FAILED!
Running test TODO: Link containing double-single-quotes '' (bug 4598)... FAILED!
Running test TODO: Template with thumb image (with link in description)... FAILED!
Running test TODO: message transform: <noinclude> in transcluded template (bug 4926)... FAILED!
Running test TODO: message transform: <onlyinclude> in transcluded template (bug 4926)... FAILED!
Running test BUG 1887, part 2: A <math> with a thumbnail- math enabled... FAILED!
Running test TODO: HTML bullet list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML ordered list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML nested bullet list, open tags (bug 5497)... FAILED!
Running test TODO: HTML nested ordered list, open tags (bug 5497)... FAILED!
Running test TODO: Parsing optional HTML elements (Bug 6171)... FAILED!
Running test TODO: Inline HTML vs wiki block nesting... FAILED!
Running test TODO: Mixing markup for italics and bold... FAILED!
Running test TODO: 5 quotes, code coverage +1 line... FAILED!
Running test TODO: HTML Hex character encoding.... FAILED!
Running test TODO: dt/dd/dl test... FAILED!
Passed 413 of 429 tests (96.27%) FAILED!
Do I correctly remember that mediawiki projects do not keep log files and
statistics and stuff (other than the ones that can be gleaned from the
database itself) to reduce server load? I think I remember somebody saying
the even log files are not kept... or that could have been some other
reality.
I found
http://en.wikipedia.org/wiki/Wikipedia:Statistics
and
http://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
And a few other things, but nothing that looks like it would answer some of
the questions being asked.
So - hitting an external log with the standard client issued image /
javascript type counter thingy would get some of that. More data could be
gleaned by combining that with the info available from the database.
Has this already been discussed/beaten to death? Is it a dumb idea?
Anybody got a server and bandwidth to take a gazillion hits to crunch some
additional statistics? :-)