I've noticed a growing in extensions extending link syntax. (Namely SMW's annotations, and other extensions using Embed:, Video:, or theoretically even Audio: namespaces for embedding things).
However all implementations have strong issues. We have an internal parsing of links, however when an extension does something it's customary to use a regex rather than duplicating a small part of the parser. This normally leads to either a limited syntax substandard of what the parser does, or a regex so complex it causes server errors when syntax is a bit broken (missing a trailing ]] ).
For that reason I'm looking into adding a new feature for the parser Link Hooks. Basically this would allow an extension to hook into link processing for a Namespace, or a pattern.
I plan to support a number of flags (Link/Media callbacks [link modification, vs. embedding], namespace/pattern [ns number, or a special pattern (like SMW's ::)], Multi-params [Pipe separated params rather than one display text], Recursive parameters [Things like Image: where links can be inside parameters], Recursive link text [For patterns which break things up and may contain links]) so it should handle most cases.
Unfortunately I hit a snag in the code when dealing with [[Embedablens:Page|Content with [[link|displaytext]] inside]]. I can't provide data to extensions in a sane way. Either plaintext is sent to them, and they work with that (albet breaking things like usual), or I try to split up the |'s which doesn't work with nested things, or I first parse the nested links, but then extensions get a hard to work with mess passed to them as their data.
The nice way the preprocessor works with objects has pointed me out that the best way this would work, would probably be to recursively parse the text into link objects, and then do our expansion, also allowing them access in special ways to the tree (Extract as WikiText, HTML, Plain Text).
Doing some research into the way the parser handles links at first provided me with good results ([[link [[inside of]] link]] nicely gives you a link to "inside of" with the outside stuff verbatim just as the processor I think of would do). However I ran into an ugly, sticky, mess with image embedding. http://dev.wiki-tools.com/wiki/LinkHook#Old_Tests (Ignore the fact my examples here don't have the frame option) [[Image:File.ext|Caption]] Renders as a image with "Caption" [[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of another image that has a caption of "Caption". [[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a link, the rest is completely verbatim.
Honestly, the syntax is inconsistent with itself. If we were trying to stop embeds inside of embeds, then the last one should render as an image, with a link to [[link]] and the other Image: verbatim as a caption.
I believe there is a bug about the 2nd case, if anyone has it handy I'd love a link. I hunted through bugzilla but couldn't find it.
Some use cases, what's expected would be nice.
My issue is that Image links are functionally supposed to be the same as a setLinkHook using the Media, Multi-params, and Recursive parameters options. (Embed but not with : at the start, pipe separated parameters, and parameters can have links inside of them). However, in terms of any extension or anything that would be using setLinkHook, something like that making use of the recursive parameters option would be expecting something different. [[Embed:Title|[[Otherembed:Title]] and [[link]]]] Would actually render as an embed, with two links (since it's inside of another embed the 'Otherembed' reverts to a link). And: [[Embed:Title|[[Otherembed:Title|[[link]]]]]] Would actually render as an embed, with a link to [[link]] and the rest of the caption verbatim.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Daniel Friesen wrote:
[[Image:File.ext|Caption]] Renders as a image with "Caption" [[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of another image that has a caption of "Caption". [[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a link, the rest is completely verbatim.
Honestly, the syntax is inconsistent with itself. If we were trying to stop embeds inside of embeds, then the last one should render as an image, with a link to [[link]] and the other Image: verbatim as a caption.
Yes, links are not currently fully 'embeddable' in a recursive way. :( You're currently allowed just one level of 'link embedding' in the caption area for Image: links, and even that's special-cased.
It was basically stuck in as a hack on the existing link parsing, which was optimized for doing a single flat pass of links through the entire page; after we extended link syntax to allow image captions, there was a need to hack it up to allow links in the captions...
If it can be more cleanly done in a way that, as it happens, lets you do multiple levels cleanly, that would probably be just great! But definitely try to keep it clean and consistent. :)
In general though I'm not sure we should concentrate on using link syntax here, though; the trend these days seems to be to use parser functions for such things.
- -- brion
Yes, I'm trying to keep things consistent. The only location my idea for a processor differs from current syntax is in the insane edge cases. And at that, only the two most unlikely of them. I'd like to track down the bug report related to that off rendering. I can't imagine anywhere someone would be relying on [[Image:File.ext|[[Image:File.ext|[[link]]]]]] rendering a [[link]] and the rest verbatim. However there might be an insane use of [[Image:File.ext|[[Image:File.ext|Caption]]]].
Well, yes. Link syntax is not a one stop thing for use, in fact in comparison to ParserFunctions the features are going to be substandard. However there are some nice cases where extending a link fits the syntax better than using a parser function. Especially with embedding the Image: namespace, and how it could be extended to embed things like Audio: and such. SMW Also uses annotations, which do most of the time fit in as link like syntax. SMW Could use an #annotate pfunc for the ugly cases, but that's beside the case here. It would still be good to preserve the old syntax where possible.
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of: -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) --Games-G.P.S. (http://ggps.org) -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Daniel Friesen wrote:
[[Image:File.ext|Caption]] Renders as a image with "Caption" [[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of another image that has a caption of "Caption". [[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a link, the rest is completely verbatim.
Honestly, the syntax is inconsistent with itself. If we were trying to stop embeds inside of embeds, then the last one should render as an image, with a link to [[link]] and the other Image: verbatim as a caption.
Yes, links are not currently fully 'embeddable' in a recursive way. :( You're currently allowed just one level of 'link embedding' in the caption area for Image: links, and even that's special-cased.
It was basically stuck in as a hack on the existing link parsing, which was optimized for doing a single flat pass of links through the entire page; after we extended link syntax to allow image captions, there was a need to hack it up to allow links in the captions...
If it can be more cleanly done in a way that, as it happens, lets you do multiple levels cleanly, that would probably be just great! But definitely try to keep it clean and consistent. :)
In general though I'm not sure we should concentrate on using link syntax here, though; the trend these days seems to be to use parser functions for such things.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkilrQEACgkQwRnhpk1wk47wFgCg33invc1nWH9YgMPtv/inCOZR jc4AoNtXmRScusO58z6v9/ixwjQkpN4V =KRgM -----END PGP SIGNATURE-----
Daniel Friesen skrev:
However there are some nice cases where extending a link fits the syntax better than using a parser function. <...> SMW Also uses annotations, which do most of the time fit in as link like syntax. SMW Could use an #annotate pfunc for the ugly cases, but that's beside the case here. It would still be good to preserve the old syntax where possible.
MW's and SMW's parsing of links isn't very impressing at the moment.
Due to the limitations of Regex for nested links, and other problems with overly complex expressions, I've had some fun playing around with an attempt to make a new algorithm which actually would follow desired rules.
The algorithm should return matching tags, nested and in perfect 'balance' (not so now), any complexity, and preferably without choking hardware.
So far I have only a Delphi* version (draft) for which I did some functional and performance tests last night: http://wiki.rilnet.com/wiki/Pattern_matching_of_SMW_properties/Testdata
About current Regexes used in SMW here (from a recent post by Markus Krötzsch): http://wiki.rilnet.com/wiki/Pattern_matching_of_SMW_properties
Regards,
// Rolf Lampa
* For Pascal->PHP conversion I extended an existing DelphiToCpp converter: http://wiki.rilnet.com/wiki/RIL_DelphiToPHP_Converter
Rolf Lampa wrote:
MW's and SMW's parsing of links isn't very impressing at the moment.
I might be talking out of my ass here, since I haven't really looked very much at the relevant parts of the code (except to know that the current link parsing is indeed hacky), but couldn't we somehow reuse whatever code we currently use to parse the curly bracket syntax for transclusion, parser functions and whatnot? After all, from a user's viewpoint, pretty much the only syntactical difference between linking and transclusion is the shape of the brackets.
Ilmari Karonen skrev:
Rolf Lampa wrote:
MW's and SMW's parsing of links isn't very impressing at the moment.
After all, from a user's viewpoint, pretty much the only syntactical difference between linking and transclusion is the shape of the brackets.
Hm, I'd say that it's about more than only keeping braces/brackets in balance. It's also about rules about the link-content which (should) determine whether the entire link should be skipped entirely or not (on syntax errors). Some rules requires unique logic and awareness of context.
But as said, I haven't either looked into how the logic for braces is structured (if it allows for _callbacks for example, such can be a risky though).
In any case, I'm onto learning some PHP and therefore I'll continue "playing around" with this for a while, if for no other good reason so for my own interest. :)
Regards,
// Rolf Lampa
Timstarling commented that adding links into the preprocessor would change they way they are handled and would break cases like http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of WikiText in an incompatible way.
The actual code for Parse::replaceInternalLinks is a sort of explode engine. It starts off by exploding using [[, and it searches for ends and such, also taking into account Image: which can have recursive stuff. It's quite ugly, however it works. And I have been able to find good points to run callbacks in. However my issue lies in the |, there is no strict handling of those and making them "safe" is handled by parsing links inside of the links before the | is broken up. Works good for the parser, but not for anything you want to send to a callback.
So the plan is to actually build an object tree similar to the Frames and Parts the preprocessor uses. This'll allow for better handling of things inside of callbacks.
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of: -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) --Games-G.P.S. (http://ggps.org) -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Rolf Lampa wrote:
Ilmari Karonen skrev:
Rolf Lampa wrote:
MW's and SMW's parsing of links isn't very impressing at the moment.
After all, from a user's viewpoint, pretty much the only syntactical difference between linking and transclusion is the shape of the brackets.
Hm, I'd say that it's about more than only keeping braces/brackets in balance. It's also about rules about the link-content which (should) determine whether the entire link should be skipped entirely or not (on syntax errors). Some rules requires unique logic and awareness of context.
But as said, I haven't either looked into how the logic for braces is structured (if it allows for _callbacks for example, such can be a risky though).
In any case, I'm onto learning some PHP and therefore I'll continue "playing around" with this for a while, if for no other good reason so for my own interest. :)
Regards,
// Rolf Lampa
Daniel Friesen skrev:
Timstarling commented that adding links into the preprocessor would change they way they are handled and would break cases like http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of WikiText in an incompatible way.
Yes, that's a good example.
<...>However my issue lies in the |, there is no strict handling of those and making them "safe" is handled by parsing links inside of the links before the | is broken up. Works good for the parser, but not for anything you want to send to a callback.
A temporary hint for extension writers (until a final generic solution is available in the framework) is to count the brackets and count the pipes only while at the "main-link" level, that is:
0. Start a loop examining the string or string fragment. 1. Count UP on [ brackets. // $BracketsCnt++ 2. Count DOWN on ] brackets. // $BracketsCnt-- 3. Count | (pipes) ONLY when BracketsCnt equals two // if ( $BracketsCnt = 2 ) PipeCnt++ 4. Break loop if more than one pipe // if ( $PipesCnt > 1 ) Exit; was found
This would determine this syntax error in this link "[[ | | ]]" as well as in "[[ | [[ | | ]] ]]" (on the second call if called recursively).
So the plan is to actually build an object tree similar to the Frames and Parts the preprocessor uses. This'll allow for better handling of things inside of callbacks.
Which php file do you recommend me to start look at for this logic?
Regards,
// Rolf Lampa
The logic for links lies in parser/Parser.php's Parser::replceInternalLinks (Actually now it's Parser::replaceInternalLinks2 you should see), TimStarling also made some recent changes so you may also want to look at parser/LinkHolderArray.php.
For the preprocessor stuff parser/Preprocessor_DOM.php, however the logic there is actually fairly more complex than what we'll even need.
I'm trying to find a way to get nested things to work right, without ruining TimStarling's recent improvements to the memory and speed of that area of the parser. I did a small benchmark between: A) recursive call; Find [[, walk till the closing ]] and do a recursive call for the stuff in between. (This is similar to what we currently do now, though we limit to a depth of 2) B) markers and a single hashtable; As we find [['s we create a stack of offsets, when a ]] is found we pop the last offset, create a new token for the hashtable with the contents in between, and replace the text with a marker. (Though when expanding, we need to expand multiple times because the content of markers can have markers inside of them as well)
For a real flat setup A) and B) are similar, though A) does have a slightly lower footprint (But do note that this test is flat string replacement recursion, there is no link holders setup and we don't run a setup which we would be running multiple times with the recursion, so a actual Parser implementation would likely be heavier). However, when you get into an insane level of nested brackets, A) starts to take 10x the time that B) takes. This would be why we limit to a depth of 2 recursions, but may actually make using B) to create a tree possible. Of course, links would never be nested like that, but when you are using a different order of parsing it does get needed.
I'm considering creating another parser (inheriting from Parser) in order to start experimenting and working on a different order. That would allow us to use the Parser_DiffTest to make sure that for all use cases syntax remains the same. (And also allow us to benchmark).
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of: -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) --Games-G.P.S. (http://ggps.org) -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Rolf Lampa wrote:
Daniel Friesen skrev:
Timstarling commented that adding links into the preprocessor would change they way they are handled and would break cases like http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of WikiText in an incompatible way.
Yes, that's a good example.
<...>However my issue lies in the |, there is no strict handling of those and making them "safe" is handled by parsing links inside of the links before the | is broken up. Works good for the parser, but not for anything you want to send to a callback.
A temporary hint for extension writers (until a final generic solution is available in the framework) is to count the brackets and count the pipes only while at the "main-link" level, that is:
- Start a loop examining the string or string fragment.
- Count UP on [ brackets. // $BracketsCnt++
- Count DOWN on ] brackets. // $BracketsCnt--
- Count | (pipes) ONLY when BracketsCnt equals two // if ( $BracketsCnt = 2 )
PipeCnt++ 4. Break loop if more than one pipe // if ( $PipesCnt > 1 ) Exit; was found
This would determine this syntax error in this link "[[ | | ]]" as well as in "[[ | [[ | | ]] ]]" (on the second call if called recursively).
So the plan is to actually build an object tree similar to the Frames and Parts the preprocessor uses. This'll allow for better handling of things inside of callbacks.
Which php file do you recommend me to start look at for this logic?
Regards,
// Rolf Lampa
wikitech-l@lists.wikimedia.org