Hi,
I'd like to ask you for your input on a change to the DOM spec [1] we are considering. The idea is to simplify links by removing the explicit mw:WikiLink or mw:ExtLink typeof attributes:
<a rel="mw:WikiLink" href="./Main_Page">Main Page</a> <a rel="mw:ExtLink" href="http://example.com">http://example.com</a>
would become just
<a href="Main_Page">Main Page</a> <a href="http://example.com">http://example.com</a>
Reasons for this change are:
- The external vs. internal link distinction is pretty simple to do with a prefix match on the href.
- When editing, an internal link can turn into an external one and vice-versa. Editors should not have to deal with updating the typeof to reflect the information already available in the href attribute.
- The page source will be slightly cleaner and smaller.
Potential disadvantages we see are
- For ISBN links [2], we will continue to link to Special:BookSources, which looks internal. Matching on that to identify ISBN links should however not be harder than it is right now.
Are you currently relying on these typeofs? Do you see other issues with this proposed change?
Thanks for your input,
Gabriel
[1]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Wiki_links
Hi Gabriel
Le 08/01/2014 21:37, Gabriel Wicke a écrit :
I'd like to ask you for your input on a change to the DOM spec [1] we are considering. The idea is to simplify links by removing the explicit mw:WikiLink or mw:ExtLink typeof attributes:
<a rel="mw:WikiLink" href="./Main_Page">Main Page</a> <a rel="mw:ExtLink" href="http://example.com">http://example.com</a>
would become just
<a href="Main_Page">Main Page</a> <a href="http://example.com">http://example.com</a>
Reasons for this change are:
The external vs. internal link distinction is pretty simple to do with a prefix match on the href.
When editing, an internal link can turn into an external one and vice-versa. Editors should not have to deal with updating the typeof to reflect the information already available in the href attribute.
The page source will be slightly cleaner and smaller.
Potential disadvantages we see are
- For ISBN links [2], we will continue to link to Special:BookSources, which looks internal. Matching on that to identify ISBN links should however not be harder than it is right now.
Are you currently relying on these typeofs? Do you see other issues with this proposed change?
I rely on this information (the "rel" attribute) and this looks to me like a not so elegant move.
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
I fully understand the need to build an efficient solution, I have myself an old computer, but wouldn't that be a compromise to shorten "mw:WikiLink" and "mw:ExtLink".
Removing "./" in the href value makes really sense to me.
Emmanuel
At this moment VisualEditor depends on rel attribute to identify type of the link. Not having rel attribute wouldn't make things easier in VisualEditor when converting from internal to external (or other way round) since it requires replacing internal type of the annotation anyways.
I don't like that consumer of HTMLDOM would have to know certain rules (starts with http or Special:BookSources) in order to recognize type of the element.
Inez
On Wed, Jan 8, 2014 at 12:37 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Hi,
I'd like to ask you for your input on a change to the DOM spec [1] we are considering. The idea is to simplify links by removing the explicit mw:WikiLink or mw:ExtLink typeof attributes:
<a rel="mw:WikiLink" href="./Main_Page">Main Page</a> <a rel="mw:ExtLink" href="http://example.com">http://example.com</a>
would become just
<a href="Main_Page">Main Page</a> <a href="http://example.com">http://example.com</a>
Reasons for this change are:
The external vs. internal link distinction is pretty simple to do with a prefix match on the href.
When editing, an internal link can turn into an external one and vice-versa. Editors should not have to deal with updating the typeof to reflect the information already available in the href attribute.
The page source will be slightly cleaner and smaller.
Potential disadvantages we see are
- For ISBN links [2], we will continue to link to Special:BookSources, which looks internal. Matching on that to identify ISBN links should however not be harder than it is right now.
Are you currently relying on these typeofs? Do you see other issues with this proposed change?
Thanks for your input,
Gabriel
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On Wed, Jan 8, 2014 at 5:24 PM, Inez Korczyński inez@wikia-inc.com wrote:
I don't like that consumer of HTMLDOM would have to know certain rules (starts with http or Special:BookSources) in order to recognize type of the element.
Note that they need to know special rules already, in order to properly identify interwiki links, PMID, RFC, ISBN, external links, etc. So the alternative to this proposal is really to add *more* rel types, past just mw:ExtLink and mw:WikiLink.
The current situation is the worst of both worlds -- we don't provide enough information to accurately identify the link type without examining the URL, but we also require some arbitrary distinction to be made in the rel attribute. --scott
Le 08/01/2014 23:32, C. Scott Ananian a écrit :
The current situation is the worst of both worlds -- we don't provide enough information to accurately identify the link type without examining the URL, but we also require some arbitrary distinction to be made in the rel attribute.
Yes, I'm not against more information.
Emmanuel
One option here is to get serious about the 'post processing' stage of parsoid. The goal was to have a stripped down HTML-based representation for MW content. But many applications (kwix, pdf rendering, etc) want some additional "sugar" in the HTML. They would like, say, more precise link types, or alt and title tags on images, or redlink information. I believe gwicke's original plan was to have a post processing pipeline where this could be added.
Perhaps some of that postprocessing could optionally take place within the parsoid codebase. If you requested "elaborated html" from parsoid, it could go through and add some extra attributes and rel types. Would that sort of design satisfy your needs, Emmanuel (and allow us to remove the rel attribute in the core representation)? --scott
On Wed, Jan 8, 2014 at 5:36 PM, Emmanuel Engelhart kelson@kiwix.org wrote:
Le 08/01/2014 23:32, C. Scott Ananian a écrit :
The current situation is the worst of both worlds -- we don't provide enough information to accurately identify the link type without examining the URL, but we also require some arbitrary distinction to be made in the rel attribute.
Yes, I'm not against more information.
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 01/08/2014 01:29 PM, Emmanuel Engelhart wrote:
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
The good news is that removal of ./ also entails percent-encoding literal colons in internal links to avoid them being interpreted as protocols. That in turn makes matching for external links pretty simple:
/^[a-zA-Z]+:/
No need for an explicit list of protocols there.
Gabriel
Le 08/01/2014 23:52, C. Scott Ananian a écrit :
One option here is to get serious about the 'post processing' stage of parsoid. The goal was to have a stripped down HTML-based representation for MW content. But many applications (kwix, pdf rendering, etc) want some additional "sugar" in the HTML. They would like, say, more precise link types, or alt and title tags on images, or redlink information. I believe gwicke's original plan was to have a post processing pipeline where this could be added.
Perhaps some of that postprocessing could optionally take place within the parsoid codebase. If you requested "elaborated html" from parsoid, it could go through and add some extra attributes and rel types. Would that sort of design satisfy your needs, Emmanuel (and allow us to remove the rel attribute in the core representation)?
If this is possible to offer both, like you propose, without generating too much additional work, then this looks to me to be an excellent proposition.
Emmanuel
On 01/08/2014 03:05 PM, Emmanuel Engelhart wrote:
Perhaps some of that postprocessing could optionally take place within the parsoid codebase. If you requested "elaborated html" from parsoid, it could go through and add some extra attributes and rel types. Would that sort of design satisfy your needs, Emmanuel (and allow us to remove the rel attribute in the core representation)?
If this is possible to offer both, like you propose, without generating too much additional work, then this looks to me to be an excellent proposition.
Instead of a heavy-weight postprocessor we can also just provide the code that classifies a href into different types. It is a pretty simple method, unlikely longer than a few lines.
Gabriel
Le 08/01/2014 23:56, Gabriel Wicke a écrit :
On 01/08/2014 01:29 PM, Emmanuel Engelhart wrote:
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
The good news is that removal of ./ also entails percent-encoding literal colons in internal links to avoid them being interpreted as protocols. That in turn makes matching for external links pretty simple:
/^[a-zA-Z]+:/
This looks good indeed (and is easy to maintain), but still an heuristic isn't it? In the worst case we will have titles matching this regexp.
On 01/08/2014 03:10 PM, Emmanuel Engelhart wrote:
Le 08/01/2014 23:56, Gabriel Wicke a écrit :
On 01/08/2014 01:29 PM, Emmanuel Engelhart wrote:
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
The good news is that removal of ./ also entails percent-encoding literal colons in internal links to avoid them being interpreted as protocols. That in turn makes matching for external links pretty simple:
/^[a-zA-Z]+:/
This looks good indeed (and is easy to maintain), but still an heuristic isn't it? In the worst case we will have titles matching this regexp.
Titles can't match this regexp after this change as we'll percent-encode all literal colons to avoid clients interpreting them as protocols. That was the reason why we added the ./ in the first place, but the idea now is making all links relative to the wiki root rather than the current page name. Main reason for this is that we don't want to rewrite HTML on page rename or when combining HTML from different pages.
Gabriel
Le 09/01/2014 00:14, Gabriel Wicke a écrit :
On 01/08/2014 03:10 PM, Emmanuel Engelhart wrote:
Le 08/01/2014 23:56, Gabriel Wicke a écrit :
On 01/08/2014 01:29 PM, Emmanuel Engelhart wrote:
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
The good news is that removal of ./ also entails percent-encoding literal colons in internal links to avoid them being interpreted as protocols. That in turn makes matching for external links pretty simple:
/^[a-zA-Z]+:/
This looks good indeed (and is easy to maintain), but still an heuristic isn't it? In the worst case we will have titles matching this regexp.
Titles can't match this regexp after this change as we'll percent-encode all literal colons to avoid clients interpreting them as protocols. That was the reason why we added the ./ in the first place, but the idea now is making all links relative to the wiki root rather than the current page name. Main reason for this is that we don't want to rewrite HTML on page rename or when combining HTML from different pages.
Then OK to me for this move, we have elegant solutions to replace the current ones. Thank you very much for having asked/warned people about this.
Emmanuel
Hi Gabriel,
I noticed the spec hasn't been updated yet (will you do that once the change is checked in?)
And could you also document that the "post-processing" will be available for clients that need "elaborated" information? (maybe via an additional API or put an additional cgi parameters, those information can be easily gotten them back)
In general, I think keeping more raw information is a way to provide more development flexibility for clients.
Thanks
On Wed, Jan 8, 2014 at 3:30 PM, Emmanuel Engelhart kelson@kiwix.org wrote:
Le 09/01/2014 00:14, Gabriel Wicke a écrit :
On 01/08/2014 03:10 PM, Emmanuel Engelhart wrote:
Le 08/01/2014 23:56, Gabriel Wicke a écrit :
On 01/08/2014 01:29 PM, Emmanuel Engelhart wrote:
Not clean and not robust, because what we will have to build (to identify an external link) a not so easy to maintain heuristic.
The good news is that removal of ./ also entails percent-encoding literal colons in internal links to avoid them being interpreted as protocols. That in turn makes matching for external links pretty
simple:
/^[a-zA-Z]+:/
This looks good indeed (and is easy to maintain), but still an heuristic isn't it? In the worst case we will have titles matching this regexp.
Titles can't match this regexp after this change as we'll percent-encode all literal colons to avoid clients interpreting them as protocols. That was the reason why we added the ./ in the first place, but the idea now is making all links relative to the wiki root rather than the current page name. Main reason for this is that we don't want to rewrite HTML on page rename or when combining HTML from different pages.
Then OK to me for this move, we have elegant solutions to replace the current ones. Thank you very much for having asked/warned people about this.
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 01/12/2014 02:16 AM, Jiang BIAN wrote:
Hi Gabriel,
I noticed the spec hasn't been updated yet (will you do that once the change is checked in?)
Yes, I was asking here in advance so that it can inform our discussion.
And could you also document that the "post-processing" will be available for clients that need "elaborated" information? (maybe via an additional API or put an additional cgi parameters, those information can be easily gotten them back)
We'll document the code needed to distinguish external from internal links [1], and will also add documentation for the URL bases of RFC/PMID/ISBN links. The latter is already needed with the current spec.
Both encodings provide the same information. In the current spec the external vs. internal distinction is encoded redundantly in two attributes (rel and href), while the proposal is to encode this in the href attribute only.
Gabriel
[1]: var isExternal = /^[a-zA-Z]+:/.test(href)
wikitext-l@lists.wikimedia.org