Hi,
I'm building the ToC entries from Parsoid HTML content. Another part which caused some struggle is building the correct anchors for the section headings.
First I thought I could just use the id attributes in the heading tags Parsoid provides[2].
Example from [1]: <h2 id="mwCA">Template truncation</h2>
But then I thought about links to specific sections. Those would not use the same ids Parsoid generates.[3] They would use the anchorencoded tocline strings.[4]
Since I have not found an npm module which does anchorencoding in JavaScript I wrote a small library function to do the same. It uses the phpjs npm module to take into account the PHP specific way URLencoding is done. Would you mind checking the anchorencode.js file and the associate test file anchorencode-test.js in my patch[5]?
If there is a JS implementation of this I'd be happy to hear about that, of course.
Thanks, Bernd
[1] https://test.wikipedia.org/wiki/Section_edit_links_bug2 [2] view-source: https://test.wikipedia.org/api/rest_v1/page/html/Section_edit_links_bug2 [3] https://test.wikipedia.org/api/rest_v1/page/html/Section_links [4] https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding#Encodings_compared [5] https://gerrit.wikimedia.org/r/#/c/246100/7
On 10/27/2015 01:38 PM, Bernd Sitzmann wrote:
Hi,
I'm building the ToC entries from Parsoid HTML content. Another part which caused some struggle is building the correct anchors for the section headings.
In the past, we discussed generating HTML5 ids rather than the munged ids that are currently generated in mediawiki core (for html4 reasons that no longer apply to html5). However, we didn't go ahead with it because at least for old content, we have to generate both old munged and new ids since a lot of anchors would have escaped out in the wild. https://gerrit.wikimedia.org/r/#/c/226032/ has some comments about this.
I haven't thought through this, but could mobile generate html5 ids (which is less restrictive) instead of the html4-style ids? I suppose if those section links got shared and opened outside the mobile view, they would break in some cases.
In any case, the escapeId function in https://gerrit.wikimedia.org/r/#/c/226032/4/lib/ext.core.Sanitizer.js is the code to generate these ids if it helps (right now unused in Parsoid, but will be used once we start generating section ids).
Subbu.
First I thought I could just use the id attributes in the heading tags Parsoid provides[2].
Example from [1]:
<h2 id="mwCA">Template truncation</h2>
But then I thought about links to specific sections. Those would not use the same ids Parsoid generates.[3] They would use the anchorencoded tocline strings.[4]
Since I have not found an npm module which does anchorencoding in JavaScript I wrote a small library function to do the same. It uses the phpjs npm module to take into account the PHP specific way URLencoding is done. Would you mind checking the anchorencode.js file and the associate test file anchorencode-test.js in my patch[5]?
If there is a JS implementation of this I'd be happy to hear about that, of course.
Thanks, Bernd
[1] https://test.wikipedia.org/wiki/Section_edit_links_bug2 [2] view-source:https://test.wikipedia.org/api/rest_v1/page/html/Section_edit_links_bug2 [3] https://test.wikipedia.org/api/rest_v1/page/html/Section_links [4] https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding#Encodings_compared [5] https://gerrit.wikimedia.org/r/#/c/246100/7
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles, and HTML, templates and parser tags are handled differently for display and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
On 10/27/2015 03:48 PM, Gergo Tisza wrote:
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles,
This would need to be handled to guarantee id uniqueness.
and HTML, templates and parser tags are handled differently for display and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
Subbu.
Another option could be to use compact stable element IDs https://phabricator.wikimedia.org/T116350 not based on the content. This would be less readable, but on the upside there wouldn't be any collisions, and links wouldn't break on minor heading changes.
On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 10/27/2015 03:48 PM, Gergo Tisza wrote:
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles,
This would need to be handled to guarantee id uniqueness.
and HTML, templates and parser tags are handled differently for display
and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
Subbu.
Subbu, Gergo and Gabriel:
Thank you for your comments so far.
Just to be clear, ideally I want the anchor ids to be the same as used in Core. I would really like for Parsoid to provide the same anchor ids as Core does. Then that would take also care of the uniqueness issue. Is there a task for this? If not I'd be happy to create one. In the meantime I'll use my own implementation until we get something from upstream.
If the anchor ids generated by the Mobile Content Service do not match the ones generated by Core then the app would not scroll to the correct section. Instead it would just stay at the top of the page.
The links can come from inside the same page, other pages, redirects, or even from outside the app/site. The app builds the correct <h[2-6]> tags using the anchor values provided by the Mobile Content Service output. This is why I don't want just an anchor id that looks like "mwCA". (Of course, that would be ok if core would do the same but right now it doesn't.)
Subbu: Thanks for the link to the JS code. I'll adapt my patch to include some of the additional substitutions. You may also want to check out my patch since I think some of the cases that are handled by the phpjs library are not handled in the Parsoid code.
Another thing I haven't found in the Parsoid code is ensuring uniqueness of ids. I'd be interested how this is resolved in Core, too, of course, to make sure what we do on the JS matches Core.
Cheers, Bernd
On Tue, Oct 27, 2015 at 3:10 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option could be to use compact stable element IDs https://phabricator.wikimedia.org/T116350 not based on the content. This would be less readable, but on the upside there wouldn't be any collisions, and links wouldn't break on minor heading changes.
On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 10/27/2015 03:48 PM, Gergo Tisza wrote:
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles,
This would need to be handled to guarantee id uniqueness.
and HTML, templates and parser tags are handled differently for display
and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
Subbu.
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
1. Parsoid doesn't generate section ids now, but when we do, yes, we'll make sure ids are compatible with core ids and unique. Will check the phpjs library code to see what we are missing. We don't have a ticket for generating section ids yet.
2. At some point, it makes sense to switch both core and Parsoid to a different id scheme (HTML5 ids is the obvious possibility, and Gabriel proposed another) and have fallback support for old-style ids (we've brainstormed some ideas in the past, but I forget the details right now).
Subbu.
On 10/27/2015 05:01 PM, Bernd Sitzmann wrote:
Subbu, Gergo and Gabriel:
Thank you for your comments so far.
Just to be clear, ideally I want the anchor ids to be the same as used in Core. I would really like for Parsoid to provide the same anchor ids as Core does. Then that would take also care of the uniqueness issue. Is there a task for this? If not I'd be happy to create one. In the meantime I'll use my own implementation until we get something from upstream.
If the anchor ids generated by the Mobile Content Service do not match the ones generated by Core then the app would not scroll to the correct section. Instead it would just stay at the top of the page.
The links can come from inside the same page, other pages, redirects, or even from outside the app/site. The app builds the correct <h[2-6]> tags using the anchor values provided by the Mobile Content Service output. This is why I don't want just an anchor id that looks like "mwCA". (Of course, that would be ok if core would do the same but right now it doesn't.)
Subbu: Thanks for the link to the JS code. I'll adapt my patch to include some of the additional substitutions. You may also want to check out my patch since I think some of the cases that are handled by the phpjs library are not handled in the Parsoid code.
Another thing I haven't found in the Parsoid code is ensuring uniqueness of ids. I'd be interested how this is resolved in Core, too, of course, to make sure what we do on the JS matches Core.
Cheers, Bernd
On Tue, Oct 27, 2015 at 3:10 PM, Gabriel Wicke <gwicke@wikimedia.org mailto:gwicke@wikimedia.org> wrote:
Another option could be to use compact stable element IDs <https://phabricator.wikimedia.org/T116350> not based on the content. This would be less readable, but on the upside there wouldn't be any collisions, and links wouldn't break on minor heading changes. On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry <ssastry@wikimedia.org <mailto:ssastry@wikimedia.org>> wrote: On 10/27/2015 03:48 PM, Gergo Tisza wrote: If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles, This would need to be handled to guarantee id uniqueness. and HTML, templates and parser tags are handled differently for display and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.) But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute Subbu. -- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Thanks, I've created a task for this [1] and updated my patch [2].
The advantage of having this done by Parsoid or even higher upstream is that if/when the ids get generated differently we would get it for free.
[1] https://phabricator.wikimedia.org/T116876 [2] https://gerrit.wikimedia.org/r/246100
-Bernd
On Tue, Oct 27, 2015 at 4:08 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
- Parsoid doesn't generate section ids now, but when we do, yes, we'll
make sure ids are compatible with core ids and unique. Will check the phpjs library code to see what we are missing. We don't have a ticket for generating section ids yet.
- At some point, it makes sense to switch both core and Parsoid to a
different id scheme (HTML5 ids is the obvious possibility, and Gabriel proposed another) and have fallback support for old-style ids (we've brainstormed some ideas in the past, but I forget the details right now).
Subbu.
On 10/27/2015 05:01 PM, Bernd Sitzmann wrote:
Subbu, Gergo and Gabriel:
Thank you for your comments so far.
Just to be clear, ideally I want the anchor ids to be the same as used in Core. I would really like for Parsoid to provide the same anchor ids as Core does. Then that would take also care of the uniqueness issue. Is there a task for this? If not I'd be happy to create one. In the meantime I'll use my own implementation until we get something from upstream.
If the anchor ids generated by the Mobile Content Service do not match the ones generated by Core then the app would not scroll to the correct section. Instead it would just stay at the top of the page.
The links can come from inside the same page, other pages, redirects, or even from outside the app/site. The app builds the correct <h[2-6]> tags using the anchor values provided by the Mobile Content Service output. This is why I don't want just an anchor id that looks like "mwCA". (Of course, that would be ok if core would do the same but right now it doesn't.)
Subbu: Thanks for the link to the JS code. I'll adapt my patch to include some of the additional substitutions. You may also want to check out my patch since I think some of the cases that are handled by the phpjs library are not handled in the Parsoid code.
Another thing I haven't found in the Parsoid code is ensuring uniqueness of ids. I'd be interested how this is resolved in Core, too, of course, to make sure what we do on the JS matches Core.
Cheers, Bernd
On Tue, Oct 27, 2015 at 3:10 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option could be to use compact stable element IDs https://phabricator.wikimedia.org/T116350 not based on the content. This would be less readable, but on the upside there wouldn't be any collisions, and links wouldn't break on minor heading changes.
On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry < ssastry@wikimedia.orgssastry@wikimedia.org> wrote:
On 10/27/2015 03:48 PM, Gergo Tisza wrote:
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles,
This would need to be handled to guarantee id uniqueness.
and HTML, templates and parser tags are handled differently for display
and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
Subbu.
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Url encoding and decoding in JS are done with encodeURIComponent and decodeURIComponent.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Obj...
phpjs... http://media.giphy.com/media/42k1WHX6mNpSM/giphy.gif
On Wed, Oct 28, 2015 at 1:06 AM, Bernd Sitzmann bernd@wikimedia.org wrote:
Thanks, I've created a task for this [1] and updated my patch [2].
The advantage of having this done by Parsoid or even higher upstream is that if/when the ids get generated differently we would get it for free.
[1] https://phabricator.wikimedia.org/T116876 [2] https://gerrit.wikimedia.org/r/246100
-Bernd
On Tue, Oct 27, 2015 at 4:08 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
- Parsoid doesn't generate section ids now, but when we do, yes, we'll
make sure ids are compatible with core ids and unique. Will check the phpjs library code to see what we are missing. We don't have a ticket for generating section ids yet.
- At some point, it makes sense to switch both core and Parsoid to a
different id scheme (HTML5 ids is the obvious possibility, and Gabriel proposed another) and have fallback support for old-style ids (we've brainstormed some ideas in the past, but I forget the details right now).
Subbu.
On 10/27/2015 05:01 PM, Bernd Sitzmann wrote:
Subbu, Gergo and Gabriel:
Thank you for your comments so far.
Just to be clear, ideally I want the anchor ids to be the same as used in Core. I would really like for Parsoid to provide the same anchor ids as Core does. Then that would take also care of the uniqueness issue. Is there a task for this? If not I'd be happy to create one. In the meantime I'll use my own implementation until we get something from upstream.
If the anchor ids generated by the Mobile Content Service do not match the ones generated by Core then the app would not scroll to the correct section. Instead it would just stay at the top of the page.
The links can come from inside the same page, other pages, redirects, or even from outside the app/site. The app builds the correct <h[2-6]> tags using the anchor values provided by the Mobile Content Service output. This is why I don't want just an anchor id that looks like "mwCA". (Of course, that would be ok if core would do the same but right now it doesn't.)
Subbu: Thanks for the link to the JS code. I'll adapt my patch to include some of the additional substitutions. You may also want to check out my patch since I think some of the cases that are handled by the phpjs library are not handled in the Parsoid code.
Another thing I haven't found in the Parsoid code is ensuring uniqueness of ids. I'd be interested how this is resolved in Core, too, of course, to make sure what we do on the JS matches Core.
Cheers, Bernd
On Tue, Oct 27, 2015 at 3:10 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Another option could be to use compact stable element IDs https://phabricator.wikimedia.org/T116350 not based on the content. This would be less readable, but on the upside there wouldn't be any collisions, and links wouldn't break on minor heading changes.
On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry < ssastry@wikimedia.orgssastry@wikimedia.org> wrote:
On 10/27/2015 03:48 PM, Gergo Tisza wrote:
If you care about edge cases, section anchor generation is rather complicated: anchors can be postfixed with an index when there are multiple identical titles,
This would need to be handled to guarantee id uniqueness.
and HTML, templates and parser tags are handled differently for display
and for anchor generation. (Yes, these can and do appear in titles. E.g. people sometimes put <math> tags in there, or italicize a word.)
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
Subbu.
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
On Tue, Oct 27, 2015 at 2:04 PM, Subramanya Sastry ssastry@wikimedia.org wrote:
But, if we move core and Parsoid to HTML5 ids, this shouldn't matter since the only restriction on HTML5 ids is that they shouldn't contain a space char as per https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute
You would still have to output the legacy anchor as well, unless you want to break existing links.
Anyway I wasn't thinking of character sets but problems like https://phabricator.wikimedia.org/T26262. But maybe I am confusing things and section anchors handle those correctly and only the post-edit redirection ID fragment does not.