I have some general Parsoid questions I hoped someone here might help me with.
The background is that we are doing some preliminary work looking at how Text-to-Speech might work on Wikipedia (there will be some info online in the coming weeks).
One detail of this is that you might occasionally have to highlight specific words/sentences that are dealt with differently (e.g. World War III -> World War 3). It is still unclear how frequent such things would be but if they are very frequent then there would likely be push-back from the community if this is stored in the normal wikitext.
In this case we would have to store the markup outside of the wikitext and any viewing/editing of it would have to happen in some user enabled extension of the normal environment.
And here we come to the question. 1. If we would have to store this markup outside of the wikitext could this be done by storing the individual parsoid-data-units? 2. Would it be possible to add these units to the existing parsoid-data (which gets loaded from the wikitext) when loading a page? 3. Would it be possible to detect which of these units would be affected by edits to the wikipage?
This is still in the early stages so mainly we are looking at what possibilities exist should we need them. Using Parsoid data was something we thought of as a light-weight solution to having to store a synced copy of the wikitext+additional markup.
Cheers, André André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
I think you are looking for a solution that can attach metadata to specific places in the DOM -- there have been other contexts where this has come up as well. So, I think we need a generic solution to do this.
That said, Parsoid assigns ids to individual elements in the DOM, and so, an easy way to do this would be to store this data keyed on element ids and then looked up this metadata separately.
As for stability, we right now don't guarantee it, but this has come up previously ( https://phabricator.wikimedia.org/T116350 ) and we haven't tackled it because there hasn't been a compelling use case that would benefit immediately from it, and we cannot reliably guarantee that the ids will continue to be stable across a series of wikitext edits.
But, on a edit-to-edit basis, Parsoid already does dom-diffs and identifies only the edited portions of the DOM (and this is used internally to support no-dirty-diff serialization of edited HTML to wikitext). However, this functionality is not exposed currently outside of internal Parsoid use.
This doesn't answer your questions directly, but hope this is atleast in the direction of what you are looking for.
Subbu.
On 10/28/2015 06:31 AM, André Costa wrote:
I have some general Parsoid questions I hoped someone here might help me with.
The background is that we are doing some preliminary work looking at how Text-to-Speech might work on Wikipedia (there will be some info online in the coming weeks).
One detail of this is that you might occasionally have to highlight specific words/sentences that are dealt with differently (e.g. World War III -> World War 3). It is still unclear how frequent such things would be but if they are very frequent then there would likely be push-back from the community if this is stored in the normal wikitext.
In this case we would have to store the markup outside of the wikitext and any viewing/editing of it would have to happen in some user enabled extension of the normal environment.
And here we come to the question.
- If we would have to store this markup outside of the wikitext could
this be done by storing the individual parsoid-data-units? 2. Would it be possible to add these units to the existing parsoid-data (which gets loaded from the wikitext) when loading a page? 3. Would it be possible to detect which of these units would be affected by edits to the wikipage?
This is still in the early stages so mainly we are looking at what possibilities exist should we need them. Using Parsoid data was something we thought of as a light-weight solution to having to store a synced copy of the wikitext+additional markup.
Cheers, André André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se |+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Hi Subbu,
Many thanks for your answer. It confirmed some of my thoughts on how this might be done.
I'll take this back to our team and get back if I have any updates.
Cheers, André
André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
On 28 October 2015 at 19:17, Subramanya Sastry ssastry@wikimedia.org wrote:
I think you are looking for a solution that can attach metadata to specific places in the DOM -- there have been other contexts where this has come up as well. So, I think we need a generic solution to do this.
That said, Parsoid assigns ids to individual elements in the DOM, and so, an easy way to do this would be to store this data keyed on element ids and then looked up this metadata separately.
As for stability, we right now don't guarantee it, but this has come up previously ( https://phabricator.wikimedia.org/T116350 ) and we haven't tackled it because there hasn't been a compelling use case that would benefit immediately from it, and we cannot reliably guarantee that the ids will continue to be stable across a series of wikitext edits.
But, on a edit-to-edit basis, Parsoid already does dom-diffs and identifies only the edited portions of the DOM (and this is used internally to support no-dirty-diff serialization of edited HTML to wikitext). However, this functionality is not exposed currently outside of internal Parsoid use.
This doesn't answer your questions directly, but hope this is atleast in the direction of what you are looking for.
Subbu.
On 10/28/2015 06:31 AM, André Costa wrote:
I have some general Parsoid questions I hoped someone here might help me with.
The background is that we are doing some preliminary work looking at how Text-to-Speech might work on Wikipedia (there will be some info online in the coming weeks).
One detail of this is that you might occasionally have to highlight specific words/sentences that are dealt with differently (e.g. World War III -> World War 3). It is still unclear how frequent such things would be but if they are very frequent then there would likely be push-back from the community if this is stored in the normal wikitext.
In this case we would have to store the markup outside of the wikitext and any viewing/editing of it would have to happen in some user enabled extension of the normal environment.
And here we come to the question.
- If we would have to store this markup outside of the wikitext could
this be done by storing the individual parsoid-data-units? 2. Would it be possible to add these units to the existing parsoid-data (which gets loaded from the wikitext) when loading a page? 3. Would it be possible to detect which of these units would be affected by edits to the wikipage?
This is still in the early stages so mainly we are looking at what possibilities exist should we need them. Using Parsoid data was something we thought of as a light-weight solution to having to store a synced copy of the wikitext+additional markup.
Cheers, André André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.seAndre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
Wikitext-l mailing listWikitext-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
André, another option to anchor annotations (which is also linked from the task Subbu mentioned) is hypothesis' approximate match algorithm: https://github.com/hypothesis/dom-anchor-text-quote
This approach uses xpaths of a selection where available (which would profit from stable element ids), but falls back to approximate phrase matching with some context. They use this to annotate random web pages and PDFs: https://hypothes.is/
Gabriel
On Mon, Nov 2, 2015 at 5:21 AM, André Costa andre.costa@wikimedia.se wrote:
Hi Subbu,
Many thanks for your answer. It confirmed some of my thoughts on how this might be done.
I'll take this back to our team and get back if I have any updates.
Cheers, André
André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
On 28 October 2015 at 19:17, Subramanya Sastry ssastry@wikimedia.org wrote:
I think you are looking for a solution that can attach metadata to specific places in the DOM -- there have been other contexts where this has come up as well. So, I think we need a generic solution to do this.
That said, Parsoid assigns ids to individual elements in the DOM, and so, an easy way to do this would be to store this data keyed on element ids and then looked up this metadata separately.
As for stability, we right now don't guarantee it, but this has come up previously ( https://phabricator.wikimedia.org/T116350 ) and we haven't tackled it because there hasn't been a compelling use case that would benefit immediately from it, and we cannot reliably guarantee that the ids will continue to be stable across a series of wikitext edits.
But, on a edit-to-edit basis, Parsoid already does dom-diffs and identifies only the edited portions of the DOM (and this is used internally to support no-dirty-diff serialization of edited HTML to wikitext). However, this functionality is not exposed currently outside of internal Parsoid use.
This doesn't answer your questions directly, but hope this is atleast in the direction of what you are looking for.
Subbu.
On 10/28/2015 06:31 AM, André Costa wrote:
I have some general Parsoid questions I hoped someone here might help me with.
The background is that we are doing some preliminary work looking at how Text-to-Speech might work on Wikipedia (there will be some info online in the coming weeks).
One detail of this is that you might occasionally have to highlight specific words/sentences that are dealt with differently (e.g. World War III -> World War 3). It is still unclear how frequent such things would be but if they are very frequent then there would likely be push-back from the community if this is stored in the normal wikitext.
In this case we would have to store the markup outside of the wikitext and any viewing/editing of it would have to happen in some user enabled extension of the normal environment.
And here we come to the question.
- If we would have to store this markup outside of the wikitext could
this be done by storing the individual parsoid-data-units? 2. Would it be possible to add these units to the existing parsoid-data (which gets loaded from the wikitext) when loading a page? 3. Would it be possible to detect which of these units would be affected by edits to the wikipage?
This is still in the early stages so mainly we are looking at what possibilities exist should we need them. Using Parsoid data was something we thought of as a light-weight solution to having to store a synced copy of the wikitext+additional markup.
Cheers, André André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.seAndre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
Wikitext-l mailing listWikitext-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Hi Gabriel,
Thanks for the tip. I'll pass this one along also.
Cheers, André
André Costa | GLAM-utvecklare, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
On 3 November 2015 at 01:11, Gabriel Wicke gwicke@wikimedia.org wrote:
André, another option to anchor annotations (which is also linked from the task Subbu mentioned) is hypothesis' approximate match algorithm: https://github.com/hypothesis/dom-anchor-text-quote
This approach uses xpaths of a selection where available (which would profit from stable element ids), but falls back to approximate phrase matching with some context. They use this to annotate random web pages and PDFs: https://hypothes.is/
Gabriel
On Mon, Nov 2, 2015 at 5:21 AM, André Costa andre.costa@wikimedia.se wrote:
Hi Subbu,
Many thanks for your answer. It confirmed some of my thoughts on how this might be done.
I'll take this back to our team and get back if I have any updates.
Cheers, André
André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
On 28 October 2015 at 19:17, Subramanya Sastry ssastry@wikimedia.org wrote:
I think you are looking for a solution that can attach metadata to specific places in the DOM -- there have been other contexts where this has come up as well. So, I think we need a generic solution to do this.
That said, Parsoid assigns ids to individual elements in the DOM, and so, an easy way to do this would be to store this data keyed on element ids and then looked up this metadata separately.
As for stability, we right now don't guarantee it, but this has come up previously ( https://phabricator.wikimedia.org/T116350 ) and we haven't tackled it because there hasn't been a compelling use case that would benefit immediately from it, and we cannot reliably guarantee that the ids will continue to be stable across a series of wikitext edits.
But, on a edit-to-edit basis, Parsoid already does dom-diffs and identifies only the edited portions of the DOM (and this is used internally to support no-dirty-diff serialization of edited HTML to wikitext). However, this functionality is not exposed currently outside of internal Parsoid use.
This doesn't answer your questions directly, but hope this is atleast in the direction of what you are looking for.
Subbu.
On 10/28/2015 06:31 AM, André Costa wrote:
I have some general Parsoid questions I hoped someone here might help me with.
The background is that we are doing some preliminary work looking at how Text-to-Speech might work on Wikipedia (there will be some info online in the coming weeks).
One detail of this is that you might occasionally have to highlight specific words/sentences that are dealt with differently (e.g. World War III -> World War 3). It is still unclear how frequent such things would be but if they are very frequent then there would likely be push-back from the community if this is stored in the normal wikitext.
In this case we would have to store the markup outside of the wikitext and any viewing/editing of it would have to happen in some user enabled extension of the normal environment.
And here we come to the question.
- If we would have to store this markup outside of the wikitext could
this be done by storing the individual parsoid-data-units? 2. Would it be possible to add these units to the existing parsoid-data (which gets loaded from the wikitext) when loading a page? 3. Would it be possible to detect which of these units would be affected by edits to the wikipage?
This is still in the early stages so mainly we are looking at what possibilities exist should we need them. Using Parsoid data was something we thought of as a light-weight solution to having to store a synced copy of the wikitext+additional markup.
Cheers, André André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.seAndre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
Wikitext-l mailing listWikitext-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
A generic solution would be desirable for other reasons I can think of, like auto-switching between different spellings (color|colour) in different varieties of English, etc. It would be nice if this required no user-level gadgets or other customization, but would work regardless of login status.
wikitext-l@lists.wikimedia.org