Charles,

I foresee all three. Lexemes do no work. They are somewhat quantum-like masses in that all lexemes have several potential states, but you don't know what flavor they will take until observed in the wild after being acted upon by grammar and context. Grammar and context are the energy to lexemes' mass. So, tagging lexemes is the first part of using language programmatically. A critical next step is either understanding a lexeme's current sense state, or placing one in a sense state. This is semantic disambiguation. Grammar can give us some sense states, like setting part of speech. However, those pesky little modifiers are more sensitive to context than grammar. "The yellow digging cat," is still somewhat stateless until we look left and right and find the communication is about construction. Now our mind's eye resolves the cat into a large construction vehicle of some nature.

To get to multiple choice questions, we have to set all of the sense states to support an interrogative. To do translation, we have to precisely know the incoming sense states, and then to also translate to the outgoing grammar rules from the incoming ones. The team is going to be writing a ton of rules for the constructors. The problem is that the total number of rule combinations becomes compute-hard when you start to scale. The team at FrameNet tried to shortcut the rules at the phrase level. Their work became rule convoluted, as the exceptions became significant. We don't communicate in words or phrases we communicate in full concepts - sentences or on social media, sentence fragments. They are fragments mostly because they shun sentence grammar. Grammar is hard.

Wikipragmatica is designed to do the same thing at the base level as Abstract. The curation has some bonus benefits due to the embedded context. However, it's basic job is to do semantic disambiguation (your lexeme tagging is one part of disambiguation) while also providing context outside the concept (sentence) for larger communication construction (e.g., sentence fragments, paragraphs, emails, web pages, etc.). A thesaurus is critical for traditional lexeme manipulation since some people say tomato and some people say nightshade. The point being, you can substitute lexemes and still mean the same thing. What Wikipragmatica does is exactly the same thing as a thesaurus except at the sentence level (complete concept). So instead of synonyms, we have paraphrases. This way, we have ignored sentence level grammar and now we can use context for paragraph and larger grammars. Wikipragmatica skips all that messy sentence grammar stuff since we compare sentences and context to conduct semantic disambiguation. There are a ton less rules above the sentence level. Interestingly, we do see some new rules resolving as we get more samples of pathing (context) between concepts. Sometimes the important context is five sentences away, not our next door neighbor. Thus, keeping track of how sentences connect together in all communications give us another super powerful tool.

Denny and the team have selected a traditional path with a twist. The constructors are the first large scale attempt that I know of to codify grammar for active use rather than passive grammar checkers. Google went with N-grams instead of grammars. GPT-3 went with extreme metadata. Grammar replication is gonna be really hard when you do not have context brokering (meaning a service that can help derive context clues for lexeme sense resolution). Just because you know the markup of the incoming sentence, sentence five may affect which translation grammar rule you applied. If the team is successful, it will be quite an achievement. I just think Wikipragmatica is a simpler, more robust solution with quite a few more use cases.

I hope this addresses your comment. Please let me know if you would like further clarification.

Doug 

On Mon, May 17, 2021 at 11:55 PM Charles Matthews via Abstract-Wikipedia <abstract-wikipedia@lists.wikimedia.org> wrote:



On 17 May 2021 at 19:21 Douglas Clark <clarkdd@gmail.com> wrote:
I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.

So suppose we look beyond the proof-of-concept and the immediate impacts of the "materials" of the Abstract Wikipedia project: the concrete improvements in the Lexeme space in Wikidata, for example for medical and chemical vocabulary; and the repository including what broadly could be called "conversion scripts".

Various further topics have come up on this list. Some of those might be:

(a) Authoring multiple-choice questions in AW code, as a basis for multilingual educational materials.

(b) Publication of WikiJournals - the Wikimedia contribution to learned journals - in AW code that would then translate to multilingual versions.

(c) Using AW code as the target in generalised text-mining.

I think you are foreseeing something like (c). Certainly it is more like a blue-sky problem.

Charles

_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimedia.org/