Charles,
I foresee all three. Lexemes do no work. They are somewhat quantum-like
masses in that all lexemes have several potential states, but you don't
know what flavor they will take until observed in the wild after being
acted upon by grammar and context. Grammar and context are the energy to
lexemes' mass. So, tagging lexemes is the first part of using language
programmatically. A critical next step is either understanding a lexeme's
*current* sense state, or placing one in a sense state. This is semantic
disambiguation. Grammar can give us some sense states, like setting part of
speech. However, those pesky little modifiers are more sensitive to context
than grammar. "The yellow digging cat," is still somewhat stateless until
we look left and right and find the communication is about construction.
Now our mind's eye resolves the cat into a large construction vehicle of
some nature.
To get to multiple choice questions, we have to set all of the sense states
to support an interrogative. To do translation, we have to precisely know
the incoming sense states, and then to also translate to the outgoing
grammar rules from the incoming ones. The team is going to be writing a ton
of rules for the constructors. The problem is that the total number of
rule combinations becomes compute-hard when you start to scale. The team at
FrameNet tried to shortcut the rules at the phrase level. Their work became
rule convoluted, as the exceptions became significant. We don't communicate
in words or phrases we communicate in full concepts - sentences or on
social media, sentence fragments. They are fragments mostly because they
shun sentence grammar. Grammar is hard.
Wikipragmatica is designed to do the same thing at the base level as
Abstract. The curation has some bonus benefits due to the embedded context.
However, it's basic job is to do semantic disambiguation (your lexeme
tagging is one part of disambiguation) while also providing context outside
the concept (sentence) for larger communication construction (e.g.,
sentence fragments, paragraphs, emails, web pages, etc.). A thesaurus is
critical for traditional lexeme manipulation since some people say tomato
and some people say nightshade. The point being, you can substitute lexemes
and still mean the same thing. What Wikipragmatica does is exactly the same
thing as a thesaurus except at the sentence level (complete concept). So
instead of synonyms, we have paraphrases. This way, we have ignored
sentence level grammar and now we can use context for paragraph and larger
grammars. Wikipragmatica skips all that messy sentence grammar stuff since
we compare sentences and context to conduct semantic disambiguation. There
are a ton less rules above the sentence level. Interestingly, we do see
some new rules resolving as we get more samples of pathing (context)
between concepts. Sometimes the important context is five sentences away,
not our next door neighbor. Thus, keeping track of how sentences connect
together in all communications give us another super powerful tool.
Denny and the team have selected a traditional path with a twist. The
constructors are the first large scale attempt that I know of to codify
grammar for active use rather than passive grammar checkers. Google went
with N-grams instead of grammars. GPT-3 went with extreme metadata. Grammar
replication is gonna be really hard when you do not have context brokering
(meaning a service that can help derive context clues for lexeme sense
resolution). Just because you know the markup of the incoming sentence,
sentence five may affect which translation grammar rule you applied. If the
team is successful, it will be quite an achievement. I just think
Wikipragmatica is a simpler, more robust solution with quite a few more use
cases.
I hope this addresses your comment. Please let me know if you would like
further clarification.
Doug
On Mon, May 17, 2021 at 11:55 PM Charles Matthews via Abstract-Wikipedia <
abstract-wikipedia(a)lists.wikimedia.org> wrote:
On 17 May 2021 at 19:21 Douglas Clark <clarkdd(a)gmail.com> wrote:
I would like to clarify the base capability I see in Wikipragmatica, as
well as the user community's work stream in support of it's curation. My
concern is the path the team has chosen is a dead end beyond the limited
use cases of Wikipedia day zero + a few years. An ecosystem of free
knowledge certainly seems to lead outside the confines of today's wiki
markup world for data and information acquisition. At some point, you will
have to semantically disambiguate the remainder of the web. That is not in
the manual tagging solution set.
So suppose we look beyond the proof-of-concept and the immediate impacts
of the "materials" of the Abstract Wikipedia project: the concrete
improvements in the Lexeme space in Wikidata, for example for medical and
chemical vocabulary; and the repository including what broadly could be
called "conversion scripts".
Various further topics have come up on this list. Some of those might be:
(a) Authoring multiple-choice questions in AW code, as a basis for
multilingual educational materials.
(b) Publication of WikiJournals - the Wikimedia contribution to learned
journals - in AW code that would then translate to multilingual versions.
(c) Using AW code as the target in generalised text-mining.
I think you are foreseeing something like (c). Certainly it is more like a
blue-sky problem.
Charles
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…