You're welcome, Thad! I've already replied to you there, but this is where
https://www.wikidata.org/wiki/Lexeme_talk:L269709, in case anyone was
wondering.
Since you mention *'d*, I feel bound to mention *'s*: our two most common
verbs in one handy package! I'm not even going to mention our multi-purpose
s inflection ;)
Best regards,
Al.
On Monday, 31 August 2020, Thad Guidry <thadguidry(a)gmail.com> wrote:
After researching a bit more and looking at the NLP
programmatic space
(and semantic vector models), it seems that contractions are dealt with
nicely and not so nicely ;-)
One caveat for English language seems to be upon expansion of
contractions, where it requires contextual knowledge.
I'd -> I would
I'd -> I had
Regardless, I think having all the individual necessary Lexemes already in
Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run
Wikilambda functions that incorporate Lexeme knowledge.
There are definite limits to SPO handling and I think I can see why Denny
and team are working towards a 3rd system where Rules for conflation, etc.
can be programmed with functions rather than directly storing in Lexeme
namespace.
For additional context, I've copied my reply to GrounderUK from our
discussion on on of the Lexeme talk pages that has some interesting links:
@GrounderUK <https://www.wikidata.org/wiki/User:GrounderUK>: I think we
have enough information now to say that like Denny had stated, the current
Lexeme handling will need to eventually account for rules. Where those
rules (as some of them you started to hint at) would be stored in
Wikilambda functions, that are then incorporated into the Lexeme pages and
Abstract Wikipedia handling. The mere fact that individual Lexeme can be
stored and referenced already makes them useful, but programmatically it
makes sense to store "contractions and other language rules" into a rule
engine and the current limits of SPO handling cannot account for that.
Which is why I can see that we will need that new system in place that
Denny and team are working on and help it out by writing those rules once
we have that ability. Something like this tokenizer
<https://hackage.haskell.org/package/chatter-0.0.0.3/docs/src/NLP-Tokenize.html>
for instance in Haskell or the pycontractions
<https://pypi.org/project/pycontractions/> package.[mentioned in an NLP
article
<https://medium.com/@lukei_3514/dealing-with-contractions-in-nlp-d6174300876b>]
Thanks to both of you nonetheless. I'll engage in the other discussion
areas. But all of this discussion here is very useful, even if to highlight
the gaps in the current Lexeme system. Thadguidry
<https://www.wikidata.org/wiki/User:Thadguidry> (talk
<https://www.wikidata.org/wiki/User_talk:Thadguidry>) 15:27, 31 August
2020 (UTC)
Thanks Al. And thanks Denny and team. Very much looking forward to the
"playground" for language rules.
(Python is exceptional for NLP primarily because of it's quite good string
handling and wealth of existing NLP packages. To get C-like speed, NIM
<https://nim-lang.org/> is also well suited and Python-like, but a rarely
used language)
(Haskell is also heavily used for NLP but in the last 5 years has seen a
decline within the NLP programming space)
Thad
https://www.linkedin.com/in/thadguidry/