After researching a bit more and looking at the NLP programmatic space (and
semantic vector models), it seems that contractions are dealt with nicely
and not so nicely ;-)
One caveat for English language seems to be upon expansion of contractions,
where it requires contextual knowledge.
I'd -> I would
I'd -> I had
Regardless, I think having all the individual necessary Lexemes already in
Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run
Wikilambda functions that incorporate Lexeme knowledge.
There are definite limits to SPO handling and I think I can see why Denny
and team are working towards a 3rd system where Rules for conflation, etc.
can be programmed with functions rather than directly storing in Lexeme
For additional context, I've copied my reply to GrounderUK from our
discussion on on of the Lexeme talk pages that has some interesting links:
@GrounderUK <https://www.wikidata.org/wiki/User:GrounderUK>: I think we
have enough information now to say that like Denny had stated, the current
Lexeme handling will need to eventually account for rules. Where those
rules (as some of them you started to hint at) would be stored in
Wikilambda functions, that are then incorporated into the Lexeme pages and
Abstract Wikipedia handling. The mere fact that individual Lexeme can be
stored and referenced already makes them useful, but programmatically it
makes sense to store "contractions and other language rules" into a rule
engine and the current limits of SPO handling cannot account for that.
Which is why I can see that we will need that new system in place that
Denny and team are working on and help it out by writing those rules once
we have that ability. Something like this tokenizer
for instance in Haskell or the pycontractions
<https://pypi.org/project/pycontractions/> package.[mentioned in an NLP
Thanks to both of you nonetheless. I'll engage in the other discussion
areas. But all of this discussion here is very useful, even if to highlight
the gaps in the current Lexeme system. Thadguidry
<https://www.wikidata.org/wiki/User_talk:Thadguidry>) 15:27, 31 August 2020
Thanks Al. And thanks Denny and team. Very much looking forward to the
"playground" for language rules.
(Python is exceptional for NLP primarily because of it's quite good string
handling and wealth of existing NLP packages. To get C-like speed, NIM
<https://nim-lang.org/> is also well suited and Python-like, but a rarely
(Haskell is also heavily used for NLP but in the last 5 years has seen a
decline within the NLP programming space)