Al.

On Monday, 31 August 2020, Thad Guidry <thadguidry@gmail.com> wrote:

After researching a bit more and looking at the NLP programmatic space (and semantic vector models), it seems that contractions are dealt with nicely and not so nicely ;-)
One caveat for English language seems to be upon expansion of contractions, where it requires contextual knowledge.
I'd -> I would
I'd -> I had
Regardless, I think having all the individual necessary Lexemes already in Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run Wikilambda functions that incorporate Lexeme knowledge.
There are definite limits to SPO handling and I think I can see why Denny and team are working towards a 3rd system where Rules for conflation, etc. can be programmed with functions rather than directly storing in Lexeme namespace.

For additional context, I've copied my reply to GrounderUK from our discussion on on of the Lexeme talk pages that has some interesting links:

@GrounderUK: I think we have enough information now to say that like Denny had stated, the current Lexeme handling will need to eventually account for rules. Where those rules (as some of them you started to hint at) would be stored in Wikilambda functions, that are then incorporated into the Lexeme pages and Abstract Wikipedia handling. The mere fact that individual Lexeme can be stored and referenced already makes them useful, but programmatically it makes sense to store "contractions and other language rules" into a rule engine and the current limits of SPO handling cannot account for that. Which is why I can see that we will need that new system in place that Denny and team are working on and help it out by writing those rules once we have that ability. Something like this tokenizer for instance in Haskell or the pycontractions package.[mentioned in an NLP article] Thanks to both of you nonetheless. I'll engage in the other discussion areas. But all of this discussion here is very useful, even if to highlight the gaps in the current Lexeme system. Thadguidry (talk) 15:27, 31 August 2020 (UTC)

Thanks Al. And thanks Denny and team. Very much looking forward to the "playground" for language rules.

(Python is exceptional for NLP primarily because of it's quite good string handling and wealth of existing NLP packages. To get C-like speed, NIM is also well suited and Python-like, but a rarely used language)
(Haskell is also heavily used for NLP but in the last 5 years has seen a decline within the NLP programming space)

Thad
https://www.linkedin.com/in/thadguidry/