[Abstract-wikipedia] Contractions (shortened forms of words)

31 Aug 2020

You're welcome, Thad! I've already replied to you there, but this is where
https://www.wikidata.org/wiki/Lexeme_talk:L269709, in case anyone was
wondering.

Since you mention *'d*, I feel bound to mention *'s*: our two most common
verbs in one handy package! I'm not even going to mention our multi-purpose
s inflection ;)

Best regards,
Al.

On Monday, 31 August 2020, Thad Guidry &lt;thadguidry(a)gmail.com&gt; wrote:

...
  After researching a bit more and looking at the NLP
programmatic space
 (and semantic vector models), it seems that contractions are dealt with
 nicely and not so nicely ;-)
 One caveat for English language seems to be upon expansion of
 contractions, where it requires contextual knowledge.

 I'd -> I would
 I'd -> I had

 Regardless, I think having all the individual necessary Lexemes already in
 Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run
 Wikilambda functions that incorporate Lexeme knowledge.
 There are definite limits to SPO handling and I think I can see why Denny
 and team are working towards a 3rd system where Rules for conflation, etc.
 can be programmed with functions rather than directly storing in Lexeme
 namespace.

 For additional context, I've copied my reply to GrounderUK from our
 discussion on on of the Lexeme talk pages that has some interesting links:

 @GrounderUK <https://www.wikidata.org/wiki/User:GrounderUK>: I think we
 have enough information now to say that like Denny had stated, the current
 Lexeme handling will need to eventually account for rules. Where those
 rules (as some of them you started to hint at) would be stored in
 Wikilambda functions, that are then incorporated into the Lexeme pages and
 Abstract Wikipedia handling. The mere fact that individual Lexeme can be
 stored and referenced already makes them useful, but programmatically it
 makes sense to store "contractions and other language rules" into a rule
 engine and the current limits of SPO handling cannot account for that.
 Which is why I can see that we will need that new system in place that
 Denny and team are working on and help it out by writing those rules once
 we have that ability. Something like this tokenizer
 <https://hackage.haskell.org/package/chatter-0.0.0.3/docs/src/NLP-Tokenize.html>
 for instance in Haskell or the pycontractions
 <https://pypi.org/project/pycontractions/> package.[mentioned in an NLP
 article
 <https://medium.com/@lukei_3514/dealing-with-contractions-in-nlp-d6174300876b>]
 Thanks to both of you nonetheless. I'll engage in the other discussion
 areas. But all of this discussion here is very useful, even if to highlight
 the gaps in the current Lexeme system. Thadguidry
 <https://www.wikidata.org/wiki/User:Thadguidry> (talk
 <https://www.wikidata.org/wiki/User_talk:Thadguidry>) 15:27, 31 August
 2020 (UTC)

 Thanks Al.  And thanks Denny and team.  Very much looking forward to the
 "playground" for language rules.

 (Python is exceptional for NLP primarily because of it's quite good string
 handling and wealth of existing NLP packages.  To get C-like speed, NIM
 <https://nim-lang.org/> is also well suited and Python-like, but a rarely
 used language)
 (Haskell is also heavily used for NLP but in the last 5 years has seen a
 decline within the NLP programming space)

 Thad
 https://www.linkedin.com/in/thadguidry/

2024

2023

2022

2021

2020

[Abstract-wikipedia] Contractions (shortened forms of words)