Contractions (shortened forms of words)

List overview All Threads
Download

newer

older

Re: [Abstract-wikipedia] NLP...

More mockups!

Thad Guidry

29 Aug 2020 29 Aug '20

5:46 p.m.

Hi Team,

As usual for me, I love digging and pulling out weeds from the garden beds on the weekend. :-)

I searched the repo and did not see "apostrophe" or "contraction" mentioned at all. https://github.com/google/abstracttext/search?q=apostrophe&unscoped_q=ap... https://github.com/google/abstracttext/search?q=contraction&unscoped_q=c...

I was hoping to see an example conversion function to help with contractions (shortened forms of words where letters have been omitted and replaced by apostrophes and sometimes other characters) ?

My use case (in the future) is to help Abstract Wikipedia to more easily handle search & matching for English idioms and deal with alternative variants that sometimes have contracted forms of words within them. For example: https://www.wikidata.org/wiki/Lexeme:L311061 "We will cross that bridge when we come to it" "We'll cross that bridge when we come to it"

Idioms are so complex in English and many alternative variants with optional hyphens, apostrophes, etc. So I'm (<-- a contraction!) trying to understand some of the future ideas on how searchability might be improved by allowing hints somehow in Wikidata Lexemes and what a first practice (maybe not best practice yet!) would begin to look like.

We might have parsing functions that already know that "we will" = "we'll". The lexeme for "we'll" is in fact already there https://www.wikidata.org/wiki/Lexeme:L269709 GREAT! But I think that lexeme and others are missing additional information to make them really useful with our later conversion functions or renderers?

So... Some of these questions are deep, forward thinking, and probably will not have the best answers right now, but it's (<--another contraction!!) useful to ask them now I think: 1. Perhaps somehow mentioning that https://www.wikidata.org/wiki/Lexeme:L269709 is a contraction and not only a phrase? But I don't see how to do that currently. Lexical category allows only 1. If you know, let me know. 2. How would a function determine equivalency handled with Z-objects in the case of contractions? For example, would the mere fact that on L269709 that there are 2 forms -F1 and -F2, automatically return a boolean True on some function? Is that best? 3. Seeing as how ubiquitous contractions are... does that make them a good candidate in the future for separate indexing? a. Is L269709 and its -F1 and -F2 forms good enough for further building very fast lookup or conversion functions for contractions that would use ElasticSearch indexes? This could be performant enough and completely stored in memory for English language, I guess?

I'm all ears (<-- an idiom!!)(means I am listening) Thad https://www.linkedin.com/in/thadguidry/

Attachments:

attachment.htm (text/html — 6.1 KB)

Show replies by date

Grounder UK

30 Aug 30 Aug

2:41 a.m.

Hi Thad! "Il faut cultiver notre jardin." [https://againstprofphil.org/2 017/02/20/il-faut-cultiver-notre-jardin/]

I did add a contracted form to a lexeme few weeks ago, just to test the water. (https://www.wikidata.org/w/index.php?title=Lexeme:L1883&dif f=1248168703.) Specifically, this was "'m" as a contraction of "am" and, therefore, a form of *be*. But, as is common with contractions, the form is permitted only in specific contexts (some languages, like French, *require* contractions in particular contexts, so we certainly can't ignore them).

1. I'm not convinced that "we'll" is a phrase. It is certainly a contraction (of either "we shall" or "we will"). My understanding is that a *lexical phrase* would only be admitted as a lexeme if its definition (including its inflection) is not fully implied by its constituents. And I suggest that is not the case here. Rather, it is justified as a lexeme by the fact that it is a contraction (personally, I don't entirely agree with this justification in this particular case, because "'ll" is practically a separate word in contemporary English, but perhaps that 'll be a topic for another day).

2. From 1, "we'll" has three forms: "we'll", "we shall" and "we will". But these are not inflections, so perhaps they aren't strictly *forms* at all. Semantics aside, "we'll" is neither more nor less than "we" followed by "'ll", with the orthographic convention of no intervening space where a word begins with an apostrophe (unless 'tis an exception). Anyway... What sort of function did you have in mind? My theory would be that only a parsing function would start with "we'll" and its result is ambiguous because "'ll" is ambiguous (being a form in two separate lexemes, *shall* and *will)*. Since you mention searching, I'd guess that a search would want to find results containing any one of the three forms, given "we'll", but only the given form and "we'll", given either of the other forms. Since you also mention "equivalency", I would say that there is equivalence between "we'll" and "we shall", and there is equivalence between "we'll" and "we will", but there is no (implied) equivalence between "we shall" and "we will". (For a clearer case, consider "we'd", which might mean "we had" or "we should" or "we would". You might suspect equivalence between "we should" and "we would" but not between either and "we had". So it is not the common form of the contraction that implies the suspected equivalence, it is the fading/faded distinction between *shall* and *will*, facilitated by common contractions.) Since you mention renderers, I would say not. But that's not a definitive no. It's more a case of we'll build that bridge when we get to where it isn't ;) A rendering function should be expected to leave the matter of contraction 'til later (which, like tomorrow, never comes). According to this theory, there might come a time when a renderer function has (in effect) "we shall" and returns "we'll", but not the other way around. (And according to my principle of losslessness, the result is actually both: "we'll", in essence.)

3. No idea! (Just saying...)

Just to backtrack to your use case, I'd be inclined to lemmatize the entire phrase. My millennial Collins English Dictionary has sense 32 for *cross*: "cross a bridge when one comes to it". My Oxford Dictionary of English has the phrase "cross that bridge when one comes to it" under *bridge*. One never says "one", of course! There seem to be five or six occurrences in the British National Corpus. Five say "cross that bridge when"; three are for "we" (one attributive modifying "attitude", one imperative: "let's cross that bridge when we *get* to it, one: "we can cross..."), one each for "he" and "she" "would" (in the same text but not close together), and one, well... "we cross our bridges when we come to them and burn them behind us..." (Why, thank you, Sir Tom! https://en.wikipedia.org/ wiki/Tom_Stoppard)

Thank you for "listening", Al.

On Saturday, 29 August 2020, Thad Guidry thadguidry@gmail.com wrote:

...

Hi Team,

As usual for me, I love digging and pulling out weeds from the garden beds on the weekend. :-)

I searched the repo and did not see "apostrophe" or "contraction" mentioned at all. https://github.com/google/abstracttext/search?q=apostrophe&u nscoped_q=apostrophe https://github.com/google/abstracttext/search?q=contraction& unscoped_q=contraction

I was hoping to see an example conversion function to help with contractions (shortened forms of words where letters have been omitted and replaced by apostrophes and sometimes other characters) ?

My use case (in the future) is to help Abstract Wikipedia to more easily handle search & matching for English idioms and deal with alternative variants that sometimes have contracted forms of words within them. For example: https://www.wikidata.org/wiki/Lexeme:L311061 "We will cross that bridge when we come to it" "We'll cross that bridge when we come to it"

Idioms are so complex in English and many alternative variants with optional hyphens, apostrophes, etc. So I'm (<-- a contraction!) trying to understand some of the future ideas on how searchability might be improved by allowing hints somehow in Wikidata Lexemes and what a first practice (maybe not best practice yet!) would begin to look like.

We might have parsing functions that already know that "we will" = "we'll". The lexeme for "we'll" is in fact already there https://www.wikidata.org/wiki/Lexeme:L269709 GREAT! But I think that lexeme and others are missing additional information to make them really useful with our later conversion functions or renderers?

So... Some of these questions are deep, forward thinking, and probably will not have the best answers right now, but it's (<--another contraction!!) useful to ask them now I think:

Perhaps somehow mentioning that https://www.wikidata.org/wiki/

Lexeme:L269709 is a contraction and not only a phrase? But I don't see how to do that currently. Lexical category allows only 1. If you know, let me know. 2. How would a function determine equivalency handled with Z-objects in the case of contractions? For example, would the mere fact that on L269709 that there are 2 forms -F1 and -F2, automatically return a boolean True on some function? Is that best? 3. Seeing as how ubiquitous contractions are... does that make them a good candidate in the future for separate indexing? a. Is L269709 and its -F1 and -F2 forms good enough for further building very fast lookup or conversion functions for contractions that would use ElasticSearch indexes? This could be performant enough and completely stored in memory for English language, I guess?

I'm all ears (<-- an idiom!!)(means I am listening) Thad https://www.linkedin.com/in/thadguidry/

Thad Guidry

3:10 a.m.

Hi Al. Contractions that are avoidable would be advantageous in certain cases. For example: For Simple English rendering, we might have a rule that says *no contractions* would be allowed. This would help others learning a language or simply having smart auto-completion as you type on a Wikipedia page for instance to replace "we'll" with "we will" as an example. Or a reverse rendering for those that might have their Wikipedia Babel native language for English set to en-N, then you might *want contractions* in order to avoid verbosity for those that could quickly read and understand English sentences that have them.

In general, there are tons of use cases for renderers that I have thought of and this is only 1 of them, but they all rely on constructors, functions, and lexicographical data stored well and in some standardized fashion somewhere. Which is why I am always asking about best practices within Wikidata's Lexeme namespace. I find that I often look for some of the most complex issues to deal with which helps make it easier for me to understand where the gaps will likely be in any system. Then we can begin conversations on how we'll tackle them.

Looking forward to others' opinions!

Thad https://www.linkedin.com/in/thadguidry/

Grounder UK

12:39 p.m.

Nothing I said is intended to imply that contractions should be avoided. They are, in any event, a linguistic fact that should be recorded accurately in Wikidata. So, I have:

1. Added forms -F3 ('ll) and -F4 ('d) to L1891 *shall*. (I've left *will* for the time being.) 2. Added form -F3 (we shall) to L269709. I don't believe this is correct, but it parallels the existing -F2, so we can get rid of them both later. I added -F3 with the feature "contraction". This is even less correct, but it might help focus future debate. 3. Changed L269709 from a phrase to a contraction. 4. Added a "combines" (P5238) statement to L26909. I tried to add two, but the second merged itself with the first. I've left it as *we*+*shall*+*will* for the time being and 5. Created the discussion page for L269709 explaining the problem. Jura has responded "I think it would be better to have separate entities"...

Those weeds keep growing... let's think of them as green manure! Al. On Sunday, 30 August 2020, Thad Guidry thadguidry@gmail.com wrote:

...

Hi Al. Contractions that are avoidable would be advantageous in certain cases. For example: For Simple English rendering, we might have a rule that says *no contractions* would be allowed. This would help others learning a language or simply having smart auto-completion as you type on a Wikipedia page for instance to replace "we'll" with "we will" as an example. Or a reverse rendering for those that might have their Wikipedia Babel native language for English set to en-N, then you might *want contractions* in order to avoid verbosity for those that could quickly read and understand English sentences that have them.

In general, there are tons of use cases for renderers that I have thought of and this is only 1 of them, but they all rely on constructors, functions, and lexicographical data stored well and in some standardized fashion somewhere. Which is why I am always asking about best practices within Wikidata's Lexeme namespace. I find that I often look for some of the most complex issues to deal with which helps make it easier for me to understand where the gaps will likely be in any system. Then we can begin conversations on how we'll tackle them.

Looking forward to others' opinions!

Thad https://www.linkedin.com/in/thadguidry/

Thad Guidry

31 Aug 31 Aug

7:05 p.m.

After researching a bit more and looking at the NLP programmatic space (and semantic vector models), it seems that contractions are dealt with nicely and not so nicely ;-) One caveat for English language seems to be upon expansion of contractions, where it requires contextual knowledge.

I'd -> I would I'd -> I had

Regardless, I think having all the individual necessary Lexemes already in Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run Wikilambda functions that incorporate Lexeme knowledge. There are definite limits to SPO handling and I think I can see why Denny and team are working towards a 3rd system where Rules for conflation, etc. can be programmed with functions rather than directly storing in Lexeme namespace.

For additional context, I've copied my reply to GrounderUK from our discussion on on of the Lexeme talk pages that has some interesting links:

@GrounderUK https://www.wikidata.org/wiki/User:GrounderUK: I think we have enough information now to say that like Denny had stated, the current Lexeme handling will need to eventually account for rules. Where those rules (as some of them you started to hint at) would be stored in Wikilambda functions, that are then incorporated into the Lexeme pages and Abstract Wikipedia handling. The mere fact that individual Lexeme can be stored and referenced already makes them useful, but programmatically it makes sense to store "contractions and other language rules" into a rule engine and the current limits of SPO handling cannot account for that. Which is why I can see that we will need that new system in place that Denny and team are working on and help it out by writing those rules once we have that ability. Something like this tokenizer https://hackage.haskell.org/package/chatter-0.0.0.3/docs/src/NLP-Tokenize.html for instance in Haskell or the pycontractions https://pypi.org/project/pycontractions/ package.[mentioned in an NLP article https://medium.com/@lukei_3514/dealing-with-contractions-in-nlp-d6174300876b] Thanks to both of you nonetheless. I'll engage in the other discussion areas. But all of this discussion here is very useful, even if to highlight the gaps in the current Lexeme system. Thadguidry https://www.wikidata.org/wiki/User:Thadguidry (talk https://www.wikidata.org/wiki/User_talk:Thadguidry) 15:27, 31 August 2020 (UTC)

Thanks Al. And thanks Denny and team. Very much looking forward to the "playground" for language rules.

(Python is exceptional for NLP primarily because of it's quite good string handling and wealth of existing NLP packages. To get C-like speed, NIM https://nim-lang.org/ is also well suited and Python-like, but a rarely used language) (Haskell is also heavily used for NLP but in the last 5 years has seen a decline within the NLP programming space)

Thad https://www.linkedin.com/in/thadguidry/

Grounder UK

8:19 p.m.

You're welcome, Thad! I've already replied to you there, but this is where https://www.wikidata.org/wiki/Lexeme_talk:L269709, in case anyone was wondering.

Since you mention *'d*, I feel bound to mention *'s*: our two most common verbs in one handy package! I'm not even going to mention our multi-purpose s inflection ;)

Best regards, Al.

On Monday, 31 August 2020, Thad Guidry thadguidry@gmail.com wrote:

...

After researching a bit more and looking at the NLP programmatic space (and semantic vector models), it seems that contractions are dealt with nicely and not so nicely ;-) One caveat for English language seems to be upon expansion of contractions, where it requires contextual knowledge.

I'd -> I would I'd -> I had

Regardless, I think having all the individual necessary Lexemes already in Wikidata (I, would, had, etc.) will be useful so that a Rule engine can run Wikilambda functions that incorporate Lexeme knowledge. There are definite limits to SPO handling and I think I can see why Denny and team are working towards a 3rd system where Rules for conflation, etc. can be programmed with functions rather than directly storing in Lexeme namespace.

For additional context, I've copied my reply to GrounderUK from our discussion on on of the Lexeme talk pages that has some interesting links:

@GrounderUK https://www.wikidata.org/wiki/User:GrounderUK: I think we have enough information now to say that like Denny had stated, the current Lexeme handling will need to eventually account for rules. Where those rules (as some of them you started to hint at) would be stored in Wikilambda functions, that are then incorporated into the Lexeme pages and Abstract Wikipedia handling. The mere fact that individual Lexeme can be stored and referenced already makes them useful, but programmatically it makes sense to store "contractions and other language rules" into a rule engine and the current limits of SPO handling cannot account for that. Which is why I can see that we will need that new system in place that Denny and team are working on and help it out by writing those rules once we have that ability. Something like this tokenizer https://hackage.haskell.org/package/chatter-0.0.0.3/docs/src/NLP-Tokenize.html for instance in Haskell or the pycontractions https://pypi.org/project/pycontractions/ package.[mentioned in an NLP article https://medium.com/@lukei_3514/dealing-with-contractions-in-nlp-d6174300876b] Thanks to both of you nonetheless. I'll engage in the other discussion areas. But all of this discussion here is very useful, even if to highlight the gaps in the current Lexeme system. Thadguidry https://www.wikidata.org/wiki/User:Thadguidry (talk https://www.wikidata.org/wiki/User_talk:Thadguidry) 15:27, 31 August 2020 (UTC)

Thanks Al. And thanks Denny and team. Very much looking forward to the "playground" for language rules.

(Python is exceptional for NLP primarily because of it's quite good string handling and wealth of existing NLP packages. To get C-like speed, NIM https://nim-lang.org/ is also well suited and Python-like, but a rarely used language) (Haskell is also heavily used for NLP but in the last 5 years has seen a decline within the NLP programming space)

Thad https://www.linkedin.com/in/thadguidry/

1578

Age (days ago)

1580

Last active (days ago)

abstract-wikipedia@lists.wikimedia.org

5 comments

2 participants

tags (0)

participants (2)

Grounder UK
Thad Guidry