Re: [Wikidata-tech] Two questions about Lexeme Modeling

22 Nov 2016

...
  There are many many words with multiple spellings, but
not many words with more than two, and few with more than three [citation needed].

That is not true in languages with a high amount of dialects. For instance
in Catalan there are 5 standard spellings for "carrot" depending on which
dialect you choose, plus some more if you consider local variations:
https://ca.wikipedia.org/wiki/Pastanaga

But that is nothing compared to the 8 spellings of tomato or more if you
count the local variations:
https://ca.wikipedia.org/wiki/Tom%C3%A0quet

Additionally the same form can have different meanings depending on which
dialect you choose. For instance "pastenaga" means "orange carrot" in
Catalan from Catalonia, and "purple carrot" in Catalan from Valencia.

Which makes me think, how dialects will be handled? Statements?

This is an example of a dialect map:
https://ca.wikipedia.org/wiki/Dialectes_del_catal%C3%A0#Divisi.C3.B3_dialec…

Regards and thanks for elaborating your long answer,
-d

On Mon, Nov 21, 2016 at 5:45 PM, Daniel Kinzler &lt;daniel.kinzler(a)wikimedia.de
...
  wrote: 
...
  Hi all!

 Sorry for the delay. To keep the conversation in one place, I will reply to
 David, Denny, and Philipp in one mail. It's going to be a bit long,
 sorry...

 Am 11.11.2016 um 23:17 schrieb David Cuenca Tudela:
  Hi there!

 1) a possible solution could be to have another category of items 
("Gxxx",
  grammatical rule?) to store grammatical
structures, like "Noun + verb +  object"
  or "Noun + reflexive verb" and then
linking to that structure with a  qualifier
  of the position that it uses on that structure.
Example:
 "to shit" <grammatical structure required> "Subject + reflexive verb
+  reflexive
  pronoun"
     <role in grammatical structure> "reflexive verb" 
 I see no need for a separate entity type, this could be done with a regular
 Item. If we want this to work nicely for display, though, the software
 would
 need to know about some "magic" properties and their meaning. Since
 Wikidata
 provides a stable global vocabulary, it would not be terrible to hard-code
 this.
 But still, it's special case code...

 This is pretty similar to Lemon's "Syntactic Frame" that Philipp pointed
 out,
 see below.

  2) I would prefer statements as they can be
complemented with qualifiers  as for
  why it has a certain spelling (geographical
variant, old usage,  corruption...).

 You can always use a statement for this kind of information, just as we do
 now
 on Wikidata with properties for the surname or official name.

 The question is how often the flexibility of a statement is really needed.
 If
 it's not too often, it would be ok to require both (the lemma and the
 statement)
 to be entered separately, as we do now for official name, birth name, etc.

 Another question is which (multi-term lemma or secondary
 lemma-in-a-statement)
 is easier to handle by a 3rd party consumer. More about that later.

  It would be nice however if there would be some
mechanism to have a  special kind
  of property that would use its value as an item
alias. And this is  something
  that could benefit normal items in Wikidata too,
as most name properties  like
  P1448, P1477 (official name, birth name, etc),
should have its value
 automatically show as alias of the item in all languages, if that were
 technologically feasible. 
 Yes, this would be very convenient. But it would also mix levels of content
 (editorial vs. sourced) that are now nicely separated. I'm very tempted,
 but I'm
 not sure it's worth it.

 Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
  Not only that. "I shit myself" is very
different from "Don't shit  yourself".
  It is not just the reflexivity. It might the
whole phrase. 
 Yes, the boundary to a phrase is not clear cut. But if we need the full
 power of
 modeling as a phrase, we can always do that by creating a separate Lexeme
 for
 the phrase. The question is if that should be the preferred or even the
 only way
 to model the "syntactic frame".

 It's typical for a dictionary to have a list of meanings structured like
 this:

   to ask
   to ask so. sth.
   to ask so. for sth.
   to ask so. about sth.
   to ask so. after sb.
   to ask so. out
   ...

 It would be nice if we had an easy way to create such an overview. If each
 line
 is modeled as a separate Lexeme, we need to decide how these Lexemes
 should be
 connected to allow such an overview.

 I feel these "frames" should be attached to senses. Making all of them
 separate
 Lexemes will drive granularity up, making things hard to follow and
 maintain.

      We could also add this information as a
special field in the Sense
     entity, but I don't even know what that field should contain,  exactly.

 It could be a reference to an Item. Perhaps that item defines a specific
 pattern, like "$verb someone" or "$verb someone something" or
"$verb
 oneself".
 That pattern (defined by a statement on the item) can then be used to
 render the
 concrete pattern for each word sense.

  Just a usage example on the sense? That would
often be enough to express  the
  proposition. 
 Possible, but then it's unclear which parts of the grammar are required to
 generate a specific meaning. You'd need some kind of markup in the example,
 which I would like to avoid.

  I am not a friend of multi-variant lemmas. I
would prefer to either have
 separate Lexemes or alternative Forms. Yes, there will be duplication in  the
  data, but this is expected already, and also,
since it is  machine-readable,
  the duplication can be easily checked and
bot-ified. 
 Getting rid of bots that keep duplicate data in sync was one of the
 reasons we
 created Wikidata, and one of it's major selling points. Bots have a lot of
 uses,
 but copying data around isn't really a good one.

 Also, how do you sync deletions? Reverts? The semantics is not trivial.

  Also, this is how Wiktionary works today:
 https://en.wiktionary.org/wiki/colour
 https://en.wiktionary.org/wiki/color

 Notice that there is no primacy of either. 
 True. But that's not how other dictionaries work:

 https://dict.leo.org/ende/index_de.html#/search=color
 http://www.merriam-webster.com/dictionary/colour
 http://www.dictionary.com/browse/color?s=t

 Oxford even redirects: https://en.oxforddictionaries.com/definition/color

 Only dict.cc makes the distinction: https://www.dict.cc/?s=colour vs
 https://www.dict.cc/?s=color

 We are collecting a LOT of information about each Lexeme.  Duplicating it
 for
 all spelling variants is going to be a HUGE pain. And it's not rare,
 either. I
 estimate that we'll be storing like 20% duplicates (assuming one in five
 words
 has two spellings, on average, across all languages). That also means 20%
 duplicate notifications in your feed, 20% more pages to watch. I don't
 like it...

  Having multi-variant lemmas seem to complicate
the situation a lot. I  think it
  is important to have only one single Lemma for
each Lexeme, in order to  keep
  display logic simple 
 Just show all variants, unless told otherwise. In the order given.

 There are many many words with multiple spellings, but not many words with
 more
 than two, and few with more than three [citation needed].

  But what is the difference for an entry that
doesn't have a BE
 variant in order to reduce redundancy vs an entry that doesn't have a BE
 variant because it has not been entered yet. 
 We have the problem of distinguishing these cases for all the modeling
 variants.
 Well, with statements you *could* use SomeValue, but I highly doubt that
 people
 will do that.

  Lemmas do not have the capability and
flexibility
 of statements. 
 True. When the full power of a Statement or Lemma is needed, just create
 one.
 I'm just saying that in the vast majority of cases, that's overkill, and a
 pain
 to manage, so that should not be the default way.

  How do you determine the primacy of the American
or British English  version?
  Fallback would be written into the code base, it
would not be amenable to
 community editing through the wiki. 
 I currently prefer to just always show all spellings, in the order given.
 For
 people who strongly prefer one version over the other, filtering/sorting
 can be
 applied by a gadget, or server side formatting code.

 Consumers that only want to show a single lemma can just show the first.
 Sure,
 people will need to figure out primacy. But they would have to do this
 also if
 you go with Statements (which spelling will be the one single lemma?) and
 separate Lexemes (either show all, or pick the "main" one somehow).

  Whether separate Lexemes or alternative Forms are
better might be  different
  from
 language to language, from case to case. By hard-coding the multi-variant
 lemmas, you not only pre-decided the case, but also made the code and  the data
  model much more complicated. And not only for the
initial development,  but for
  perpetuity, whenever the data is used. 
 I think for a 3rd party consumer that does care about variants, it's a LOT
 simpler to deal with multiple lemmas than to deal with Statements with
 special
 properties, getting the ranks right, etc.

 And for those who don't care about variants, joining list elements or just
 showing the first element is simple enough.

 Also: a Wikibase client will need code for dealing with TermLists anyway,
 since
 it needs to handle multi-lingual item labels.

 My broader point is: by keeping the (ontology level) meta-model simple, we
 would
 make the actual (instance level) model more complicated. I prefer a more
 complex
 meta-mode, which allows for a simpler instance model. The instance model
 is what
 the community has to deal with, and it's what we'll have gigabytes of.

  We shouldn't force for perfection and
covering everything from the  beginning.

 That is true. But if we miss a crucial aspect, people will build
 workarounds.
 And cleaning those up is a lot of work - and sometimes impossible. This is
 what
 is locking us into the ancient wiki syntax.

  If not every case can be ideally modeled, but we
can capture 99.9% 
 People *will* capture 99.9% - the question is just how much energy that
 costs
 them, and how re-usable the result is.

  Also, there is always Wiktionary as the layer on
top of Wikidata
 that actually can easily resolve these issues anyway. 
 Agreed. But how exactly? For instance, take the two Wiktionary pages on
 "color"
 and "colour". Would they benefit more from two separate Lexemes (similar
 to how
 things are on Wiktionary), or from a single Lexeme, to automatically keep
 the
 pages in sync?

 The model determines how our data is going to be used. We cannot rely on
 the
 presentation layer to work out kinks in the model. And more importantly, we
 can't make fundamental changes to the model later, as that would break
 millions
 of pages.

  Once we have the simple pieces working, we can
actually try to understand
 where the machinery is creaking and not working well, and then think  about
  these issues. 
 Slow iteration is nice as long as you don't produce artifact you need to
 stay
 compatible with. I have become extremely wary of lock-in - Wikitext is the
 worst
 lock-in I have ever seen. Some aspects of how we implemented the Wikibase
 model
 for Wikidata also have proven to be really hard to iterate on. Iterating
 the
 model itself is even harder, since it is bound to break all clients in a
 fundamental way. We just got very annoyed comments just for making two
 fields in
 the Wikibase model optional.

 Switching from single-lemma to multi-lemma would be a major breaking
 change,
 with lots of energy burned on backwards compatibility. The opposite switch
 would
 be much simpler (because it adds guarantees, instead of removing them).

  But until then I would prefer to keep the system
as dumb and
 simple as possible. 
 I would prefer to keep the user generated *data* as straight forward as
 possible. That's more important to me than a simple meta-model. The
 complexity
 of the instance data determines the maintenance burden.

 Am 20.11.2016 um 21:06 schrieb Philipp Cimiano:
  Please look at the final spec of the lemon
model:

 https://www.w3.org/community/ontolex/wiki/Final_Model_
 Specification#Syntactic_Frames

 In particular, check example: synsem/example7 
 Ah, thank you! I think we could model this in a similar way, by
 referencing an
 Item that represents a (type of) frame from the Sense. Whether this should
 be a
 special field or just a Statement I'm still undecided on.

 Is it correct that in the Lemon model, it's not *required* to define a
 syntactic
 frame for a sense? Is there something like a default frame?

  2) Such spelling variants are modelled in lemon
as two different
 representations
 of the same lexical entry.  [...]
  In our understanding these are not two different
forms as you mention,  but two
  different spellings of the same form. 
 Indeed, sorry for being imprecise. And yes, if we have a multi-variant
 lemma, we
 should also have multi-variant Forms. Our lemma corresponds to the
 canonical
 form in Lemon, if I understand correctly.

  The preference for showing e.g. the American or
English variant should be
 stated by the application that uses the lexicon. 
 I agree. I think Denny is concerned with putting that burden on the
 application.
 Proper language fallback isn't trivial, and the application may be a light
 weight JS library... But I think for the naive case, it's fine to simply
 show
 all representations.

 Thank you all for your input!

 --
 Daniel Kinzler
 Senior Software Developer

 Wikimedia Deutschland
 Gesellschaft zur Förderung Freien Wissens e.V.

 --
 Daniel Kinzler
 Senior Software Developer

 Wikimedia Deutschland
 Gesellschaft zur Förderung Freien Wissens e.V.

 _______________________________________________
 Wikidata-tech mailing list
 Wikidata-tech(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

-- 
Etiamsi omnes, ego non

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Two questions about Lexeme Modeling