Wikidata-tech November 2016

wikidata-tech@lists.wikimedia.org

11 participants
8 discussions

Linking RDF resources for external IDs
by Daniel Kinzler 29 Nov '16

29 Nov '16

Hi Stas, Markus, Denny! For a long time now, we have been wanting to generate proper resource references (URIs) for external identifier values, see <https://phabricator.wikimedia.org/T121274>. Implementing this is complicated by the fact that "expanded" identifiers may occur in four different places in the data model (direct, statement, qualifier, reference), and that we can't simply replace the old string value, we need to provide an additional value. I have attached three files with snippets of three different RDF mappings: - Q111.ttl - the status quo, with normalized predicates declared but not used. - Q111.rc.ttl - modeling resource predicates separately from normalized values. - Q111.norm.ttl - modeling resource predicates as normalized values. The "rc" variant means more overhead, the "norm" variant may have semantic difficulties. Please look at the two options for the new mapping and let me know which you like best. You can use a plain old diff between the files for a first impression. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

2 4

Two questions about Lexeme Modeling
by Daniel Kinzler 25 Nov '16

25 Nov '16

Hi all! There is two questions about modelling lexemes that are bothering me. One is an old question, and one I only came across recently. 1) The question that came up for me recently is how we model the grammatical context for senses. For instance, "to ask" can mean requesting information, or requesting action, depending on whether we use "ask somebody about" or "ask somebody to". Similarly, "to shit" has entirely different meanings when used reflexively ("I shit myself"). There is no good place for this in our current model. The information could be placed in a statement on the word Sense, but that would be kind of non-obvious, and would not (at least not easily) allow for a concise rendering, in the way we see it in most dictionaries ("to ask sbdy to do sthg"). The alternative would be to treat each usage with a different grammatical context as a separate Lexeme (a verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That could lead to a fragmentation of the content in a way that is quite unexpected to people used to traditional dictionaries. We could also add this information as a special field in the Sense entity, but I don't even know what that field should contain, exactly. Got a better idea? 2) The older question is how we handle different renderings (spellings, scripts) of the same lexeme. In English we have "color" vs "colour", in German we have "stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and Cyrillic rendering for every word. We can treat these as separate Lexemes, but that would mean duplicating all information about them. We could have a single Lemma, and represent the others as alternative Forms, or using statements on the Lexeme. But that raises the question which spelling or script should be the "main" one, and used in the lemma. I would prefer to have multi-variant lemmas. They would work like the multi-lingual labels we have now on items, but restricted to the variants of a single language. For display, we would apply a similar language fallback mechanism we now apply when showing labels. 2b) if we treat lemmas as multi-variant, should Forms also be multi-variant, or should they be per-variant? Should the glosse of a Sense be multi-variant? I currently tend towards "yes" for all of the above. What do you think? -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

4 11

Re: [Wikidata-tech] Fwd: Two questions about Lexeme Modeling
by Philipp Cimiano 20 Nov '16

20 Nov '16

Dear Denny, Daniel, thanks for your question. I try to answer. ad 1) "ask somebody about" and "ask somebody to" are two different syntactic and semantic frames. Please look at the final spec of the lemon model: https://www.w3.org/community/ontolex/wiki/Final_Model_Specification#Syntact… In particular, check example: synsem/example7 There you see two different syntactic frames for the word "give". In this case they both represent the same sense corresponding to an exchange of goods but with different syntactic construcitons. In your case for "ask" there would be also two syntactic frames, but two senses instead of one. If you want I can send you a modelled example. 2) Such spelling variants are modelled in lemon as two different representations of the same lexical entry. See ontolex/example3 in the above mentioned spec. After all, it is the same word with the same meanings and same pronunciation but just with a different spelling for each dialect of English. In our understanding these are not two different forms as you mention, but two different spellings of the same form. A form represents a particular grammatical variant, not a spelling variant. In this case it is the singular form of the noun. But both spellings really represent the same (grammatical) form, that is the singular form of the noun. You do not need to specify one main written representation for each form, as both are valid depending on the context. The preference for showing e.g. the American or English variant should be stated by the application that uses the lexicon. Does this help? Philipp Am 11.11.16 um 20:07 schrieb Denny Vrandečić: > The Wikidata Lexeme model is basically based on Lemon, so I wanted to > ask you whether you have answers for the following questions in Lemon? > > Feel free to answer directly to the list: > > https://lists.wikimedia.org/pipermail/wikidata-tech/2016-November/001057.ht… > > > Cheers, > Denny > > > > ---------- Forwarded message --------- > From: Daniel Kinzler <daniel.kinzler(a)wikimedia.de > <mailto:daniel.kinzler@wikimedia.de>> > Date: Fri, Nov 11, 2016 at 9:03 AM > Subject: [Wikidata-tech] Two questions about Lexeme Modeling > To: wikidata-tech <wikidata-tech(a)lists.wikimedia.org > <mailto:wikidata-tech@lists.wikimedia.org>> > > > Hi all! > > There is two questions about modelling lexemes that are bothering me. > One is an > old question, and one I only came across recently. > > 1) The question that came up for me recently is how we model the > grammatical > context for senses. For instance, "to ask" can mean requesting > information, or > requesting action, depending on whether we use "ask somebody about" or > "ask > somebody to". Similarly, "to shit" has entirely different meanings > when used > reflexively ("I shit myself"). > > There is no good place for this in our current model. The information > could be > placed in a statement on the word Sense, but that would be kind of > non-obvious, > and would not (at least not easily) allow for a concise rendering, in > the way we > see it in most dictionaries ("to ask sbdy to do sthg"). The > alternative would be > to treat each usage with a different grammatical context as a separate > Lexeme (a > verb phrase Lexeme), so "to shit oneself" would be a separate lemma. > That could > lead to a fragmentation of the content in a way that is quite > unexpected to > people used to traditional dictionaries. > > We could also add this information as a special field in the Sense > entity, but I > don't even know what that field should contain, exactly. > > Got a better idea? > > > 2) The older question is how we handle different renderings > (spellings, scripts) > of the same lexeme. In English we have "color" vs "colour", in German > we have > "stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and > Cyrillic rendering for every word. We can treat these as separate > Lexemes, but > that would mean duplicating all information about them. We could have > a single > Lemma, and represent the others as alternative Forms, or using > statements on the > Lexeme. But that raises the question which spelling or script should > be the > "main" one, and used in the lemma. > > I would prefer to have multi-variant lemmas. They would work like the > multi-lingual labels we have now on items, but restricted to the > variants of a > single language. For display, we would apply a similar language fallback > mechanism we now apply when showing labels. > > 2b) if we treat lemmas as multi-variant, should Forms also be > multi-variant, or > should they be per-variant? Should the glosse of a Sense be > multi-variant? I > currently tend towards "yes" for all of the above. > > > What do you think? > > > -- > Daniel Kinzler > Senior Software Developer > > Wikimedia Deutschland > Gesellschaft zur Förderung Freien Wissens e.V. > > _______________________________________________ > Wikidata-tech mailing list > Wikidata-tech(a)lists.wikimedia.org > <mailto:Wikidata-tech@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata-tech -- -- Prof. Dr. Philipp Cimiano AG Semantic Computing Exzellenzcluster für Cognitive Interaction Technology (CITEC) Universität Bielefeld Tel: +49 521 106 12249 Fax: +49 521 106 6560 Mail: cimiano(a)cit-ec.uni-bielefeld.de Office CITEC-2.307 Universitätsstr. 21-25 33615 Bielefeld, NRW Germany

1 0

Why term for lemma?
by Denny Vrandečić 11 Nov '16

11 Nov '16

Hi, I am not questioning or criticizing, just curious - why was it decided to implement lemmas as terms? I guess it is for code reuse purposes, but just wanted to ask. Cheers, Denny

4 6

Wikidata deving and vagrant
by Denny Vrandečić 07 Nov '16

07 Nov '16

Hi all, I set up a vagrant environment to do some hacking on Wikibase, but the wikidata role seems to create a client only? Is there a role for the server? Cheers, Denny

2 1

lagging runUpdate.sh on wikidata stand-alone
by Eric Scott 04 Nov '16

04 Nov '16

Hi all - We've been using a locally installed wikidata stand-alone service (https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Standalon…) for several months now. Recently the service went down for a significant amount of time, and when we ran runUpdate.sh -n wdq, instead of catching up to real time as it usually does, the update process lagged, failing even to keep parity with real time. Example output from the log: 09:30:39.805 [main] INFO org.wikidata.query.rdf.tool.Update - Polled up to 2016-10-24T23:01:05Z at (0.0, 0.0, 0.0) updates per second and (271.8, 56.2, 18.8) milliseconds per second This is normal when starting the update of course, but the system never seems to find its feet, and continues to stumble and lag. Restarting both the blazegraph process and the update process has no lasting effect. From time to time, a message like this will appear: INFO org.wikidata.query.rdf.tool.RdfRepository - HTTP request failed: org.apache.http.NoHttpResponseException: wikidata.cb.ntent.com:9999 failed to respond, retrying in 2175 ms. I have experienced this effect in the past, and had success replacing an old journal which was the product of a long update process with a new journal rebuilt from the latest dump. This strategy did not work. I tried rebuilding with the latest git pull from origin and rebuilding the journal, again with no effect. This problem started about 3 days ago, and we're now polling up to a point in time 18 hours earlier than real time. I would appreciate any guidance. Also: is this an appropriate list to write to with such problems? Are there more appropriate places? Thanks, Eric Scott

3 3

BREAKING CHANGE: Quantity Bounds Become Optional
by Daniel Kinzler 04 Nov '16

04 Nov '16

Hi all! This is an announcement for a breaking change to the Wikidata API, JSON and RDF binding, to go live on 2016-11-15. It affects all clients that process quantity values. As Lydia explained in the mail she just sent to the Wikidata list, we have been working on improving our handling of quantity values. In particular, we are making upper- and lower bounds optional: When the uncertainty of a quantity measurement is not explicitly known, we no longer require the bounds to somehow be specified anyway, but allow them to be omitted. This means that the upperBound and lowerBound fields of quantity values become optional in all API input and output, as well as the JSON dumps and the RDF mapping. Clients that import quantities should now omit the bounds if they do not have explicit information on the uncertainty of a quantity value. Clients that process quantity values must be prepared to process such values without any upper and lower bound set. That is, instead of this "datavalue":{ "value":{ "amount":"+700", "unit":"1", "upperBound":"+710", "lowerBound":"+690" }, "type":"quantity" }, clients may now also encounter this: "datavalue":{ "value":{ "amount":"+700", "unit":"1" }, "type":"quantity" }, The intended semantics is that the uncertainty is unspecified if not bounds are present in the XML, JSON or RDF representation. If they are given, the interpretation is as before. For more information, see the JSON model documentation [1]. Note that quantity bounds have been marked as optional in the documentation since August. The RDF mapping spec [2] has been adjusted accordingly. This change is scheduled for deployment on November 15. Please let us know if you have any comments or objections. -- daniel [1] https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON [2] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Quantity Relevant tickets: * <https://phabricator.wikimedia.org/T115269> Relevant patches: * <https://gerrit.wikimedia.org/r/#/c/302248> * <https://github.com/DataValues/Number/commit/2e126eee1c0067c6c0f35b4fae0388f…> -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

1 0

Wikidata stand-alone service peformance problems and the Blazegraph multi-GPU architecture
by Eric Scott 02 Nov '16

02 Nov '16

Hi all - We've been using a locally installed wikidata stand-alone service (https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Standalon…) for several months now. We're becoming increasingly plagued by performance issues, and are wondering if one approach to the problem might be to adopt the Blazegraph multi-GPU architecture (https://www.blazegraph.com/product/gpu-accelerated/). Could anyone provide guidance as to how much pain would be involved in making such a transition? Thanks, Eric Scott

5 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech November 2016