Wikidata-tech October 2014

wikidata-tech@lists.wikimedia.org

15 participants
10 discussions

by Adrian Lang

Hi everyone, in the current deployment, we made a breaking change to wikibase.RepoApi. We will do another breaking change to it in the next deployment. If you use RepoApi on wikidata.org or any Wikidata client wiki, please read these instructions on how to migrate your code. == Use case 1: RepoApi.get, RepoApi.post == You exclusively use RepoApi.get or RepoApi.post. You can and will have to use a mw.Api directly. If you are on wikidata.org, just use the following: > var api = new mw.Api(); > // api.get( { … } ); > // api.post( { … } ); If you are on a Wikidata client wiki, use the new wikibase.client.getMwApiForRepo: > mw.loader.using( [ 'wikibase.client.getMwApiForRepo' ] ).done( function() { > var api = wikibase.client.getMwApiForRepo(); > // api.get( { … } ); > // api.post( { … } ); > } ); In this case, I would encourage you to look into the features RepoApi provides. Chances are RepoApi could build your api call for you. == Use case 2: RepoApi's wrapper methods == You use one of RepoApi.createEntity, RepoApi.editEntity, RepoApi.formatValue, RepoApi.getEntities, RepoApi.getEntitiesByPage, RepoApi.parseValue, RepoApi.searchEntities, RepoApi.setLabel, RepoApi.setDescription, RepoApi.setAliases, RepoApi.setClaim, RepoApi.createClaim, RepoApi.removeClaim, RepoApi.getClaims, RepoApi.setClaimValue, RepoApi.setReference, RepoApi.removeReferences, RepoApi.setSitelink, or RepoApi.mergeItems. You should continue using RepoApi, but have to provide it a mw.Api instance. If you are on wikidata.org, just use the following: > mw.loader.using( [ 'wikibase.RepoApi' ] ).done( function() { > var mwApi = new mw.Api(); > var repoApi = new wikibase.RepoApi( mwApi ); > // mwApi.get( { … } ); > // mwApi.post( { … } ); > // repoApi.setClaim( … ); > } ); If you are on a Wikidata client wiki, use the new wikibase.client.getMwApiForRepo: > mw.loader.using( [ 'wikibase.client.getMwApiForRepo' ] ).done( function() { > var mwApi = wikibase.client.getMwApiForRepo(); > var repoApi = new wikibase.RepoApi( mwApi ); > // mwApi.get( { … } ); > // mwApi.post( { … } ); > // repoApi.setClaim( … ); > } ); If you have any questions, feel free to contact me. Bye, Adrian

9 years, 6 months

PHP library for the JSON dump

by Jeroen De Dauw

Hey, I've just created two tiny PHP components for reading Wikibase JSON dumps. * https://github.com/JeroenDeDauw/JsonDumpReader * https://github.com/JeroenDeDauw/JsonDumpData Hope this is of use to others :) Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 6 months

Description(s)

by Magnus Manske

Hi all, in the dump file wikidatawiki-20140912-pages-articles.xml.bz2 I seem to find some items with a key "description", some with "descriptions". For example, near the beginning of the file: Q15 seems to have key "description" Q17 seems to have key "descriptions" This is rather unhelpful when running e.g. my stats script. a) Can someone please confirm that I'm not crazy? I mean, in this instance. b) Is this a bug, or a feature? c) If a bug, is it already fixed for the next dump? Which key will it be? (If a feature: why?) Thanks, Magnus

9 years, 6 months

Parsing Entity IDs

by Jeroen De Dauw

Hey, I just noticed this commit [0], which gets rid of a pile of direct BasicEntityIdParser usages for performance reasons. It also addresses a problem which is apparently more widespread than I thought: using an EntityIdParser that only works for Item IDs and Property IDs. Unless you are only dealing with Item IDs or with Property IDs, an EntityIdParser should always be injected. This allows the thing constructing the object graph to add support for all required entity types, and gives extensions to Wikibase Repo (or Wikibase Client) a chance to register new entity types. Of course this also means that no new code that introduces such occurrences should be allowed through review, even if it contains a "fix this later" TODO (for new code there is no excuse to do it wrong). [0] https://gerrit.wikimedia.org/r/#/c/167136/ Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 6 months

Wikibase changesAsJson

by Jeroen De Dauw

Hey, I was wondering if we still used PHP serialization in our change replication mechanism. (We need to be very careful making changes to the objects in WB DM if that is the case.) Looking at the code, I discovered we have a changesAsJson setting, presumably introduced to migrate away from the PHP serialization. Has such a migration happened? Can we get rid of the setting an the old PHP serialize code? Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 6 months

Are statements ordered?

by Markus Kroetzsch

Hey, In the current JSON, statements are grouped by property ("statement groups") and the statements in each group are in a list (ordered). The statement groups, however, are in a map (unordered) and no extra order information is given [1]. Should we consider this the official definition? When we defined the original datamodel, we used to have statements ordered (and there were no groups by properties), but now we have the groups and abandoning statement-group order makes sense with Reasonator-style UIs that find a good order automatically. However, this should then also match the internal datamodel implementation as used, e.g., for diffs (if a bot edits an item, and statement order is not part of the data, then it might randomly reorder statement groups). Tools that work on the JSON as it is now cannot guarantee any order of statement groups. In Java the order you get can really be different from run to run. So we should change our interfaces in WDTK from List to Set (otherwise our tests fail as soon as we read more than one statement). But before doing this, I would like to know if this is really the future or if there are plans to keep the order of statement groups and to extend the JSON instead. Cheers, Markus [1] The current JSON has the following ordered elements: * statements in one statement group (the collection of all statements with the same property) are stored as a list (ordered) * statement qualifiers are a map (no order) but extra order information is given ("qualifier-order") * the "list of references" of a statement is indeed a list (ordered) * the snaks of each reference are a map (no order) but extra order information is given ("snak-order") * aliases of each language are stored in a list (ordered) What is not ordered: * statement groups are stored in a map (from property ids to lists of statements with this property); no order is expressed between statements with different main properties. * labels, descriptions, alias lists (map from language to content) -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

9 years, 6 months

Problems with overly large items

by hoo

Hi Everyone, below are some chatlogs between Daniel and me. <hoo> aude: Around? <DanielK_WMDE__> hoo: aude is having dinner with the multimedia team <hoo> I feared that :S <hoo> DanielK_WMDE__: Did you talk about the Q183 issues today? * hoo came home late today... <DanielK_WMDE__> no. <DanielK_WMDE__> hoo: actually - Thiemo was doing some benchmarking earlier, together with aude. Might have been that <hoo> I see... but the issue is (sadly) complexer <DanielK_WMDE__> regular work is pretty much zero right now though. we have been in sessions with the multimedia folks all day <hoo> some revisions of that item segfault, some fatal and some throw an exception <DanielK_WMDE__> bah <DanielK_WMDE__> this happens since the switch to DataModel 1.0, right? <hoo> That's a good question, actually <DanielK_WMDE__> sigh <hoo> not sure why shit hit the fan exactly now and not earlier <DanielK_WMDE__> JeroenDeDauw: care to look into that? <DanielK_WMDE__> hoo: can you collect your findings somewhere, and mail a link to wikidata-tech? <hoo> DanielK_WMDE__: Well, everything is bugzillad <hoo> just search for Q183 <hoo> might be that it's spread across various products, though (Wikimedia and Wikidata repo) <hoo> This problem has so many parts :S <hoo> But the one that old revisions sometimes can't be viewed/ unserialized(?) is the most minor one IMO * James_F|Away is now known as James_F <DanielK_WMDE__> but it seems like all of them are related to the change in the DataModel and/or the serialization format <DanielK_WMDE__> hoo: my feelign is that we should (partially) go back to deferred unstubbing: have stub implementations of StatementList, etc, that would only instantiate the full structure when needed. <DanielK_WMDE__> JeroenDeDauw: what do you think? <hoo> Probably, yes [...] I post them here in order to coordinate a bit on the efforts to solve this. So if you investigate anything, have any findings or are working on something please let the others (and especially me) know. Right now Tim is trying to help with the segfaults, but that's only one of the problems we see here. Also: <TimStarling> "value":"\u0b9a\u0bc6\u0bb0\u0bc1\u0bae\u0ba9\u0bbf" <TimStarling> very efficient encoding there <TimStarling> well, using UTF-8 only reduces it from 736163 to 721591 <ori> it's JSON, though the JSON specification doesn't require that you encode code points outside the ASCII range; it simply allows it <ori> http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode <ori> I think FormatJson::encode() supports a similar option, and it's compatible with PHP 5.3 <TimStarling> yes, that's what I just tried, reduces oldid 158433886 from 736163 to 721591 bytes <TimStarling> I haven't reproduced that crash in eval.php yet <ori> hoo: FormatJson::encode( $value, /* $pretty = */ false, FormatJson::UTF8_OK ); I think we should do this, if all major json implementation support it (which I guess they do). But also, as Tim points out, this is not going to help very much. Cheers, Marius

9 years, 6 months

Problems regarding Q183 (Germany)

by hoo

Hi, as some of you may noticed we recently had huge troubles[0] with Wikidata's Q183 (the item about Germany). The immediate cause of these problems was the huge size of that item. Below I'm going to explain what led to the problems and how I was able to workaround the problem for now. As can been seen in the item's edit history[1], the item hadn't been edited since early July, until HHVM had been enabled on Wikidata. I can't tell for sure, but I presume the item had been unchangeable since July, because of it's size. While the huge size of the item presumably prevented editing, it never was a (at least on a huge scale) noticeable problem viewing it and using it in the client. After HHVM got enabled on Wikidata it was, due to HHVM being way faster than Zend PHP, possible to change the item again. This was done once[1] and led to the item internally being converted into our new serialization format (this can be seen in the huge size change of that small content change). The now much larger item then became a problem for the slower Zend PHP (in both the client and repo) because it was much larger then the old version. As explained above the item's size led to out of memory errors, php segfaults, exceptions and various other problems causing huge disruption within both Wikidata and the Wikipedias. Because of the huge impact of the problem I was forced to find a way around the immediate problems in a timely manner. I started mitigating the problems by simply undoing the change to the new serialization format and restored the version from July 8[1]. To my own surprise even that version didn't render, probably because our DataModel is less efficient now then it used to be. After that I pulled out an older (and thus smaller) version[2] from the edit history which I could view (with oldid=120566337). But sadly also this revision, although much smaller than what we used to have, didn't always render (it might have worked sometimes and probably caused a little less trouble in other components). After all I was forced to revert to revision 116786096[3] which has way less data then what used to be on the item. Using that revision the item could be rendered again and can also be used by other components again (eg. in the Wikipedias). As any change to that item would trigger an update of the serialization format again (which would also make the stored data much bigger), I (together with Marc-André Pelletier (coren)) decided that the item needs to be "freezed" in a working state, so that it can't be altered (that also makes sure we can go back to the version with the most data once this has been solved). Because of that we chose to superprotect the item to make sure it can't be edited by anyone[4]. The above steps work as a temporary workaround to the problems caused in all components, but of course they are only a temporary countermeasure and we will need to fix this properly. Cheers, Marius [0]: https://bugzilla.wikimedia.org/show_bug.cgi?id=71519 and others [1]: https://www.wikidata.org/wiki/Q183?action=history [2]: https://www.wikidata.org/w/index.php?title=Q183&oldid=120566337 [3]: https://www.wikidata.org/w/index.php?title=Q183&oldid=116786096 [4]: https://www.wikidata.org/w/index.php?oldid=162225140#Temporary_protection_o…

9 years, 6 months

alternatives to memcached for caching entity objects across projects

by Daniel Kinzler

We currently use memcached to share cached objects across wikis, most importantly, entity objects (like data items). Ori suggested we should look into alternatives. This is what he wrote: [21:15] <ori> I was wondering if you think the way you use memcached is optimal (this sounds like a loaded question but I mean it sincerely). And if not, I was going to propose that you identify an optimal distributed object store, and I was also going to offer to help push for procurement and deployment of such a service on the WMF cluster. [21:17] <ori> memcached is a bit of a black box. it is very difficult to get comprehensible metrics about how much space and bandwidth you're utilizing, especially when your data is mixed up with everything else that goes into memcached [21:18] <ori> and the fact that you're serializing objects using php serialize() rather than simple values makes it even harder, because it means that you can only really poke around from php with wikidata code available Just food for thought, for now... any suggestions for a shared object store? In any case, thanks for looking into this, Ori! -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

9 years, 6 months

Wikibase DataModel 1.1 released

by Jeroen De Dauw

Hey all, I'm happy to announce the 1.1 release of Wikibase DataModel. This release only contains new features: * The `Property` constructor now accepts an optional `StatementList` parameter * Added `Property::getStatements` and `Property::setStatements` * Added `PropertyIdProvider` interface * Added `ByPropertyIdGrouper` * Added `BestStatementsFinder` * Added `EntityPatcher` and `EntityPatcherStrategy` * Added `StatementList::getAllSnaks` to use instead of `Entity::getAllSnaks` * The `Statement` constructor now also accepts a `Claim` parameter * Added `Statement::setClaim` * The `Reference` constructor now accepts a `Snak` array * Added `ReferenceList::addNewReference` Internal changes where also made to enhance the quality, which is reflected for instance here: https://scrutinizer-ci.com/g/wmde/WikibaseDataModel/reports/ More information on Wikibase DataModel, including installation instructions and release notes, can be found at https://github.com/wmde/WikibaseDataModel Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech October 2014