Wikidata-tech September 2014

wikidata-tech@lists.wikimedia.org

19 participants
17 discussions

Parser cache update/migration strategies

by Daniel Kinzler

Hi all! tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code? We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related to languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar. However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput: 1) ...then generate it on the fly. On every page view, until the parser cache is purged. This seems bad especially if generating the required info means hitting the database. 2) ...then invalidate the parser cache for this page, and then a) just live with this request missing a bit of output, or b) generate on the fly c) trigger a self-redirect. 3) ...then generated it, attach it to the ParserOutput, and push the updated ParserOutput object back into the cache. This seems nice, but I'm not sure how to do that. 4) ...then force a full re-rendering and re-caching of the page, then continue. I'm not sure how to do this cleanly. So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable? Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry like that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble? Any input would be great! Thanks, daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

9 years, 8 months

Location of JSON-only dump files

by Markus Krötzsch

Hi, The JSON-only dumps are at http://dumps.wikimedia.org/other/wikidata/ Is this a preliminary location? I am asking since this does not follow the directory structure conventions of the rest of the WMF dump site, which is normally something like http://some-base-url/projectname/filename The projectname here would be "wikidatawiki" but even http://dumps.wikimedia.org/other/wikidatawiki/ would obviously not be a good location if more projects would export JSON. For example, where would Commons put its JSON in the future? Cheers, Markus

9 years, 8 months

Changes to CI / Jenkins for Wikidata/base

by addshorewiki＠gmail.com

Hi all! Over the past week or so I have been working with Hashar to move our CI stuff from our own jenkins instance that we created so that we could run tests with composer back to the wmf jenkins and zuul! One the whole this has gone well! All Wikibase tests as now running through the wmf jenkins with the Wikidata build tests chasing behind (not quite complete yet) See: - https://gerrit.wikimedia.org/r/156715 - https://gerrit.wikimedia.org/r/156568 - https://bugzilla.wikimedia.org/show_bug.cgi?id=70250 Our setup now is that we have 3 jenkins slave instances on labs, each with 5 test runners. See: - https://integration.wikimedia.org/ci/computer/wikidata-jenkins1 - https://integration.wikimedia.org/ci/computer/wikidata-jenkins2 - https://integration.wikimedia.org/ci/computer/wikidata-jenkins3 These are located within the wikidata-build projects on labs (which has the small amount of documentation needed to add more instances. See: https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-build As a result of this our tests in general should now run 3 times faster than they did with wdjenkins, and infact they run even faster than when we were using the old wmfjenkins setup! We also have gate-submit jobs again! That is all :) Addshore

9 years, 8 months

Factoring code for repo-client interaction into packages

by Daniel Kinzler

WikibaseLib is a horrible kitchen sink, and I don't want to add more to the mess. So I want to put the usage tracking code into sensible packages. However, I'm a bit at a loss as to how to best split the different responsibilities into packages. Here are some of the communication needs we have, implying which code needs to be shared between repo and client: The client needs to: * load entity data ** need to share entity storage code ** but should not know about EntityContent ** and should have no write access * look up properties by label, and look up labels of items ** need to share term storage code ** no need for write access ** no need for code for constraints checks, etc. ** should not have related related maintenance scripts or schema update code * look up data types for properties ** need to share property info storage code ** no need for write access ** should not have related maintenance scripts or schema update code * load change details ** need to share change table storage code and value objects ** no need for write access ** no need for dispatching logic ** also should not have schema update code * look up sitelinks by page title ** need to share link table storage code ** no need for write access ** should not have related maintenance scripts or schema update code * update notification subscriptions ** need to share subscription storage code ** should not have related maintenance scripts or schema update code So, there are 6 things the client and the repo both need to access. But the write logic, or at least the maintenance logic, should not be bundled with the leaner "read only" package. So I see 12 new packages... dependency hell. So, what to do? Have 6 read level packages, and stuff the maintenance logic into a single package not used by the client? Also ugly. Ideas? -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

9 years, 8 months

Wikibase DataModel 1.0 released

by Jeroen De Dauw

Hey, I’m happy to announce the 1.0 release of Wikibase DataModel [0]. Wikibase DataModel is the canonical PHP implementation of the Data Model at the heart of the Wikibase software [1]. This is a big release with a lot of breaking changes [*]. You can find the full list of changes that affect users of the component in the release notes [2]. I've also written a brief blog post about the release [3]. Thanks to all who helped with this release. [0] https://github.com/wmde/WikibaseDataModel [1] http://wikiba.se/ [2] https://github.com/wmde/WikibaseDataModel/blob/master/RELEASE-NOTES.md [3] http://www.bn2vs.com/blog/2014/09/02/wikibase-datamodel-1-0/ [*] These do not directly affect the Wikidata.org website or its users Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 8 months

Questions on JSON dumps and format

by Markus Krötzsch

Hi, Some questions on the new dump options. I noticed that the XML dump files use exactly the same content model and format for the new model as they used for the old. This is not so great as it reduces the utility of the <model> information greatly if the same model is used for incompatible content. I am now trying to find a way to write code that supports both old and new dumps. Hence my questions: (1) The most recent full dump that is available contains the old format. The most recent current dump that is available contains the new format. Is it possible that a single dump contains both formats? (2a) If the answer to (1) is no: what are/will be the first (or last) full/current/daily dump files that use the new format? (2b) If the answer to (1) is yes: what is the revision number at which the change was made (i.e., what is the largest revision number that is still in the old format)? Many thanks, Markus

9 years, 8 months

JSON dump inconsistencies (bug)

by Markus Krötzsch

Hi, The new JSON dump format uses JSON maps for aliases (and many other things). A JSON map is a thing of the form { ... }. However, if an item has no aliases, the JSON dump sets its value to the empty *list* []. This is bad since it makes object-model based JSON parsers, such as Jackson, trip over this (since the parser must know what type the parsed output should have, and [] cannot be converted into a map). It is possible that the problem occurs with other fields as well (labels, descriptions, etc.). There are two possible fixes: * Correctly export empty maps as {}. * Do not export empty maps at all (leave away the key). The second is what happens in the API. It would be nice if the JSON exported by the API would be the same as the JSON exported in the dumps. If this is news for you, I can also create a bug report. Cheers, Markus

9 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech September 2014