Wikidata-tech September 2014

wikidata-tech@lists.wikimedia.org

19 participants
17 discussions

Wikidata Toolkit 0.3.0 released
by Markus Krötzsch 30 Sep '14

30 Sep '14

Dear all, I am happy to announce the third release of Wikidata Toolkit [1], the Java library for programming with Wikidata and Wikibase. The main new features are: * Full support for the (now) standard JSON format used by Wikidata * Huge performance improvements (decompressing and parsing the whole JSON dump now takes about 15min; was more like 80min before) * Many new example programs for inspiration and guidance [2] Maven users can get the library directly from Maven Central (see [1]); this is the preferred method of installation. There is also an all-in-one JAR at github [3] and of course the sources [4]. Version 0.3.0 is still in alpha. For the next release, we will focus on the following tasks: * Support a binary format for even faster random access (some of this is done already, but not quite ready for release yet) * A command-line tool for data processing/conversion tasks * Support for storing and querying data Feedback is very welcome. Developers are also invited to contribute via github. Cheers, Markus [1] https://www.mediawiki.org/wiki/Wikidata_Toolkit [2] https://github.com/Wikidata/Wikidata-Toolkit/tree/master/wdtk-examples (scroll down for documentation) [3] https://github.com/Wikidata/Wikidata-Toolkit/releases (you'll also need to install the third party dependencies manually when using this) [4] https://github.com/Wikidata/Wikidata-Toolkit/

1 0

Incorrect build for Special:UnconnectedPages
by Amir Ladsgroup 29 Sep '14

29 Sep '14

Hello, If you check the php file of SpecilUnconnectedPages (in Wikibase/client/includes/specials/) it extends SpecialPage instead of QueryPage. I want to know is it intentional? if it's not Do you agree it's better to rewrite the file to use the QueryPage class? Best -- Amir

2 2

Re: [Wikidata-tech] [Multimedia] From the MW Core Backlog....
by Rob Lanphier 29 Sep '14

29 Sep '14

(+cc Nemo and Wikidata-tech) On Fri, Sep 26, 2014 at 5:33 AM, Gergo Tisza <gtisza(a)wikimedia.org> wrote: > On Fri, Sep 26, 2014 at 2:49 AM, Rob Lanphier <robla(a)wikimedia.org> wrote: > >> There's an item that's Luis Villa added to the MW Core backlog that I'd >> like to move to the Multimedia backlog: >> >> https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog#Struct… >> >> I'm assuming everything that he describes fits nicely into what is >> planned for Structured Data. Assuming that's true, should I just >> copy/paste into a new card in Mingle, or a new page on mw.org or what? >> > > This seems to be about article text, or mainly about article text > (articles imported from other wikis and so on). > > The plan for the structured data project is to create Wikidata properties > for legalese, install Wikibase on Commons (and possibly other wikis which > have local images), make that Wikibase use Wikidata properties (and > sometimes Wikidata items as values), create a new entity type called > mediainfo (which is like a Wikibase item, but associated with a file), and > add legal information to the mediainfo entries. > > Part of that (the Wikidata properties) could be reused for articles and > other non-file content - the source, license etc. properties are generic > enough. However, if we want to use this structure to attribute files, we > would either have to make mediainfo into some more generic thing that can > be attached to any wiki page, or abuse the langlink/badge feature to serve > a similar purpose. That is a major course correction; if we want to do > something like that, that should be discussed (with the involvement of the > Wikidata team) as soon as possible. > Thanks for the analysis, Gergo! I was going to split Luis' proposal into a separate wiki page, but I see Nemo has linked to this page as the "Canonical page on the topic": https://www.mediawiki.org/wiki/Files_and_licenses_concept Without a deep reading that I'm admittedly just not going to have time for, it's hard to tell how related the page that Nemo linked to is to the concepts that Luis is trying to capture. Could someone (Nemo? Luis?) merge Luis's requirements into the "canonical page" to Luis' satisfaction, so I can delete most of the information from our backlog? I'll keep the item on the MW Core backlog, since I don't know where else to put it, but it's probably going to be relatively low priority for that team. Multimedia team and Wikidata team, could you make sure you're considering the requirements that Luis brought up as you build your solution? Even if you decide to punt on some of the things that aren't strictly necessary for files, it's still good to make sure you don't paint us in a corner when if/when we do try to do something more sophisticated for articles. One thing I'll note, though, before we get too complacent in thinking that files are somehow simpler than articles, we should consider these relatively common scenarios: * Group photo with potentially different per-person personality rights * PDF of a slide deck with many images * PDF of a Wikipedia article :-) Rob

5 6

Video: running the DataModel tests
by Jeroen De Dauw 27 Sep '14

27 Sep '14

Hey all, I've made little video demonstrating how you can get a clone of Wikibase DataModel, set it up, and run it's tests. It's only 30 seconds long, yet includes all you need to know. https://asciinema.org/a/12530 Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

1 0

New names for Wikibase / Wikidata deployment branches
by Katie Filbert 24 Sep '14

24 Sep '14

We are now on MediaWiki 1.25 deployment branches of core. For Wikibase / Wikidata branches, we are using a new naming scheme: Instead of "mw1.25-wmf1" as in the past, we now have "wmf/1.25wmf1" which is the same naming scheme as used by core and other extensions. This allows Jenkins to test our branches against the corresponding version of core, if available, instead of testing against master. Thanks to hashar for making this possible. :) Cheers, Katie -- Katie Filbert Wikidata Developer Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0 http://wikimedia.de Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.

2 1

Removing 3rd party dependencies from WikibaseQueryEngine
by Daniel Kinzler 19 Sep '14

19 Sep '14

Hi. The call with the WMF today turned up a pretty big issue. WikibaseQueryEngine (WQE) depends on two big 3rd party libraries: * Symphony (used for implementing a command line client for the query engine, which we use to generate database tables) * doctine/dbal (a database abstraction layer we use to run queries and create database tables) In order to deploy these, the WMF would require a line-by-line review of something around 50thousand lines of code (or maybe it was 30thousand, or 80thousand - a lot, in any case). This is not feasible. We (Katie, Jeroen, Chris, Nik, me, etc) have come up with a plan to get rid of these dependencies: 1) Split the command line interface into a separate component (WQE-CLI perhaps) that would not be deployed. Symphony is then out of the picture. 2) Go back to using MediaWiki's DB abstraction for running queries. This should be easy. 3) For generating DB tables (aka schema creation), we create a separate component (WQE schema generator or something) that would use dbal to generate staic sql files for the supported db systems (most importantly, mysql and sqlite). This would be part of our build step, but the code relying on dbal would not be deployed. Only the generated sql files would be used for deployment (either through the update script, or, in the case of WMF, manually). In the end, WQE could use either MW or dbal for running queries, so we could deploy it without dbal on the WMF cluster, but it could also be used outside MediaWiki. If there are no objections, I propose to take this on for the next sprint. Perhaps we can start to split this into tasks on bugzilla already. Cheers, Daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

6 13

DB performance review: Wikibase Usage Tracking
by Daniel Kinzler 17 Sep '14

17 Sep '14

Hi all! The Wikibase team would like to allow data from any item to be used on any client page. To do this, we need to track which item is being used where, so we can purge the appropriate pages when the item changes. We would like to have people with database experience to look at our proposal and let us know about any concerns, especially wrt performance. Here you find a proposal for two database tables for tracking the usage of entities across wikis: https://gerrit.wikimedia.org/r/#/c/158078/9/usagetracking/includes/Usage/Sq… https://gerrit.wikimedia.org/r/#/c/158078/9/subscription/includes/Subscript… The "entity_usage" table would be on every client, recording wich entity is used on which page (kind of like the iwlinks table). The "entity_per_client" table would be on the repo, and track which wiki ("client") is interested in changes to which entity. Please have a look and let me know if you have any questions or suggestions, especially with regards to the following use cases: The following would happpen when editing/re-parsing a page on a client wiki (e.g. wikipedia): * get all entities used on a given page from entity_usage * delete rows based on a page id and a list of entity ids from entity_usage * insert rows for a page / entity pair into entity_usage * queriy rows for a set of entities from entity_usage (with no page id specified). * add rows for a set of (newly used) entites to the entity_per_client table * remove rows for a set of (no longer used) entites from the entity_per_client table The following would happen when dispatching a change from wikibase: * looking up interested wikis for a list of entities from the entity_per_client table. * (notification via the job queue) * looking up pages to be purged/updated based on a list of entity ids (and possibly an aspect id) in the entity_usage table. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

2 2

Re: [Wikidata-tech] Removing 3rd party dependencies from WikibaseQueryEngine
by Daniel Kinzler 10 Sep '14

10 Sep '14

Hi Katie, hi Rob! Thanks for your swift replies. Am 10.09.2014 04:59, schrieb Katie Filbert: > On Tue, Sep 9, 2014 at 7:25 PM, Daniel Kinzler <daniel.kinzler(a)wikimedia.de > <mailto:daniel.kinzler@wikimedia.de>> wrote: ... > So, there are two things we need to know in order to make an informed decision: > > 1) can we use the Ubuntu LTS packages for symfony and dbal? > > > which versions are they? php-doctrine-dbal (2.3.4-1) php-symfony-console (2.3.1+dfsg-1) Both are "universe" packages, no not "officially" supported by Ubuntu. But so is php5-memcached and other stuff we use. > are they hhvm-compatible? > > http://hhvm.h4cc.de/package/doctrine/dbal The hhvm status page for dbal lists everything before 2.5 as "not tested", so we just don't know whether it would work with hhvm. Worth a try, I suppose. I can't find anything about hhvm compatibility on symfony.com, but I found some blog posts from 2013 discussing symfony performance on hhvm, so that should work fine. Am 10.09.2014 07:34, schrieb Rob Lanphier: ... > I'm kinda regretting bringing up the apt repo case, because basically > a loophole in our review strategy, not something I'm sure we want to > encourage or something I believe will get a great deal of traction. > There's an outside chance that people shrug and say "sure, why not?", > but I don't think it's going to be generally attractive. I wouldn't call that a loophole, but rather an inconsistency. Nobody is going to review libpcre, or the mysql client library, or Lucene. I find it confusing that anything written in PHP apparently needs full review, while we just trust stuff written in C or Java. I'm not saying we should deploy just any 3rd party code without review. I'm saying we should have better criteria than the language the library is written in. One criterion could be the size of the install base - many eyes.But then, heartbleed happened... >> 1) can we use the Ubuntu LTS packages for symfony and dbal? > > Per above, probably not likely. But using, say, php5-curl or php5-oauth is fine?... I don't want to whine and complain about the deployment/review policy; I'm trying to find out whether there is a policy, and if so, what it is, and what the rationale it is based on. >> 2) when is 14.04 going to be rolled out? > > Concurrently with the HHVM upgrade. In the coming weeks. We don't > have a set end date for when all machines will be converted, but we > should be well underway by the end of this month. There's probably > going to be a stubborn service or two that will stick around quite a > bit longer than that, though. Sadly, Wikibase seems to have some compat issues with hhvm. At least that would be an explanation for the breakage on beta. We currently blame it on class aliases, but nobody really knows. It seems kind of random... >> Who can answer these questions? How do we poke TechOps? > > The Wikimedia Operations list is a good place to ask about this type of thing. Yay, yet another mailing list! Thanks, Daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

1 0

congratulations
by Gerard Meijssen 10 Sep '14

10 Sep '14

Hoi, The latest statistics show that we may finally say that "only 50% of our items have zero, one or two statements". No longer can we say more that 50% of our items have zero or one statement. THAT is in my opinion a great indication of our progress of bringing data to Wikidata. As a consequence we become more information and useful. It is important to know that the first statements are in many ways the most difficult but also the most important. They often include the instances of or subclasses. Congratulations !!! Thanks, GerardM

1 0

Parser cache update/migration strategies
by Daniel Kinzler 09 Sep '14

09 Sep '14

Hi all! tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code? We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related to languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar. However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput: 1) ...then generate it on the fly. On every page view, until the parser cache is purged. This seems bad especially if generating the required info means hitting the database. 2) ...then invalidate the parser cache for this page, and then a) just live with this request missing a bit of output, or b) generate on the fly c) trigger a self-redirect. 3) ...then generated it, attach it to the ParserOutput, and push the updated ParserOutput object back into the cache. This seems nice, but I'm not sure how to do that. 4) ...then force a full re-rendering and re-caching of the page, then continue. I'm not sure how to do this cleanly. So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable? Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry like that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble? Any input would be great! Thanks, daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

3 2

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech September 2014