Wikidata-tech December 2014

wikidata-tech@lists.wikimedia.org

13 participants
7 discussions

Re: [Wikidata-tech] claim IDs in wikidata
by Stas Malyshev 22 Dec '14

22 Dec '14

Hi! > The best place for this kind of question would be the wikidata-tech mailing list > <wikidata-tech(a)lists.wikimedia.org>. It would probably be a good idea if you > (and whoever else deals with wikidata on the technical level) were subscribed > there. It's pretty low traffic. Thanks, I've sent the subscription request and adding it to the CC. Still learning the right places to go for things :) > Statement IDs are GUIDs (with the Item ID prefixed), and they do not change when > the Statement changes (otherwise, they would be hashes, not IDs - References are > currently handled by hash). >From the export/import point of view, I think I'd prefer immutable claims (i.e. ID changes each time claim changes) as they are easier to handle, but as it is not the case, I can switch to using the content hash instead. The performance impact (time spent calculating the hashes) should not be too big. > One thing that would be rather easy to do is to make JSON dumps of just the > items that changed in the last X hours. But that wouldn't tell you wich > statements changed. I think for imports the best thing would be to have real diffs - i.e. list of claims/item fields that were added/removed/changed - but if that's not feasible, list of changed items would be great too. We may want this with even more frequency than hours. Item data is not that big, so loading it and running the diff manually would still be workable. It would be slightly slower for big items (since each claim for the item has to be examined) and requires maintaining additional data structure to efficiently enumerate the claims, but it should be still workable. Thanks, Stas

3 2

Wikidata now has a mobile site
by Max Semenik 16 Dec '14

16 Dec '14

Hi, today I've enabled MobileFrontend on wikidata.org. So far, it's still considered a trial, so no automatic redirection for mobile devices happens. While it was primarily needed to satisfy a dependency in WikiGrok, this is as good chance as any to revisit the topic of having a mobile UI for Wikidata. The news are both good and bad: while claims unstyled and therefore look broken, they don't take whole desktop screen's width which is a good indicator that a bit of CSS should fix it. I think that even viewing Wikidata from mobile would be really awesome. Compare yourself: https://www.wikidata.org/wiki/Q2 vs. https://m.wikidata.org/wiki/Q2 -- Best regards, Max Semenik ([[User:MaxSem]])

4 6

Article: "Facebook's Top Open Data Problems"
by Ori Livneh 16 Dec '14

16 Dec '14

Facebook just published this summary of a summit for database researchers held at Menlo Park last September. I recommend it. It contains a clear and concise description of Facebook's data infrastructure, and a description of the open problems they are thinking about, which is even more interesting. https://research.facebook.com/blog/1522692927972019/facebook-s-top-open-dat… To whet your appetite, here are the problems (the summaries mostly my own paraphrase): * Mobile: How should the shift toward mobile devices affect Facebook’s data infrastructure? * Reducing replication: How can we reduce the number of round trips between the application and data layers? * Impact of Caching on Availability (aka "oh no, we just restarted memcached"): How do we harness the efficiency gains provided by caching without being brought to our knees by a sudden drop in cache hit rate? * Sampling at logging time in a distributed environment: How should we sample log streams if we want to maintain accuracy and flexibility to answer post-hoc queries? * Trading storage space and CPU: TL;DR: gzip --best or gzip --fast? * Reliability of pipelines: Pipelines are less reliable than the sum of their parts. A pipeline composed of two systems, each 0.999 reliable, is 0.989 reliable. Much sadness. What to do? * Globally distributed warehouse: consistency models and synchronization problems. * Time series correlation and anomaly detection: AKA: I want an alert for that massive memcached bytes_out spike that doesn't also wake me up with false positives at 2AM.

4 3

Wikidata item search via API based on labels and description
by Adrian Pohl 15 Dec '14

15 Dec '14

Hello, I have a list of place names and want to find the according wikidata item with this name. The list includes "Köln, "Düsseldorf" but also parts of towns which are recorded as compounds of the superior administrative entity and the district like "Schmallenberg-Westernbödefeld" or "Kerpen-Manheim". If I lookup these via the Wikidata API with the wbsearchentities action I get no problems with "Köln" and the like [1] but won't get any results for compounds, see e.g. [2] although both strings are part of the label and the description of a wikidata item. Via the wikidata interface I get the right result, though.[3] I have looked quite some time but couldn't find a way to query wikidata programatically and get results similar to the website search. Thus, my question is: Is there a way to query wikidata via an API over both the label fields and the description? Background I am working at the North Rhine-Westphalian Library Service Center (hbz)and we are currently building a new website for the Northrhine-Westphalian bibliography. [4] This bibliography collects articles, books and other media about places in the German federal state of Northrhine- Westphalia. Each record contains a string which indicates which place a resource is about. As soon as we have those links to Wikidata we will think about how to link to a list of bibliographic resources about a place from the place's wikipedia page. See the GitHub issue on this particular problem at [5]. All the best Adrian [1] https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Köln&lang… [2] https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Kerpen%20… [3] https://www.wikidata.org/w/index.php?search=Kerpen+Manheim [4] http://lobid.org/nwbib [5] https://github.com/hbz/nwbib/issues/42 -- Adrian Pohl hbz - Hochschulbibliothekszentrum des Landes NRW Tel: (+49)(0)221 - 400 75 235 http://www.hbz-nrw.de

2 1

Resolving redirects
by Magnus Manske 10 Dec '14

10 Dec '14

There are currently ~500 item-to-item links on WIkidata where the "target item" is a redirect. Is there a bot resolving those? Should the merge API do that automatically? Or the merge script on site? Or Wikidata itself, after, say, a day of not reverting the merge?

3 3

Things to get merged before the branch next week
by Daniel Kinzler 05 Dec '14

05 Dec '14

Hey! Here's a few performance relevant changes I think should get merged before we branch next week: https://gerrit.wikimedia.org/r/#/c/170961/ "Determine update actions based on usage aspects." <--- the last bit missing for usage tracking https://gerrit.wikimedia.org/r/#/c/176650/ "Use wb_terms table for label lookup." <--- should improve memory consumption a lot, and possibly also speed. https://gerrit.wikimedia.org/r/#/c/167224/ "Defer entity deserialization" <--- should reduce memory footprint and improve speed of trivial operations like checkign whether something is a redirect. Are there any other performance improvements that we should get in? I imagine that this will be the last time we branch until the third week of January. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

1 0

SiteStore loading (memcached traffic)
by hoo 04 Dec '14

04 Dec '14

Hi Everyone, just a wanted to post a quick summary of what I did today in order to significantly reduce the SiteStore related memcached traffic. I stumbled upon this comment https://phabricator.wikimedia.org/T58602#808530 and thus poked a bit at when we load sites from memcached. During that I found that we still were loading the sites basically all the time. To get the number of that down, I uploaded the following patches that have been reviewed, merged and even deployed yet (thanks for the reviews Daniel and Katie): * Don't lookup Sites from mc for the 'languageLinkSiteGroup' setting: https://gerrit.wikimedia.org/r/177419 * Don't load all sites for LangLinkHandler: https://gerrit.wikimedia.org/r/177429 * Don't access sites on WikibaseClient::getEntityIdForTitle: https://gerrit.wikimedia.org/r/177434 That lead to a noticeable memcached traffic change (see attachment). Memcached traffic graphs: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Memcached% 20eqiad&m=cpu_report&r=hour&s=by% 20name&hc=4&mc=2&st=1417641558&g=network_report&z=large I still have https://gerrit.wikimedia.org/r/177416 in review, which slightly changes the behavior of the other projects sidebar, but I think that this change also has quite some potential to reduce memcached traffic even further. Would be great if we could get that ready for backporting until Monday. Cheers, Marius

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech December 2014