Wikidata-tech November 2014

wikidata-tech@lists.wikimedia.org

9 participants
6 discussions

Article: "Facebook's Top Open Data Problems"

by Ori Livneh

Facebook just published this summary of a summit for database researchers held at Menlo Park last September. I recommend it. It contains a clear and concise description of Facebook's data infrastructure, and a description of the open problems they are thinking about, which is even more interesting. https://research.facebook.com/blog/1522692927972019/facebook-s-top-open-dat… To whet your appetite, here are the problems (the summaries mostly my own paraphrase): * Mobile: How should the shift toward mobile devices affect Facebook’s data infrastructure? * Reducing replication: How can we reduce the number of round trips between the application and data layers? * Impact of Caching on Availability (aka "oh no, we just restarted memcached"): How do we harness the efficiency gains provided by caching without being brought to our knees by a sudden drop in cache hit rate? * Sampling at logging time in a distributed environment: How should we sample log streams if we want to maintain accuracy and flexibility to answer post-hoc queries? * Trading storage space and CPU: TL;DR: gzip --best or gzip --fast? * Reliability of pipelines: Pipelines are less reliable than the sum of their parts. A pipeline composed of two systems, each 0.999 reliable, is 0.989 reliable. Much sadness. What to do? * Globally distributed warehouse: consistency models and synchronization problems. * Time series correlation and anomaly detection: AKA: I want an alert for that massive memcached bytes_out spike that doesn't also wake me up with false positives at 2AM.

9 years, 5 months

Wikidata item search via API based on labels and description

by Adrian Pohl

Hello, I have a list of place names and want to find the according wikidata item with this name. The list includes "Köln, "Düsseldorf" but also parts of towns which are recorded as compounds of the superior administrative entity and the district like "Schmallenberg-Westernbödefeld" or "Kerpen-Manheim". If I lookup these via the Wikidata API with the wbsearchentities action I get no problems with "Köln" and the like [1] but won't get any results for compounds, see e.g. [2] although both strings are part of the label and the description of a wikidata item. Via the wikidata interface I get the right result, though.[3] I have looked quite some time but couldn't find a way to query wikidata programatically and get results similar to the website search. Thus, my question is: Is there a way to query wikidata via an API over both the label fields and the description? Background I am working at the North Rhine-Westphalian Library Service Center (hbz)and we are currently building a new website for the Northrhine-Westphalian bibliography. [4] This bibliography collects articles, books and other media about places in the German federal state of Northrhine- Westphalia. Each record contains a string which indicates which place a resource is about. As soon as we have those links to Wikidata we will think about how to link to a list of bibliographic resources about a place from the place's wikipedia page. See the GitHub issue on this particular problem at [5]. All the best Adrian [1] https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Köln&lang… [2] https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Kerpen%20… [3] https://www.wikidata.org/w/index.php?search=Kerpen+Manheim [4] http://lobid.org/nwbib [5] https://github.com/hbz/nwbib/issues/42 -- Adrian Pohl hbz - Hochschulbibliothekszentrum des Landes NRW Tel: (+49)(0)221 - 400 75 235 http://www.hbz-nrw.de

9 years, 5 months

Introducing Gerrie - A crawler for Googles code review system Gerrit

by Andy Grunwald

Hey everyone, i want to introduce Gerrie [1] to you. Gerrie is a crawler for Googles code review system Gerrit. As far as i know, the Wikimedia community, use Gerrit to improve and develop various products like Mediawiki, Wiki-Extensions, Infrastructure and so on. The Gerrit instance is located at gerrit.wikimedia.org [2]. During this activity a lot of interesting data will be created in the background. Gerrie is a tool to transform this data into a RDBMS like MySQL. After this you can create analysis with it with simple sql queries. To retrieve the data the SSH API of Gerrit will be used. The benefit from this is you can use every Gerrit system to analyze like TYPO3 [3]. I assume that most of you are registered at gerrit.wikimedia.org [2] and entered their SSH public key into the system. Gratulations, you are ready to crawl the data as well. Gerrie is written in PHP and complete documented [4]. Even the database scheme is documented [5] to help you to analyze the data. For a quickstart you can follow the documented Getting started guide [6]. Results of this analysis are wide spreaded. One use case can be for example a gamification analysis like the Activity Monitor build upon TYPO3s Gerrit data for the TYPO3.CMS (main content management system) product: http://metrics.andygrunwald.com/statistics/gerrit/activity-monitor/analysis… In this analysis every activity will results into points. Based on the sum of the points a score list will be created. The background color is determined by a hash function based on the users name. This mean: Same Name = Same color. Please take in mind: This does not reflect the contribution to the community. This displays only activity by a user in a specific system. I would love to see you as a kind of tester for this tool. If you need help, do not hesitate to ask. I try to help as much as possible. Have fun during testing. Cheers, Andy [1] Gerrie: https://github.com/andygrunwald/Gerrie [2] gerrit.wikimedia.org: https://gerrit.wikimedia.org/ [3] Gerrit @ TYPO3: https://review.typo3.org/ [4] Documentation: http://gerrie.readthedocs.org/en/latest/ [5] Database Schema: http://gerrie.readthedocs.org/en/latest/database/index.html#schema [6] Getting started: http://gerrie.readthedocs.org/en/latest/getting_started/index.html

9 years, 5 months

Fixing OCP violations

by Gordon P. Hemsley

Hey all, I'd like to help with the effort to implement the ability to support additional Entity types beyond Item and Property, but I'm not entirely sure where to start. One potential avenue of contribution is the elimination of OCP violations. According to the list of dependencies [1] on bug 73496 [2], some candidates for this include EntityContent [3], ChangeOps [4], and EntityViewFactory [5]. However, as the codebase layout is currently in flux (as I imagine it always is), I'm not clear on what code is good and what code is bad, nor where new good code should live. I'm very interested in the ability to add additional entity types, and I'd like to help speed the work along without stepping on anyone's toes or duplicating any work. What is the status of these efforts? Any advice or feedback is welcome. Regards, Gordon [1]: https://bugzilla.wikimedia.org/showdependencytree.cgi?id=73496&hide_resolve… https://old-bugzilla.wikimedia.org/showdependencytree.cgi?id=73496&hide_res… [2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=73496 https://old-bugzilla.wikimedia.org/show_bug.cgi?id=73496 [3]: https://bugzilla.wikimedia.org/show_bug.cgi?id=67238 https://old-bugzilla.wikimedia.org/show_bug.cgi?id=67238 [4]: https://bugzilla.wikimedia.org/show_bug.cgi?id=73500 https://old-bugzilla.wikimedia.org/show_bug.cgi?id=73500 [5]: https://bugzilla.wikimedia.org/show_bug.cgi?id=73559 https://old-bugzilla.wikimedia.org/show_bug.cgi?id=73559 -- Gordon P. Hemsley me(a)gphemsley.org http://gphemsley.org/

9 years, 5 months

Usage of "instanceof StatementListProvider"

by Jeroen De Dauw

Hey Hoo and Katie, A few days ago, on IRC I mentioned that I could not see any immediate problems with doing something like if ( instanceof StatementListProvider ) { do stuff with statements } I started wondering why this is not bad, since I've definitely seen some code that got seriously bad by doing this. And I've come to the conclusion that this is fine: function storeStatementsOfEntities( EntityDocument[] ) { if ( instanceof StatementListProvider ) { store statements } } While this is not: function storeAllPartsOfEntities( EntityDocument[] ) { if ( instanceof StatementListProvider ) { store statements } if ( instanceof FingerprintProvider ) { store fingerprint } // ... } In other words, if the context is specifically about one thing that an entity can have, then it is fine. If on the other hand, you need some general handling for whole entities, such as for diffing them, serializing them, etc, then a different approach is needed. The pseudo code above should make it clear why this is the case. (The second code suffers from a big Open Closed Principle violation. You'd need to modify this if an extension adds a new type of entity containing a new type of field, which you cannot do, since the dependency would go in the wrong direction. So you become unable to define new types of entities via an extension mechanism.) IIRC the lua handling code does something with whole entities, and thus falls into the second category. If that is the case, then you probably need to do something like what I outlined here: https://lists.wikimedia.org/pipermail/wikidata-tech/2014-August/000546.html Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

9 years, 6 months

Cached label lookup

by hoo

Hi Everyone, I saw that we know have a LabelLookup[0] which is awesome (thanks, aude). I would like to use the some kind of label lookup in our Lua bindings, as getting labels there is insanely heavy (we get the entity, then push the whole thing into Lua and then extract the label). Sadly the current label lookup is uncached and as far as I can see we neither have a change to fix that nor a bug for the issue, yet. It would be great to hear about why we don't have this yet (or am I missing something?) and whether you think it's ok to use it yet (don't want to knock over the DB by hitting wb_terms a lot). Cheers, Marius [0]: https://gerrit.wikimedia.org/r/169330

9 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech November 2014