Facebook just published this summary of a summit for database researchers
held at Menlo Park last September. I recommend it. It contains a clear and
concise description of Facebook's data infrastructure, and a description of
the open problems they are thinking about, which is even more interesting.
https://research.facebook.com/blog/1522692927972019/facebook-s-top-open-dat…
To whet your appetite, here are the problems (the summaries mostly my own
paraphrase):
* Mobile: How should the shift toward mobile devices affect Facebook’s data
infrastructure?
* Reducing replication: How can we reduce the number of round trips between
the application and data layers?
* Impact of Caching on Availability (aka "oh no, we just restarted
memcached"): How do we harness the efficiency gains provided by caching
without being brought to our knees by a sudden drop in cache hit rate?
* Sampling at logging time in a distributed environment: How should we
sample log streams if we want to maintain accuracy and flexibility to
answer post-hoc queries?
* Trading storage space and CPU: TL;DR: gzip --best or gzip --fast?
* Reliability of pipelines: Pipelines are less reliable than the sum of
their parts. A pipeline composed of two systems, each 0.999 reliable,
is 0.989 reliable. Much sadness. What to do?
* Globally distributed warehouse: consistency models and synchronization
problems.
* Time series correlation and anomaly detection: AKA: I want an alert for
that massive memcached bytes_out spike that doesn't also wake me up with
false positives at 2AM.
Hello,
I have a list of place names and want to find the according wikidata
item with this name. The list includes "Köln, "Düsseldorf" but also
parts of towns which are recorded as compounds of the superior
administrative entity and the district like
"Schmallenberg-Westernbödefeld" or "Kerpen-Manheim".
If I lookup these via the Wikidata API with the wbsearchentities action
I get no problems with "Köln" and the like [1] but won't get any results
for compounds, see e.g. [2] although both strings are part of the label
and the description of a wikidata item.
Via the wikidata interface I get the right result, though.[3]
I have looked quite some time but couldn't find a way to query wikidata
programatically and get results similar to the website search. Thus, my
question is:
Is there a way to query wikidata via an API over both the label fields
and the description?
Background
I am working at the North Rhine-Westphalian Library Service Center
(hbz)and we are currently building a new website for the
Northrhine-Westphalian bibliography. [4] This bibliography collects
articles, books and other media about places in the German federal state
of Northrhine- Westphalia. Each record contains a string which indicates
which place a resource is about. As soon as we have those links to
Wikidata we will think about how to link to a list of bibliographic
resources about a place from the place's wikipedia page. See the GitHub
issue on this particular problem at [5].
All the best
Adrian
[1]
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Köln&lang…
[2]
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Kerpen%20…
[3] https://www.wikidata.org/w/index.php?search=Kerpen+Manheim
[4] http://lobid.org/nwbib
[5] https://github.com/hbz/nwbib/issues/42
--
Adrian Pohl
hbz - Hochschulbibliothekszentrum des Landes NRW
Tel: (+49)(0)221 - 400 75 235
http://www.hbz-nrw.de
Hey everyone,
i want to introduce Gerrie [1] to you.
Gerrie is a crawler for Googles code review system Gerrit.
As far as i know, the Wikimedia community, use Gerrit to improve and
develop various products like Mediawiki, Wiki-Extensions, Infrastructure
and so on. The Gerrit instance is located at gerrit.wikimedia.org [2].
During this activity a lot of interesting data will be created in the
background. Gerrie is a tool to transform this data into a RDBMS like
MySQL. After this you can create analysis with it with simple sql queries.
To retrieve the data the SSH API of Gerrit will be used. The benefit from
this is you can use every Gerrit system to analyze like TYPO3 [3]. I assume
that most of you are registered at gerrit.wikimedia.org [2] and entered
their SSH public key into the system. Gratulations, you are ready to crawl
the data as well.
Gerrie is written in PHP and complete documented [4].
Even the database scheme is documented [5] to help you to analyze the data.
For a quickstart you can follow the documented Getting started guide [6].
Results of this analysis are wide spreaded.
One use case can be for example a gamification analysis like the Activity
Monitor build upon TYPO3s Gerrit data for the TYPO3.CMS (main content
management system) product:
http://metrics.andygrunwald.com/statistics/gerrit/activity-monitor/analysis…
In this analysis every activity will results into points. Based on the sum
of the points a score list will be created. The background color is
determined by a hash function based on the users name. This mean: Same Name
= Same color.
Please take in mind: This does not reflect the contribution to the
community. This displays only activity by a user in a specific system.
I would love to see you as a kind of tester for this tool.
If you need help, do not hesitate to ask. I try to help as much as possible.
Have fun during testing.
Cheers,
Andy
[1] Gerrie: https://github.com/andygrunwald/Gerrie
[2] gerrit.wikimedia.org: https://gerrit.wikimedia.org/
[3] Gerrit @ TYPO3: https://review.typo3.org/
[4] Documentation: http://gerrie.readthedocs.org/en/latest/
[5] Database Schema:
http://gerrie.readthedocs.org/en/latest/database/index.html#schema
[6] Getting started:
http://gerrie.readthedocs.org/en/latest/getting_started/index.html
Hey Hoo and Katie,
A few days ago, on IRC I mentioned that I could not see any immediate
problems with doing something like
if ( instanceof StatementListProvider ) { do stuff with statements }
I started wondering why this is not bad, since I've definitely seen some
code that got seriously bad by doing this. And I've come to the conclusion
that this is fine:
function storeStatementsOfEntities( EntityDocument[] ) {
if ( instanceof StatementListProvider ) { store statements }
}
While this is not:
function storeAllPartsOfEntities( EntityDocument[] ) {
if ( instanceof StatementListProvider ) { store statements }
if ( instanceof FingerprintProvider ) { store fingerprint }
// ...
}
In other words, if the context is specifically about one thing that an
entity can have, then it is fine. If on the other hand, you need some
general handling for whole entities, such as for diffing them, serializing
them, etc, then a different approach is needed. The pseudo code above
should make it clear why this is the case. (The second code suffers from a
big Open Closed Principle violation. You'd need to modify this if an
extension adds a new type of entity containing a new type of field, which
you cannot do, since the dependency would go in the wrong direction. So you
become unable to define new types of entities via an extension mechanism.)
IIRC the lua handling code does something with whole entities, and thus
falls into the second category. If that is the case, then you probably need
to do something like what I outlined here:
https://lists.wikimedia.org/pipermail/wikidata-tech/2014-August/000546.html
Cheers
--
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
Evil software architect at Wikimedia Germany
~=[,,_,,]:3
Hi Everyone,
I saw that we know have a LabelLookup[0] which is awesome (thanks,
aude).
I would like to use the some kind of label lookup in our Lua bindings,
as getting labels there is insanely heavy (we get the entity, then push
the whole thing into Lua and then extract the label).
Sadly the current label lookup is uncached and as far as I can see we
neither have a change to fix that nor a bug for the issue, yet.
It would be great to hear about why we don't have this yet (or am I
missing something?) and whether you think it's ok to use it yet (don't
want to knock over the DB by hitting wb_terms a lot).
Cheers,
Marius
[0]: https://gerrit.wikimedia.org/r/169330