Hey Steffen and Andy,
Continuing what I started on Twitter here, as some more characters might be
It seems that both our projects (FLOW3 and Wikidata) are in a similar
situation. We are using Gerrit as CR tool, and TravisCI to run our tests.
And we both want to have Travis run tests for all patchsets submitted to
Gerrit, and then +1 or -1 on verified based on the build passing or
failing. To what extend have you gotten such a thing to work on your
project? Is there code available anywhere? If both projects can use the
same code for this, I'd be happy to contribute to what you already have.
Jeroen De Dauw
Don't panic. Don't be evil. ~=[,,_,,]:3
Hello everybody! Glad to join the team!
I'm a freshman to use the Wikidata, and face some confusions:
1. I though Wikidata gathered data from Wikipedia infobox, but I also find
that, the statements and infobox are not entirely corresponded. Some
infobox data are missed in statements and the Statements also have more
data than infobox. Then how does Wikidata actually create a new item?
What's the detailed process? If there are new or updated articles in
Wikipedia, how does Wikidata catch these updates?
2. I think there are more valuable information in Wikipedia but not
included in Wikidata, e.g. categories, infobox data. How can I enrich the
Wikidata? I know I may use & follow WikiData Bot's process, any directions?
Any suggestions or guidance are welcome!
I stumbled upon an inconsistency when parsing the dumpfile JSON:
In item Q58404 (Haaften) the aliases are an empty array, as in
The same holds for Q15982 (Wernigerode), wich also has no aliases and
therefor an empty array.
In item Q189889 (Chicago) the aliases are an object:
The same is the case for Q42 (Douglas Adams).
Which one should it be now? I suspect there is an error during the
writing of the dumpfiles…
When I parse the JSON from the dumpfiles, I find coordinates like
However the altitude-key is not represented in the WDTK data model. Is
this key deprecated (i.e. should I ignore it when converting dump-JSON
PS. First usable dumpfile-converter class version sighted at the horizon.
We've been mostly following http://semver.org/ for our new components,
which is great. One thing we have however not been doing well is bumping to
1.0 at the appropriate time.
Major version zero (0.y.z) is for initial development. Anything may change
> at any time. The public API should not be considered stable.
In other words, y denotes a breaking change, and z does not. This is
shifted from the "normal" x.y.z, where x>0, in which case all breaking
changes result in increment of x, feature additions result in increment of
y and fixes of z. Hence if we wrongly stick around at 0.y.z, than we are
also wrongly using the rest of the numbering at that point.
So when should the bump to 1.0 happen?
Version 1.0.0 defines the public API. The way in which the version number
> is incremented after this release is dependent on this public API and how
> it changes.
How should I deal with revisions in the 0.y.z initial development phase?
> The simplest thing to do is start your initial development release at
> 0.1.0 and then increment the minor version for each subsequent release.
> How do I know when to release 1.0.0?
> If your software is being used in production, it should probably already
> be 1.0.0. If you have a stable API on which users have come to depend, you
> should be 1.0.0. If you're worrying a lot about backwards compatibility,
> you should probably already be 1.0.0.
> Doesn't this discourage rapid development and fast iteration?
> Major version zero is all about rapid development. If you're changing the
> API every day you should either still be in version 0.y.z or on a separate
> development branch working on the next major version.
Most of our components are used in production and have a relatively stable
API, yet they still have a version in the 0.y.z format. Let's pay more
attention to this for our new components, so they don't end up in the same
situation. For the existing ones we can increment the first number rather
than the second one the next instance where we would otherwise increment
the second one.
Jeroen De Dauw
Don't panic. Don't be evil. ~=[,,_,,]:3
For some time now, it has been clear that a lot of people have use for the
code that can serialize and deserialize the Wikibase DataModel. Anyone
interacting with the API or the dumps and doing non-trivial things with the
data benefits from being able to use the domain objects provided by the
DataModel. While those themselves are reusable for some time already, the
code responsible for serialization is not. As the involved code also
suffers from serious design issues and is a sizeable chunk of our technical
debt, the idea is to create a new shiny dedicated component that is not
bound to Wikibase Repo or Wikibase Client.
As interest has been expressed in contribution to this, I'll briefly
outline the general idea, upcoming steps and relevant resources. If you are
not interested in contributing, you can stop reading here :)
I have created a new git repo for this component, which can be found here
This component should be to the DataModel component, what AskSerialization
 is to the Ask library . The approach and structure should be very
similar. A few stubs have been added to illustrate how to organize the
code, and autoloading and test bootsrap are in place, so the tests can be
run by executing "phpunit" in the root directory. The existing legacy code
for this serialization functionality can be found in Wikibase.git, in
I myself will be working on a very similar component which is aimed at
solving the technical debt around the serialization code for the format
used by the Wikibase Repo data access layer. This code resides in  and
follows much the same approach as the new component for the serialization
code dealing with the public format. Once the most critical issues this new
code will solve are tackled, I will likely start work on the former
Things to keep in mind when contributing to this component:
* Almost no code should know concrete instances of serializers. Program
against the interfaces. Ie, when the constructor of a serializer for a
higher level object (ie SnakSerializer) takes a collaborator for a lower
level one (ie DataValueSerializer), type hint against the interface (ie
* Unit tests for all code should be provided. And round trip tests for
matching serializers and deserializers. As well as high level serialization
and deserialization integration tests.
* Write clean tests, with descriptive method names, ie as done in
Jeroen De Dauw
Don't panic. Don't be evil. ~=[,,_,,]:3
Since I am working on the conversion from the dump files to the wdtk
data model, I will have to take apart the "refs" section of the JSON
representing the stored items.
Now a "refs"-section most likely looks like this:
(Tried to format it for readability)
So I figured out the following: The outer array groups all references.
The second level array groups information about one reference (so if we
had multiple references, we would also have multiple second-level
arrays) and the inner arrays each group one specific information about
one specific reference (as determined by the second level array they are
Am I correct so far?
The integer following the "value"-string denotes what the following
information is about. So if I read that the value is 577, I know that
there must be a specification of time, don't I?
Is there any specific reason, why this is done in array form and not in
a JSON object? (Since the "value"-key is always there one could know
from its value, what other keys must be available.)
If yes, why mention the type of information (e.g. "time") again?
Am I overlooking something?
Thanks so far,
-- Fredo Erxleben
[Moving discussion to wikidata-tech, since it is of general interest.
The question was how the grouped statements seen in the UI are computed
from the flat list of statements found in the dumps]
StatementGroups are a new structure that we decided to introduce to the
data model last week (and which did not exist so far). They represent
statements that are grouped by property, like it can be seen in the user
interface. The JSON generated by the Web API also groups statements by
As you noticed, the internal JSON format found in the dumps does not
have such groups -- there is just a list of statements there, which may
not even have statements of the same property next to each other. The
order of groups shown in the UI and the API results is computed from
this internal list. Having StatementGroups as an explicit construct will
allow the order of the groups to be cntrolled in a saner way.
Currently the groups are created from the plain list by putting the
group for a property "at the first position where a statement of this
property occurs in the list". More algorithmically:
* Input: a list of statements
* Output: a list of lists of statements
* result =  // empty list
* Iterate over all statements in the input:
** if property of statement has no group in result yet:
*** create new group and append it to the result
** add the statement to its group
The statements within each group are reordered in the UI, so that the
ones with preferred rank are on top, followed by the ones of normal
rank, followed by the ones of deprecated rank. The relative order of
statements with the same rank should be preserved in this operation.
We do not have a representation for these rank-based groups in the data
model (it seemed not as critical as the StatementGroups), so the
statements in each group should just be kept in their original order.
This is also important for moving groups, since the group can currently
only be moved by moving its first statement in the internal dump list
(which may not have the highest rank). This will hopefully at some point
get easier with a new API action for moving groups. Reordering
statements into groups as described above is not significant for query
answering or for display -- in fact Wikibase does this internally on
some edits. So this preprocessing does not really change the data.
In the future, I hope that also the rank-based ordering is maintained
within statement groups.
I hope this answers the original question :-)
On 19/02/14 18:33, Fredo Erxleben wrote:
> Hello everybody,
> I encountered a question when trying to take apart statements:
> So far I encountered JSON looking like this:
> … … …
> which is just one StatementGroup with some statements, as far as I
> understand it.
> Now, how would it look, if the Item has multiple statement groups? then
> the value to the "claim" - key would not be an array of statements but
> an array of arrays of statements? Or do have all Items have only one
> statement group?
> Kind regards
> -- Fredo
I added a comment on bug 40810 but I also want to send it here to reach
We still need to discuss how the badges should work in general. At the
development plan  it says "Each sitelink can have zero or one badge
attached to it." This does not match the current implementation which
allows multiple badges. So the implementation needs to be changed to
only supporting one badge.
However, if we decide to keep the option of multiple badges we face some
other problems. First, the ui gets harder to be implemented on Wikidata.
This is not a real problem though. But a real problem is that we cannot
determine which badge we want to display on the client. The only option
I see to solve this issue is either to create another config variable
(very ugly) or to only allow one badge as stated in the development plan