Dear all,
I am happy to announce the third release of Wikidata Toolkit [1], the
Java library for programming with Wikidata and Wikibase. The main new
features are:
* Full support for the (now) standard JSON format used by Wikidata
* Huge performance improvements (decompressing and parsing the whole
JSON dump now takes about 15min; was more like 80min before)
* Many new example programs for inspiration and guidance [2]
Maven users can get the library directly from Maven Central (see [1]);
this is the preferred method of installation. There is also an
all-in-one JAR at github [3] and of course the sources [4].
Version 0.3.0 is still in alpha. For the next release, we will focus on
the following tasks:
* Support a binary format for even faster random access (some of this is
done already, but not quite ready for release yet)
* A command-line tool for data processing/conversion tasks
* Support for storing and querying data
Feedback is very welcome. Developers are also invited to contribute via
github.
Cheers,
Markus
[1] https://www.mediawiki.org/wiki/Wikidata_Toolkit
[2]
https://github.com/Wikidata/Wikidata-Toolkit/tree/master/wdtk-examples
(scroll down for documentation)
[3] https://github.com/Wikidata/Wikidata-Toolkit/releases
(you'll also need to install the third party dependencies manually when
using this)
[4] https://github.com/Wikidata/Wikidata-Toolkit/
Hello,
If you check the php file of SpecilUnconnectedPages (in
Wikibase/client/includes/specials/) it extends SpecialPage instead of
QueryPage. I want to know is it intentional? if it's not Do you agree it's
better to rewrite the file to use the QueryPage class?
Best
--
Amir
(+cc Nemo and Wikidata-tech)
On Fri, Sep 26, 2014 at 5:33 AM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:
> On Fri, Sep 26, 2014 at 2:49 AM, Rob Lanphier <robla(a)wikimedia.org> wrote:
>
>> There's an item that's Luis Villa added to the MW Core backlog that I'd
>> like to move to the Multimedia backlog:
>>
>> https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog#Struct…
>>
>> I'm assuming everything that he describes fits nicely into what is
>> planned for Structured Data. Assuming that's true, should I just
>> copy/paste into a new card in Mingle, or a new page on mw.org or what?
>>
>
> This seems to be about article text, or mainly about article text
> (articles imported from other wikis and so on).
>
> The plan for the structured data project is to create Wikidata properties
> for legalese, install Wikibase on Commons (and possibly other wikis which
> have local images), make that Wikibase use Wikidata properties (and
> sometimes Wikidata items as values), create a new entity type called
> mediainfo (which is like a Wikibase item, but associated with a file), and
> add legal information to the mediainfo entries.
>
> Part of that (the Wikidata properties) could be reused for articles and
> other non-file content - the source, license etc. properties are generic
> enough. However, if we want to use this structure to attribute files, we
> would either have to make mediainfo into some more generic thing that can
> be attached to any wiki page, or abuse the langlink/badge feature to serve
> a similar purpose. That is a major course correction; if we want to do
> something like that, that should be discussed (with the involvement of the
> Wikidata team) as soon as possible.
>
Thanks for the analysis, Gergo! I was going to split Luis' proposal into a
separate wiki page, but I see Nemo has linked to this page as the "Canonical
page on the topic":
https://www.mediawiki.org/wiki/Files_and_licenses_concept
Without a deep reading that I'm admittedly just not going to have time for,
it's hard to tell how related the page that Nemo linked to is to the
concepts that Luis is trying to capture. Could someone (Nemo? Luis?) merge
Luis's requirements into the "canonical page" to Luis' satisfaction, so I
can delete most of the information from our backlog? I'll keep the item on
the MW Core backlog, since I don't know where else to put it, but it's
probably going to be relatively low priority for that team.
Multimedia team and Wikidata team, could you make sure you're considering
the requirements that Luis brought up as you build your solution? Even if
you decide to punt on some of the things that aren't strictly necessary for
files, it's still good to make sure you don't paint us in a corner when
if/when we do try to do something more sophisticated for articles.
One thing I'll note, though, before we get too complacent in thinking that
files are somehow simpler than articles, we should consider these
relatively common scenarios:
* Group photo with potentially different per-person personality rights
* PDF of a slide deck with many images
* PDF of a Wikipedia article :-)
Rob
Hey all,
I've made little video demonstrating how you can get a clone of Wikibase
DataModel, set it up, and run it's tests. It's only 30 seconds long, yet
includes all you need to know. https://asciinema.org/a/12530
Cheers
--
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
Evil software architect at Wikimedia Germany
~=[,,_,,]:3
We are now on MediaWiki 1.25 deployment branches of core. For Wikibase /
Wikidata branches, we are using a new naming scheme:
Instead of "mw1.25-wmf1" as in the past, we now have "wmf/1.25wmf1" which
is the same naming scheme as used by core and other extensions.
This allows Jenkins to test our branches against the corresponding version
of core, if available, instead of testing against master. Thanks to hashar
for making this possible. :)
Cheers,
Katie
--
Katie Filbert
Wikidata Developer
Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin
Phone (030) 219 158 26-0
http://wikimedia.de
Wikimedia Germany - Society for the Promotion of free knowledge eV Entered
in the register of Amtsgericht Berlin-Charlottenburg under the number 23
855 as recognized as charitable by the Inland Revenue for corporations I
Berlin, tax number 27/681/51985.
Hi. The call with the WMF today turned up a pretty big issue.
WikibaseQueryEngine (WQE) depends on two big 3rd party libraries:
* Symphony (used for implementing a command line client for the query engine,
which we use to generate database tables)
* doctine/dbal (a database abstraction layer we use to run queries and create
database tables)
In order to deploy these, the WMF would require a line-by-line review of
something around 50thousand lines of code (or maybe it was 30thousand, or
80thousand - a lot, in any case). This is not feasible.
We (Katie, Jeroen, Chris, Nik, me, etc) have come up with a plan to get rid of
these dependencies:
1) Split the command line interface into a separate component (WQE-CLI perhaps)
that would not be deployed. Symphony is then out of the picture.
2) Go back to using MediaWiki's DB abstraction for running queries. This should
be easy.
3) For generating DB tables (aka schema creation), we create a separate
component (WQE schema generator or something) that would use dbal to generate
staic sql files for the supported db systems (most importantly, mysql and
sqlite). This would be part of our build step, but the code relying on dbal
would not be deployed. Only the generated sql files would be used for deployment
(either through the update script, or, in the case of WMF, manually).
In the end, WQE could use either MW or dbal for running queries, so we could
deploy it without dbal on the WMF cluster, but it could also be used outside
MediaWiki.
If there are no objections, I propose to take this on for the next sprint.
Perhaps we can start to split this into tasks on bugzilla already.
Cheers,
Daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hi all!
The Wikibase team would like to allow data from any item to be used on any
client page. To do this, we need to track which item is being used where, so we
can purge the appropriate pages when the item changes. We would like to have
people with database experience to look at our proposal and let us know about
any concerns, especially wrt performance.
Here you find a proposal for two database tables for tracking the usage of
entities across wikis:
https://gerrit.wikimedia.org/r/#/c/158078/9/usagetracking/includes/Usage/Sq…https://gerrit.wikimedia.org/r/#/c/158078/9/subscription/includes/Subscript…
The "entity_usage" table would be on every client, recording wich entity is used
on which page (kind of like the iwlinks table). The "entity_per_client" table
would be on the repo, and track which wiki ("client") is interested in changes
to which entity.
Please have a look and let me know if you have any questions or suggestions,
especially with regards to the following use cases:
The following would happpen when editing/re-parsing a page on a client wiki
(e.g. wikipedia):
* get all entities used on a given page from entity_usage
* delete rows based on a page id and a list of entity ids from entity_usage
* insert rows for a page / entity pair into entity_usage
* queriy rows for a set of entities from entity_usage (with no page id specified).
* add rows for a set of (newly used) entites to the entity_per_client table
* remove rows for a set of (no longer used) entites from the entity_per_client table
The following would happen when dispatching a change from wikibase:
* looking up interested wikis for a list of entities from the entity_per_client
table.
* (notification via the job queue)
* looking up pages to be purged/updated based on a list of entity ids (and
possibly an aspect id) in the entity_usage table.
-- daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hi Katie, hi Rob!
Thanks for your swift replies.
Am 10.09.2014 04:59, schrieb Katie Filbert:
> On Tue, Sep 9, 2014 at 7:25 PM, Daniel Kinzler <daniel.kinzler(a)wikimedia.de
> <mailto:daniel.kinzler@wikimedia.de>> wrote:
...
> So, there are two things we need to know in order to make an informed decision:
>
> 1) can we use the Ubuntu LTS packages for symfony and dbal?
>
>
> which versions are they?
php-doctrine-dbal (2.3.4-1)
php-symfony-console (2.3.1+dfsg-1)
Both are "universe" packages, no not "officially" supported by Ubuntu. But so is
php5-memcached and other stuff we use.
> are they hhvm-compatible?
>
> http://hhvm.h4cc.de/package/doctrine/dbal
The hhvm status page for dbal lists everything before 2.5 as "not tested", so we
just don't know whether it would work with hhvm. Worth a try, I suppose.
I can't find anything about hhvm compatibility on symfony.com, but I found some
blog posts from 2013 discussing symfony performance on hhvm, so that should work
fine.
Am 10.09.2014 07:34, schrieb Rob Lanphier:
...
> I'm kinda regretting bringing up the apt repo case, because basically
> a loophole in our review strategy, not something I'm sure we want to
> encourage or something I believe will get a great deal of traction.
> There's an outside chance that people shrug and say "sure, why not?",
> but I don't think it's going to be generally attractive.
I wouldn't call that a loophole, but rather an inconsistency. Nobody is going to
review libpcre, or the mysql client library, or Lucene. I find it confusing that
anything written in PHP apparently needs full review, while we just trust stuff
written in C or Java.
I'm not saying we should deploy just any 3rd party code without review. I'm
saying we should have better criteria than the language the library is written
in. One criterion could be the size of the install base - many eyes.But then,
heartbleed happened...
>> 1) can we use the Ubuntu LTS packages for symfony and dbal?
>
> Per above, probably not likely.
But using, say, php5-curl or php5-oauth is fine?...
I don't want to whine and complain about the deployment/review policy; I'm
trying to find out whether there is a policy, and if so, what it is, and what
the rationale it is based on.
>> 2) when is 14.04 going to be rolled out?
>
> Concurrently with the HHVM upgrade. In the coming weeks. We don't
> have a set end date for when all machines will be converted, but we
> should be well underway by the end of this month. There's probably
> going to be a stubborn service or two that will stick around quite a
> bit longer than that, though.
Sadly, Wikibase seems to have some compat issues with hhvm. At least that would
be an explanation for the breakage on beta. We currently blame it on class
aliases, but nobody really knows. It seems kind of random...
>> Who can answer these questions? How do we poke TechOps?
>
> The Wikimedia Operations list is a good place to ask about this type of thing.
Yay, yet another mailing list!
Thanks,
Daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hoi,
The latest statistics show that we may finally say that "only 50% of our
items have zero, one or two statements". No longer can we say more that 50%
of our items have zero or one statement.
THAT is in my opinion a great indication of our progress of bringing data
to Wikidata. As a consequence we become more information and useful.
It is important to know that the first statements are in many ways the most
difficult but also the most important. They often include the instances of
or subclasses.
Congratulations !!!
Thanks,
GerardM
Hi all!
tl;dr: How to best handle the situation of an old parser cache entry not
containing all the info expected by a newly deployed version of code?
We are currently working to improve our usage of the parser cache for
Wikibase/Wikidata. E.g., We are attaching additional information related to
languagelinks the to ParserOutput, so we can use it in the skin when generating
the sidebar.
However, when we change what gets stored in the parser cache, we still need to
deal with old cache entries that do not yet have the desired information
attached. Here's a few options we have if the expected info isn't in the cached
ParserOutput:
1) ...then generate it on the fly. On every page view, until the parser cache is
purged. This seems bad especially if generating the required info means hitting
the database.
2) ...then invalidate the parser cache for this page, and then a) just live with
this request missing a bit of output, or b) generate on the fly c) trigger a
self-redirect.
3) ...then generated it, attach it to the ParserOutput, and push the updated
ParserOutput object back into the cache. This seems nice, but I'm not sure how
to do that.
4) ...then force a full re-rendering and re-caching of the page, then continue.
I'm not sure how to do this cleanly.
So, the simplest solution seems to be 2, but it means that we invalidate the
parser cache of *every* page on the wiki potentially (though we will not hit the
long tail of rarely viewed pages immediately). It effectively means that any
such change requires all pages to be re-rendered eventually. Is that acceptable?
Solution 3 seems nice and surgical, just injecting the new info into the cached
object. Is there a nice and clean way to *update* a parser cache entry like
that, without re-generating it in full? Do you see any issues with this
approach? Is it worth the trouble?
Any input would be great!
Thanks,
daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.