Wikidata-tech August 2013

wikidata-tech@lists.wikimedia.org

19 participants
30 discussions

TravisCI integration with Gerrit
by Jeroen De Dauw 14 Jun '14

14 Jun '14

Hey Steffen and Andy, Continuing what I started on Twitter here, as some more characters might be helpful :) It seems that both our projects (FLOW3 and Wikidata) are in a similar situation. We are using Gerrit as CR tool, and TravisCI to run our tests. And we both want to have Travis run tests for all patchsets submitted to Gerrit, and then +1 or -1 on verified based on the build passing or failing. To what extend have you gotten such a thing to work on your project? Is there code available anywhere? If both projects can use the same code for this, I'd be happy to contribute to what you already have. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

2 2

UI for badges
by Michał Łazowik 06 Sep '13

06 Sep '13

Hi! I'm getting closer and closer to writing UI for badges in repo. Does anyone have any ideas? The only thing I came with is a additional row beneath each site link which behaves (and looks) like aliases editing currently does, but I think that this would take too much space. Thanks, Michał

4 12

IRI-value or string-value for URLs?
by Denny Vrandečić 03 Sep '13

03 Sep '13

We are planning to deploy URLs as data values rather soon (i.e. September 9, if all goes well). There was a discussion on wikidata-l mailing list: <http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html> The current implementation for URLs uses a string data value. There was also a IRI data value developed (for this use case), but in a previous (internal) discussion it was decided to use string value instead. The above thread included a few strong arguments by Markus for using the IRI data value. If we want to do this, we need to decide that very quickly, and change it accordingly. Let's see if we can make the decision here on this list. We need to make the decision by Monday latest, better earlier. Here are my current thoughts (check also the above mentioned thread if you did not have already). Currently I have a preference to using the string value, just to point out my current bias, but I want wider input. * I do not see the advantage of representing ' http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form { protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt', query : '', fragment : '' }. * If we use string value, a number of necessary features come for free, like the diffing, displaying it in the diffs, etc. Sure, there is the argument that we can use the getString method for these, but then what is the use case that we actually serve by using the structured data? * I understood the advantages of being able to *identify* whether the value of a snak is a string or a URL, but that seems to be the same advantages as for knowing whether the value of a snak is a Commons media file name or a string. None of the the use cases though have been explaining why using the above data structure is advantageous over a simple string value. Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string). Not completely independent of that, there are a few questions that need to be answered but that are not as immediate, i.e. do not have to be decided by next week: * should, in the external JSON structure, for every snak the data value type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"? * should, in the external JSON structure, for every snak the data type of the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure. * should, in the internal JSON structure, something be changed? The external JSON structure is the one used when communicating through the API. The internal JSON structure is the one that you get when using the dumps. We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :) Cheers, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

5 13

Improvements to the implementation of Entity
by Jeroen De Dauw 02 Sep '13

02 Sep '13

Hey, Now the design issues of EntityId have been fixed, it's Entity's turn :) (Note: this is about domain layer implementation details. Not considering changing anything visible to the user here.) While working on the QueryEntity together with addshore, we ran into a number of issues with the current implementation of Entity. The main problem is that Entity objects are constructed from their internal "serialization". The constructor, which is marked as protected, takes in this "serialization" (in array form). This is rather awkward, consider how we now typically construct a Property: $property = Property::newEmpty(); $property->setDataTypeId( $id ); (We also have a static newFromDataTypeId which wraps this.) There is a bunch of code that assumes one can create empty Entity objects. Esp in tests. I now think it was a mistake to allow this at all for Property, which should not be constructed without a dataTypeId. The same goes for QueryEntity, which should not be constructed without a Ask\Query. It'd be much nicer if people could just use the constructors of the objects and have these enforce the list of required parameters. They'd just take the actual objects and not serializations. $property = new Property( $id ); $queryEntity = new QueryEntity( $askQuery ); And since these things are enforced, one now gets back a string when calling getDataTypeId, and a Ask\Query when calling getQueryDefinition, rather then either that type or null. Serialization and deserialization code can also go into dedicated service objects. This is already done in QueryEntity, which is using the same serializers as the ones the web API will use, saving us implementation of a second format, which would not be of much help here anyway (I'd save some disk space...). There is also a lot of room to be more strict about things. Right now you can happily construct a Entity that has ints as aliases, or as language code for labels. On top of that, there are currently still TODOs from the first months of the project in Entity related to normalization and handling of duplicates. We might want to clearly define responsibilities at this point :) Oh and, of course Entity, Item, Property and Query each should go into their own git repo. Any objections or concerns about the above rambling? There lately has also been some talking about doing things with Entity that we did not consider before. Such as entities that contain other entities. Is there a list of such thoughts? If not, lets compile one here so these can be held into account. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

2 3

DataValues and related components reorganization
by Jeroen De Dauw 30 Aug '13

30 Aug '13

Hey, This email is meant to provide an overview of the plans regarding the reorganization of the components in the DataValues.git repository. Current component situation: * DataValues * ValueParsers, depends on DataValues * ValueValidators, depends on DataValues * ValueFormatters, depends on DataValues * DataTypes, depends on all the above * ValueView, depends on all the above All of these are bundling inheritance hierarchies and are both defining interfaces as well as complex implementations. Reorganization plans: * DataValues, will hold interfaces, exceptions and trivial implementations of current DataValues * DataValues interfaces (still need a good name for this), will hold interfaces, exceptions and trivial implementations of ValueParsers, ValueFormatters and ValueValidators. Depends on DataValues * DataValues implementations (still need a good name for this), will hold common non-trivial implementations of the interfaces defined by the above two components * DataTypes, unchanged, now only dependent on DataValues and DataValues interfaces * ValueView, unchanged, now only dependent on DataTypes, DataValues and DataValues interfaces Dependencies are thus minimized and users are no longer forced to depend on unstable concrete classes for no reason. Coincidentally the number of components also drops by one. Git repository wise, everything is currently in a single repository. Each component will go into its own repo, with the exception of ValueView and DataTypes, which we'll at least initially put together. This means creation of 3 new git repos. The DataTypes git repo has already been created and we are awaiting removal of the old DataTypes code from DataValues.git which currently is blocked by WMF configuration update. Once this is done we can proceed with the remaining two repos. When this reorganization is done and the components reside in their own repos, we can make the two abstract ones releasable. These are the ones most dependent upon, and some of the current users have their own releases blocked due to the lack of any released version of their DataValues dependencies. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

1 2

Build step before deployment?
by Denny Vrandečić 30 Aug '13

30 Aug '13

Hello, currently, there is a strong interdependence between the Wikibase development and Wikidata deployment, which caused and causes friction. Sorry for that. A suggestion that was made yesterday by Rob was to introduce a build-step for deployment of Wikidata (something ULS already seems to be doing). If done right, this would allow to decouple the way components are split from the way components are deployed. If I understood it correctly, we would basically have either a deployment module or a deployment branch on an existing module, which (preferably) automatically gathers all dependencies, e.g. into a lib/ or dependency/ folder, or similar, and remains the sole module to be deployed. There are a number of details to be decided obviously, but I wanted first to gather consensus on whether this is the way forward for the short-term. If so, we would start to work on this very soon. Cheers, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

1 0

Re: [Wikidata-tech] Clarifying role of different Wikidata extensions (Re: Loading DataTypes in WMF config)
by Rob Lanphier 29 Aug '13

29 Aug '13

Hi Denny, On Tue, Aug 27, 2013 at 4:27 AM, Denny Vrandečić < denny.vrandecic(a)wikimedia.de> wrote: > 2013/8/27 Rob Lanphier <robla(a)wikimedia.org> >> On Mon, Aug 26, 2013 at 9:26 AM, Denny Vrandečić >> <denny.vrandecic(a)wikimedia.de> wrote: >> > We are indeed not yet using the RFC process, but I would prefer if we >> > could >> > agree to move to this process in the future, as this particular >> > discussion >> > is going on already a bit longer than what I would expect for a >> > discussion >> > regarding the question about how to organize code. >> >> >> This is not just about how we organize code. This is about how we >> package our work. It's unusual for us to have extensions with hard >> dependencies on other extensions, let alone the complicated hierarchy >> you all have chosen. It may be ok to do that, but we should discuss >> it before setting the precedent. >> > > This was already discussed here: > <http://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg69983.html> Yes, it was discussed. I was traveling the week that thread broke out, and never fully caught up on that particular thread (only fully read it this morning), so my apologies. That said, it didn't appear to me that we actually resolved anything there. Did I miss something? > Scribunto and Babel are other extensions that have dependencies, and there > are more of them. We are not setting a sole precedent here. Scribunto has optional integration with WikiEditor, SyntaxHighlight GeSHi, CodeEditor. The latter three all are useful extensions to end users in their own right, and Scribunto works fine without them. Moreover, I'm comfortable that I could walk around the office and get a coherent and mostly correct explanation about what all of these extensions do from many (if not most) developers here. I'm not as familiar with the dependencies that Babel has. In my cursory look, the situation looks similar to Scribunto's dependencies. The description for Babel could be a bit better, but it's sufficient for me to know what end-user functionality it's providing. I'm not saying that, because we haven't really managed dependencies between internal libraries before that we can't do it now. However, I don't think your examples support your case that this is already standard practice. >> This gets exposed to end users via Special:Version: >> https://en.wikipedia.org/wiki/Special:Version >> >> Since MediaWiki administrators often use this as a means of >> understanding how to configure their wikis "like Wikipedia", it would >> be nice if we didn't clutter that page up with a lot of the internals >> of our systems. Each of the links should point to a page that does a >> good job of describing what the extension does, and the vast majority >> of them do. Unfortunately, for most of the Wikidata extensions right >> now, the pages are pretty much boilerplaite plus a one-liner in many >> cases. >> > > I just checked a random sample of other extensions, and most of them are > just boilerplate plus a one-liner. I just started at the bottom: > <https://www.mediawiki.org/wiki/Extension:ZeroRatedMobileAccess> > <https://www.mediawiki.org/wiki/Extension:WikimediaMessages> > <https://www.mediawiki.org/wiki/Extension:WikimediaShopLink> Starting at the bottom isn't a representative sample. Moreover, the documentation for ZeroRatedMobileAccess is mostly in the clearly linked README which is not the worst place for it. Most of the spot checking I did on other extensions, there was sufficient information there for me to understand the essential functionality provided. > Also, this is a very different point than raised before, and we have, in > several places tried to improve our documentation, as e.g. by creating this > page: > <https://www.mediawiki.org/wiki/Wikibase> > But the quality of our documentation seems to be a shift in the focus of > this discussion. I would be happy if we could define well what the actual > point of discussion is, so that we can resolve it as soon. I'm trying to understand the breakup of the extensions, and in our continuing discussion, you've pointed me at varying bits of documentation that don't answer the questions that I have. I'd like to understand what exactly is so terrible about the status quo that you all are blocked on your development. If this refactoring is really so urgent, why can't you clearly and concisely state not only what you are doing, but *why* you are doing it. >> Rather than get too far into the implementation details of what your >> extensions are doing, I think maybe I'll hold off until someone on my >> team has more time to think about this and comment on it. >> > > As long as this does not contradict your other mail, where we said not to > further delay the refactoring of the DataValues-related extensions, sure. Well, we can move forward with this specifically: https://gerrit.wikimedia.org/r/#/c/76481/ I would hope you can hold off on the other refactoring until there is at least one person on my team who can confidently explain what the role of each of the extensions you're proposing is. I was going to bite the bullet and just get my head around it myself, but I need to be realistic about the level of effort I can expend with this. Rob

2 1

Fwd: [Wikitech-l] New search backend live on mediawiki.org
by Daniel Kinzler 28 Aug '13

28 Aug '13

Did I say it'll take until next year before we have the new search infrastructure? Looks like I was wrong, it's being beta tested on mediawiki.org already. Time to check how well it works with wikibase, then! Let's ask about that in the call tomorrow. -- daniel -------- Original-Nachricht -------- Betreff: [Wikitech-l] New search backend live on mediawiki.org Datum: Wed, 28 Aug 2013 14:20:10 -0400 Von: Nikolas Everett <neverett(a)wikimedia.org> Antwort an: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> An: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Today we threw the big lever and turned on our new search backend at mediawiki.org. It isn't the default yet but it is just about ready for you to try. Here is what is we think we've improved: 1. Templates are now expanded during search so: 1a. You can search for text included in templates 1b. You can search for categories included in templates 2. The search engine is updated very quickly after articles change. 3. A few funky things around intitle and incategory: 3a. You can combine them with a regular query (incategory:kings peaceful) 3b. You can use prefix searches with them (incategory:norma*) 3c. You can use them everywhere in the query (roger incategory:normans) What we think we've made worse and we're working on fixing: 1. Because we're expanding templates some things that probably shouldn't be searched are being searched. We've fixed a few of these issues but I wouldn't be surprised if more come up. We opened Bug 53426 regarding audio tags. 2. The relative weighting of matches is going to be different. We're still fine tuning this and we'd appreciate any anecdotes describing search results that seem out of order. 3. We don't currently index headings beyond the article title in any special way. We'll be fixing that soon. (Bug 53481) 4. Searching for file names or clusters of punctuation characters doesn't work as well as it used to. It still works reasonably well if you surround your query in quotes but it isn't as good as it was. (Bugs 53013 and 52948) 5. "Did you mean" suggestions currently aren't highlighted at all and sometimes we'll suggest things that aren't actually better. (Bugs 52286 and 52860) 6. incategory:"category with spaces" isn't working. (Bug 53415) What we've changed that you probably don't care about: 1. Updating search in bulk is much more slow then before. This is the cost of expanding templates. 2. Search is now backed by a horizontally scalable search backend that is being actively developed (Elasticsearch) so we're in a much better place to expand on the new solution as time goes on. Neat stuff if you run your own MediaWiki: CirrusSearch is much easier to install than our current search infrastructure. So what will you notice? Nothing! That is because while the new search backend (CirrusSearch) is indexing we've left the current search infrastructure as the default while we work on our list of bugs. You can see the results from CirrusSearch by performing your search as normal and adding "&srbackend=CirrusSearch" to the url parameters. If you notice any problems with CirrusSearch please file bugs directly for it: https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions… Nik Everett _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2 1

Database schema updating code review
by Jeroen De Dauw 27 Aug '13

27 Aug '13

Hey, I wrote up a little design spike for a bunch of code needed for the store schema update functionality. This is nothing big or out of the ordinary [0]. Since I am only continuing with this code later this week, and no real implementation has been done so far, this commit now lends itself well to a design level review. https://gerrit.wikimedia.org/r/#/c/81254/3/src/TableDefinitionReader.php [0] Someone please suggest writing a formal RFC, have 5 calls about it, a dedicated architecture review, a dedicated security review, and perhaps its own mailing list. Cheers -- Jeroen De Dauw http://www.bn2vs.com *Don't panic*. Don't be evil. ~=[,,_,,]:3 --

1 0

Loading DataTypes in WMF config
by Jeroen De Dauw 27 Aug '13

27 Aug '13

Hey, We have a bunch of development TODOs which are blocked by WMF deployment config being updated. Aude made the following commit 3 weeks ago. Can someone with the appropriate rights please have a look at it? https://gerrit.wikimedia.org/r/#/c/76481/ Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

3 8

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech August 2013