Wikidata November 2015

wikidata@lists.wikimedia.org

63 participants
49 discussions

We plan to drop the wb_entity_per_page table
by hoo 01 Jun '17

01 Jun '17

Hey folks, we plan to drop the wb_entity_per_page table sometime soon[0], because it is just not required (as we will likely always have a programmatic mapping from entity id to page title) and it does not supported non -numeric entity ids as it is now. Due to this removing it is a blocker for the commons metadata. Is anybody using that for their tools (on tool labs)? If so, please tell us so that we can give you instructions and a longer grace period to update your scripts. Cheers, Marius [0]: https://phabricator.wikimedia.org/T95685

5 5

People who died in 2015 who were Dutch
by Gerard Meijssen 31 Aug '16

31 Aug '16

Hoi, Jura1 created a wonderful list of people who died in Brazil in 2015 [1]. It is a page that may update regularly from Wikidata thanks to the ListeriaBot. Obviously, there may be a few more because I am falling ever more behind with my quest for registering deaths in 2015. I have copied his work and created a page for people who died in the Netherlands in 2015 [2]. It is trivially easy to do this and, the result is great. The result looks great, it can be used for any country in any Wikipedia The Dutch Wikipedia indicated that they nowadays maintain important metadata at Wikidata. I am really happy that we can showcase their work. It is important work because as someone reminded me at some stage, this is part of what amounts to the policy of living people... Thanks, GerardM [1] https://www.wikidata.org/wiki/User:Jura1/Recent_deaths_in_Brazil [2] https://www.wikidata.org/wiki/User:Jura1/Recent_deaths_in_the_Netherlands

10 35

Freebase to Wikidata: Results from Tpt internship
by Denny Vrandečić 23 Feb '16

23 Feb '16

Hi all, as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks. First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool). The scripts that were created and used can be found here: https://github.com/google/freebase-wikidata-converter All scripts are released under the Apache license v2. The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis… The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed. Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation. https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.… Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels… This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m… This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped). Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks. Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase. First, the following is a visualization of the whole of Wikidata: https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image. Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated. In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph). https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color… The same visualizations has also been created for Freebase: https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color… In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases. https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freeba… https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikida… The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it). Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer. Cheers, Denny

10 10

Maintenance scripts for clients
by John Erling Blad 01 Feb '16

01 Feb '16

We lack several maintenance scripts for the clients, that is human readable special pages with reports on which pages lacks special treatment. In no particular order we need some way to identify unconnected pages in general (the present one does not work [1]), we need some way to identify pages that are unconnected but has some language links, we need to identify items that are used in some language and lacks labels (almost like [2],but on the client and for items that are somehow connected to pages on the client), and we need to identify items that lacks specific claims and the client pages use a specific template. There are probably more such maintenance pages, these are those that are most urgent. Now users start to create categories to hack around the missing maintenance pages, which create a bunch of categories.[3] At Norwegian Bokmål there are just a few scripts that utilize data from Wikidata, still the number of categories starts to grow large. For us at the "receiving end" this is a show stopper. We can't convince the users that this is a positive addition to the pages without the maintenance scripts, because them we more or less are in the blind when we try to fix errors. We can't use random pages to try to prod the pages to find something that is wrong, we must be able to search for the errors and fix them. This summer we (nowiki) have added about ten (10) properties to the infobokses, some with scripts and some with the property parser function. Most of my time I have not been coding, and I have not been fixing errors. I have been trying to explain to the community why Wikidata is a good idea. At one point the changes was even reverted because someone disagree with what we had done. The whole thing basically revolves around "my article got an Q-id in the infobox and I don't know how to fix it". We know how to fix it, and I have explained that to the editors at nowiki several times. They still don't get it, so we need some way to fix it, and we don't have maintenance scripts to do it. Right now we don't need more wild ideas that will swamp the development for months and years to come, we need maintenance scripts, and we need them now! [1] https://no.wikipedia.org/wiki/Spesial:UnconnectedPages [2] https://www.wikidata.org/wiki/Special:EntitiesWithoutLabel [3] https://no.wikipedia.org/wiki/Spesial:Prefiksindeks/Kategori:Artikler_hvor John Erling Blad /jeblad

4 7

Goal: Establish a framework to engage with data engineers and open data organizations
by Quim Gil 01 Feb '16

01 Feb '16

Hi, it's first of July and I would like to introduce you a quarterly goal that the Engineering Community team has committed to: Establish a framework to engage with data engineers and open data organizations https://phabricator.wikimedia.org/T101950 We are missing a community framework allowing Wikidata content and tech contributors, data engineers, and open data organizations to collaborate effectively. Imagine GLAM applied to data. If all goes well, by the end of September we would like to have basic documentation and community processes for open data engineers and organizations willing to contribute to Wikidata, and ongoing projects with one open data org. If you are interested, get involved! We are looking for * Wikidata contributors with good institutional memory * people that has been in touch with organizations willing to contribute their open data * developers willing to help improving our software and programming missing pieces * also contributors familiar with the GLAM model(s), what works and what didn't work This goal has been created after some conversations with Lydia Pintscher (Wikidata team) and Sylvia Ventura (Strategic Partnerships). Both are on board, Lydia assuring that this work fits into what is technically effective, and Sylvia checking our work against real open data organizations willing to get involved. This email effectively starts the bootstrapping of this project. I will start creating subtasks under that goal based on your feedback and common sense. -- Quim Gil Engineering Community Manager @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

12 19

Re: [Wikidata] [Wikimedia-l] Quality issues
by Andreas Kolbe 10 Dec '15

10 Dec '15

Gerard, On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote: > Hoi, > To start of, results from the past are no indications of results in the > future. It is the disclaimer insurance companies have to state in all their > adverts in the Netherlands. When you continue and make it a "theological" > issue, you lose me because I am not of this faith, far from it. Wikidata is > its own project and it is utterly dissimilar from Wikipedia.To start of > Wikidata has been a certified success from the start. The improvement it > brought by bringing all interwiki links together is enormous.That alone > should be a pointer that Wikipedia think is not realistic. > These benefits are internal to Wikimedia and a completely separate issue from third-party re-use of Wikidata content as a default reference source, which is the issue of concern here. To continue, people have been importing data into Wikidata from the start. > They are the statements you know and, it was possible to import them from > Wikipedia because of these interwiki links. So when you call for sources, > it is fairly save to assume that those imports are supported by the quality > of the statements of the Wikipedias The quality of three-quarters of the 280+ Wikipedia language versions is about at the level the English Wikipedia had reached in 2002. Even some of the larger Wikipedias have significant problems. The Kazakh Wikipedia for example is controlled by functionaries of an oppressive regime[1], and the Croatian one is reportedly[2] controlled by fascists rewriting history (unless things have improved markedly in the Croatian Wikipedia since that report, which would be news to me). The Azerbaijani Wikipedia seems to have problems as well. The Wikimedia movement has always had an important principle: that all content should be traceable to a "reliable source". Throughout the first decade of this movement and beyond, Wikimedia content has never been considered a reliable source. For example, you can't use a Wikipedia article as a reference in another Wikipedia article. Another important principle has been the disclaimer: pointing out to people that the data is anonymously crowdsourced, and that there is no guarantee of reliability or fitness for use. Both of these principles are now being jettisoned. Wikipedia content is considered a reliable source in Wikidata, and Wikidata content is used as a reliable source by Google, where it appears without any indication of its provenance. This is a reflection of the fact that Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision was, I understand, made by Denny, who is both a Google employee and a WMF board member. The benefit to Google is very clear: this free, unattributed content adds value to Google's search engine result pages, and improves Google's revenue (currently running at about $10 million an hour, much of it from ads). But what is the benefit to the end user? The end user gets information of undisclosed provenance, which is presented to them as authoritative, even though it may be compromised. In what sense is that an improvement for society? To me, the ongoing information revolution is like the 19th century industrial revolution done over. It created whole new categories of abuse, which it took a century to (partly) eliminate. But first, capitalists had a field day, and the people who were screwed were the common folk. Could we not try to learn from history? > and if anything, that is also where > they typically fail because many assumptions at Wikipedia are plain wrong > at Wikidata. For instance a listed building is not the organisation the > building is known for. At Wikidata they each need their own item and > associated statements. > > Wikidata is already a success for other reasons. VIAF no longer links to > Wikipedia but to Wikidata. The biggest benefit of this move is for people > who are not interested in English. Because of this change VIAF links > through Wikidata to all Wikipedias not only en.wp. Consequently people may > find through VIAF Wikipedia articles in their own language through their > library systems. > At the recent Wikiconference USA, a Wikimedia veteran and professional librarian expressed the view to me that * circular referencing between VIAF and Wikidata will create a humongous muddle that nobody will be able to sort out again afterwards, because – unlike wiki mishaps in other topic areas – here it's the most authoritative sources that are being corrupted by circular referencing; * third parties are using Wikimedia content as a *reference standard *when that was never the intention (see above). I've seen German Wikimedians express concerns that quality assurance standards have dropped alarmingly since the project began, with bot users mass-importing unreliable data. > So do not forget about Wikipedia and the lessons learned. These lessons are > important to Wikipedia. However, they do not necessarily apply to Wikidata > particularly when you approach Wikidata as an opportunity to do things in a > different way. Set theory, a branch of mathematics, is exactly what we > need. When we have data at Wikidata of a given quality.. eg 90% and we have > data at another source with a given quality eg 90%, we can compare the two > and find a subset where the two sources do not match. When we curate the > differences, it is highly likely that we improve quality at Wikidata or at > the other source. This sounds like "Let's do it quick and dirty and worry about the problems later". I sometimes get the feeling software engineers just love a programming challenge, because that's where they can hone and display their skills. Dirty data is one of those challenges: all the clever things one can do to clean up the data! There is tremendous optimism about what can be done. But why have bad data in the first place, starting with rubbish and then proving that it can be cleaned up a bit using clever software? The effort will make the engineer look good, sure, but there will always be collateral damage as errors propagate before they are fixed. The engineer's eyes are not typically on the content, but on their software. The content their bots and programs manipulate at times seems almost incidental, something for "others" to worry about – "others" who don't necessarily exist in sufficient numbers to ensure quality. In short, my feeling is that the engineering enthusiasm and expertise applied to Wikidata aren't balanced by a similar level of commitment to scholarship in generating the data, and getting them right first time. We've seen where that approach can lead with Wikipedia. Wikipedia hoaxes and falsehoods find their way into the blogosphere, the media, even the academic literature. The stakes with Wikidata are potentially much higher, because I fear errors in Wikidata stand a good chance of being massively propagated by Google's present and future automated information delivery mechanisms, which are completely opaque. Most internet users aren't even aware to what extent the Google Knowledge Graph relies on anonymously compiled, crowdsourced data; they will just assume that if Google says it, it must be true. In addition to honest mistakes, transcription errors, outdated info etc., the whole thing is a propagandist's wet dream. Anonymous accounts! Guaranteed identity protection! Plausible deniability! No legal liability! Automated import and dissemination without human oversight! Massive impact on public opinion![3] If information is power, then this provides the best chance of a power grab humanity has seen since the invention of the newspaper. In the media landscape, you at least have right-wing, centrist and left-wing publications each presenting their version of the truth, and you know who's publishing what and what agenda they follow. You can pick and choose, compare and contrast, read between the lines. We won't have that online. Wikimedia-fuelled search engines like Google and Bing dominate the information supply. The right to enjoy a pluralist media landscape, populated by players who are accountable to the public, was hard won in centuries past. Some countries still don't enjoy that luxury today. Are we now blithely giving it away, in the name of progress, and for the greater glory of technocrats? I don't trust the way this is going. I see a distinct possibility that we'll end up with false information in Wikidata (or, rather, the Google Knowledge Graph) being used to "correct" accurate information in other sources, just because the Google/Wikidata content is ubiquitous. If you build circular referencing loops fuelled by spurious data, you don't provide access to knowledge, you destroy it. A lie told often enough etc. To quote Heather Ford and Mark Graham, "We know that the engineers and developers, volunteers and passionate technologists are often trying to do their best in difficult circumstances. But there need to be better attempts by people working on these platforms to explain how decisions are made about what is represented. These may just look like unimportant lines of code in some system somewhere, but they have a very real impact on the identities and futures of people who are often far removed from the conversations happening among engineers." I agree with that. The "what" should be more important than the "how", and at present it doesn't seem to be. It's well worth thinking about, and having a debate about what can be done to prevent the worst from happening. In particular, I would like to see the decision to publish Wikidata under a CC0 licence revisited. The public should know where the data it gets comes from; that's a basic issue of transparency. Andreas [1] https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed [2] http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-contro… [3] http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-201…

5 9

provenance tracking for high volume edit sources (was Data model explanation and protection)
by Benjamin Good 09 Dec '15

09 Dec '15

In another thread, we are discussing the preponderance of problematic merges of gene/protein items. One of the hypotheses raised to explain the volume and nature of these merges (which are often by fairly inexperienced editors and/or people that seem to only do merges) was that they were coming from the wikidata game. It seems to me that anything like the wikidata game that has the potential to generate a very large volume of edits - especially from new editors - ought to tag its contributions so that they can easily be tracked by the system. It should be easy to answer the question of whether an edit came from that game (or any of what I hope to be many of its descendants). This will make it possible to debug what could potentially be large swathes of problems and to make it straightforward to 'reward' game/other developers with information about the volume of the edits that they have enabled directly from the system (as opposed to their own tracking data). Please don't misunderstand me. I am a big fan of the wikidata game and actually am pushing for our group to make a bio-specific version of it that will build on that code. I see a great potential here - but because of the potential scale of edits this could quickly generate, we (the whole wikidata community) need ways to keep an eye on what is going on. -Ben

3 4

Does a REST services exist that converts Wikipedia url to Wikidata Q id, and the converse?
by james＠j1w.xyz 09 Dec '15

09 Dec '15

I'm creating an app that provides Wikidata info in slide-out menus on top of Wikipedia pages. Here's a video of a prototype: https://vimeo.com/146061825 Much of the app will be implemented as REST services in the cloud, and one item of functionality required will be a REST service that returns the Q id given a Wikipedia URL (in any language). Another REST service required will return a Wikipedia URL given a Wikidata Q id and language code (e.g. "en" or "pt-br"). Does anything like this currently exist? Regards, James Weaver http://JavaFXpert.com http://CulturedEar.com

5 6

REST API for Wikidata
by Jeroen De Dauw 06 Dec '15

06 Dec '15

Hey all, I've created a very rough REST API for Wikidata and am looking for your feedback. * About this API: http://queryr.wmflabs.org * Documentation: http://queryr.wmflabs.org/about/docs * API root: http://queryr.wmflabs.org/api At present this is purely a demo. The data it serves is stale and potentially incomplete, the endpoints and formats they use are very much liable to change, the server setup is not reliable and I'm not 100% sure I'll continue with this little project. The main thing I'm going for with this API compared to the existing one is greater ease of use for common use cases. Several factors make this a lot easier to do in a new API than in the existing one: no need the serve all use cases, no need to retain compatibility with existing users and no framework imposed restrictions. You can read more about the difference on the website. You are invited to comment on the concept and on the open questions mentioned on the website. Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate ~=[,,_,,]:3

12 26

data access for Wikispecies, MediaWiki, Meta and Wikinews is coming
by Lydia Pintscher 02 Dec '15

02 Dec '15

Hey everyone :) Things seem to be going well so we'll move forward and give another bunch of projects access to the data on Wikidata. They so far only have access to the sitelinks. This will be: Wikispecies, MediaWiki, Meta and Wikinews. We'll do this on 2. December. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

2 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikidata November 2015