Wikidata October 2015

wikidata@lists.wikimedia.org

70 participants
48 discussions

We plan to drop the wb_entity_per_page table

by hoo

Hey folks, we plan to drop the wb_entity_per_page table sometime soon[0], because it is just not required (as we will likely always have a programmatic mapping from entity id to page title) and it does not supported non -numeric entity ids as it is now. Due to this removing it is a blocker for the commons metadata. Is anybody using that for their tools (on tool labs)? If so, please tell us so that we can give you instructions and a longer grace period to update your scripts. Cheers, Marius [0]: https://phabricator.wikimedia.org/T95685

6 years, 10 months

People who died in 2015 who were Dutch

by Gerard Meijssen

Hoi, Jura1 created a wonderful list of people who died in Brazil in 2015 [1]. It is a page that may update regularly from Wikidata thanks to the ListeriaBot. Obviously, there may be a few more because I am falling ever more behind with my quest for registering deaths in 2015. I have copied his work and created a page for people who died in the Netherlands in 2015 [2]. It is trivially easy to do this and, the result is great. The result looks great, it can be used for any country in any Wikipedia The Dutch Wikipedia indicated that they nowadays maintain important metadata at Wikidata. I am really happy that we can showcase their work. It is important work because as someone reminded me at some stage, this is part of what amounts to the policy of living people... Thanks, GerardM [1] https://www.wikidata.org/wiki/User:Jura1/Recent_deaths_in_Brazil [2] https://www.wikidata.org/wiki/User:Jura1/Recent_deaths_in_the_Netherlands

7 years, 8 months

Freebase to Wikidata: Results from Tpt internship

by Denny Vrandečić

Hi all, as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks. First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool). The scripts that were created and used can be found here: https://github.com/google/freebase-wikidata-converter All scripts are released under the Apache license v2. The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis… The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed. Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation. https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.… Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels… This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped). https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m… This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped). Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks. Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase. First, the following is a visualization of the whole of Wikidata: https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image. Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated. In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph). https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color… The same visualizations has also been created for Freebase: https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color… In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases. https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freeba… https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikida… The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it). Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer. Cheers, Denny

8 years, 2 months

Maintenance scripts for clients

by John Erling Blad

We lack several maintenance scripts for the clients, that is human readable special pages with reports on which pages lacks special treatment. In no particular order we need some way to identify unconnected pages in general (the present one does not work [1]), we need some way to identify pages that are unconnected but has some language links, we need to identify items that are used in some language and lacks labels (almost like [2],but on the client and for items that are somehow connected to pages on the client), and we need to identify items that lacks specific claims and the client pages use a specific template. There are probably more such maintenance pages, these are those that are most urgent. Now users start to create categories to hack around the missing maintenance pages, which create a bunch of categories.[3] At Norwegian Bokmål there are just a few scripts that utilize data from Wikidata, still the number of categories starts to grow large. For us at the "receiving end" this is a show stopper. We can't convince the users that this is a positive addition to the pages without the maintenance scripts, because them we more or less are in the blind when we try to fix errors. We can't use random pages to try to prod the pages to find something that is wrong, we must be able to search for the errors and fix them. This summer we (nowiki) have added about ten (10) properties to the infobokses, some with scripts and some with the property parser function. Most of my time I have not been coding, and I have not been fixing errors. I have been trying to explain to the community why Wikidata is a good idea. At one point the changes was even reverted because someone disagree with what we had done. The whole thing basically revolves around "my article got an Q-id in the infobox and I don't know how to fix it". We know how to fix it, and I have explained that to the editors at nowiki several times. They still don't get it, so we need some way to fix it, and we don't have maintenance scripts to do it. Right now we don't need more wild ideas that will swamp the development for months and years to come, we need maintenance scripts, and we need them now! [1] https://no.wikipedia.org/wiki/Spesial:UnconnectedPages [2] https://www.wikidata.org/wiki/Special:EntitiesWithoutLabel [3] https://no.wikipedia.org/wiki/Spesial:Prefiksindeks/Kategori:Artikler_hvor John Erling Blad /jeblad

8 years, 2 months

Goal: Establish a framework to engage with data engineers and open data organizations

by Quim Gil

Hi, it's first of July and I would like to introduce you a quarterly goal that the Engineering Community team has committed to: Establish a framework to engage with data engineers and open data organizations https://phabricator.wikimedia.org/T101950 We are missing a community framework allowing Wikidata content and tech contributors, data engineers, and open data organizations to collaborate effectively. Imagine GLAM applied to data. If all goes well, by the end of September we would like to have basic documentation and community processes for open data engineers and organizations willing to contribute to Wikidata, and ongoing projects with one open data org. If you are interested, get involved! We are looking for * Wikidata contributors with good institutional memory * people that has been in touch with organizations willing to contribute their open data * developers willing to help improving our software and programming missing pieces * also contributors familiar with the GLAM model(s), what works and what didn't work This goal has been created after some conversations with Lydia Pintscher (Wikidata team) and Sylvia Ventura (Strategic Partnerships). Both are on board, Lydia assuring that this work fits into what is technically effective, and Sylvia checking our work against real open data organizations willing to get involved. This email effectively starts the bootstrapping of this project. I will start creating subtasks under that goal based on your feedback and common sense. -- Quim Gil Engineering Community Manager @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

8 years, 2 months

Data model explanation and protection

by Benjamin Good

The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience. When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3]) Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist? I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation. It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community. [1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#D… [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/m…

8 years, 5 months

birthday present: Special:Nearby

by Lydia Pintscher

Hey folks :) Have you ever wondered what is happening on Wikidata around you? What does Wikidata know about the building next door? Or the tube station a few blocks away? Now you can find out. Head over to https://www.wikidata.org/wiki/Special:Nearby and you will know. This is another piece in the puzzle to making it easier to get an overview of the data you care about. Share if you find something cool and unexpected around you! Cheers Lydia PS: A special thanks to everyone who helped get this out including Max Semenik, Jon Robson, Florian, Erik Bernhardson and David Causse.. -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

8 years, 5 months

Happy 3rd Birthday, Wikidata!

by Lydia Pintscher

Hey everyone :) Today we are celebrating Wikidata's 3rd birthday. I've been with the project since we started development 3.5 years ago and I can't believe what a ride it has been and how far we've come. As for every birthday celebrations are in order. We've created a page at https://www.wikidata.org/wiki/Wikidata:Third_Birthday. There you can find editorials (by Harmonia Amanda, Ash Crow and me) about the past year and what is coming. Please take a moment to read it. There you will also find a section for congratulations and wishes, presents and more. Here's to many more years of Wikidata. Stay as awesome as you are! Cheers Lydia PS: The development team has presents as well. I'll send an email about them in a few hours. Ohhh the suspense :D -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

8 years, 5 months

Blazegraph

by Gerard Meijssen

Hoi, Arguments have been raised where Blazegraph was key to the problem. It is however a server based tool. Would someone please install it on labs and thereby making it available to all of us. In the process the argument becomes an argument that is of relevance to all of us. At this stage it is very much a niche issue. Thanks, GerardM

8 years, 5 months

Three birthday gifts for wikidata's third birthday

by Amir Ladsgroup

Hey, It's Wikidata's third birthday, right? So I prepared three gifts for you: 1- AI-based anti-vandalism classifier is ready after four months of work thanks to Aaron Halfaker. It's so big that I can't write it here. This is link to the announcement. <https://www.wikidata.org/wiki/Wikidata:Third_Birthday/Presents/ORES> 2- Remember Kian? Using bot I already added 100K statements when the Kian had high certainty but there are far more to add but they need human review. Thanks to Magnus Manske now we have a game that suggests statements based on Kian <https://tools.wmflabs.org/wikidata-game/distributed/#game=15> and you can simply add them. what I did was populating a database with suggestions and building an API around that There are 2.6 million suggestions in 17 languages based on 53 models. I can easily add more languages and models. Just name them :) 3- Still there lots of old interwiki links (in case you don't remember, things like [[en:Foo]], ewww) in small wikis specially in template namespace and there is a flow of them adding in medium-sized wikis. Also in future we need to clean them from Wiktionary \o/. Now we have a script in pywikibot named interwikidata.py <https://github.com/wikimedia/pywikibot-core/blob/master/scripts/interwikida…> merged ten hours ago thanks to jayvdb and xzise. It cleans pages, add links to wikidata and create items for pages in your wiki. i.e. It's interwiki.py but for wikis that use Wikidata Just run: python pwb.py scripts/interwikidata.py -unconnectedpages -clean -create Or if you are a little bit advanced in pywikibot. Write a script based on this and handle interwiki conflicts (more help in source code) Happy birthday! Best

8 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikidata October 2015