Wikidata

wikidata@lists.wikimedia.org

5 participants
4160 discussions

[Wikidata-l] Phase #3 deadline
by Jan Kučera 21 Aug '13

21 Aug '13

Hi there, how is the development of phase #3 (lists) going? Is it due to soon? Sub-question: I guess sorting feature in lists will be implemented in default as list without sorting would be a bad idea? Thx for answer. Cheers, Kozuch

3 2

[Wikidata-l] Exporting RDF from Wikidata?
by Kingsley Idehen 20 Aug '13

20 Aug '13

All, How do I obtain an RDF rendition of the Wikidata document <http://www.wikidata.org/wiki/Q215607> ? Naturally, I've scoured the Web for examples and I keep on coming up empty :-( -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen

5 6

[Wikidata-l] weekly summary #71
by Lydia Pintscher 17 Aug '13

17 Aug '13

Heya folks, Here's your weekly serving of Wikidata news. This time it includes a bit of mushroom and easter eggs ;-) http://meta.wikimedia.org/wiki/Wikidata/Status_updates/2013_08_16 Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

1 0

[Wikidata-l] Wikidata slides on Wikimania2013
by Jiang BIAN 15 Aug '13

15 Aug '13

Hi, Is there a place that I can find the slides used on this Wikimania? How about link them on the submission page, e.g. State of Wikidata<http://wikimania2013.wikimedia.org/wiki/Submissions/State_of_Wikidata> . Thanks -- Jiang BIAN This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks.

4 4

[Wikidata-l] access to Wikidata data (aka phase 2) coming to Wikivoyage
by Lydia Pintscher 15 Aug '13

15 Aug '13

Heya folks :) I just posted a note about this to the Wikivoyage Traveler Pubs. We plan to enable access to the data on Wikidata for Wikivoyage on 26th of August. Data like the international calling code, time zone or currency in a Wikivoyage article can then come from Wikidata. Please let me know if you have any questions. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

1 0

[Wikidata-l] Representing complex semantic information in Wikidata
by David Cuenca 14 Aug '13

14 Aug '13

There are many applications that require of complex structures to represent data, like the proposal for a multilingual Wikipedia [1] or a possible OpenMath implementation [2]. In Wikidata there is already support for 1-depth trees (qualifiers) and I am wondering if n-depth trees it is something that could be implemented in other namespaces, like multilingual statement, or math formula. Would that be possible or are there better approaches? Cheers, Micru [1] http://meta.wikimedia.org/wiki/A_proposal_towards_a_multilingual_Wikipedia [2] http://en.wikipedia.org/wiki/Openmath#Example

1 0

Re: [Wikidata-l] Make Commons a wikidata client
by Paul A. Houle 13 Aug '13

13 Aug '13

I’d like to see assertions of the sort “Picture B represents topic X” in commons. One can easily infer this for some pictures by noticing that “Picture B is included in the encyclopedia entry for topic X”, but often there are so many pictures of the topic that they aren’t all included in the topic page. Often I see people use categories in Commons for this purpose, and I think that this function be done in some way synchronized with Wikidata, which would come with many benefits. It would also be nice to some have some way to mark an image as “A photograph of X” “A piece of art created by X” “A 3-d molecular rendering of X” as well as that matter. There will be a fringe of tasks that categories will be necessary for, but it’s a good idea to make as many of them machine readable as we can. From: Gerard Meijssen Sent: Tuesday, August 13, 2013 5:08 AM To: Discussion list for the Wikidata project. Subject: Re: [Wikidata-l] Make Commons a wikidata client Hoi, As far as I am concerned, the categories used for images are not really helpful , While there are many images about Kiribati, you find only a few in the category by that name. The rest can be found in subcategories. In the proposal for Commons there is a provision for tags. These tags can be populated to some extend by the categories they are in. The reason to have categories is because they are intended to help find images. Without them and without tags we would not have Commons as a functioning entity. However, the way they work with all these subcategories and stuff prevent many people including myself to use Commons as the source of images when I need them. So yes, having categories are good in a half arsed way but we should get rid of them as we can have something better. One other big advantage of tags is that they are typically single concepts that have typically have translations either in the labels in Wikidata or in Wiktionary. This allows us to make Commons a truly multi-lingual resource. Thanks, GerardM On 10 August 2013 06:19, Maarten Dammers <maarten(a)mdammers.nl> wrote: Hi everyone, At Wikimania we had several discussions about the future of Wikidata and Commons. Some broader feedback would be nice. Now we have a property "Commons category" (https://www.wikidata.org/wiki/Property:P373). This is a string and an intermediate solution. In the long run Commons should probably be a wikibase instance in it's own right (structured metadata stored at Commons) integrated with Wikidata.org, see https://www.wikidata.org/wiki/Wikidata:Wikimedia_Commons for more info. In the meantime we should make Commons a wikidata client like Wikipedia and Wikivoyage. How would that work? We have an item https://www.wikidata.org/wiki/Q9920 for the city Haarlem. It links to the Wikipedia article "Haarlem" and the Wikivoyage article "Haarlem". It should link to the Commons gallery "Haarlem" (https://commons.wikimedia.org/wiki/Haarlem) We have an item https://www.wikidata.org/wiki/Q7427769 for the category Haarlem. It links to the Wikipedia category "Haarlem". It should link to the Commons category "Haarlem" (https://commons.wikimedia.org/wiki/Category:Haarlem). The category item (Q7427769) links to article item (Q9920) using the property "main category topic" (https://www.wikidata.org/wiki/Property:P301). We would need to make an inverse property of P301 to make the backlink. Some reasons why this is helpful: * Wikidata takes care of a lot of things like page moves, deletions, etc. Now with P373 (Commons category) it's all manual * Having Wikidata on Commons means that you can automatically get backlinks to Wikipedia, have intro's for category, etc etc * It's a step in the right direction. It makes it easier to do next steps Small change, lot's of benefits! Maarten _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -------------------------------------------------------------------------------- _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

1 0

[Wikidata-l] Wikidata RDF export available
by Markus Krötzsch 13 Aug '13

13 Aug '13

Hi, I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported. For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports. The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable. As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy. Features and known limitations of the wda RDF export: (1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings). (2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements. (3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it) (4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes. (5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all). Feedback is welcome. Cheers, Markus [1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF -- Markus Kroetzsch, Departmental Lecturer Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/

9 27

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available
by Markus Krötzsch 13 Aug '13

13 Aug '13

On 12/08/13 17:07, Paul A. Houle wrote: > My feelings are strong towards one-line-per-fact. > Large RDF data sets have validity problems, and the difficulty of > convincing publishers that this matters indicates that this situation > will continue. I hope that the Wikidata export that the script creates is actually valid, but please feel free to report a bug if you found any. I will update the online files I had created tomorrow to make sure they are based on the latest code as well. > I’ve thought a bit about the problem of the “streaming converter > from Turtle to N-Triples”. It’s true that this can be done in a > streaming manner most of the time, but there is a stack that can get > infinitely deep in Turtle so you can’t say, strictly, that memory > consumption is bounded. Yes, it is not a "streaming algorithm" in a formal sense, just a streaming-style algorithm. But in practice, it would most likely work quite well. > It’s also very unclear to me how exactly to work around broken > records and restart the parser in the general case. It’s not hard to > mock up examples where a simple recovery mechanism works, but I dread > the thought of developing one for commercial use where I’d probably be > playing whack-a-mole for edge cases for years. The beauty of a "quirks mode" Turtle parser is that there are no requirements on it. If the Turtle is broken, then anything is better than rejecting it altogether. The state of the art seems to be to give up and return no triples, not even the ones that were well-formed higher up in the file (which would also help to find the location of the error ...). A first improvement would be to keep the finished triples that have been recognized so far. Then you can think about restarting methods. And of course, while one can construe cases that seem confusing (at least to the human eye), most errors in real turtle documents are missing escapes, missing terminators, or missing entities (unexpected terminators), one at a time. It seems we can do a fairly reasonable recovery in each case (and of course there is always "Too many errors, giving up"). ... (I have nothing to say about compression) > I am very much against blank nodes for ‘wiki-ish’ data that is shared > between systems. The fact that Freebase reifies “blank nodes” as CVTs > means that we can talk about them on the outside, reason about them, > and then name them in order to interact with them on the live Freebase > system. By their nature, blank nodes defy the “anyone, anything, > anywhere” concept because they can’t be referred to. In the case of OWL > that’s a feature not a bug because you can really close the world > because nobody can add anything to a lisp-list without introducing a new > node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like > SPIN, or expressing algebraic sorts of things (i.e. describe the mixed > eigenstates of quarks in some Hadron), the mainstream of linked data > doesn’t use them. I think I agree. I also create proper URIs for all auxiliary objects and confine bnodes to OWL axioms, which are always small standard patterns that express one axiom. > Personally I’d like to see the data published in Quad form and have the > reification data expressed in the context field. As much as possible, > the things in the (?s ?p ?o) fields should make sense as facts. Ideally > you could reuse one ?c node for a lot of facts, such as if a number of > them came in one transactions. You could ask for the ?c fields (show me > all estimates for the population of Berlin from 2000 to the present and > who made them) or you could go through the file of facts and pick the > ?c’s that provide the point of view that you need the system to have. This has been discussed before and there was a decision in the past that the reification-based format should be implemented first, and that a named-graph based format should be implemented as well, but later. So hopefully we will have both at some time. For now, the RDF export is still experimental, and has no concrete uses (or planned uses that I know of) yet, so the priority of extensions/changes in the grand scheme of things is rather low. This might change, of course. Cheers, Markus -- Markus Kroetzsch, Departmental Lecturer Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/

1 0

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available
by Paul A. Houle 12 Aug '13

12 Aug '13

My feelings are strong towards one-line-per-fact. Large RDF data sets have validity problems, and the difficulty of convincing publishers that this matters indicates that this situation will continue. I’ve thought a bit about the problem of the “streaming converter from Turtle to N-Triples”. It’s true that this can be done in a streaming manner most of the time, but there is a stack that can get infinitely deep in Turtle so you can’t say, strictly, that memory consumption is bounded. It’s also very unclear to me how exactly to work around broken records and restart the parser in the general case. It’s not hard to mock up examples where a simple recovery mechanism works, but I dread the thought of developing one for commercial use where I’d probably be playing whack-a-mole for edge cases for years. There was a gap of quite a few years in the lare-90’s when there weren’t usable open-source web browsers because a practical web browser had to: (1) read broken markup, and (2) render it exactly the same as Netscape 3. Commercial operations can get things like this done by burning out programmers, who finally show up at a standup meeting one day, smash their laptop and stomp out. It’s not so easy in the open source world where you’re forced to use carrots and not sticks. So far as compression v. inner format I also have some thoughts because for every product I’ve made in the last few years I always tried a few different packaging methods before releasing something to final. Gzip eats up a lot of the ‘markupbloat’ in N-Triples because recently used IRIs and prefixes will be in the dictionary. The minus is that the dictionary isn’t very big, so the contents of the dictionary itself are bloated; there isn’t much entropy there, but the same markupbloat gets repeated hundreds of times; if you just put the prefixes in a hash table that might be more like 1000 bytes total to represent that. When you prefix-compress RDF and gzip it then, you’ve got the advantage that the dictionary contains more entropy than it would otherwise. Even though gzip is not cutting out so much markup bloat, it is compressing off a better model of the document so you get better results. As has been pointed out, sorting helps. If you sort in ?s ?p ?o . order it helps, partially because the sorting itself removes entropy (There are N! possible unsorted files and only one sorted one) and obviously the dictionary is being set up to roll together common ?s and ?s ?p’s the way turtle does. Bzip’s ability to work like a markov chain with the element of chance taken out is usually more effective at compression than gzip is, but I’ve noticed some exceptions. In the original :BaseKB products, all of the nodes looked like <http://rdf.basekb.com/ns/m.112az> I found my ?s ?p ?o sorted data compressed better with gzip than bzip, and perhaps the structure of the identifiers had something to do with it. A big advantage of bzip tha is that the block-based nature of the compression means that blocks can be compressed and decompressed in parallel (pbzip2 is a drop-in replacements for bzip2), so that the possible top speed of decompressing bzip data is in principle unlimited, even though bzip is a more expensive algorithm Hadoop in version 1.1.0+ can even automatically decompress a bzip2 file and split the result into separate mappers. Generally system performance is better if you read data out of pre-split gzip, but it is just so easy to load a big bz2 in HDFS and point a lot of transistors at it. I am very much against blank nodes for ‘wiki-ish’ data that is shared between systems. The fact that Freebase reifies “blank nodes” as CVTs means that we can talk about them on the outside, reason about them, and then name them in order to interact with them on the live Freebase system. By their nature, blank nodes defy the “anyone, anything, anywhere” concept because they can’t be referred to. In the case of OWL that’s a feature not a bug because you can really close the world because nobody can add anything to a lisp-list without introducing a new node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like SPIN, or expressing algebraic sorts of things (i.e. describe the mixed eigenstates of quarks in some Hadron), the mainstream of linked data doesn’t use them. Personally I’d like to see the data published in Quad form and have the reification data expressed in the context field. As much as possible, the things in the (?s ?p ?o) fields should make sense as facts. Ideally you could reuse one ?c node for a lot of facts, such as if a number of them came in one transactions. You could ask for the ?c fields (show me all estimates for the population of Berlin from 2000 to the present and who made them) or you could go through the file of facts and pick the ?c’s that provide the point of view that you need the system to have.

1 0

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikidata