Wikidata-tech March 2015

wikidata-tech@lists.wikimedia.org

15 participants
5 discussions

by Jeroen De Dauw

Hey, I'm wondering if we still need to support PHP 5.3. http://blog.ircmaxell.com/2014/12/on-php-version-requirements.html I'd rather bump the minimum version up to PHP 5.5, and know people have been talking about doing the same for MediaWiki itself. My question is if this can already be done without making Wikibase undeployable on WMF servers. Is everything running HHVM yet, or is there some stuff relevant to Wikibase that still runs an unsupported version of PHP? Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

8 years, 10 months

Thoughts on (not) exposing a SPARQL endpoint

by Daniel Kinzler

Hi all! After the initial enthusiasm, I have grown increasingly wary of the prospect of exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to share my (personal and unfinished) thoughts about this on this list, as food for thought and a basis for discussion. Basically, I fear that exposing SPARQL will lock us in with respect to the backend technology we use. Once it's there, people will rely on it, and taking it away would be very harsh. That would make it practically impossible to move to, say, Neo4J in the future. This is even more true if if expose vendor specific extensions like RDR/SPARQL*. Also, exposing SPARQL as our primary query interface probably means abruptly discontinuing support for WDQ. It's pretty clear that the original WDQ service is not going to be maintained once the WMF offers infrastructure for wikidata queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will need major modifications, or would just die. So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation. But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could be exposed to Labs, just like we provide access to replicated SQL databases there: on Labs, you get "raw" access, with added performance and flexibility, but no guarantees about interface stability. In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows. Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically tailored to our domain and use case, and there already is an ecosystem of tools that use it. We'd want to refine it a bit I suppose, but by and large, it's pretty much exactly what we need, because it was built around the actual demand for querying wikidata. So far my current thoughts. Note that this is not a decision or recommendation by the Wikidata team, just my personal take. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

9 years, 2 months

bio2rdf, wikidata and SPARQL

by Michel Dumontier

Hi all, I was excited to learn about your plans to explore the use of SPARQL capable stores for providing wikidata. I currently run Bio2RDF ( http://bio2rdf.org), an open source project that transforms and provides over 30 biomedical databases into 11B triples of Linked Data. For the past 10 years our project has relied on Virtuoso, primarily because it performs well under most circumstances (lookup and simple queries) and is open source. We are pleased to learn of the strides that BigData has made with its BlazeGraph release - and we are currently investigating its feasibility to support our project. Our project currently loads each RDF dataset into a separate SPARQL endpoint, which induces a high memory overhead, but seems to scale better on a single server and also makes it vastly easier to update individual datasets rather than having to delete/update a large triple store. Thus, users must use SPARQL federation in order to query across the graph, or, just download the freely available data files and build their own integrated database. We have just begun the process of seriously analyzing our user logs in order to better understand the kinds of queries that our users formulate, and the content that they are interested in. We hope that our work will provide insight into access patterns, data quality, and overall performance. But what I can say is that most queries are relatively simple (select + 1 triple pattern) and that, unsurprisingly, frequency decreases exponentially with increased complexity. However, if the goal is to provide fast access, you might also look into http://linkeddatafragments.org/ . It's something that we're looking into. I noticed your discussion about representation, I concur with Jerven that you should consider using explicit data structures that decompose complex concepts into computable fragments. We have described our approach in applying ontology design patterns built from the Semanticscience Integrated Ontology (SIO) to represent arbitrary knowledge ( http://sio.semanticscience.org/), which is also friendly to reasoning with OWL ontologies. I would be happy to discuss this in greater detail if interested. Finally, given the overlap in Bio2RDF with content in wikidata, I would like to investigate ways in which we can interlink our repositories. It would be useful if wikipedia/wikidata users could automatically discover related content in Bio2RDF, and vice versa. One way is for us to dynamically ask whether either of us knows about an entity, another is that share a ]data identifier registry (see identifiers.org). Would be great to hear your ideas on this! Cheers! m. Michel Dumontier Associate Professor of Medicine (Biomedical Informatics), Stanford University Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group http://dumontierlab.com

9 years, 2 months

Positive thoughts on exposing a SPARQL endpoint

by Jerven Tjalling Bolleman

Dear WikiData Developers, I am very pleased to read your intention to deploy a SPARQL based API for WikiData. I am one of the developers behind sparql.uniprot.org a public free to use SPARQL endpoint (with content under CC-ND, soon to be made more liberal). UniProt can be seen as an encyclopaedia of facts about proteins, and we have about a 90 million pages of entries e.g. http://www.uniprot.org/uniprot/P49407 (corresponds to http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in UniProtKB (and another 200 million or so in our supporting related datasets) lead to a bit more than 19 billion unique triples. We currently use Virtuoso, but I completely understand why you select BlazeGraph, it seems to me a better fit for your mission and deployment scenario. I evaluated what was then BigData 3 years ago and would have selected it if we did not need to go for vertical scaling due to having just 1U in our computer room available. I also understand the worries that you have about being able to run a service that is resilient to DOS (intentional or not). We have not yet needed to run any mitigation strategies that are more complex than banning a IP+user-agent string. However, we do have a few ideas about how to make that automatic. Even if we don’t have the time to work on these ideas yet. Our query timeouts are generously above what a sparsely used HTTP connection supports in practice. You should also not be worried about GRAPH traversals, in our practice simple DISTINCT queries are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE (?subject ?predicate ?object) that some people like to send daily can be a challenge on large databases. We could institute draconian timeouts but we don’t because we want to get the difficult queries, the simple our users can do on our main site, but the analytical ones require a solution such as SPARQL. Maintaining a SPARQL endpoint for public use requires you to focus on client management not query management. You can have one client with relatively easy queries, but one that happily sends you 40,000 of these per second. Others send you a query that looks complicated and expensive but is actually very selective and ends up taking near to no time at all. One of the solutions that we will look into long term is a query manager in front of our endpoint. I will try to explain the idea behind that but if its fuzzy just ask and I will try to explain better ;) Assume, you have a CPU budget available for hosting your SPARQL endpoint. Lets say 100 ticks per day. If you have one user that user should be able to get all 100 ticks. If you have one hundred users each user gets 1 tick. If you limit your one user to 1 tick you have wasted 99 ticks. If you let the fist user use 100 ticks and the other ninety-nine users their 1, you have broken the budget by 99 ticks. One solution is therefore to allow clients to optimistically run a query for longer than their ticks, but the moment that other clients arrive they get kicked off. e.g. complex queries run ok on sunday evening but not on monday 10am. You can do this by having a service in front of your API that takes SPARQL queries, queues them for execution in a priority queue. Sends back a 303 response with a retry-after and location header so that the HTTP responses don’t die. The priority queue ensures everyone can get a turn to run their queries, by reducing the priority of people who are already running a query etc… The query management service can then see if a query returned in its tick and if anyone else has scheduled a query. If someone else has a query outstanding the query manager tells the SPARQL engine to stop working on the query and send back any results if possible. I don’t know if BlazeGraph allows this but I know how to do it for GraphDB and have an idea about how to do it in Virtuoso, I am however sure that Systap can add this to BlazeGraph on short notice. Such a query manager will deal with all common forms of DOS attacks. Now to the advantages of SPARQL over the other options Gremlin included is that it is a open standard that is deployed in the wild for use of HTTP by complete strangers. Not just in academia but also e.g. at http://open-data.europa.eu/en/linked-data so its more likely that you can combine data in primary resources with data in WikiData. Also you won’t be the only ones worrying about attacks on your public endpoint and will have a larger community to share knowledge and code with. On the Reification business, we use it extensively and in our case about 15% of our triples are in reification quads. But that is the surface representation how that is actually stored in disk (or materialised on demand) is completely different and not something you should worry about at this time. Most of the time you don’t need reification and can often be avoided by better modelling. For example in our archive of protein sequences we used to use reification to note that a protein sequence can no longer be found in a database entry. Now we model a database entry just like all the other active entries we just give it a new unique URI and a rdf:type to say its obsolete and that it corresponds to a version of an active database entry. When teaching SPARQL to scientist I call this way of thinking “model the measurement result, not the conclusion”. e.g. instead of saying example:Jerven example:length_in_cm 197 . model example:Jerven example:measured_length [ example:length_measurement_result_in_cm 197 ; example:measurement_date “2015-02-27”^^xsd:date . ] ; [ example:length_measurement_result_in_cm 60 ; example:measurement_date “1983-06-21”^^xsd:date . ] . As you can see in the second model you don’t have to worry about invalidating data as the measurement stays correct. You would have been forced to model like this in practice with Titan, and you should do the same in RDF. Just because reification is available does not mean you must use it ;) I hope that this mail was constructive and that some of your worries will be less knowing that there are (possible) solutions to maintain a reliable service. Regards, Jerven PS. There are more opensource options for SPARQL eg. RDF4J -- Jerven Tjalling Bolleman SIB | Swiss Institute of Bioinformatics CMU - 1, rue Michel Servet - 1211 Geneva 4 t: +41 22 379 58 85 - f: +41 22 379 58 58 Jerven.Bolleman(a)isb-sib.ch - http://www.isb-sib.ch

9 years, 2 months

Wikidata Query Backend Update (take two!)

by Nikolas Everett

TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service. After Titan evaporated about a month ago we went back to the drawing board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet <https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9…> we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph <http://www.blazegraph.com/> anyway. Let me first explain what BlazeGraph is and then defend our decision to stop spreadsheet work. BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1, RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions. These are all cool features - really the kind of things that we thought we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain <https://www.youtube.com/watch?v=oKiXpO2rbJM> it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people <http://xmlns.com/foaf/spec/>, social connections <http://rdfs.org/sioc/spec/>, and music <http://musicontology.com/specification/>. And they all have rules. And Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber <https://www.wikidata.org/wiki/Q1053330>. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day. I feel like these semantic web technologies and BlazeGraph in particular are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed. We've been reasonably careful to reach out people we know are familiar with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers. The BlazeGraph upstream has been super active with us. They've spent hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented. So by now you've realized I'm a fan. I believe that we should stop on the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known <http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29> to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner. We created a phabricator task <https://phabricator.wikimedia.org/T90101> with lots of children to run down our remaining questions. The biggest remaining questions revolve around three areas: 1. Operational issues like "how should the cluster be deployed?" "do we use HA at all?" "how are rolling restarts done in HA?" 2. How should we represent the data in the database? BlazeGraph (and only BlazeGraph) has an extension that *could* us called RDR. Should we use it? 3. Some folks have identified update rate as a risk. Not upstream, but others familiar with triple stores in general. Our plans is to work on #2 over the next weeks because it really informs #1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon. I admit I'm not super happy to leave the spreadsheet in the format its current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now. So thanks for reading all of this. Please reply with comments. Thanks again, Nik

9 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikidata-tech March 2015