At the Semantic MediaWiki conference (SMWCon) a few days ago, Yuri mentioned that we're considering making our web API JSON-only. In response, Steve Newcomb emailed me the message below, and gave me permission to forward it to mediawiki-api for your consideration. Thank you, Steve Newcomb.
XML doesn't so much have a data/metadata distinction so much as it has a set of attributes on every element, which makes for a more complex data structure than JSON's object graphs. This makes it harder to create a common internal->external data structure mapping that works well with *both* XML and JSON output.
Only supporting one or the other means we have a more consistent internal API (for the API modules to export data) and a more consistent external API (for the consumers of the API).
As for naming; property names in JSON objects are equivalent to element and attribute names in XML, and require human selection in either case.
-- brion
On Sun, Mar 24, 2013 at 11:54 AM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:
At the Semantic MediaWiki conference (SMWCon) a few days ago, Yuri mentioned that we're considering making our web API JSON-only. In response, Steve Newcomb emailed me the message below, and gave me permission to forward it to mediawiki-api for your consideration. Thank you, Steve Newcomb.
-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation
Dear Ms. Harihareswara,
The remarks that appear below, after my signature, are informed by participation in years of earnest presentations and discussions about XML vs. JSON at the Balisage conferences (see balisage.org).
The below remarks are extracted from the documentation of a tool we use in our consulting practice, which includes data management/publishing services for U.S. government customers. The extract is from a discussion of how the tool can optionally format XML for human-readability *without* polluting the data with spurious new whitespace. Then it digresses to more general considerations in a NOTE which is directly relevant to the JSON vs. XML question.
It all boils down to a simple question: will the data ever be used outside its current known applications and/or software? If the answer is "No", then JSON is probably the right choice. If the answer is "Yes", then XML is certainly a better choice, but then the questions arise: "Whose perspective on the data should be baked into it?", and "Who will pay the cost of baking it in?"
All best wishes for you and for humanity's ongoing invention of civilization, which depends on the longevity of knowledge,
Steve Newcomb srn@coolheads.com
In consideration of the haphazard way in which XML data are sometimes processed in the real world, one may with some justification worry about how a given XML document may someday be understood, especially when whitespace is significant. [This tool's] use of
markup characters for all readability-whitespace moots the criticism of XML that JSON is easier than XML to read and use for data interchange on account of the fact that, in JSON, all whitespace is intrinsically explicit and not subject to subsequent diddling when parsed, even when JSON data are elegantly formatted for readability.
Note: Needless to say, both syntaxes, XML and JSON, have advantages and disadvantages. In the context of this discussion, it may be worthwhile to highlight the essential difference between JSON and XML, which is that XML provides (demands, really) an explicit distinction between data and data-about-data (metadata), while JSON does not. In other words, XML requires specific classes of things to be endowed with names, while JSON imposes no such constraint. XML offers a standard way of unambiguously distinguishing the names of classes of data, and the names of attributes of those classes, from the data themselves. These names must be chosen somehow. Normally, the chosen names are meaningful. The choice of a specific name by a human being is the making of a semantic commitment. Thus, in XML, data are expressed in a way that almost inevitably reflects how someone (perhaps even the author!) thought the data should, or at least could, be understood. JSON, by contrast, does not demand that such a perspective be explicitly embedded in the data. If such a perspective is embedded in JSON data, JSON does not provide a standard way of abstracting that perspective from the data. But neither syntax prohibits the processing of data in terms of a data/metadata perspective other than the one(s) that were embedded in them. Whatever information XML can convey, JSON can also convey, and vice versa. However, if a data/metadata distinction needs to be baked into the data, such as when the data may need to be understood by a human being apart from any specific software application, XML is simpler to use, and the baked-in data/metadata distinction will be universally understandable as such, not only because of the World Wide Web Consortium XML Recommendation, but also because of ISO International Standard 8879-1986, as amended. If a baked-in data/metadata distinction is not desired, JSON is pretty clearly the better choice, but then at least two questions arise: (1) Are you certain that an embedded data/metadata distinction will be undesirable for all future applications of these data, including applications that do not yet exist? (2) Are you certain that you wish to forego your opportunity to influence how these data will be understood, including by persons as yet unborn?
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
On 03/25/2013 11:31 AM, Brion Vibber wrote:
XML doesn't so much have a data/metadata distinction so much as it has a set of attributes on every element, which makes for a more complex data structure than JSON's object graphs. This makes it harder to create a common internal->external data structure mapping that works well with *both* XML and JSON output.
That's very true. (I'll show you my scars if you show me yours!)
Even the generic identifier of each element ...
(where "generic identifier" == "class name" == "tag")
... is, in fact, an attribute value. It's the value of the nameless attribute.
Ideally, all XML attributes are metadata *about* the data content of elements, while all data content of elements is the essence and substance. But, as already noted, one person's metadata is another person's data, and the question is always, "Who decides, and on what basis, and what's the decision?"
-----
I would argue that *all* of the difficulties encountered in maintaining a *common* data structure fall into one or more of the following categories:
(1) The XML data structure not being fit for purpose.
(2) The object data structure not being fit for purpose.
(3) The two structures not being fit for the *same* purpose.
(4) The object data structure not fully reflecting the data/metadata distinction that XML requires, and that (not coincidentally) is reasonably required for the interchange of application-independent data.
Speaking as a programmer, I think #4 is the one that programmers tend to trip over. We think in terms of objects and software, rather than in terms of the information that thee objects are intended to convey, ultimately, to human beings. Our "customers" are machines, not human beings who need our data but, for any imaginable or unimaginable future reason, can't use our software.
The tags and attributes of XML -- indeed all the markup characters except what SGML calls "STAGO" (<), "ETAGO" (</), and "TAGC" (>) -- should generally be an irrelevant annoyance to anyone who is trying to get something working ASAP. It's a pity that the burden of maintaining XML falls on programmers, because they are the ones who care least about it, whose productivity suffers because of it, and whose attention to the underlying reasons for doing a good job with XML usually goes unrecognized and unrewarded. (Do I sound bitter?)
Speaking as a businessperson, before I invest in XML representations, I need to know why, because I know XML will cost real money, one way or another. In many scenarios, JSON is cheaper, and anyone who claims otherwise is ill-informed or lacks deep experience with both of them, especially in hybrid applications (Mediawiki). You guys know this; I'm preaching to the choir, here, but I want you to know that I, too, sing in your choir. Really.
Still speaking as a businessperson, customers do tend to demand XML, and at least some customers demand it for the right reasons. Some other customers demand XML for the wrong reasons, but that's OK because there are "right reasons" -- benefits to their organizations, and/or to the public -- that they, in their ignorance, don't recognize.
Still other customers demand XML for no apparent "right reason" -- perhaps out of something akin to brand loyalty. XML is simply not always the right answer. (For example, even after all these years, I'm still trying to understand why anyone would want to exchange an ODBMS, or even an RDBMS, for an "XML Database". But some do! Go figure.)
Speaking as a scholar with the motives of any data curator, I know that data objects that lack an embedded perspective on their components are extremely fragile and short-lived. Software rots, and often very quickly indeed. If I want a corpus of information to be enduringly accessible, I have to convert it to XML or SGML, and without delay.
Only supporting one or the other means we have a more consistent internal API (for the API modules to export data) and a more consistent external API (for the consumers of the API).
Very true. It's cheaper. Period. (And you get less.)
As for naming; property names in JSON objects are equivalent to element and attribute names in XML, and require human selection in either case.
Not the same. There is no distinction in JSON between what's meta and what's not. In XML, what's meta is in the markup (i.e., it's in the start-tags and end-tags), and what's not is in the content. That's the difference. Programmers *never* care about the data/metadata distinction, scholars *always* care about it, and businesspeople must do whatever the customer wants, or whatever their enterprise requires, at minimum expense. (Consultants, such as myself, get to advise all of them, which is what I'm doing right now.)
P.S. XML is pretty secure. If you use a Python interpreter to read JSON data, as many do, anything can happen. I'm not sure that's relevant to Mediawiki, but it could be relevant, particularly in a case where the data may outlive the original software. It's easy to embed a virus in a large JSON dataset. There is no such inherent risk in XML; XML is not a programming language (despite the awkward ways in which XSLT can be abused).
P.P.S. My point is: Is the focus of your product software? Or is the focus data? If it's data, then make the software conform to the requirements of the data. If it's software (e.g., the API), then you should feel quite free to make the data conform to the requirements of the software. (But I find it hard to believe that the latter case is the Mediawiki case, actually.)
Steve Newcomb
On Mon, 25 Mar 2013 21:23:59 +0100, Steve Newcomb srn@coolheads.com wrote:
If you use a Python interpreter to read JSON data, as many do, anything can happen. I'm not sure that's relevant to Mediawiki, but it could be relevant, particularly in a case where the data may outlive the original software. It's easy to embed a virus in a large JSON dataset. There is no such inherent risk in XML; XML is not a programming language (despite the awkward ways in which XSLT can be abused).
False. This is a feature of some parsers (and which should - and AFAIK is in Python - be disabled by default), which sadly mistake JSON for a data serialization format, when it's merely a data interchange one.
Thse parsers allow certain JSON data (usually with specially formatted keys) to be parsed into arbitrary language constructs in addition to the well-known and expected arrays and maps. But again, this isn't a feature of JSON itself (if anything, it speaks of its versatility), and is as far as I can see completely irrelevant here.
On 03/25/2013 05:16 PM, Bartosz Dziewoński wrote:
On Mon, 25 Mar 2013 21:23:59 +0100, Steve Newcomb srn@coolheads.com wrote:
If you use a Python interpreter to read JSON data, as many do, anything can happen. I'm not sure that's relevant to Mediawiki, but it could be relevant, particularly in a case where the data may outlive the original software. It's easy to embed a virus in a large JSON dataset. There is no such inherent risk in XML; XML is not a programming language (despite the awkward ways in which XSLT can be abused).
False. This is a feature of some parsers (and which should - and AFAIK is in Python - be disabled by default), which sadly mistake JSON for a data serialization format, when it's merely a data interchange one.
Not sure what's false about what I said. Here's what I was talking about:
#!/usr/bin/env python jsonDataSet = """{ 'this': 'hello', 'that': 'goodbye' }""" exec "myDictionary = %s" % ( jsonDataSet) ## <-- bad but real
Much can happen in an -exec-, including the definition of functions, and their assignment to "self" as methods. And recursive -exec-s, too.
Thse parsers allow certain JSON data (usually with specially formatted keys) to be parsed into arbitrary language constructs in addition to the well-known and expected arrays and maps. But again, this isn't a feature of JSON itself (if anything, it speaks of its versatility), and is as far as I can see completely irrelevant here.
The relevance depends on what people do with the data as represented. It's optimistic to expect knowledgeable, rational behaviors from human data recipients. As with any security concern:
* What's irrelevant is what we *expect* to happen.
* What's relevant is what *could* happen.
But I take your point that it's unlikely to be a problem, except for naive and/or desperate problem-solvers coping with situations and requirements we can scarcely imagine.
I was attempting to point out a different distinction from the one you draw between serialization and interchange (which is a very valid and appropriate distinction to bear in mind, here). I was saying...
JSON syntax is indistinguishable from a subset of Python syntax.
... whereas ...
XML syntax is very distinguishable from the syntax of any programming language.
XML syntax *could* be used for data serialization, but that would be directly contrary to its spirit, because much of a software implementation is implicit in its data structure (whether serialized or not), while XML is best used when the structures of its document instances are driven by, and imply, only the inherent semantics of the data. The reason is: *new* applications that will consume the data, including applications as yet unknown, won't have to compensate for the constraints and assumptions of *existing* applications of the same data.
Fact: From the perspective of an assumed set of existing popular software tools and practices, XML is significantly *less* versatile than JSON.
Fact: From the perspective of human beings, their cultures, their rapidly changing and diverse technological environments, etc., XML is significantly *more* versatile for data interchange than JSON. At a cost.
It all depends on what you're trying to accomplish. Which is more central to your mission: (1) software, or (2) data? You can't have it both ways. No one can serve two masters.
I would argue that even if you ultimately decide to forego all XML in favor of JSON, this discussion is well worth having. Things go better when everybody is on the same page.
On Tue, Mar 26, 2013 at 2:32 PM, Steve Newcomb srn@coolheads.com wrote:
On 03/25/2013 05:16 PM, Bartosz Dziewoński wrote:
On Mon, 25 Mar 2013 21:23:59 +0100, Steve Newcomb srn@coolheads.com wrote:
If you use a Python interpreter to read JSON data, as many do, anything can happen. I'm not sure that's relevant to Mediawiki, but it could be relevant, particularly in a case where the data may outlive the original software. It's easy to embed a virus in a large JSON dataset. There is no such inherent risk in XML; XML is not a programming language (despite the awkward ways in which XSLT can be abused).
False. This is a feature of some parsers (and which should - and AFAIK is in Python - be disabled by default), which sadly mistake JSON for a data serialization format, when it's merely a data interchange one.
Not sure what's false about what I said. Here's what I was talking about:
#!/usr/bin/env python jsonDataSet = """{ 'this': 'hello', 'that': 'goodbye' }""" exec "myDictionary = %s" % ( jsonDataSet) ## <-- bad but real
Much can happen in an -exec-, including the definition of functions, and their assignment to "self" as methods. And recursive -exec-s, too.
You're trying to state that that utterly bogus fragment of python is somehow a failing of json, rather than a failing of whoever wrote that python code. That won't even parse all json correctly, since "true", "false", and "null" are valid json but not valid python.
On Tue, 26 Mar 2013 19:32:44 +0100, Steve Newcomb srn@coolheads.com wrote:
Not sure what's false about what I said. Here's what I was talking about:
#!/usr/bin/env python jsonDataSet = """{ 'this': 'hello', 'that': 'goodbye' }""" exec "myDictionary = %s" % ( jsonDataSet) ## <-- bad but real
Much can happen in an -exec-, including the definition of functions, and their assignment to "self" as methods. And recursive -exec-s, too.
Are you even serious? How is that relevant? Who in their right might would exec() arbitrary outside data?
I won't comment on the rest of your reply, as it's apparently a wall of text completely unrelated to what I said, and I think also to the original discussion (which I'm personally not interested in, but I just wanted to point out the obviously false pretense of your comment).
The issues I was talking about are the likes of CVE-2013-0269 [1] (see https://groups.google.com/forum/?fromgroups=#!topic/rubyonrails-security/4_Y... ).
XML has its own fair share of vulnerabilities in inadequately written parsers, such as the billion laughs attack[2] or the ability to access arbitrary files (using '!ENTITY file'). This is, however, just as irrelevant here as the JSON issues are.
(EOT on my side.)
[1] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-0269 [2] https://en.wikipedia.org/wiki/Billion_laughs
On Tue, Mar 26, 2013 at 1:50 PM, Bartosz Dziewoński matma.rex@gmail.com wrote:
Are you even serious? How is that relevant? Who in their right might would exec() arbitrary outside data?
Wrong question. The right question is "how often do people exec() arbitrary outside data", and for the answer, look up "code injection exploit". I believe Wikipedia has an article on it.
mediawiki-api@lists.wikimedia.org