Hi
I've been trying to use the API to get output into YAML format for processing in Ruby. Seems (see http://bugzilla.wikimedia.org/show_bug.cgi?id=12120 ) that the API's YAML output is broken, as it does not escape quotes, colons followed by spaces etc. This causes output like
title: Job: A Comedy of Justice
when getting the Wikipedia page with that title; due to the unescaped second colon followed by space, Ruby's YAML.load chokes on this. Seems to me that this basically means that YAML output is not usable for getting data from wikipedia, quite a few of whose articles use ": " in titles or other fields.
Any thoughts on how to circumvent this (without using XML as format; I know I can parse this from Ruby using REXML, but it ain't nearly as easy/clean as YAML), or on a time horizon for a fix/update?
Any help or suggestions are appreciated, thanks,
Loek Cleophas
I am working on a Ruby client for the MediaWiki API and was having that same problem for a long time. I'm not good enough with YAML to have fixed it, so I switched to XML. I recall the problem having to do with YAML formatting as opposed to the content of the message, so you may have found a second bug. There is a class of the API dedicated to YAML formatting, so you can look through it for a place to clean output (if you're comfortable with PHP).
Briefly on XML- I have found that it wasn't so bad to switch from YAML to XML. I know that YAML is attactive for how simple it is, but really XML isn't so bad either. My two cents.
-Eddie
On Thu, Mar 13, 2008 at 9:55 AM, L. Cleophas mdwkapi@loekcleophas.com wrote:
Hi
I've been trying to use the API to get output into YAML format for processing in Ruby. Seems (see http://bugzilla.wikimedia.org/show_bug.cgi?id=12120 ) that the API's YAML output is broken, as it does not escape quotes, colons followed by spaces etc. This causes output like
title: Job: A Comedy of Justice
when getting the Wikipedia page with that title; due to the unescaped second colon followed by space, Ruby's YAML.load chokes on this. Seems to me that this basically means that YAML output is not usable for getting data from wikipedia, quite a few of whose articles use ": " in titles or other fields.
Any thoughts on how to circumvent this (without using XML as format; I know I can parse this from Ruby using REXML, but it ain't nearly as easy/clean as YAML), or on a time horizon for a fix/update?
Any help or suggestions are appreciated, thanks,
Loek Cleophas
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
L. Cleophas schreef:
Hi
I've been trying to use the API to get output into YAML format for processing in Ruby. Seems (see http://bugzilla.wikimedia.org/show_bug.cgi?id=12120 ) that the API's YAML output is broken, as it does not escape quotes, colons followed by spaces etc. This causes output like
title: Job: A Comedy of Justice
when getting the Wikipedia page with that title; due to the unescaped second colon followed by space, Ruby's YAML.load chokes on this. Seems to me that this basically means that YAML output is not usable for getting data from wikipedia, quite a few of whose articles use ": " in titles or other fields.
I know, I should fix it. It's just that I'm not familiar with YAML at all, so I need to find some time to read the YAML specs and rewrite the YAML formatter.
Any thoughts on how to circumvent this (without using XML as format; I know I can parse this from Ruby using REXML, but it ain't nearly as easy/clean as YAML), or on a time horizon for a fix/update?
XML is not that bad. There's also JSON, which is equally easy/clean.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote: | I know, I should fix it. It's just that I'm not familiar with YAML at | all, so I need to find some time to read the YAML specs and rewrite the | YAML formatter.
If it's not a hard fix, it might be great to see that come in in the next day or two so I can merge it into the 1.12.0 release.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber schreef:
If it's not a hard fix, it might be great to see that come in in the next day or two so I can merge it into the 1.12.0 release.
Well ain't that a funny coincidence ;) I fixed the issue and applied the fix to the 1.12 branch. The magnitude of this problem was far greater than I had imagined: *every* mention of a page outside of the main namespace breaks, *every* ISO8601 timestamp breaks, which makes most query modules completely useless in YAML.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote: | Brion Vibber schreef: |> If it's not a hard fix, it might be great to see that come in in the |> next day or two so I can merge it into the 1.12.0 release. |> | Well ain't that a funny coincidence ;) I fixed the issue and applied the | fix to the 1.12 branch. The magnitude of this problem was far greater | than I had imagined: *every* mention of a page outside of the main | namespace breaks, *every* ISO8601 timestamp breaks, which makes most | query modules completely useless in YAML.
Hmm, a related problem I see is that numeric strings aren't quoted. If I'm reading things properly, that means that, say, a page named "123" will decode with an integer value for the 'title' field instead of a string value.
While some scripting languages will happily convert automatically between integer and string types, others are pretty picky about this.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber schreef:
Hmm, a related problem I see is that numeric strings aren't quoted. If I'm reading things properly, that means that, say, a page named "123" will decode with an integer value for the 'title' field instead of a string value.
While some scripting languages will happily convert automatically between integer and string types, others are pretty picky about this.
That's a generic problem which occurs with JSON and possibly other formats as well. I'll try to fix it, but with PHP's weak typing system it could be impossible.
Roan Kattouw (Catrope)
Roan Kattouw schreef:
Brion Vibber schreef:
Hmm, a related problem I see is that numeric strings aren't quoted. If I'm reading things properly, that means that, say, a page named "123" will decode with an integer value for the 'title' field instead of a string value.
While some scripting languages will happily convert automatically between integer and string types, others are pretty picky about this.
That's a generic problem which occurs with JSON and possibly other formats as well. I'll try to fix it, but with PHP's weak typing system it could be impossible.
Turns out I wasn't entirely correct.
JSON handles "123" vs. 123 just fine. However, YAML has a simple string format that doesn't require quotes (only works if the string doesn't contain stuff that has to be escaped), as in:
foo: bar baz: 3
Because of this, there's no way to tell the string "3" from the integer 3 because both are encoded exactly the same. If people are having trouble with this, they should file a bug at Bugzilla and I'll use the literal format for numerical strings as well.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote: | JSON handles "123" vs. 123 just fine. However, YAML has a simple string | format that doesn't require quotes (only works if the string doesn't | contain stuff that has to be escaped), as in: | | foo: bar | baz: 3 | | Because of this, there's no way to tell the string "3" from the integer | 3 because both are encoded exactly the same.
According to the Wikipedia article on YAML, you can quote or explicitly typecast the value to ensure it will be decoded as a string:
"3" !!str 3
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber schreef:
According to the Wikipedia article on YAML, you can quote or explicitly typecast the value to ensure it will be decoded as a string:
"3" !!str 3
That's true. The following is also possible (and is used for strings with weird characters):
foo: | 3
So I'll go and implement that.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote: | Roan Kattouw wrote: | | I know, I should fix it. It's just that I'm not familiar with YAML at | | all, so I need to find some time to read the YAML specs and rewrite the | | YAML formatter. | | If it's not a hard fix, it might be great to see that come in in the | next day or two so I can merge it into the 1.12.0 release.
... and I see you've committed it. Super! :)
- -- brion
L. Cleophas schreef:
or on a time horizon for a fix/update?
Bug 12120 (the escaping issue) has been fixed on trunk, it could take a few days (or weeks) before it goes live on Wikipedia and friends. I'll also try to get this fix into the final 1.12 release.
In the meantime, please read comment #3 on bug 12120 [1] and verify that the example I provided validates in your YAML parsers.
Roan Kattouw (Catrope)
Hi
Thanks for the very quick responses and action. Ruby's YAML parser seems to parse your example alright, though converting the result back to YAML again seems to yield a slightly different order of elements... That seems to be a Ruby YAML module issue though, not one for the API.
I haven't installed MediaWiki locally myself to check the fix works, but took a look at the code, and the fix seems to make sense. One small issue: your code seems to handle : and # inside strings, but comment #3 on the bug says it handles it at the beginning of strings. Typo?
Eagerly awaiting the fix to go live on Wikipedia etc. :-)
Thanks, Loek
On Thu, Mar 13, 2008 at 5:54 PM, Roan Kattouw roan.kattouw@home.nl wrote:
L. Cleophas schreef:
or on a time horizon for a fix/update?
Bug 12120 (the escaping issue) has been fixed on trunk, it could take a few days (or weeks) before it goes live on Wikipedia and friends. I'll also try to get this fix into the final 1.12 release.
In the meantime, please read comment #3 on bug 12120 [1] and verify that the example I provided validates in your YAML parsers.
Roan Kattouw (Catrope)
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=12120#c3
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
mediawiki-api@lists.wikimedia.org