Trung Dinh wrote:
Hi all, I have an issue why trying to parse data fetched from wikipedia api. This is the piece of code that I am using: api_url = 'http://en.wikipedia.org/w/api.php' api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rc dir=newer&format=json&rcstart=20160504022715'
f = urllib2.Request(api_url, api_params) print ('requesting ' + api_url + '?' + api_params) source = urllib2.urlopen(f, None, 300).read() source = json.loads(source)
json.loads(source) raised the following exception " Expecting , delimiter: line 1 column 817105 (char 817104"
I tried to use source.encode('utf-8') and some other encodings but they all didn't help. Do we have any workaround for that issue ? Thanks :)
Hi.
Weird, I can't reproduce this error. I had to import the "json" and "urllib2" modules, but after doing so, executing the code you provided here worked fine for me: https://phabricator.wikimedia.org/P3009.
You probably want to use 'https://en.wikipedia.org/w/api.php' as your end-point (HTTPS, not HTTP).
As far as I know, JSON is always encoded as UTF-8, so you shouldn't need to encode or decode the data explicitly.
The error you're getting generally means that the JSON was malformed for some reason. It seems unlikely that MediaWiki's api.php is outputting invalid JSON, but I suppose it's possible.
Since you're coding in Python, you may be interested in a framework such as https://github.com/alexz-enwp/wikitools.
MZMcBride
On Thu, May 5, 2016 at 6:16 PM, MZMcBride z@mzmcbride.com wrote:
The error you're getting generally means that the JSON was malformed for some reason. It seems unlikely that MediaWiki's api.php is outputting invalid JSON, but I suppose it's possible.
There is https://phabricator.wikimedia.org/T132159 along those lines, although it's not an API issue.
I note that the reported issue is with list=recentchanges, the output of which (even at a constant timestamp offset) could easily change with page deletion or revdel.
Guys,
Thanks so much for your prompt feedback. Basically, what I am doing is to keep sending the request based on date & time until we reach to another day. Specifically, what I have is something like:
api_url = 'http://en.wikipedia.org/w/api.php' date='20160504022715'
while (True): api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcd ir=newer&format=json&rcstart={date}'.format(date=date) f = urllib2.Request(api_url, api_params) source = urllib2.urlopen(f, None, 300).read() source = json.loads(source) Increase date.
Given the above code, I am encountering an weird situation. In the query, if I set rclimit to 500 then it runs normally. However, if I set rclimit to 5000 like my previous email, I will see the error. I know that for recent change rclimit should be set to 500. But is there anything particular about the values of rclimit that could lead to the break in json ?
On 5/5/16, 11:16 PM, "Wikitech-l on behalf of MZMcBride" <wikitech-l-bounces@lists.wikimedia.org on behalf of z@mzmcbride.com> wrote:
Trung Dinh wrote:
Hi all, I have an issue why trying to parse data fetched from wikipedia api. This is the piece of code that I am using: api_url = 'https://urldefense.proofpoint.com/v2/url?u=http-3A__en.wikipedia.org_w_a pi.php&d=CwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=K9jJjNfacravQkfypdTZOg&m=Gl3eq wsc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4&s=AjBJxhe0ZaeTqz3r3wPQOH_kiIjq2_h4 UgKIgJUC5XQ&e= ' api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&r c dir=newer&format=json&rcstart=20160504022715'
f = urllib2.Request(api_url, api_params) print ('requesting ' + api_url + '?' + api_params) source = urllib2.urlopen(f, None, 300).read() source = json.loads(source)
json.loads(source) raised the following exception " Expecting , delimiter: line 1 column 817105 (char 817104"
I tried to use source.encode('utf-8') and some other encodings but they all didn't help. Do we have any workaround for that issue ? Thanks :)
Hi.
Weird, I can't reproduce this error. I had to import the "json" and "urllib2" modules, but after doing so, executing the code you provided here worked fine for me: https://phabricator.wikimedia.org/P3009.
You probably want to use 'https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_w_a pi.php&d=CwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=K9jJjNfacravQkfypdTZOg&m=Gl3eqw sc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4&s=aw9laFsQi8JGilqru0zbRUlrBdcWj52NmF tRw6ZW5sI&e= ' as your end-point (HTTPS, not HTTP).
As far as I know, JSON is always encoded as UTF-8, so you shouldn't need to encode or decode the data explicitly.
The error you're getting generally means that the JSON was malformed for some reason. It seems unlikely that MediaWiki's api.php is outputting invalid JSON, but I suppose it's possible.
Since you're coding in Python, you may be interested in a framework such as https://github.com/alexz-enwp/wikitools.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
MZMcBride wrote:
The error you're getting generally means that the JSON was malformed for some reason. It seems unlikely that MediaWiki's api.php is outputting invalid JSON, but I suppose it's possible.
I left a note on the Phabricator task that Marius linked to: https://phabricator.wikimedia.org/T133866#2272654.
It seems api.php end-points really are outputting garbage characters in some cases, though it remains unclear which layer is to blame. :-/
MZMcBride
wikitech-l@lists.wikimedia.org