brion@svn.wikimedia.org schreef:
Revision: 35075 Author: brion Date: 2008-05-20 06:30:36 +0000 (Tue, 20 May 2008)
Log Message:
Got sick of testing local copies of Wikipedia articles with no images handy. Threw together a really quick hack FileRepo class to grab images from the remote wiki using the MediaWiki API.
Amazingly it sort of works -- be warned however:
- no scaling seems to be done yet -- multi-megapixel images in your browser :D
- no caching of lookups -- verrrry slow
- lookups are done one at a time, not in any kind of batching
- probably doesn't properly support lots and lots of things
<snip>
// For some reason API doesn't currently provide type info
$magic = MimeMagic::singleton();
$info['mime'] = $magic->guessTypesForExtension( $this->getExtension() );
list( $info['major_mime'], $info['minor_mime'] ) = self::splitMime( $info['mime'] );
$info['media_type'] = $magic->getMediaType( null, $info['mime'] );
$this->mInfo = $info;
You're right. I never noticed, but list=allimages has aiprop=mime whereas prop=imageinfo doesn't. I'll fix that discrepancy (and any other discrepancies between aiprop and iiprop that are likely to exist) right away.
- function getThumbPath( $suffix = false ) {
return false; // hrmmm
- }
- function getThumbUrl( $suffix = false ) {
return false; // FLKDSJLKFDJS
- }
Did you know you can get these through the API as well with the iiurlwidth and iiurlheight parameters? The foreign wiki will do the scaling for you, and return the URL of a thumbnail and its dimensions.
- function findFile( $title, $time = false ) {
$url = $this->mApiBase .
'?' .
wfArrayToCgi( array(
'format' => 'json',
'action' => 'query',
'titles' => $title, // fixme -- canonical namespacea
'prop' => 'imageinfo',
'iiprop' => 'timestamp|user|comment|url|size|sha1|metadata' ) );
$json = Http::get( $url );
$data = json_decode( $json, true );
Why do you use JSON here? Wouldn't it be better/faster/more intuitive/whatever to use format=php and unserialize()? Surely PHP can interpret its own format faster than JSON? Also, I though Http::get() was a faux pas, why use it here? Or is there no other option? As to the FIXME, it's kind of cryptical so I don't know exactly what you mean there, but the client API will normalize titles. Or are you worried about de requesting "Bild:Foo.jpg" from en?
if( isset( $data['query']['pages'] ) ) {
foreach( $data['query']['pages'] as $pageid => $info ) {
if( isset( $info['imageinfo'][0] ) ) {
return new ForeignAPIFile( $title, $this, $info['imageinfo'][0] );
}
}
}
I don't really get what the foreach is for: it's only one file, right? You can use reset() to get the first element of an array if it doesn't have index 0. Speaking of which, there should probably be some sanity check that verifies that $title is a legal title to prevent unnecessary requests or people sneaking in a '|' and not getting what they expect.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote:
// For some reason API doesn't currently provide type info
You're right. I never noticed, but list=allimages has aiprop=mime whereas prop=imageinfo doesn't. I'll fix that discrepancy (and any other discrepancies between aiprop and iiprop that are likely to exist) right away.
Yay!
- function getThumbPath( $suffix = false ) {
return false; // hrmmm
- }
- function getThumbUrl( $suffix = false ) {
return false; // FLKDSJLKFDJS
- }
Did you know you can get these through the API as well with the iiurlwidth and iiurlheight parameters? The foreign wiki will do the scaling for you, and return the URL of a thumbnail and its dimensions.
Yep! Just didn't bother doing it yet at it was late and I wanted to go to bed. ;)
Why do you use JSON here? Wouldn't it be better/faster/more intuitive/whatever to use format=php and unserialize()? Surely PHP can interpret its own format faster than JSON?
Unserializing PHP serialized data that comes from a foreign server is potentially unsafe. Some versions of PHP have had outright security vulnerabilities in unserialization, but even in the best of cases you aren't necessarily sure what data type you're going to get out -- a compromised or malicious server could hand us an instance of any class in MediaWiki, with completely arbitrary contents.
Rather than do a full audit of the entire system, including all extensions, to ensure that no class does anything unsafe in its unserialization wake-up functions, using a data format that will just give us an associative array full of strings and ints sounds nice and safe.
As for performance, the overhead of JSON encoding/decoding should be fairly trivial compared to the network time and the backend's work. Feel free to benchmark it, but I doubt using PHP serialization would have any significant performance impact.
(It might be nice for compatibility, as PHP's native JSON decode module isn't guaranteed to be present, but we may have a JSON decoder sitting around for some of the AJAX stuff anyway, I haven't bothered to look.)
Also, I though Http::get() was a faux pas, why use it here?
On the contrary -- *not* using Http::get() is a faux pas. :)
We can control the timeout values etc with Http::get(), and choose the best backend (PHP's URL wrappers or CURL module), all nice and encapsulated. Duplicating all that by hand is a big no-no.
As to the FIXME, it's kind of cryptical so I don't know exactly what you mean there, but the client API will normalize titles. Or are you worried about de requesting "Bild:Foo.jpg" from en?
Right. If the local wiki's not in English, it may construct the output using something other than the canonical 'Image:', which won't be recognized by the foreign repo.
if( isset( $data['query']['pages'] ) ) {
foreach( $data['query']['pages'] as $pageid => $info ) {
if( isset( $info['imageinfo'][0] ) ) {
return new ForeignAPIFile( $title, $this, $info['imageinfo'][0] );
}
}
}
I don't really get what the foreach is for: it's only one file, right? You can use reset() to get the first element of an array if it doesn't have index 0.
IMHO, reset() and foreach() are equally ugly here. :)
Speaking of which, there should probably be some sanity check that verifies that $title is a legal title to prevent unnecessary requests or people sneaking in a '|' and not getting what they expect.
An invalid title shouldn't have reached us by this point; we're handed a Title object, not a string.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber schreef:
Why do you use JSON here? Wouldn't it be better/faster/more intuitive/whatever to use format=php and unserialize()? Surely PHP can interpret its own format faster than JSON?
Unserializing PHP serialized data that comes from a foreign server is potentially unsafe. Some versions of PHP have had outright security vulnerabilities in unserialization, but even in the best of cases you aren't necessarily sure what data type you're going to get out -- a compromised or malicious server could hand us an instance of any class in MediaWiki, with completely arbitrary contents.
Rather than do a full audit of the entire system, including all extensions, to ensure that no class does anything unsafe in its unserialization wake-up functions, using a data format that will just give us an associative array full of strings and ints sounds nice and safe.
We're handling an *array* here, right? How the hell can unserializing() something, then using [] on it cause anything else than a fatal error? Or does PHP allow overloading of the [] operator? (I though PHP didn't *do* operator overloading, but I could be wrong.) Anyway, is_array() should be safe enough here.
As for performance, the overhead of JSON encoding/decoding should be fairly trivial compared to the network time and the backend's work. Feel free to benchmark it, but I doubt using PHP serialization would have any significant performance impact.
(It might be nice for compatibility, as PHP's native JSON decode module isn't guaranteed to be present, but we may have a JSON decoder sitting around for some of the AJAX stuff anyway, I haven't bothered to look.)
I'm not saying it makes a huge difference and I know trying to optimize non-bottleneck code is discouraged, but if you can get better performance by just using a different function, it's too easy not to do. You're also right about the compatibility thing.
Also, I though Http::get() was a faux pas, why use it here?
On the contrary -- *not* using Http::get() is a faux pas. :)
We can control the timeout values etc with Http::get(), and choose the best backend (PHP's URL wrappers or CURL module), all nice and encapsulated. Duplicating all that by hand is a big no-no.
OK. I remember someone saying something related to bug 14024 (interwiki API) that calling Http::get() would take too long etc.
As to the FIXME, it's kind of cryptical so I don't know exactly what you mean there, but the client API will normalize titles. Or are you worried about de requesting "Bild:Foo.jpg" from en?
Right. If the local wiki's not in English, it may construct the output using something other than the canonical 'Image:', which won't be recognized by the foreign repo.
In this case, we need to canonicalize the title first (i.e. replace Bild: by Image: on a German wiki), then submit that title to the remote wiki, which will normalize Image: to whatever it uses.
if( isset( $data['query']['pages'] ) ) {
foreach( $data['query']['pages'] as $pageid => $info ) {
if( isset( $info['imageinfo'][0] ) ) {
return new ForeignAPIFile( $title, $this, $info['imageinfo'][0] );
}
}
}
I don't really get what the foreach is for: it's only one file, right? You can use reset() to get the first element of an array if it doesn't have index 0.
IMHO, reset() and foreach() are equally ugly here. :)
Personal preference then, I guess. To me, foreach() implies that something runs more than once (although that's impossible in this case because of the return statement).
Speaking of which, there should probably be some sanity check that verifies that $title is a legal title to prevent unnecessary requests or people sneaking in a '|' and not getting what they expect.
An invalid title shouldn't have reached us by this point; we're handed a Title object, not a string.
OK then. I didn't look at the calling code at all, just at the functions themselves.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote:
Brion Vibber schreef:
Why do you use JSON here? Wouldn't it be better/faster/more intuitive/whatever to use format=php and unserialize()? Surely PHP can interpret its own format faster than JSON?
Unserializing PHP serialized data that comes from a foreign server is potentially unsafe.
[snip]
We're handling an *array* here, right? How the hell can unserializing() something, then using [] on it cause anything else than a fatal error?
"If the variable being unserialized is an object, after successfully reconstructing the object PHP will automatically attempt to call the __wakeup() member function (if it exists)."
Arbitrary data input + execution of a function -> potentially could cause problems, depending on what the wakeup function of the given class does.
At the moment that's probably reasonably safe given our codebase (ignoring that unserialize() itself has had a rash of low-level security problems such as buffer overflows), but it gives me the willies as a low-level problem.
Even if it does only cause a fatal error, now your site's broken because of a problem with a third-party provider. That's a pretty poor practice and should be avoided.
Or does PHP allow overloading of the [] operator? (I though PHP didn't *do* operator overloading, but I could be wrong.) Anyway, is_array() should be safe enough here.
It sort of does -- classes can provide an array-like interface.
I'm not saying it makes a huge difference and I know trying to optimize non-bottleneck code is discouraged, but if you can get better performance by just using a different function, it's too easy not to do.
That depends on what other concerns are present. Using low-level serialization on unvalidated input data just isn't a good security practice.
It's *convenient* for quickie hacks, but it's not very reliable and not very safe. These concerns trump a hypothetical speed advantage which may not even exist.
Also, I though Http::get() was a faux pas, why use it here?
On the contrary -- *not* using Http::get() is a faux pas. :)
We can control the timeout values etc with Http::get(), and choose the best backend (PHP's URL wrappers or CURL module), all nice and encapsulated. Duplicating all that by hand is a big no-no.
OK. I remember someone saying something related to bug 14024 (interwiki API) that calling Http::get() would take too long etc.
That's an unrelated issue where you'd be trying to proxy third-party requests.
Right. If the local wiki's not in English, it may construct the output using something other than the canonical 'Image:', which won't be recognized by the foreign repo.
In this case, we need to canonicalize the title first (i.e. replace Bild: by Image: on a German wiki), then submit that title to the remote wiki, which will normalize Image: to whatever it uses.
Exactly. :)
I don't really get what the foreach is for: it's only one file, right? You can use reset() to get the first element of an array if it doesn't have index 0.
IMHO, reset() and foreach() are equally ugly here. :)
Personal preference then, I guess. To me, foreach() implies that something runs more than once (although that's impossible in this case because of the return statement).
reset() followed by current()/next() is exactly equivalent to foreach(), but without the clean, visible structure around the iterator.
Sometimes it's necessary because you're doing something _weird_ with your loop, but I tend to think of it as very icky. :)
Anyway this can be fairly easily extended to implement the new multiple findFiles(), with the singular findFile() as a special case that passes only one title, so a foreach is gonna be just fine. ;)
- -- brion vibber (brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org