Interface embarrassment rant

List overview All Threads
Download

newer

older

Re: [Wikitech-l] Wikitech-l...

Re: [Wikitech-l] So... status of...

Magnus Manske

20 Feb 2008 20 Feb '08

2:14 p.m.

<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw

It's shameful. Honestly. Look at it. We're the world's #9 top web site, and this is the best we can do?

Yes, I know that the images are large, both in dimensions (~5000x5000px) and size (5-15MB each). Yes, I know that ImageMagick has problems with such images. But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers?

In case it's not possible (which I doubt, since I can generate thumbnails with ImageMagick from these on my laptop, one at a time; maybe a slow-running thumbnail generator, at least for "usual" sizes, on a dedicated server?), it's no use cluttering the entire page with broken thumbnails. Where's the option for a list view? You know, a table with linked title, size, uploader, date, no thumbnails? They're files, so why don't we use things that have proven useful in a file system?

And then, of course: "There are 200 files in this category." That's two lines below the "(next 200)" link. At that point, we know there are more than 200 images, but we forget about that two lines further down?

Yes, I know that some categories are huge, and that it would take too long to get the exact number. But, would the exact number for large categories be useful? 500.000 or 500.001 entries, who cares? How many categories are that large anyway? 200 or 582 entries, now /that/ people might care about. Why not at least try to get a number, set a limit to, say, 5001, and * give the exact number if it's less that 5001 entries * say "over 5000 entries" if it returns 5001

Yes, everyone's busy. Yes, there are more pressing issues (SUL, stable versions, you name it). Yes, MediaWiki wasn't developed as a media repository (tell me about it;-) Yes, "sofixit" myself.

Still, I ask: is this the best we can do?

Magnus

</rant>

Show replies by date

Simetrical

20 Feb 20 Feb

4:25 p.m.

On Wed, Feb 20, 2008 at 5:14 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Yes, everyone's busy. Yes, there are more pressing issues (SUL, stable versions, you name it). Yes, MediaWiki wasn't developed as a media repository (tell me about it;-) Yes, "sofixit" myself.

Still, I ask: is this the best we can do?

No. We can do better. Both category handling and image handling are a complete mess and need to be reworked to have some level of features that doesn't require ludicrous amounts of manual or bot work to do things that should be one-click operations, or to find information that should be provided at first glance. We absolutely can and should do better.

The problem is . . . someone has to step up and do it, preferably someone who's willing and recognized as able to do large-scale rearchitecting of large and important subsystems. Frankly, I think that about limits our options to Tim at the moment. Some others, like Aaron or Werdna, are willing to do work like this (at least on some things), and others like VasilievVV have even tried to make a start on this very category of issue (he did image redirects, right?). But mostly either their work needs to be disabled by default for lack of polish or imperfect implementation, or it can't even be reviewed properly because Brion doesn't have time. From my perspective, it looks like Brion is much more willing to review big commits by Tim than anyone else, I guess because of some combination of a) he has to, Tim is a paid employee doing what he's specifically asked to and b) he doesn't have to look quite so closely because he knows he can trust Tim to do a good job overall. Both of which are perfectly natural and good reasons, but inevitably they limit things.

It just seems to me that we really don't have enough senior developers. Maybe we'll get another one later this year -- various Foundation statements have mentioned multiple developers to be hired in 2008. A paid junior developer is also much needed, for other reasons, and I'm not questioning the decision to do that first -- it will also give Brion more time since he won't have so many office duties -- but IMO this is pretty much how the facts lie. There's only so much possibility of having inexperienced to moderately experienced developers, whether volunteer or paid, write thousands of lines of code to revamp major features, and that's what this kind of complaint requires to be properly fixed.

Steve Bennett

5:59 p.m.

On 2/21/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

Yes, I know that some categories are huge, and that it would take too long to get the exact number.

Why is that? Couldn't we store the count somewhere and inc/dec it as required?

Steve

Simetrical

6:02 p.m.

On Wed, Feb 20, 2008 at 8:59 PM, Steve Bennett stevagewp@gmail.com wrote:

...

On 2/21/08, Magnus Manske magnusmanske@googlemail.com wrote:

...
Yes, I know that some categories are huge, and that it would take too long to get the exact number.

Why is that? Couldn't we store the count somewhere and inc/dec it as required?

Of course. We just need to make a category table. Would cut down on the size of categorylinks, too, if we could use category id's.

Andrew Garrett

8:52 p.m.

On 2/21/08, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Wed, Feb 20, 2008 at 8:59 PM, Steve Bennett stevagewp@gmail.com wrote:

...
Why is that? Couldn't we store the count somewhere and inc/dec it as required?

Of course. We just need to make a category table. Would cut down on the size of categorylinks, too, if we could use category id's.

Actually, I've been looking into this recently (It's a very frequent personal request). I very much like Simetrical's idea of having a separate category table (I felt bad introducing a schema change to store a silly little count in the database per category). Domas tells me that checking the number of items in a category is about a five-second query for very big categories - so we probably don't want to rebuild it at all - even when saving a page with a category in it (think about it, 5-6 big categories would tie up the database servers for a minute or more).

So, if I were to implement this tomorrow, this is what I would do (feedback welcomed): * Create a 'category' table starting off with a c_id, c_page_id, and c_count. This would fit in very nicely with something Tim was talking about last night - allowing flags to be put on categories (i.e. "don't show this category on the article", "show icon Image:X in the top corner of articles in this category", et cetera) which could be put in a separate column of the category table. * On save of a category, or a page including that category, insert the requisite category entry (this would save a dirty great big migration script). * When a categorylinks item is INSERTED or DELETED on links-update (the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table. * Display the count on the category page.

This would add a performance penalty of a potentially very expensive database query per category, as a once-off, on page-save. It would enable a whole bunch of new functionality, such as proper display of membership counts on category pages, the stuff Tim was talking about last night with assigning a bit of metadata to categories (to make them a bit more useful).

I'm willing to look into implementing this in the next few months, depending upon how my understanding of the technical details corresponds to reality.

-- Andrew Garrett

Jim Hu

11:42 p.m.

That would be nice. I haven't kept up with the last couple of versions, but there used to also be a problem with subcategories not showing up if they didn't sort into the initial 200 in the query limit. If this is going to get fixed, it would be nice if there was a fix for that too, if it's not already addressed...we're still on 1.9x

My badly named SplitCategoryPage extension was written to do that, but is only suitable for operations that are much, much smaller than wikimedia projects. Nevertheless, if you want to see it in action

http://gowiki.tamu.edu/wiki/index.php/Category:Eukaryota

This is a category with 111867 articles. The wiki is running on a single Intel quad Mac XServe blade that it shares with another wiki. No squid, no memcache. Traffic is microscopically (or perhaps femtoscopically?) small by WMF standards.

Jim

On Feb 20, 2008, at 10:52 PM, Andrew Garrett wrote:

...

On 2/21/08, Simetrical Simetrical+wikilist@gmail.com wrote:

...
On Wed, Feb 20, 2008 at 8:59 PM, Steve Bennett stevagewp@gmail.com wrote:

...
Why is that? Couldn't we store the count somewhere and inc/dec it as required?

Of course. We just need to make a category table. Would cut down on the size of categorylinks, too, if we could use category id's.

Actually, I've been looking into this recently (It's a very frequent personal request). I very much like Simetrical's idea of having a separate category table (I felt bad introducing a schema change to store a silly little count in the database per category). Domas tells me that checking the number of items in a category is about a five-second query for very big categories - so we probably don't want to rebuild it at all - even when saving a page with a category in it (think about it, 5-6 big categories would tie up the database servers for a minute or more).

So, if I were to implement this tomorrow, this is what I would do (feedback welcomed):

Create a 'category' table starting off with a c_id, c_page_id, and

c_count. This would fit in very nicely with something Tim was talking about last night - allowing flags to be put on categories (i.e. "don't show this category on the article", "show icon Image:X in the top corner of articles in this category", et cetera) which could be put in a separate column of the category table.

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

When a categorylinks item is INSERTED or DELETED on links-update

(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.

Display the count on the category page.

This would add a performance penalty of a potentially very expensive database query per category, as a once-off, on page-save. It would enable a whole bunch of new functionality, such as proper display of membership counts on category pages, the stuff Tim was talking about last night with assigning a bit of metadata to categories (to make them a bit more useful).

I'm willing to look into implementing this in the next few months, depending upon how my understanding of the technical details corresponds to reality.

-- Andrew Garrett

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Magnus Manske

21 Feb 21 Feb

1:38 a.m.

On Thu, Feb 21, 2008 at 4:52 AM, Andrew Garrett andrew@epstone.net wrote:

...

So, if I were to implement this tomorrow, this is what I would do (feedback welcomed):

Create a 'category' table starting off with a c_id, c_page_id, and

c_count. This would fit in very nicely with something Tim was talking about last night - allowing flags to be put on categories (i.e. "don't show this category on the article", "show icon Image:X in the top corner of articles in this category", et cetera) which could be put in a separate column of the category table.

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

When a categorylinks item is INSERTED or DELETED on links-update

(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.

Display the count on the category page.

What about the number of items in different namespaces? At the very least, I'd like to separate pages and files in the count. Maybe also articles and other namespaces, so there'd be c_count_main, c_count_files, c_count_other.

...

This would add a performance penalty of a potentially very expensive database query per category, as a once-off, on page-save. It would enable a whole bunch of new functionality, such as proper display of membership counts on category pages, the stuff Tim was talking about last night with assigning a bit of metadata to categories (to make them a bit more useful).

Hmm... When you save a page, there can only be three "changes" regarding a category: 1. It was added 2. It was removed 3. It was kept

#3 won't cost anything; #1 and #2 could be solved by increasing/decreasing the counter.

Not sure about long-term consitency (update cronjob?) and template edits (changing categories for lots'o'pages).

...

I'm willing to look into implementing this in the next few months, depending upon how my understanding of the technical details corresponds to reality.

Great!

Magnus

Mark Clements

2:55 a.m.

"Magnus Manske" magnusmanske@googlemail.com wrote in message news:fab0ecb70802210138r7e26d14y62dcab51d96ba19@mail.gmail.com...

...

On Thu, Feb 21, 2008 at 4:52 AM, Andrew Garrett

andrew@epstone.net wrote:

...

...

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

When a categorylinks item is INSERTED or DELETED on links-update

(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.

Hmm... When you save a page, there can only be three "changes" regarding a category:

It was added

It was removed

It was kept

#3 won't cost anything; #1 and #2 could be solved by increasing/decreasing the counter.

How do you know whether a categorylinks item has been counted in the total count of items in the category? If the table is not populated via a migration script, i.e. entries are added as page edits are made, then a few deletions of pages that have not yet been added will cause you to end up with a negative count! Also, all counts will be wrong until all pages have been edited, which seems a bit pointless - am I misunderstanding something here?

- Mark Clements (HappyDog)

Magnus Manske

3:12 a.m.

On Thu, Feb 21, 2008 at 10:55 AM, Mark Clements gmane@kennel17.co.uk wrote:

...

"Magnus Manske" magnusmanske@googlemail.com wrote in message news:fab0ecb70802210138r7e26d14y62dcab51d96ba19@mail.gmail.com...

...
On Thu, Feb 21, 2008 at 4:52 AM, Andrew Garrett

andrew@epstone.net wrote:

...
...

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

When a categorylinks item is INSERTED or DELETED on links-update

(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.

...
Hmm... When you save a page, there can only be three "changes" regarding a category:

It was added

It was removed

It was kept

#3 won't cost anything; #1 and #2 could be solved by increasing/decreasing the counter.

How do you know whether a categorylinks item has been counted in the total count of items in the category? If the table is not populated via a migration script, i.e. entries are added as page edits are made, then a few deletions of pages that have not yet been added will cause you to end up with a negative count! Also, all counts will be wrong until all pages have been edited, which seems a bit pointless - am I misunderstanding something here?

You'd have to seed the counts once when the feature is turned on, of course.

Magnus

Mark Clements

3:19 a.m.

"Magnus Manske" magnusmanske@googlemail.com wrote in message news:fab0ecb70802210312l221eec90w8fd73ebe1c019169@mail.gmail.com...

...

On Thu, Feb 21, 2008 at 10:55 AM, Mark Clements

gmane@kennel17.co.uk wrote:

...

...
How do you know whether a categorylinks item has been counted in the

total

...

...
count of items in the category? If the table is not populated via a migration script, i.e. entries are added as page edits are made, then a

few

...

...
deletions of pages that have not yet been added will cause you to end

...

...
with a negative count! Also, all counts will be wrong until all pages

have

...

...
been edited, which seems a bit pointless - am I misunderstanding

something

...

...
here?

You'd have to seed the counts once when the feature is turned on, of

course.

That's what I would have thought, but this comment about not requiring a migration script implied otherwise:

On Thu, Feb 21, 2008 at 4:52 AM, Andrew Garrett andrew@epstone.net wrote:

...

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

- Mark Clements (HappyDog)

Simetrical

7:54 a.m.

On Wed, Feb 20, 2008 at 11:52 PM, Andrew Garrett andrew@epstone.net wrote:

...

Actually, I've been looking into this recently (It's a very frequent personal request). I very much like Simetrical's idea of having a separate category table

I got it from Rob, I think.

...

Domas tells me that checking the number of items in a category is about a five-second query for very big categories

I've observed this too, at http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla . :)

...

So, if I were to implement this tomorrow, this is what I would do (feedback welcomed):

Create a 'category' table starting off with a c_id, c_page_id, and

c_count. This would fit in very nicely with something Tim was talking about last night - allowing flags to be put on categories (i.e. "don't show this category on the article", "show icon Image:X in the top corner of articles in this category", et cetera) which could be put in a separate column of the category table.

Not c_page_id. It needs to be c_title. Remember, the corresponding page may not exist. Also, why not have c_pages, c_images, c_subcategories separately? We do display them separately, and it just stores more info. Magnus' suggestion of separating out content namespace from others might be a good idea too.

...

On save of a category, or a page including that category, insert the

requisite category entry (this would save a dirty great big migration script).

When a categorylinks item is INSERTED or DELETED on links-update

(the code is nice in that it only inserts/deletes those items which have been added/removed), do the requisite incrementing/decrementing on the category table.

I guess what you would do is UPDATE category SET c_count = c_count+1, and then check number of affected rows to see if it worked. And if not, you can regenerate the whole thing with an INSERT ... SELECT. And on category view, you can try reading from the category table, and if that fails read from the result set as usual. This seems sane: it has no additional overhead (above a migration script) once the migration is complete.

On the other hand, it means we have to keep and maintain these obsolescent bits of code around forever. It might be best to have the default update.php do a one-shot migration script; and on Wikimedia, push the category table updating code (in links-update) live first, then start running the migration gradually in the background. The category table updating code wouldn't have to check whether it actually worked or not, it would just affect zero rows if it didn't. Once Wikimedia was updated, of course, we would add features that depended on the count being accurate.

On Thu, Feb 21, 2008 at 2:42 AM, Jim Hu jimhu@tamu.edu wrote:

...

That would be nice. I haven't kept up with the last couple of versions, but there used to also be a problem with subcategories not showing up if they didn't sort into the initial 200 in the query limit. If this is going to get fixed, it would be nice if there was a fix for that too, if it's not already addressed...we're still on 1.9x

This is a separate issue (which is still quite true). It's because we don't page subcategories, articles, and images separately, although we display them separately. We retrieve the first 200 category members by sort key, and only then do we pigeonhole them into the correct part of the page.

There are various methods that have been discussed to deal with this. We could, for instance, include a null or other low-sorting ASCIIbetical character in the sort keys of subcategories. That would be the quick and ugly way, and would work okay, but would still possibly not work as expected -- subcategories would still be paged with everything else, just always on the first page, and not necessarily even that if someone used some really weird sort keys.

Arguably, a better way would be to have a separate subcategorylinks table, keep category articles' category inclusions out of the categorylinks table, and page them totally separately. This is harder and also fragments very similar info into separate tables. A similar solution would be to add a one-bit field (CHAR(0) NULL?! :D) to categorylinks, index it, and use that to indicate whether cl_from is a category or not. This bloats indexes a bit.

Some variant of one of these latter two might also be used for uploaded files, if we want to page those separately too. If we paged subcategories, articles, and files separately, then paging would match the layout of the page: none of the three sections of the page "interferes" with any other. This might be the most intuitive. It would also make sense to have somewhat smaller numbers of images displayed per page, if they're thumbnailed, which is impossible if they're paged together with articles.

Roan Kattouw

8:11 a.m.

Simetrical schreef:

...

There are various methods that have been discussed to deal with this. We could, for instance, include a null or other low-sorting ASCIIbetical character in the sort keys of subcategories. That would be the quick and ugly way, and would work okay, but would still possibly not work as expected -- subcategories would still be paged with everything else, just always on the first page, and not necessarily even that if someone used some really weird sort keys.

Arguably, a better way would be to have a separate subcategorylinks table, keep category articles' category inclusions out of the categorylinks table, and page them totally separately. This is harder and also fragments very similar info into separate tables. A similar solution would be to add a one-bit field (CHAR(0) NULL?! :D) to categorylinks, index it, and use that to indicate whether cl_from is a category or not. This bloats indexes a bit.

Although we may want to add cl_is_category, retrieving subcategories only is already possible with the current schema:

SELECT whatever FROM categorylinks JOIN page ON page_id=cl_from AND page_namespace=14 WHERE cl_title='Foo' AND cl_namespace=0;

No idea how efficient this query is, maybe it's evil and uses 57 filesorts, who knows ;)

Roan Kattouw (Catrope)

Simetrical

8:21 a.m.

On Thu, Feb 21, 2008 at 11:11 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

Although we may want to add cl_is_category, retrieving subcategories only is already possible with the current schema:

SELECT whatever FROM categorylinks JOIN page ON page_id=cl_from AND page_namespace=14 WHERE cl_title='Foo' AND cl_namespace=0;

No idea how efficient this query is, maybe it's evil and uses 57 filesorts, who knows ;)

No filesorts, since nothing has to be sorted (no ORDER BY or GROUP BY). It will just scan every single categorylinks row corresponding to the given category. It won't be able to use an index for the entire query, either: it will have to hit the page table data blocks. Probably it will take 30 seconds or more on Wikimedia servers for Living people or other large categories, given that just scanning the index for all those rows (no joins, no hitting data blocks) is five seconds. Could be a few minutes, I guess. Either way it's not remotely acceptable.

That such things are possible to do with the current database is obvious. After all, we store all the relevant information: there's got to be some way to retrieve it, somehow, even if it requires querying all tables in their entirety and manually processing every row in PHP. So the only real question in these things is efficiency.

Roan Kattouw

8:25 a.m.

Simetrical schreef:

...

No filesorts, since nothing has to be sorted (no ORDER BY or GROUP BY). It will just scan every single categorylinks row corresponding to the given category. It won't be able to use an index for the entire query, either: it will have to hit the page table data blocks. Probably it will take 30 seconds or more on Wikimedia servers for Living people or other large categories, given that just scanning the index for all those rows (no joins, no hitting data blocks) is five seconds. Could be a few minutes, I guess. Either way it's not remotely acceptable.

That such things are possible to do with the current database is obvious. After all, we store all the relevant information: there's got to be some way to retrieve it, somehow, even if it requires querying all tables in their entirety and manually processing every row in PHP. So the only real question in these things is efficiency.

That's why I said we probably do want a cl_is_category boolean field. Of course filling it would be a huge operation on enwp.

Roan Kattouw (Catrope)

Simetrical

8:31 a.m.

On Thu, Feb 21, 2008 at 11:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

That's why I said we probably do want a cl_is_category boolean field. Of course filling it would be a huge operation on enwp.

Well, not really. It wouldn't lock anything, so you could just run it and forget about it. The huge operation would be adding the column in the first place, since ALTER TABLE locks the table to writes. That has to be done by taking slaves out of rotation one by one, and is a pain, plus the alter itself takes a long time. That's a mark against that option; the other two require no such alterations.

Of course, I've always wondered why we couldn't have a more painless way to do stuff like this, like some nice "run and keep half an eye on it" shell script. I don't see why it has to be a huge deal just because it takes a few hours, or even a few days. But I'm not a root and don't have much opportunity to try my hand at automation. :)

Alex Powell

9:21 a.m.

Hi,

Why not alter mw code to allow for dynamic table renames of any table (at present only user is covered). Then you can create a new table with the correct structure and move the records from one to another, before discontinuing the old table? Alter table is a nasty process on such a big db.

This could have other benefits, as you could potentially share tables between wikis easier (though there is no way to identify which wiki... maybe theres a hook I could use for that). Basically I run a bunch of wikis and have unified certain aspects of the DB through hooks - writing categories, pages and search to a central store at present, keyed by original db id and wiki id. This allows me to search across wiki boundaries, but maintain wikis atomically (useful for security).

I am trying not to replace bits of the mw code base (merely supplement) as much as possible, so maintenance doesn't become a nightmare when you guys decide to rework the whole schema! Recentchanges is a new target and I am umming and ahhing over extending it in a similar fashion. I'll probably do the patch for that, but it occurs to me the strategy of an extra column for wiki identifier must have merit for the Wikimedia foundations, as it allows for a degree of centralization across projects.

Anyway, keep up the good work.

TTFN,

Alex

On Thu, Feb 21, 2008 at 4:31 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Thu, Feb 21, 2008 at 11:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...
That's why I said we probably do want a cl_is_category boolean field. Of course filling it would be a huge operation on enwp.

Well, not really. It wouldn't lock anything, so you could just run it and forget about it. The huge operation would be adding the column in the first place, since ALTER TABLE locks the table to writes. That has to be done by taking slaves out of rotation one by one, and is a pain, plus the alter itself takes a long time. That's a mark against that option; the other two require no such alterations.

Of course, I've always wondered why we couldn't have a more painless way to do stuff like this, like some nice "run and keep half an eye on it" shell script. I don't see why it has to be a huge deal just because it takes a few hours, or even a few days. But I'm not a root and don't have much opportunity to try my hand at automation. :)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Alex Powell Exscien Training Ltd Tel: +44 (0) 1865 920024 Direct: +44 (0) 1865 920032 Mob: +44 (0) 7717 765210 skype: alexp700 mailto:alexp@exscien.com http://www.exscien.com Registered in England and Wales 05927635, Unit 10 Wheatley Business Park, Old London Road, Wheatley, OX33 1XW, England

Simetrical

9:30 a.m.

On Thu, Feb 21, 2008 at 12:21 PM, Alex Powell alexp@exscien.com wrote:

...

Why not alter mw code to allow for dynamic table renames of any table (at present only user is covered). Then you can create a new table with the correct structure and move the records from one to another, before discontinuing the old table?

I'm not clear how you want this to proceed. Like having all updates, inserts, etc. be done to both tables at once, while the population is ongoing? I can't see why that wouldn't work, I admit. I don't know if it would be incredibly useful to Wikimedia, though, since alter table doesn't actually disrupt the site or anything. It's just kind of a pain. If anything it would be more useful to third parties.

I guess one question to ask is why MySQL can't do this transparently already. Maybe there's some drawback or catch I'm missing.

...

This could have other benefits, as you could potentially share tables between wikis easier (though there is no way to identify which wiki... maybe theres a hook I could use for that). Basically I run a bunch of wikis and have unified certain aspects of the DB through hooks - writing categories, pages and search to a central store at present, keyed by original db id and wiki id. This allows me to search across wiki boundaries, but maintain wikis atomically (useful for security).

I don't see how this is related. It would require additional work on top of the previous suggestion, at the very least, to allow modification of the query in addition to its replication.

Alex Powell

9:56 a.m.

On Thu, Feb 21, 2008 at 5:30 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Thu, Feb 21, 2008 at 12:21 PM, Alex Powell alexp@exscien.com wrote:

...
Why not alter mw code to allow for dynamic table renames of any table (at present only user is covered). Then you can create a new table with the correct structure and move the records from one to another, before discontinuing the old table?

I'm not clear how you want this to proceed. Like having all updates, inserts, etc. be done to both tables at once, while the population is ongoing? I can't see why that wouldn't work, I admit. I don't know if it would be incredibly useful to Wikimedia, though, since alter table doesn't actually disrupt the site or anything. It's just kind of a pain. If anything it would be more useful to third parties.

I guess one question to ask is why MySQL can't do this transparently already. Maybe there's some drawback or catch I'm missing.

Yes live updates could go to both tables, or you can just do batch updates in the scripts that catch up with the live table. The main idea is not to lock the table to writes - I presumed that it would affect the site, as how do edits proceed whilst the table is locked??

RE MySQL I expect that its a hard bit of code for a diminishingly small number of users (qv most mediawiki updates only targeted at core audience). After all alter table on a few hundred thousand record table is still quite quick. Maybe the later versions are better at this?

...

...
This could have other benefits, as you could potentially share tables between wikis easier (though there is no way to identify which wiki... maybe theres a hook I could use for that). Basically I run a bunch of wikis and have unified certain aspects of the DB through hooks - writing categories, pages and search to a central store at present, keyed by original db id and wiki id. This allows me to search across wiki boundaries, but maintain wikis atomically (useful for security).

I don't see how this is related. It would require additional work on top of the previous suggestion, at the very least, to allow modification of the query in addition to its replication.

Yes, true. More thinking out loud, as its a problem that I have had to deal with on a day to day basis (and am not happy with the solution).

Simetrical

10:38 a.m.

On Thu, Feb 21, 2008 at 12:56 PM, Alex Powell alexp@exscien.com wrote:

...

Yes live updates could go to both tables, or you can just do batch updates in the scripts that catch up with the live table. The main idea is not to lock the table to writes - I presumed that it would affect the site, as how do edits proceed whilst the table is locked??

The magic of replication. :) The one doing the update goes through each slave: taking it out of rotation, applying the change, re-adding it to rotation, and let it catch up to the master, then repeating with the next slave. Finally the master is switched, so that a former slave (with updated schema) is the new master, and the process is repeated one last time for the old master. The alter still locks the tables, but it does so while the slave is offline, so it doesn't block anything, and there are enough slaves that the site doesn't suffer too much from being short one for a while.

Of course, this depends on the schemas being similar enough that replicated statements still can be executed on the new database. But since, helpfully, Wikimedia develops its own software, it can ensure that that's always the case. It's mostly a case of ordering everything right: add any new fields, then update software so it no longer needs old fields, then remove old fields.

...

RE MySQL I expect that its a hard bit of code for a diminishingly small number of users (qv most mediawiki updates only targeted at core audience). After all alter table on a few hundred thousand record table is still quite quick. Maybe the later versions are better at this?

They aren't. Falcon, for MySQL 6.0, is supposed to support online alter table, but I don't think implementation on that feature has started yet, and the feature list is subject to change this early on. It's really a big pain for anyone who has a decent-sized database. So I figure if it weren't hard to do, it would be in the software already.

Alex Powell

1:55 p.m.

On Thu, Feb 21, 2008 at 6:38 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Thu, Feb 21, 2008 at 12:56 PM, Alex Powell alexp@exscien.com wrote:

...
Yes live updates could go to both tables, or you can just do batch updates in the scripts that catch up with the live table. The main idea is not to lock the table to writes - I presumed that it would affect the site, as how do edits proceed whilst the table is locked??

The magic of replication. :) The one doing the update goes through each slave: taking it out of rotation, applying the change, re-adding it to rotation, and let it catch up to the master, then repeating with the next slave. Finally the master is switched, so that a former slave (with updated schema) is the new master, and the process is repeated one last time for the old master. The alter still locks the tables, but it does so while the slave is offline, so it doesn't block anything, and there are enough slaves that the site doesn't suffer too much from being short one for a while.

Of course, this depends on the schemas being similar enough that replicated statements still can be executed on the new database. But since, helpfully, Wikimedia develops its own software, it can ensure that that's always the case. It's mostly a case of ordering everything right: add any new fields, then update software so it no longer needs old fields, then remove old fields.

Clever. Forgot MySQL does replication on SQL statements. I imagine if you tried something like that on MS SQL the replication would break horribly, since it does binary replication. I guess in fact you are doing my suggestion, just at the column level, rather than table. Makes me want to go and set up a slave server ;)

Well to get back to the topic I think a master list of categories would be a good thing. I have an AJAX category suggester that does a SELECT DISTINCT WHERE LIKE, LIMIT, and I suspect that with as many categories as the wikipedia has it would prove to be very inefficient. A master list of categories would be a much easier way of getting it.

Platonides

22 Feb 22 Feb

1:51 p.m.

Simetrical wrote:

...

Well, not really. It wouldn't lock anything, so you could just run it and forget about it. The huge operation would be adding the column in the first place, since ALTER TABLE locks the table to writes. That has to be done by taking slaves out of rotation one by one, and is a pain, plus the alter itself takes a long time. That's a mark against that option; the other two require no such alterations.

Of course, I've always wondered why we couldn't have a more painless way to do stuff like this, like some nice "run and keep half an eye on it" shell script. I don't see why it has to be a huge deal just because it takes a few hours, or even a few days. But I'm not a root and don't have much opportunity to try my hand at automation. :)

It's strange it hasn't been done yet.

Simetrical

1:54 p.m.

On Fri, Feb 22, 2008 at 4:51 PM, Platonides Platonides@gmail.com wrote:

...

Simetrical wrote:

...
Of course, I've always wondered why we couldn't have a more painless way to do stuff like this, like some nice "run and keep half an eye on it" shell script. I don't see why it has to be a huge deal just because it takes a few hours, or even a few days. But I'm not a root and don't have much opportunity to try my hand at automation. :)

It's strange it hasn't been done yet.

Maybe it has, what do I know?

Tim Starling

20 Feb 20 Feb

6:29 p.m.

Magnus Manske wrote:

...

<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw

Use DjVu for scanned images, not PNG.

-- Tim Starling

Magnus Manske

21 Feb 21 Feb

1:25 a.m.

On Thu, Feb 21, 2008 at 2:29 AM, Tim Starling tstarling@wikimedia.org wrote:

...

Magnus Manske wrote:

...
<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw

Use DjVu for scanned images, not PNG.

1. I didn't create or upload those (in case this was meant for me; if it was a general remark, I fully agree) 2. If the system allows it, users will use it. How many of our regular users even know about our DjVu capability, lest alone friendly drive/by contributors? How many of them can create DjVu easily, and will bother doing so? 3. Doesn't solve the problem at hand.

Magnus

Domas Mituzas

3:10 a.m.

Hi!

...

But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers?

No.

...

Yes, everyone's busy. Yes, there are more pressing issues (SUL, stable versions, you name it). Yes, MediaWiki wasn't developed as a media repository (tell me about it;-) Yes, "sofixit" myself. Still, I ask: is this the best we can do?

Why do you ask, after all these 'yes'?

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Magnus Manske

3:18 a.m.

On Thu, Feb 21, 2008 at 11:10 AM, Domas Mituzas midom.lists@gmail.com wrote:

...

Hi!

...
But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers?

No.

Well. There's a task to add to my list, then. Might get around to it as early as 2020, or so ;-)

...

...
Yes, everyone's busy. Yes, there are more pressing issues (SUL, stable versions, you name it). Yes, MediaWiki wasn't developed as a media repository (tell me about it;-) Yes, "sofixit" myself. Still, I ask: is this the best we can do?

Why do you ask, after all these 'yes'?

Because I believe that these areas have been neglected for too long without proper reason, and I want to rattle everyone's cage about that.

The things I listed are rather basic things that shouldn't be broken, IMHO, in such a generally important software in its seventh year of development.

Magnus

Domas Mituzas

4:02 a.m.

Hi!

...

Because I believe that these areas have been neglected for too long without proper reason, and I want to rattle everyone's cage about that.

You listed quite a lot of proper reasons. The issue you're describing probably didn't seem to be that important to overall project execution to those, who did the development.

...

The things I listed are rather basic things that shouldn't be broken, IMHO, in such a generally important software in its seventh year of development.

Unfortunately, 'generally important' is getting major development for 'narrow use' of it.

The counting of category members has been discussed quite a few times though. It probably has to be done for quite a few other reasons.

BR,

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Jared Williams

22 Feb 22 Feb

4:37 a.m.

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Magnus Manske Sent: 20 February 2008 22:14 To: Wikimedia developers Subject: [Wikitech-l] Interface embarrassment rant

<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw

It's shameful. Honestly. Look at it. We're the world's #9 top web site, and this is the best we can do?

Yes, I know that the images are large, both in dimensions (~5000x5000px) and size (5-15MB each). Yes, I know that ImageMagick has problems with such images. But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers?

Thought MediaWiki had a JobQueue ? Can't it just dump the resizing of large images into that to be off loaded?

Jared

Roan Kattouw

5:26 a.m.

Jared Williams schreef:

...

Thought MediaWiki had a JobQueue ? Can't it just dump the resizing of large images into that to be off loaded?

The job queue works fine for doing lots of small things (do one small thing per request, nobody notices the delay), but big things will just delay some random guy's request by 10 seconds because MW is busy resizing an image someone else uploaded. We should really do this in background processes that don't interfere with Apache/PHP, but even then the load might be too heavy to take.

Roan Kattouw (Catrope)

Jared Williams

5:51 a.m.

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Roan Kattouw Sent: 22 February 2008 13:26 To: Wikimedia developers Subject: Re: [Wikitech-l] Interface embarrassment rant

Jared Williams schreef:

...
Thought MediaWiki had a JobQueue ? Can't it just dump the resizing of large images into that

to be off loaded?

...
The job queue works fine for doing lots of small things (do one small thing per request, nobody notices the delay), but big things will just delay some random guy's request by 10 seconds because MW is busy resizing an image someone else uploaded. We should really do this in background processes that don't interfere with Apache/PHP, but even then the load might be too heavy to take.

Ah, does sound like the current JobQueue implementation is a bit of a hack. Perhaps time to move it to a service itself.

Jared

Roan Kattouw

5:59 a.m.

Jared Williams schreef:

...

Ah, does sound like the current JobQueue implementation is a bit of a hack. Perhaps time to move it to a service itself.

Not really. Most jobs are about scanning a page for links and updating the pagelinks table, which is a simple thing that can be done inside MW code. Moving it to a separate service would only complicate matters for people using shared hosting.

Roan Kattouw (Catrope)

Jim Hu

6:16 a.m.

Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

Jim

On Feb 22, 2008, at 7:26 AM, Roan Kattouw wrote:

...

Jared Williams schreef:

...
Thought MediaWiki had a JobQueue ? Can't it just dump the resizing of large images into that to be off loaded?

The job queue works fine for doing lots of small things (do one small thing per request, nobody notices the delay), but big things will just delay some random guy's request by 10 seconds because MW is busy resizing an image someone else uploaded. We should really do this in background processes that don't interfere with Apache/PHP, but even then the load might be too heavy to take.

Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Roan Kattouw

6:22 a.m.

Jim Hu schreef:

...

Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

Maybe people should just resize their images before they upload them.

Roan Kattouw (Catrope)

David Gerard

6:26 a.m.

On 22/02/2008, Roan Kattouw roan.kattouw@home.nl wrote:

...

Jim Hu schreef:

...

...
Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

...

Maybe people should just resize their images before they upload them.

This was a donated image dump. You have grossly missed the point.

- d.

Jim Hu

6:40 a.m.

On Feb 22, 2008, at 8:26 AM, David Gerard wrote:

...

On 22/02/2008, Roan Kattouw roan.kattouw@home.nl wrote:

...
Jim Hu schreef:

...
...
Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

...
Maybe people should just resize their images before they upload them.

This was a donated image dump. You have grossly missed the point.

and not answered the question. The grid application could download and return the images in bunches as tgz. Doesn't have to be one at a time.

...

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Magnus Manske

6:50 a.m.

On Fri, Feb 22, 2008 at 2:40 PM, Jim Hu jimhu@tamu.edu wrote:

...

On Feb 22, 2008, at 8:26 AM, David Gerard wrote:

...
On 22/02/2008, Roan Kattouw roan.kattouw@home.nl wrote:

...
Jim Hu schreef:

...
...
Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

...
Maybe people should just resize their images before they upload them.

This was a donated image dump. You have grossly missed the point.

and not answered the question. The grid application could download and return the images in bunches as tgz. Doesn't have to be one at a time.

Or, we could have one dedicated server for long-running jobs. Pre-generate the "usual" thumbnail sizes for each large image. Maybe generate smaller thumbnails from larger ones, or multiple in one go, so it won't have to load a large image 10 times for 10 thumbnails. Not sure what to do for "unusual" sizes, though. Could be a DOS attac vector.

Magnus

David Gerard

6:58 a.m.

On 22/02/2008, Magnus Manske magnusmanske@googlemail.com wrote:

...

Or, we could have one dedicated server for long-running jobs. Pre-generate the "usual" thumbnail sizes for each large image. Maybe generate smaller thumbnails from larger ones, or multiple in one go, so it won't have to load a large image 10 times for 10 thumbnails. Not sure what to do for "unusual" sizes, though. Could be a DOS attac vector.

Add 'em to the job queue in question.

- d.

Bryan Tong Minh

7:09 a.m.

On Fri, Feb 22, 2008 at 3:50 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Fri, Feb 22, 2008 at 2:40 PM, Jim Hu jimhu@tamu.edu wrote:

...
On Feb 22, 2008, at 8:26 AM, David Gerard wrote:

...
On 22/02/2008, Roan Kattouw roan.kattouw@home.nl wrote:

...
Jim Hu schreef:

...
...
Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

...
Maybe people should just resize their images before they upload them.

This was a donated image dump. You have grossly missed the point.

and not answered the question. The grid application could download and return the images in bunches as tgz. Doesn't have to be one at a time.

Or, we could have one dedicated server for long-running jobs. Pre-generate the "usual" thumbnail sizes for each large image. Maybe generate smaller thumbnails from larger ones, or multiple in one go, so it won't have to load a large image 10 times for 10 thumbnails. Not sure what to do for "unusual" sizes, though. Could be a DOS attac vector.

Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Just checked and there are 51811 too large PNGs on Commons with a total size of 44 GB. It takes hemlock about 30 seconds to generate a 1024px thumbnail from a large file. Assuming a server that is not crowded with 150 users, the minimum is probably around 20 seconds in the best case. It would therefore take approximately between the 2 and 3 weeks to generate thumbanils from all those files. Those files could then be used to create further smaller thumbnails from.

Bryan

Bryan Tong Minh

6:27 a.m.

On Fri, Feb 22, 2008 at 3:22 PM, Roan Kattouw roan.kattouw@home.nl wrote:

...

Jim Hu schreef:

...
Could this be done by one of those grid applications like SETI at home? Or would the bandwidth usage make it not worth the benefit? I bet a lot of Wikipedia users would install a screensaver that did image resizing for you.

Maybe people should just resize their images before they upload them.

Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

We prefer to have source images as hires as possible.

Bryan

Roan Kattouw

6:39 a.m.

Bryan Tong Minh schreef:

...

We prefer to have source images as hires as possible.

Well images so huge they choke the resizing code can afford to be scaled down a *little*, can't they? Maybe we should move back to a more restrictive size limit?

Roan Kattouw (Catrope)

Mark Clements

8:15 a.m.

"Roan Kattouw" roan.kattouw@home.nl wrote in message news:47BEDEB1.80107@home.nl...

...

Bryan Tong Minh schreef:

...
We prefer to have source images as hires as possible.

Well images so huge they choke the resizing code can afford to be scaled down a *little*, can't they? Maybe we should move back to a more restrictive size limit?

Or better, fix the resizing code, which is what we're discussing here.

- Mark Clements (HappyDog)

Simetrical

9:37 a.m.

On Fri, Feb 22, 2008 at 8:26 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

The job queue works fine for doing lots of small things (do one small thing per request, nobody notices the delay), but big things will just delay some random guy's request by 10 seconds because MW is busy resizing an image someone else uploaded. We should really do this in background processes that don't interfere with Apache/PHP, but even then the load might be too heavy to take.

Um, we do, don't we? Wikimedia runs the job queue as a cron job, using maintenance/runJobs.php, and $wgJobRunRate = 0. It would be kind of silly for anyone to do otherwise, if you have the right to run cron jobs in the first place. Image resizing is done on totally different servers anyway, IIRC.

On Fri, Feb 22, 2008 at 9:50 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Or, we could have one dedicated server for long-running jobs. Pre-generate the "usual" thumbnail sizes for each large image. Maybe generate smaller thumbnails from larger ones, or multiple in one go, so it won't have to load a large image 10 times for 10 thumbnails. Not sure what to do for "unusual" sizes, though. Could be a DOS attac vector.

Well, I'm not clear on exactly what the issue is with resizing. Is it merely that it takes a minute per image? Or does it also take 45 GB of RAM? If the former, we could of course just toss a couple of extra servers at the problem. If the latter, not so easy. Domas' "No." makes me think something closer to the latter, but I really don't know.

David Gerard

10:02 a.m.

On 22/02/2008, Simetrical Simetrical+wikilist@gmail.com wrote:

...

Well, I'm not clear on exactly what the issue is with resizing. Is it merely that it takes a minute per image? Or does it also take 45 GB of RAM? If the former, we could of course just toss a couple of extra servers at the problem. If the latter, not so easy. Domas' "No." makes me think something closer to the latter, but I really don't know.

It should be possible to thumbnail a PNG completely from the image stream, using minimal memory (at the expense of speed).

Imagemagick is a tool that is a jack of all trades and best solution for none. If the problem is with using Imagemagick to do it, use something else.

- d.

Domas Mituzas

5:51 p.m.

Hi!

...

It should be possible to thumbnail a PNG completely from the image stream, using minimal memory (at the expense of speed).

progressive/interlaced images, anybody? still, someone has to contain few lines of pixels.. we've looked at this problem too many times, and still - images have internally to be expanded to full uncompressed bitmap.

...

Imagemagick is a tool that is a jack of all trades and best solution for none. If the problem is with using Imagemagick to do it, use something else.

use what? current GD maintainer was in our channel few times, telling about future product, that might save us. that future product doesn't seem to have appeared. if anyone wants to revisit GD support for thumbnails, performance et al - please tell your results.

really, this whole discussion is about 'oh, look, these guys have no clue', when we had certain investment of time looking at what can be done.

BR,

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Magnus Manske

25 Feb 25 Feb

2:26 a.m.

On Fri, Feb 22, 2008 at 6:02 PM, David Gerard dgerard@gmail.com wrote:

...

On 22/02/2008, Simetrical Simetrical+wikilist@gmail.com wrote:

...
Well, I'm not clear on exactly what the issue is with resizing. Is it merely that it takes a minute per image? Or does it also take 45 GB of RAM? If the former, we could of course just toss a couple of extra servers at the problem. If the latter, not so easy. Domas' "No." makes me think something closer to the latter, but I really don't know.

It should be possible to thumbnail a PNG completely from the image stream, using minimal memory (at the expense of speed).

I've already looked into the PNG format and libpng. Preliminaries:

The category I used as an example in my original complaint contains greyscale PNGs, the largest ~16MB.

A little C code I wrote reads this into memory. Takes about 3 seconds, and 22 MB (as far as I can tell from "top" in that time;-) Shouldn't be too hard to scale it down and output it.

And yes, it could basically work per-line as well, saving memory. But for this example, no need.

Magnus

Bryan Tong Minh

19 Apr 19 Apr

3:35 a.m.

On Thu, Feb 21, 2008 at 12:14 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw

It's shameful. Honestly. Look at it. We're the world's #9 top web site, and this is the best we can do?

Yes, I know that the images are large, both in dimensions (~5000x5000px) and size (5-15MB each). Yes, I know that ImageMagick has problems with such images. But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers?

It got me annoyed as well. So I finally got around writing a png resizer that does not need to load the entire file into memory: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pngds/. I took a 6000x4000 PNG from that category and recompressed it to a 640px image in only 4 seconds, while taking only a few 100KB of memory.

Couple of downsides: * It does currently not do any adaptive filtering on compression, which may lead to somewhat larger file sizes than necessary * Compression level is hardcoded * Only RGB, RGBA, Grayscale and Grayscale-Alpha images are supported. Palette images are unsupported. * Single color transparency is discarded.

There should probably a MediaWiki mediahandler be written for this as well.

Bryan

Bryan Tong Minh

4:22 a.m.

On Sat, Apr 19, 2008 at 12:35 PM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

Couple of downsides:

It does currently not do any adaptive filtering on compression,

which may lead to somewhat larger file sizes than necessary

Compression level is hardcoded

[...]

Fixed now. Paeth filtering is on by default, which leads to a little bit smaller file sizes, but longer processing time. Can be disabled using --no-filtering.

Bryan

Magnus Manske

21 Apr 21 Apr

3:07 a.m.

On Sat, Apr 19, 2008 at 12:22 PM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

On Sat, Apr 19, 2008 at 12:35 PM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
Couple of downsides:

It does currently not do any adaptive filtering on compression,

which may lead to somewhat larger file sizes than necessary

Compression level is hardcoded

[...]

Fixed now. Paeth filtering is on by default, which leads to a little bit smaller file sizes, but longer processing time. Can be disabled using --no-filtering.

Cool, nice work! I hope that it goes live soon...

Thanks for this, Magnus

Bryan Tong Minh

3:50 a.m.

On Mon, Apr 21, 2008 at 12:07 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Sat, Apr 19, 2008 at 12:22 PM, Bryan Tong Minh

bryan.tongminh@gmail.com wrote:

...
On Sat, Apr 19, 2008 at 12:35 PM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
Couple of downsides:

It does currently not do any adaptive filtering on compression,

which may lead to somewhat larger file sizes than necessary

Compression level is hardcoded

[...]

Fixed now. Paeth filtering is on by default, which leads to a little bit smaller file sizes, but longer processing time. Can be disabled using --no-filtering.

Cool, nice work! I hope that it goes live soon...

Thanks for this, Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Need to make a media handler first. I'll work on that the next few days if I can get the program compiled under Windows.

Bryan

Platonides

2:49 p.m.

Bryan Tong Minh wrote:

...

Need to make a media handler first. I'll work on that the next few days if I can get the program compiled under Windows.

Bryan

I didn't have any particular problem compiling pngds on windows (after changing to uint32_t). What are you doing? Where are you having problems? If you're really stuck i suppose i could send you the binaries, but it's really easy. (Maybe we should go offlist)

Bryan Tong Minh

22 Apr 22 Apr

4:19 a.m.

On Mon, Apr 21, 2008 at 11:49 PM, Platonides Platonides@gmail.com wrote:

...

Bryan Tong Minh wrote:

...
Need to make a media handler first. I'll work on that the next few days if I can get the program compiled under Windows.

Bryan

I didn't have any particular problem compiling pngds on windows (after changing to uint32_t). What are you doing? Where are you having problems? If you're really stuck i suppose i could send you the binaries, but it's really easy. (Maybe we should go offlist)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Well, I had some general problems getting MinGW to work...

But, luckily after getting very irritated about the fact that MSVC9 misses some C99 features, I managed to get it compiling on Microsoft Visual C++ 9 Express.

The media handler has been committed as extensions/PNGHandler. (It's clearly not my day... it took me like 5 commits to get it in the correct place, and now I see that I have mixed around capitalization).

One problem still is the fact that the handler does not create the thumb directories itself, and pngds crashes on that. Isn't mediawiki supposed to create those directories itself?

Bryan

Platonides

23 Apr 23 Apr

3:53 p.m.

Bryan Tong Minh wrote:

...

Well, I had some general problems getting MinGW to work...

It's a bit hard to get the right downloads, but then it works like a charm. After all, it's gcc ;)

...

But, luckily after getting very irritated about the fact that MSVC9 misses some C99 features, I managed to get it compiling on Microsoft Visual C++ 9 Express.

I see you removed the packed attribute on r33599, which would need to be done differently on MSVC. But not having the structure packing although manually aligned for now, could give surprises on the future. I'll try to make a patch which conditionally works for both compilers.

Bryan Tong Minh

24 Apr 24 Apr

10:31 a.m.

On Thu, Apr 24, 2008 at 12:53 AM, Platonides Platonides@gmail.com wrote:

...

Bryan Tong Minh wrote:

...
Well, I had some general problems getting MinGW to work...

It's a bit hard to get the right downloads, but then it works like a charm. After all, it's gcc ;)

...
But, luckily after getting very irritated about the fact that MSVC9 misses some C99 features, I managed to get it compiling on Microsoft Visual C++ 9 Express.

I see you removed the packed attribute on r33599, which would need to be done differently on MSVC. But not having the structure packing although manually aligned for now, could give surprises on the future. I'll try to make a patch which conditionally works for both compilers.

Not really... The PNG header will always stay 13 bytes, even in newer versions, at least according to the specs. Note that I now use sizeof(pngheader) for malloc and a hardcoded 13 for fread. Or are there other foreseeable surprises?

I committed a handler as extensions/PNGHandler.

Bryan

Brion Vibber

21 Apr 21 Apr

10:51 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Bryan Tong Minh wrote:

...

It got me annoyed as well. So I finally got around writing a png resizer that does not need to load the entire file into memory: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pngds/. I took a 6000x4000 PNG from that category and recompressed it to a 640px image in only 4 seconds, while taking only a few 100KB of memory.

Sweeet! How does that compare to ImageMagick's speed on the same image?

...

Couple of downsides:

It does currently not do any adaptive filtering on compression,

which may lead to somewhat larger file sizes than necessary

Compression level is hardcoded

Only RGB, RGBA, Grayscale and Grayscale-Alpha images are supported.

Palette images are unsupported.

Single color transparency is discarded.

There should probably a MediaWiki mediahandler be written for this as well.

That'd be super-cool...

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkgM1BwACgkQwRnhpk1wk46JUgCgv4CKUvRsjOHHYb3o1jTFK0Gb eO8AoJahgqYgD48boWag5QHqzNj/Y5F3 =hEA+ -----END PGP SIGNATURE-----

Bryan Tong Minh

22 Apr 22 Apr

4:50 a.m.

On Mon, Apr 21, 2008 at 7:51 PM, Brion Vibber brion@wikimedia.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Bryan Tong Minh wrote:

...
It got me annoyed as well. So I finally got around writing a png resizer that does not need to load the entire file into memory: http://svn.wikimedia.org/viewvc/mediawiki/trunk/pngds/. I took a 6000x4000 PNG from that category and recompressed it to a 640px image in only 4 seconds, while taking only a few 100KB of memory.

Sweeet! How does that compare to ImageMagick's speed on the same image?

About the same order of magnitude. I could only test on the toolserver, which is horribly overloaded, so might not give very reliable results. Anyway, here are the results:

...

...
...
imc = 'convert PSF_B-90003.png -size 240x360 PSF_B-90003-im.png' pngds = './pngds PSF_B-90003.png PSF_B-90003-pngds.png --width 240 --height 360' import os, time it = []; pt = [] for i in xrange(10):

... t = time.time() ... os.system(imc) ... it.append(time.time() - t) ... t = time.time() ... os.system(pngds) ... pt.append(time.time() - t) ...

...

...
...
it

[29.209415912628174, 38.37901782989502, 36.218145132064819, 23.519179105758667, 29.034934043884277, 23.346055030822754, 31.353446960449219, 16.032020092010498, 20.690651893615723, 7.4122297763824463]

...

...
...
pt

[34.830365896224976, 24.175456047058105, 28.614557027816772, 18.715215921401978, 17.986761093139648, 18.512807130813599, 20.611849069595337, 12.038840055465698, 9.5885701179504395, 5.7849969863891602]

I would say that my tool is slightly faster, but I can't say for sure.

Bryan

Domas Mituzas

5:27 a.m.

Hello Bryan,

...

I would say that my tool is slightly faster, but I can't say for sure.

You should probably use 'time ./blah' - the user/system times are far more stable and representing performance than time.time().

Did you try benchmarking GraphicsMagick? thats stability/performance fork, that others use instead of IM.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Bryan Tong Minh

6:35 a.m.

On Tue, Apr 22, 2008 at 2:27 PM, Domas Mituzas midom.lists@gmail.com wrote:

...

Hello Bryan,

...
I would say that my tool is slightly faster, but I can't say for sure.

You should probably use 'time ./blah' - the user/system times are far more stable and representing performance than time.time().

Ah thanks.

bryan@hemlock:~/projects/pngds$ time ./pngds PSF_B-90003.png PSF_B-90003-pngds.png --width 240 --height 360

real 0m45.380s user 0m3.000s sys 0m0.080s bryan@hemlock:~/projects/pngds$ time ./pngds PSF_B-90003.png PSF_B-90003-pngds.png --width 240 --height 360

real 0m11.095s user 0m2.920s sys 0m0.040s bryan@hemlock:~/projects/pngds$ time ./pngds PSF_B-90003.png PSF_B-90003-pngds.png --width 240 --height 360

real 0m15.853s user 0m2.960s sys 0m0.070s bryan@hemlock:~/projects/pngds$

ImageMagick:

bryan@hemlock:~/projects/pngds$ time convert PSF_B-90003.png -size 240x360 PSF_B-90003-im.png

real 0m7.660s user 0m2.500s sys 0m0.370s bryan@hemlock:~/projects/pngds$ time convert PSF_B-90003.png -size 240x360 PSF_B-90003-im.png

real 0m5.736s user 0m2.560s sys 0m0.370s bryan@hemlock:~/projects/pngds$ time convert PSF_B-90003.png -size 240x360 PSF_B-90003-im.png

real 0m3.998s user 0m2.480s sys 0m0.310s

ImageMagick is slightly faster here.

...

Did you try benchmarking GraphicsMagick? thats stability/performance fork, that others use instead of IM.

No, only the regular im.

Bryan

5914

Age (days ago)

5978

Last active (days ago)

wikitech-l@lists.wikimedia.org

56 comments

15 participants

tags (0)

participants (15)

Alex Powell
Andrew Garrett
Brion Vibber
Bryan Tong Minh
David Gerard
Domas Mituzas
Jared Williams
Jim Hu
Magnus Manske
Mark Clements
Platonides
Roan Kattouw
Simetrical
Steve Bennett
Tim Starling