A month ago, PageImages extension[1] was black-deployed, intended to automatically associate images with articles. It populates its data when LinksUpdate is run, i.e. when a page or templates it trascludes is edited or purged. Since then, most of pages were re-parsed, however slightly less than a million English WP articles remain:
select count(*), avg(page_len) from page where page_namespace=0 and page_is_redirect=0 and page_touched < '20121229000000'; +----------+---------------+ | count(*) | avg(page_len) | +----------+---------------+ | 977568 | 3172.0948 | +----------+---------------+ 1 row in set (5 min 59.55 sec)
Waiting for these pages to be updated naturally could take forever:
select min(page_touched) from page where page_namespace=0 and page_is_redirect=0; +-------------------+ | min(page_touched) | +-------------------+ | 20090714142954 | +-------------------+ 1 row in set (2 min 15.13 sec)
That was [2] before I purged it: obscure topic, no templates.
Thus, I would like to populate this data with a script[3]. To reduce the scare, let me remark that these pages have almost no templates and are significantly smaller than average: 3172 bytes vs. 5673 so they should be mostly fast to parse.
Is running it a good idea?
----- [1] https://www.mediawiki.org/wiki/Extension:PageImages [2] https://en.wikipedia.org/wiki/City_of_Melbourne_election,_2008 [3] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/PageImages.git;...
Max Semenik wrote:
A month ago, PageImages extension was black-deployed, intended to automatically associate images with articles.
I looked at https://www.mediawiki.org/wiki/Extension:PageImages and I'm still having difficulty understanding this extension's purpose. Is there a related bug or request for comment (RFC) for this?
select count(*), avg(page_len) from page where page_namespace=0 and page_is_redirect=0 and page_touched < '20121229000000'; +----------+---------------+ | count(*) | avg(page_len) | +----------+---------------+ | 977568 | 3172.0948 | +----------+---------------+ 1 row in set (5 min 59.55 sec)
select count(*) from page where page_namespace=0 and page_is_redirect=1 and page_touched < '20120101000000'; +----------+ | count(*) | +----------+ | 16 | +----------+ 1 row in set (26.61 sec)
I ran a script in December 2012 on the English Wikipedia that updated the page_touched date of every redirect in NS:0 (and a few other namespaces, I believe) where the page_touched date was not like '2012%'. I'd considered running the same script on non-redirects. It turns out that if you take the stored wikitext of pages and echo (post) it back at the wiki via the edit action a few million times, you can discover some interesting bugs.
Thus, I would like to populate this data with a script[3]. To reduce the scare, let me remark that these pages have almost no templates and are significantly smaller than average: 3172 bytes vs. 5673 so they should be mostly fast to parse.
I don't think there's any reason to be scared here.
MZMcBride
On 01.02.2013, 9:21 MZMcBride wrote:
Max Semenik wrote:
A month ago, PageImages extension was black-deployed, intended to automatically associate images with articles.
I looked at https://www.mediawiki.org/wiki/Extension:PageImages and I'm still having difficulty understanding this extension's purpose.
It returns thumbnails associated with articles, attempting to return only meaningful images, not ones from maintenance templates, stubs or flag icons.
Is there a related bug or request for comment (RFC) for this?
A bug or a RFC is not required for WMF devs to work on something, we tend to do what our bosses say:)
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
On Fri, Feb 1, 2013 at 8:20 AM, Max Semenik maxsem.wiki@gmail.com wrote:
On 01.02.2013, 9:21 MZMcBride wrote:
Max Semenik wrote:
A month ago, PageImages extension was black-deployed, intended to automatically associate images with articles.
I looked at https://www.mediawiki.org/wiki/Extension:PageImages and I'm still having difficulty understanding this extension's purpose.
It returns thumbnails associated with articles, attempting to return only meaningful images, not ones from maintenance templates, stubs or flag icons.
Is there a related bug or request for comment (RFC) for this?
A bug or a RFC is not required for WMF devs to work on something, we tend to do what our bosses say:)
-- Best regards, Max Semenik ([[User:MaxSem]])
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01.02.2013, 18:14 John wrote:
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
That's the point. If you want to enumerate all images on a page, there's prop=images. PageImages returns just 1, most appropriate, thumb.
Its broken, on pages where there are multiple images it just shows the first one
On Friday, February 1, 2013, Max Semenik wrote:
On 01.02.2013, 18:14 John wrote:
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
That's the point. If you want to enumerate all images on a page, there's prop=images. PageImages returns just 1, most appropriate, thumb.
-- Best regards, Max Semenik ([[User:MaxSem]])
But not simply the first image to be found in the source, which in many cases is the icon in a maintenance template or top icon. For https://en.wikipedia.org/wiki/Louis_Bonaparte, for instance, the image returned is correctly the one from the infobox, not the book-with-question-mark icon from the needs-more-references template. There's still room for improvement, for sure; but it's definitely a legitimate piece of data to want to collect.
--HM
On 1 February 2013 15:17, John phoenixoverride@gmail.com wrote:
Its broken, on pages where there are multiple images it just shows the first one
On Friday, February 1, 2013, Max Semenik wrote:
On 01.02.2013, 18:14 John wrote:
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
That's the point. If you want to enumerate all images on a page, there's prop=images. PageImages returns just 1, most appropriate, thumb.
-- Best regards, Max Semenik ([[User:MaxSem]])
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Feb 1, 2013 at 10:03 AM, Max Semenik maxsem.wiki@gmail.com wrote:
On 01.02.2013, 18:14 John wrote:
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
That's the point. If you want to enumerate all images on a page, there's prop=images. PageImages returns just 1, most appropriate, thumb.
That could have been made a lot more clear, both in the documentation and in the name of the module itself ("pageimages" implies more than one per page).
Also, BTW, the API module could use implementing getExamples() and getHelpUrls(). And I wonder if there's a reason it uses 50 and 100 rather than ApiBase::LIMIT_SMALL1 and ApiBase::LIMIT_SMALL2 for the limit. And why it defaults to 1 rather than 10 like pretty much everything else.
On 01.02.2013, 19:40 Brad wrote:
On Fri, Feb 1, 2013 at 10:03 AM, Max Semenik maxsem.wiki@gmail.com wrote:
On 01.02.2013, 18:14 John wrote:
I think there are still some serious issues with this extension, I have checked several pages, and used the max limit parameter and all it returns is a single thumb
That's the point. If you want to enumerate all images on a page, there's prop=images. PageImages returns just 1, most appropriate, thumb.
That could have been made a lot more clear, both in the documentation and in the name of the module itself ("pageimages" implies more than one per page).
And I wonder if there's a reason it uses 50 and 100 rather than ApiBase::LIMIT_SMALL1 and ApiBase::LIMIT_SMALL2 for the limit. And why it defaults to 1 rather than 10 like pretty much everything else.
Because with File::transform()'s worst-case performance, 500 is too much.
On 1/31/13, Max Semenik maxsem.wiki@gmail.com wrote:
A month ago, PageImages extension[1] was black-deployed, intended to automatically associate images with articles. It populates its data when LinksUpdate is run, i.e. when a page or templates it trascludes is edited or purged. Since then, most of pages were re-parsed, however slightly less than a million English WP articles remain:
select count(*), avg(page_len) from page where page_namespace=0 and page_is_redirect=0 and page_touched < '20121229000000'; +----------+---------------+ | count(*) | avg(page_len) | +----------+---------------+ | 977568 | 3172.0948 | +----------+---------------+ 1 row in set (5 min 59.55 sec)
[..]
You do realize that page_touched gets updated by a bunch of things, many of which do not cause a LinksUpdate to happen? So running the script as you proposed will not populate the table for all data.
Of course there really isn't any way to figure out when the last LinksUpdate happened, so I suppose page_touched is as close as we can get. I guess in most cases if something has had its page_touched updated by a non-LinksUpdate event, that probably means people actually look at the article, so someone has or will probably edit the article soon.
--bawolff
wikitech-l@lists.wikimedia.org