Wikitech-l February 2008

wikitech-l@lists.wikimedia.org

100 participants
104 discussions

MediaWiki to Latex Converter
by Hugo Vincent 18 Jun '12

18 Jun '12

Hi everyone, I recently set up a MediaWiki (http://server.bluewatersys.com/w90n740/) and I need to extra the content from it and convert it into LaTeX syntax for printed documentation. I have googled for a suitable OSS solution but nothing was apparent. I would prefer a script written in Python, but any recommendations would be very welcome. Do you know of anything suitable? Kind Regards, Hugo Vincent, Bluewater Systems.

6 13

Replacement stats for placeholder images?
by David Gerard 13 Oct '09

13 Oct '09

I've been putting placeholder images on a lot of articles on en:wp. e.g. [[Image:Replace this image male.svg]], which goes to [[Wikipedia:Fromowner]], which asks people to upload an image if they own one. I know it's inspired people to add free content images to articles in several cases. What I'm interested in is numbers. So what I'd need is a list of edits where one of the SVGs that redirects to [[Wikipedia:Fromowner]] is replaced with an image. (Checking which of those are actually free images can come next.) Is there a tolerably easy way to get this info from a dump? Any Wikipedia statistics fans who think this'd be easy? (If the placeholders do work, then it'd also be useful convincing some wikiprojects to encourage the things. Not that there's ownership of articles on en:wp, of *course* ...) - d.

7 11

Case insensitive links (not just titles).
by subscribe＠divog.com.ru 23 Jun '08

23 Jun '08

Hi Sorry for my English :) What I need is case insensitive titles. My solution for the problem was to change collation in mysql from <unf8_bin> to <utf8_general_ci> in table <page>, for field <page_title>. But bigger problem with links persists. In my case, if there is an article <Frank Dreben>, link [[Frank Dreben]] is treated like a link to an existent article (GoodLink), but link [[frank dreben]] is treated like a link to a non-existent article, so, this link opens editing of existent article <Frank Dreben>. What can be fixed for that link [[frank dreben]] to be treated like a GoodLink? I've spent some time in Parser.php, LinkCache.php, Title.php, Linker.php, LinkBatch.php but found nothing useful. The last thing I tried was to do strtoupper on title every time array of link cache is filled, in LinkCache.php. I also tried to do strtoupper on title every time data is fetched from the array. I've tried to make titles in cache be case insensitive, but it didn't work out, not sure why - it seems like when links are constructed (parser, title, linker, etc) only LinkCache methods are used. Could anybody point a direction to dig in? :)

7 36

Interface embarrassment rant
by Magnus Manske 24 Apr '08

24 Apr '08

<rant> I'm currently working on the Scott Forseman image donation, cutting large scanned images into smaller, manually optimized ones. The category containing the unprocessed images is http://commons.wikimedia.org/wiki/Category:ScottForesman-raw It's shameful. Honestly. Look at it. We're the world's #9 top web site, and this is the best we can do? Yes, I know that the images are large, both in dimensions (~5000x5000px) and size (5-15MB each). Yes, I know that ImageMagick has problems with such images. But honestly, is there no open source software that can generate a thumbnail from a 15MB PNG without nuking our servers? In case it's not possible (which I doubt, since I can generate thumbnails with ImageMagick from these on my laptop, one at a time; maybe a slow-running thumbnail generator, at least for "usual" sizes, on a dedicated server?), it's no use cluttering the entire page with broken thumbnails. Where's the option for a list view? You know, a table with linked title, size, uploader, date, no thumbnails? They're files, so why don't we use things that have proven useful in a file system? And then, of course: "There are 200 files in this category." That's two lines below the "(next 200)" link. At that point, we know there are more than 200 images, but we forget about that two lines further down? Yes, I know that some categories are huge, and that it would take too long to get the exact number. But, would the exact number for large categories be useful? 500.000 or 500.001 entries, who cares? How many categories are that large anyway? 200 or 582 entries, now /that/ people might care about. Why not at least try to get a number, set a limit to, say, 5001, and * give the exact number if it's less that 5001 entries * say "over 5000 entries" if it returns 5001 Yes, everyone's busy. Yes, there are more pressing issues (SUL, stable versions, you name it). Yes, MediaWiki wasn't developed as a media repository (tell me about it;-) Yes, "sofixit" myself. Still, I ask: is this the best we can do? Magnus </rant>

15 56

Broken dump enwiki-20080103-pages-meta-current.xml.bz2
by Lev Bishop 20 Apr '08

20 Apr '08

The most recent enwiki dump seems corrupt (CRC failure when bunzipping). Another person (Nessus) has also noticed this, so it's not just me: http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080… Steps to reproduce: lsb32@cmt:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2 9aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2 lsb32@cmt:~/enwiki> bzip2 -tvv enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log lsb32@cmt:~/enwiki> tail bunzip.log [3490: huff+mtf rt+rld] [3491: huff+mtf rt+rld] [3492: huff+mtf rt+rld] [3493: huff+mtf rt+rld] [3494: huff+mtf rt+rld] [3495: huff+mtf data integrity (CRC) error in data You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. lsb32@cmt:~/enwiki> bzip2 -V bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005. Copyright (C) 1996-2005 by Julian Seward. This program is free software; you can redistribute it and/or modify it under the terms set out in the LICENSE file, which is included in the bzip2-1.0 source distribution. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for more details. bzip2: I won't write compressed data to a terminal. bzip2: For help, type: `bzip2 --help'. lsb32@cmt:~/enwiki>

4 6

Re: [Wikitech-l] Primary account for single user login
by Anon Sricharoenchai 08 Apr '08

08 Apr '08

> > Message: 8 > Date: Fri, 12 Oct 2007 17:59:22 +0200 > From: GerardM <gerard.meijssen(a)gmail.com> > Subject: Re: [Wikitech-l] Primary account for single user login > > Hoi, > This issue has been decided. Seniority is not fair either; there are > hundreds if not thousands of users that have done no or only a few edits and > I would not consider it fair when a person with say over 10.000 edits should > have to defer to these typically inactive users. 1. Yes, it's not fair, but this is the truth on wikimedia project that ones have to admit. Imagine if, all wikimedia sites has a single user login since when it is first established, the one who first register will own that username for all wikimedia sites. 2. The person with less edits, doesn't mean that they are less active than the one with more edits. And according to, http://en.wikipedia.org/wiki/Wikipedia:Edit_count, ``Edit counts do not necessarily reflect the value of a user's contributions to the Wikipedia project.'' What if, some users have less edits count, * since they deliberately edit, preview, edit, and preview the articles, over and over, before submitting the deliberated versions to wikimedia sites. * Some users edit, edit and edit the articles in their offline storage, over and over, before submitting the only final versions to wikimedia sites. While some users have more edits count, * since they often submit so many changes, without previewing it first, and have to correct the undeliberated edit, over and over. * Some users often submit so many minor changes, over and over, rather than accumulate the changes resulting in fewer edits count. * Some users do so many robot routines by themselves, rather than letting the real robot to do those tasks. * Some users often take part in many edit wars. * Some users often take part in many arguments in many talk pages. What if, the users with less edits count, try to increase their edits count to take back the status of primary account. What if, they decide to change their habit of editing, to increase the edits count, * by submitting many edits without deliberated preview, * by splitting the accumulated changes into many minor edits, and submit them separately, * by stopping their robots, and do those robot routines by themselves, * by joining edit wars. 3. According to 2) above, I think, the better measurement of activeness is to measure the time between the first edit and the last edit of that username. The formula will look like this, activeness = last edit time - first edit time > > A choice has been made and as always, there will be people that will find an > un-justice. There were many discussions and a choice was made. It is not > good to revisit things continuously, it is good to finish things so that > there is no point to it any more. > > Thanks, > GerardM > > On 10/12/07, Anon Sricharoenchai <anon.hui(a)gmail.com> wrote: > > > > According to the conflict resolution process, that the account with > > most edits is selected as a primary account for that username, this > > may sound reasonable for the username that is owned by the same person > > on all wikimedia sites. > > > > But the problem will come when the same username on those wikimedia > > sites is owned by different person and they are actively in used. > > The active account that has registered first (seniority rule) should > > rather be considered the primary account. > > Since, I think the person who register first should own that username > > on the unified > > wikimedia sites. > > > > Imagine, what if the wikimedia sites have been unified ever since the > > sites are > > first established long time ago (that their accounts have never been > > separated), > > the person who register first will own that username on all of the > > wikimedia > > sites. > > The person who come after will be unable to use the registered > > username, and have > > to choose their alternate username. > > This logic should also apply on current wikimedia sites, after it have > > been > > unified. > >

8 13

RfC: Wikipedia data displays
by Erik Moeller 06 Mar '08

06 Mar '08

We're planning to set up 4 data displays in the Wikimedia Foundation office - I'm thinking at least 19" screens, maybe larger. The intent here is not to appear "hip", but to make the office environment more interesting for visitors, such as potential donors. This creates conversation pieces and memorable moments - which is important for cultivating relationships. I'd like to request your comments on what kinds of displays we could set up. Some initial ideas: - Real-time recent changes. This should be relatively straightforward using the IRC feeds. Most effort here will go into prettification, I think. What would be a good IRC client to show multiple channels at once? - Show random articles. Not particularly creative, but should also be fairly easy to do using some scripting. Would be nice to show stuff from projects beyond WP. - Show articles matching to current searches. How difficult would it be to capture search data for this? - Show the actual search strings. I don't love this one, because Google already does this, but it might be interesting content-wise. - Show traffic data. What would be interesting displays here? Can we show bandwidth usage in real-time? - Show images as they are being uploaded. Do we have anything like that already? If not, how hard would it be to implement? - Data displays of developmental indicators - e.g. Gapminder data on Internet access, literacy, etc. Is there anything like this that we could do with relatively little effort? Any volunteers to put something together? - Geomapping of access - some visualization of the primary clusters where traffic is coming from, based on sampling. I imagine this could be quite tricky - but might be a cool long-term project for a volunteer? - Visualization of edit patterns, similar to: http://abeautifulwww.com/2007/05/20/visualizing-the-power-struggle-in-wikip… Other ideas / comments? -- Erik Möller Deputy Director, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

13 19

Tag intersection, crazy idea
by Magnus Manske 03 Mar '08

03 Mar '08

I just had the following thought: For a tag intersection system, * we limit queries to two intersections (show all pages with categories A and B) * we assume on average 5 categories per page (can someone check that?) then we have 5*4=20 intersections per page. Now, for each intersection, we calculate MD5("A|B") to get an integer hash, and store that in a new table (page_id INTEGER,intersection_hash INTERGER). That table would be 4 times as long as the categorylinks table. * Memory usage: Acceptable (?) * Update: Fast, on page edit only * Works for non-existing categories On a query, we look for the (indexed) hash in a subquery, then check those against the actual categorylinks. Looking up an integer in the subquery should be fast enough ;-) Given the number of categories and INTEGER >4bn, that would make the hash unique for all combinations of 65K categories (if the hash were truely randomly distributed, which it isn't), which should mean that the number of false positives (to be filtered by the main query) should be rather low. If that's fast enough, we could even expand to three intersections (A, B, and C), querying "A|B", "A|C", and "B|C", and let PHP find the ones common to all three sets. Summary: Fixing slow MySQL lookup by throwing memory at it... Feasible? Or pipe dream? Magnus

8 27

CSS minus
by Huji 02 Mar '08

02 Mar '08

This may be the dumbest question I've every asked, so go easy with me please! In SkinTemplate.php we have this: $sitecss .= '@import "' . self::makeUrl( '-', "action=raw&gen=css$siteargs$skinquery" ) . '";' . "\n"; I understand that it helps when a useskin parameter is passed; what I don't understand is, why does it have to return a value when no such a parameter is passed? It can simply return "nothing", can't it? Hojjat

4 9

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [31385] trunk/extensions/ConfirmEdit
by Brion Vibber 02 Mar '08

02 Mar '08

catrope(a)svn.wikimedia.org wrote: > * Adding SimpleCaptcha::addCaptchaAPI() method that adds CAPTCHA information to an API result array. Other CAPTCHA implementations should override this method with a function that does the same (did this for FancyCaptcha and MathCaptcha) [snip] > + $resultArr['captcha']['type'] = 'simple'; > + $resultArr['captcha']['id'] = $index; > + $resultArr['captcha']['question'] = $captcha['question']; [snip] > + $resultArr['captcha']['type'] = 'image'; > + $resultArr['captcha']['id'] = $index; > + $resultArr['captcha']['url'] = $title->getLocalUrl( 'wpCaptchaId=' . urlencode( $index ) ); [snip] > + $resultArr['captcha']['type'] = 'math'; > + $resultArr['captcha']['id'] = $index; > + $resultArr['captcha']['sum'] = $sum; Hmmm... How is an API client meant to figure out what to do with this captcha data, when there seems to be nothing consistent in how it's presented? How is it meant to display a new type of challenge which might be added in the future? If you're going to have an API for this, I think it needs at least some minimum future-proofing; otherwise all the clients will break when something is tweaked in the captcha. -- brion vibber (brion @ wikimedia.org)

4 9

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l February 2008