Wikitech-l May 2003

wikitech-l@lists.wikimedia.org

59 participants
147 discussions

Re: [Wikitech-l] Please run this DB query to fix Special:Randompage
by Tim Starling 02 May '03

02 May '03

>>>From: Brion Vibber <brion(a)pobox.com> >>> >>>On Thu, 2003-05-01 at 08:53, Nick Reinking wrote: >>> > Off the top of my head, I can't think of any simple mathematical way >>>to >>> > do want you want to do. (that being making articles selected randomly >>> > less likely to be selected randomly over time) >>> >>>Not sure who wants to do that... For me it's quite sufficient to make >>>the probability of selection random, and let the size of the selection >>>field make multiple selections _reasonably_ rare. >> >>With over 100,000 articles, it's unlikely the user will ever see the same >>article come up twice. A previous flawed algorithm had the property of >>making a few articles much more likely to be selected. >> >I thought the reason we were having this discussion was because articles >were repeating too often, indicating some fault in the randomizer? >I have commonly seen an article twice in a sequence of 3 -5 random page >requests - about a dozen times in the past month. On one occasion I saw an >article 3 times in a sequence of 6. Which is definitely not statistically >random. A calculator is only as smart as the fool pressing the buttons. Same with pseudorandom number generators. The PRNG appears to have been working perfectly, allowing us to misuse it with amazing precision. For an explanation of what went wrong, see my first post on this topic. As for Brion's comment about MySQL updating RAND() seeding -- I assumed they would be using a persistent seed, but that comment suggests they're making the same mistake our own code makes: initialising the seed on every connection, from the system time. Just because you ask for microseconds doesn't mean you get microseconds. The PHP manual doesn't say how microtime() is implemented, what if it's using the old 18.2Hz clock? -- Tim Starling. _________________________________________________________________ Hotmail now available on Australian mobile phones. Go to http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp

1 0

Re: [Wikitech-l] Please run this DB query to fix Special:Randompage
by Tim Starling 02 May '03

02 May '03

>From: Brion Vibber <brion(a)pobox.com> > >On Thu, 2003-05-01 at 08:53, Nick Reinking wrote: > > Off the top of my head, I can't think of any simple mathematical way to > > do want you want to do. (that being making articles selected randomly > > less likely to be selected randomly over time) > >Not sure who wants to do that... For me it's quite sufficient to make >the probability of selection random, and let the size of the selection >field make multiple selections _reasonably_ rare. With over 100,000 articles, it's unlikely the user will ever see the same article come up twice. A previous flawed algorithm had the property of making a few articles much more likely to be selected. >The resetting of the random weight upon load is just meant to keep the >pot stirring, and "in theory" shouldn't have any effect at all on >randomness if the random indexes were random to begin with. Quite right. I would recommend only setting cur_random on article creation. It would save that little tiny bit of database load. There's no reason to think this "pot stirring" will make the selection more random, now that the pot has been pre-stirred. > > Seems to me that the best way to do this is either: > > 1) Be truly random. In PostgreSQL, this would be something like: > > SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id) >FROM cur) > > (this function should be really fast, not sure if faster than:) > >Mysql doesn't seem to let you use a non-constant expression for >'LIMIT offset,count' (and besides, still no subqueries), but we can pick >a random number easily enough in PHP. The problem is that we aren't sure >what range to use; hence a random index attached to each article whose >range is known (between 0 and 1). The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for versions later than 3.23. But I think the current algorithm is quite clever. Its only drawback is the overhead associated with maintaining the extra index, and initialising the field on article creation. >We don't want to select non-article pages or redirects, but we don't >currently have a count of exactly that. There's the "good articles" >count, but it has other criteria, so if we used it we'd always crop off >relatively recent articles. If we inflate it based on an estimate we'd >get overemphasis on the _most_ recent article if we estimate too high. > >Sure, there's count(*), but that takes time to run... might be faster >with an index specifically on cur_namespace,cur_is_redirect. Or we could >maintain another column in the stats table which counts non-redirect >article pages with _no_ other criteria (such as size or comma/bracket >count). Rebuilding indexes on cur is _very_ slow, so if we need to >change the indexes, best to plan ahead. Maybe we could have a "good article" flag, stored in cur on edit. -- Tim Starling. _________________________________________________________________ MSN Instant Messenger now available on Australian mobile phones. Go to http://ninemsn.com.au/mobilecentral/hotmail_messenger.asp

4 4

RE: [Wikitech-l] Re: Chat about Wikipedia performance?
by Mark Christensen 01 May '03

01 May '03

>> If this is the problem, we are in luck because there have been a lot >> of good improvement suggestions. But they all add complexity to the >> code (or database setup) and "premature optimization is the root of >> all kinds of evil," so if link checking isn't a bottleneck it would be >> counterproductive to spend a lot of time to try to optimize it. > I think it's a bit premature for that yet. I think the differently- > rendered missing links feature is pretty critical, not just a frill. > In the short term the hardware will bail us out until we can find > a solution. What I was suggesting is just that we '''test''' the performance with differently rendered links turned off, so that we know if this is a place where optimization is required. I expect it is, but I think it makes sense to check. Even a simple hash in shared means that we are maintaining data in two places all the time and thus and makes things somewhat more complex to maintain. I think this is a good idea -- if link checking is creating the current bottleneck. And that's why I voiced support turning off the links to see the performance impact before anybody spends time coding an optimized link checking routine. --Mark Christensen

2 1

RE: [Wikitech-l] Re: Chat about Wikipedia performance?
by Mark Christensen 01 May '03

01 May '03

> Well, to be fair, one of the very first questions > I asked upon joining wikien-l was whether the > database and webserver were on one host or two, > with exactly that reasoning in mind. But I don't > think I've written any well-known books... > unless the Wikipedia itself counts? ;) I believe Jimbo was talking about Ian Gilfillan, author of "Mastering MySQL 4", who posted a couple of suggestions back in February. --Mark

1 0

Re: Chat about Wikipedia performance?
by David A. Wheeler 01 May '03

01 May '03

Lee Daniel Crocker <lee(a)piclab.com> said: > (David A. Wheeler <dwheeler(a)dwheeler.com>) said: > > > > 1. Perhaps for simple reads of the current article (cur), you > > could completely skip using MySQL and use the filesystem instead. > > In other words, caching. Sorry, I wasn't clear. I wasn't thinking of caching - I was thinking of accessing the filesystem INSTEAD of MySQL when getting the current wikitext. Why? Well, I suspect that accessing the filesystem directly is much faster than accessing the data via MySQL - if most accesses are simple reads, then you can access it without user-level locks, etc., etc. Even more importantly, checking for existence is a simple filesystem check - which is likely to be much faster than the MySQL request. Would it be faster? I don't know; the only _real_ way to find out is to benchmark it. Of course, if wikipedia is near the breaking point for performance, another approach would be to change the design so that reading only requires one lookup (for the data itself). You noted the two big problems, and I agree that they're the sticking points. You could abandon many user settings, except ones that the user can supply themselves to select between different stylesheets, and abandon displaying links differently depending on whether or not they're there. Less desirable, but you've already abandoned supporting search! Then you can cache the generated HTML as well. If it's a choice between having a working wikipedia, and having the bells & whistles, I think working is the better plan. You can always include them as settable options, to be returned once the system doesn't have performance problems. Although databases are more flexible for storing structured data, for simple unstructured data, a simple filesystem-based approach might be more suitable. This also lets you use other existing tools (like the many tools that let you store indexes for later rapid searching based on files). A quick start might be to temporarily disable all checking of links, and see if that helps much. > > [Rendering] could also be sped up, e.g., by rewriting it in flex. > > My "html2wikipedia" is written in flex - it's really fast and > didn't > > take long to write. The real problem is, I suspect that > > isn't the bottleneck. > > It isn't. And there's no reason to expect flex to be any faster > than any other language. Actually, for some lexing applications flex can be MUCH faster. That's because it can pre-compile a large set of patterns into C, and compile the result. Its "-C" option can, for some applications, result in blazingly fast operations. You CAN do the same thing by hand, but it takes a long time to hand-optimize that kind of code. However, there's no point in rewriting what is not the bottleneck. Which I why I was hoping to hear if someone has done measurements to identify the real bottlenecks, e.g., "50% of the system time is spent doing X". If most time is spent rendering articles for display (without editing), then it's worth examining what's taking the time. If the time is spent on checking if links exist, then clearly that's worth examining. Oh, one note - if you want to simply store whether or not a given article entry exists or not, and quickly check it, one fancy way of doing this is by using a Bloom filter. You can hash the article title, and then using a fancy data structure can store its existance or non-existance. More info, and MIT-licensed code, for a completely different application are at: http://www.ir.bbn.com/projects/SPIE (there, they hash packets so that later queries can ask "did you see this packet"?). Given the relatively small size of article text, it's not clear you need this (you can store all the titles in memory), but I just thought I'd mention it. Anyway, thanks for listening. My hope is that the Wikipedia doesn't become a victim of its own success :-).

4 3

Re: Really good news! (Lee Daniel Crocker)
by Daniel Mayer 01 May '03

01 May '03

LDC wrote: >> [ibiblio] would LOVE to host wikipedia.org... >I don't think that's an option, but certainly >they might be useful as a mirror, or storage >for backups, etc. My thoughts exactly LDC. If they go for it, we would have a fantastic mirror AND we still would still have direct control over our own servers. --Daniel Mayer (aka mav) __________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com

1 0

Please run this DB query to fix Special:Randompage
by Tim Starling 01 May '03

01 May '03

Someone needs to run this: UPDATE cur SET cur_random=RAND() Here's a quote from Village Pump explaining why: The pseudorandom number generator which backs the Special:Randompage link appears to be not so good. I just hit it about two dozen times and had several links come up multiple times. -º¡º Emphasis on pseudo. I got RNA transcription four times in about a dozen clicks - twice in succession. Also, Zog's favorite song came up - but redirected to Johnny Rebel. I thought redirects were excluded. Geoffrey Here's what I found out with a few SQL queries. The cur_random field values are strongly clustered up around the high end. In fact, there are no articles at all between about 0.18 and 0.4, and only few below 0.18. A few minutes of browsing through the old versions of SpecialRandompage.php shows why. A previous version of the software selected the lowest-numbered cur_random value, and set it to a random value. So here's why we now see poor results: Most of the pages are clustered up above 0.9 or so, so when you click Special:Randompage , there's a high chance of picking one of the few low-numbered articles. The cur_random value is then reset, and there's still a high chance of the new value being below 0.9. Hence, the few priveleged low-numbered articles get selected far more often, and unless someone re-randomizes cur_random column, it take a long time for the high-numbered articles to diffuse back down. -- Tim Starling 04:25 May 1, 2003 (UTC) --END QUOTE-- -- Tim Starling. _________________________________________________________________ Hotmail now available on Australian mobile phones. Go to http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp

3 3

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l May 2003