Wikipedia-l June 2002

wikipedia-l@lists.wikimedia.org

37 participants
155 discussions

New wiki renders fine with Mozilla 1.0RC3/Linux
by Robert Graham Merkel 13 Jun '02

13 Jun '02

No problems - I checked a couple of pages (including one or two w/images), and there doesn't seem to be any issues. -- ------------------------------------------------------------ Robert Merkel rgmerk(a)mira.net Go You Big Red Fire Engine -- Unknown Audience Member at Adam Hills standup gig ------------------------------------------------------------

1 0

Browser testing in new code
by lcrocker＠nupedia.com 13 Jun '02

13 Jun '02

1 0

Re: [Wikipedia-l] Searching in new codebase
by lcrocker＠nupedia.com 13 Jun '02

13 Jun '02

> The point of the parser was to detect the cases where the query > was not well-formed, unbalanced brackets, "A and or B", et cetera, > and then give some kind of syntax error to indicate what is wrong. > What do you do with those cases now? Now I let MySQL choke on it, and report back its error, which is actually much more useful than it sounds. The part of the program that reports MySQL errors is very good; I originally made it that way for debugging, but it's not bad for this case either. I just don't see that much benefit from making nicer error messages on badly formed searches. > Btw. is it correct that you only highlight one search word per line > in the result of the search? If fixed that, and also the word-boundary problem, but it does still limit the context display to 60 characters before and after the first hit of each line. Personally, I think that's plenty to get a sense of context, but I might be convinced otherwise. > Ah, I see, sorry for not checking your code first. So that is why > the scoring doesn't work anymore. The simplest way to get scoring > back is probably to not eliminate duplicates when processing. Hmm. That might be a good idea. I might even be able to add extra duplicates for words in headings or something. I initially assumed that eliminating duplicates would speed the search, but if it hurts the scoring, that's not a good tradeoff. > Not necessarily. The usual way to do this is define your own index > table like > Text_index(word, article, #occurrences) > and then you let MySQL compute some sort of scoring and sort on > that. This is tricky if you have OR and NOT but with only AND > this is easy. I think I'll try removing the dup-stripping first. 0

2 1

Re: [Wikipedia-l] Searching in new codebase
by lcrocker＠nupedia.com 12 Jun '02

12 Jun '02

> Having said that I now favour removing my search code and moving > to MySQL's binary search, because if you don't like it's default > scoring you can now use the +'s. It was fun writing a parser for > boolean expressions but if we can get rid of that complicated > piece of code and defer some of the work to the database I'm all > for it. Simplify, simplify. The first simplification I did was to get rid of the parser because it wasn't necessary--SQL is already doing it, so I just pass on the ANDs, ORs, and NOTs as they are. Yes, I put an implicit AND between terms, because that makes fast, small result sets. The boolean searching in MySQL 4.0 would be great--but that's a BIG leap--MySQL 4.0 is not a stable product. It's alpha software, and I'm not so sure that giving up the reliability of 3.23 is worth the extra features. Does anyone on the list have experience with MySQL 4.0 in a production environment? MySQL 3.23 is very stable and reliable. Even recompiling it from source was simple (I did that to get rid of the 4-letter miniumum-- you can check that out at the new site--search for "PVC" for example). The second change I made to the search was to parse the article text into a separate field the way we were already doing for titles. This field contains all the unique words of the article just once, case folded and stripped from punctuation (so it fixes the '' problem, for example). I even do some processing for things like [[game]]s, which will put both "game" and "games" in the index. We could expand this preprocessing to do some things. We could also do our own scoring after MySQL returns the raw results, but that would require making a pass through the entire result set before displaying anything. Another thing about the search in the new codebase is that it is blindingly fast--it can return results in within 2 seconds many times. When it's that fast, you don't need as many features because the user can do multiple searches. > However, on the long run we should probably implement our own > indexing. That would allow us to tackle several problems: > - the ' problem > - searching UTF-8 with proper collation without hacking the > character set > - recognizing entities such as ö > - languages with inflections All of these can be solved with the pre-processing already in the new codebase--in fact the ' problem is already solved. I haven't done anything with new character sets, but that should be pretty easy--take a look at SearchUpdate.php. > - partial matches or ... we could wait for the MySQL team to > implement the Generic user-suppliable UDF preparser as us > mentioned in their to-do list. Perhaps we should give them a > call. :-) I'm really big on stable, reliable software. Even if MySQL chose to implement something like that, I wouldn't recommend using it until it had been in production for a few months, and we can't even say that of 4.0 yet. 0

2 1

continued growth rate and quality
by Lars Aronsson 12 Jun '02

12 Jun '02

Wikipedia now has more than 50,000 pages (28,000 articles), and the 1,000 most recent changes were made in the last 5 days. But is there any way to judge/measure/monitor the quality of the contribution as volume grows? Are new articles still written on new topics of general interest, or do more and more cover obscure topics? Do more duplicates appear? Is there any way to tell from statistics? I guess the number of different authors that a new article attracts in its first three months could be an interesting statistical measure. -- Lars Aronsson <lars(a)aronsson.se> tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

3 4

Searching in new codebase
by Axel Boldt 12 Jun '02

12 Jun '02

The search code in the new codebase behaves similar to our current code in that it assumes an implicit AND for several search terms, and doesn't return any results if no articles match all terms. I wonder if this is the intuitive behavior for most users. I think Google has conditioned people to type in as much relevant information as possible to get better hits, and most search engines work that way. In fact, the built-in mysql search code works that way too. Maybe we should use it directly? That way, we could also present the results according to relevancy (which mysql reports), rather than alphabetically. We would lose the boolean AND OR NOT operators, but newer versions of mysql have substitutes: you use "+term" if you definitely want the term in your results, and you use "-term" if you definitely don't want it. This is almost as powerful as boolean searching. Alternatively, we could have an "advanced search" page where you could construct a boolean search, include/exclude specific namespaces etc. Now that I think about it, a way to optionally search talk: and wikipedia: would probably be desirable. Axel

2 1

Re: [Wikipedia-l] file uploads
by lcrocker＠nupedia.com 10 Jun '02

10 Jun '02

> Mistaken ? In 1999 Unisys stated that its policy is to require > a $5000 fee from websites that carry GIF images made by unlicensed > software -- even nonprofit websites created and displayed with free > software. Can Wikipedia prove that every GIF image uploaded to it > has been created by a properly licensed GIF encoder ? I think not. Unisys can claim any damn thing it wants. But it's what the law says that matters, and the law says that Unisys is just blowing smoke up our ass on that claim. Only the claim on encoding software has any legal merit. Again, I'm not averse to excluding GIFs from Wikipedia for many reasons, but fear of a legitimate patent infringement claim is not one of them.0

2 1

New codebase update, question about searches
by lcrocker＠nupedia.com 10 Jun '02

10 Jun '02

Oops. Sorry about the empty message. I recompiled MySQL with the 3-letter minimum instead of four, and implemented a search function on the new codebase. As with some other decisions, I went with speed over functionality I didn't see much use for, but I'm willing to be convinced. For example, Title matches and text matches are separated, and there is no total count of matches at the top (that's a whole query in itself, and I didn't think it was useful enough to be worth the time). I trimmed down the amount of context shown with each hit, and added line numbers, just so that one gets an idea of the use of the term in context (and I put it in red). When this search is satisfactory (I want to make at least one more tweak to solve the MySQL '' problem), that will be all of the major functionality of the Wiki, and only filling in a few special pages remains, so now is a good time for some major testing. Also, I turned on the PHP option for really pedantic error checking, so if you get errors now that you didn't before, that's probably the reason (and please report them to http://sourceforge.net/tracker/?group_id=34373&atid=411192 ). The test site is still http://www.piclab.com/newwiki/wiki.phtml 0

1 0

New codebase update, question about searches
by lcrocker＠nupedia.com 10 Jun '02

10 Jun '02

1 0

Re: [Wikipedia-l] file uploads
by lcrocker＠nupedia.com 10 Jun '02

10 Jun '02

> Still, as long as Wikipedia neither codes nor decodes GIFs, how can > it be in violation? It can't. Derek is completely mistaken on that score. Only software that encodes or decodes GIF has any problem, and even the case for decoders is pretty thin. So the patent itself is no reason to forgo use of GIF in web sites. But it does make a political statement to avoid their use, in that in the long run, avoiding GIFs on web pages may in time reduce their use to such low levels that free software developers might be able to produce more non-patent-encumbered software for producing images. And PNG is a superior format anyway (and I'm not just saying that because I'm one of its developers--I was on the committee that created GIF too). 0

4 3

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Wikipedia-l June 2002