Wikitech-l November 2007

wikitech-l@lists.wikimedia.org

90 participants
79 discussions

Re: [Wikitech-l] 78MB diff?
by tim＠greenscourt.com 21 Nov '07

21 Nov '07

This happens fairly frequently on very large pages on our wiki. While I don't have a good well-rounded solution, I do know that it's easy to mark such edits as patrolled. Simply hover over the diff link of the page in question and note the rcid= value at the end of the URL. Then go to any other properly displayed diff page, grab the url of the "mark as patrolled" link, copy that into your URL field in your browser, and replace the rcid of the page that you want to mark patrolled. Press enter and it's patrolled. Tim > -------- Original Message -------- > Subject: [Wikitech-l] 78MB diff? > From: "Travis Derouin" <travis(a)wikihow.com> > Date: Wed, November 21, 2007 8:33 am > To: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org> > > > We have a strange diff on our site that appears to be 78MB in size > that's causing errors: > > http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&diff=13665… > > Between this version: > > http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&oldid=1366… > > and this version: > > http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&oldid=1366… > > (obviously this is vandalism) > > It seems like the large diff is a result of a very long list of > newlines being entered into the revision. I tried putting some error > checking into DifferenceEngine to avoid displaying or storing large > diffs in the cache, but it seems like this affects several areas of > the code. This is the diff that was being stored: > > http://207.97.207.17/x/baddiff.html > > Any ideas? Has anyone run into this before? > > Travis > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > http://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

Why is difficult to have a non-corrupt dump - other ways of getting the information
by Di (rut) 21 Nov '07

21 Nov '07

Dear All, specially Anthony and Platonides, I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen. Also - now I've got a file from 2006 but I still wonder if there is no place where one can access old dumps - these will/could be very important research wise. And last but not least - If the dumps don't work, then it is very important to be able to dump some articles with their full histories in other fashions. I ask my pledge again - do you know who made the block so that export would only allow for 100 revisions? any way to hack that? Would it be possible to open an exception to get the data for a research study? Thanks! Rut

13 30

Oversight for logs
by Peter van Londen 20 Nov '07

20 Nov '07

Hi, Is there something like an oversight function for log entrees? The automatic summary after you delete a article can give away personal information if this was in the first line of the deleted article. If you forget to rewrite the summary these personal data is accesable through te logs. Is there a way to delete a log entree? (me hopes this is the right list to ask). Kind regards Peter van Londen/Londenp

3 2

Parser: is *anything* a valid magic word?
by Steve Bennett 20 Nov '07

20 Nov '07

The parser has to parse and treat magic words like __TOC__. These words are defined in languages/messages/MessagesXx.php (and possibly overridden). That theoretically means that *anything* (like "a" or even " ") could be a magic word. That makes it hard to write a fast parser, as basically you would have to process every character one at a time, look for a match, move onto the next character... So, two questions: 1) Is it possible/feasible to restrict the range of what could be a magic word in some way, like that they have to start with __, or some range of characters. 2) Is it possible to get a complete list of all the magic words currently used for all the languages of Wikipedia? Does the contents of the languages/messages directory already represent that? I realise that the term "magic word" is somewhat ambiguous: I'm primarily referring to words like __TOC__ that can appear vitually anywhere, rather than words like "subst:" that require a special context, or magic variables like PAGENAME, which (afaik) have to be wrapped in {{..}}. Thanks, Steve

6 16

Mirrored Research Node [was: Why is difficult to have a non-corrupt dump...]
by Felipe Ortega 19 Nov '07

19 Nov '07

Platonides <Platonides(a)gmail.com> escribió: I did a proposal on that line last month http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/34547 You're also welcomed to comment it ;) Although the main point seems to be if the files compression is good enough... The compression acceptable level varying due to things like WMF disk space available for dumps and the needing to have a better dump system. Well, actually if you read the previous threads on this list, you will see that this is a recurrent topic in the last two months. AFAIK, this topic also got the attention of the board of trustees, as it is not a joke. Now, it's been more than a year since the last time we had a complete and valid sutb-meta-history for enwiki. Brion also heard this complaints, so please, don't bother him again about that. Currently, he has no time to properly fix it. He also offered some solutions in his blog (read previous threads, please). Other big editions (dewiki, frwiki, plwiki....) also presents serious problems with complete history dumps. And I think the whole problem raised because the DB server was too stressed, and the dump script lost connectivity to the MySQL backend. We all agree in that: 1. We all would like this problem to be fixed soon. Many of us researchers are stopped right now, waiting for new, fresh data. 2. The admins does not have enough time to fix it, because they have more important issues to attend, and this is normal in such a big project like Wikipedia (let alone the rest of the Wikimedia Foundation projects). In short: in my humble opinion we should think about setting up: 1. A mirror/several mirrors to duplicate stub-meta-history info and thus offer alternative data repositories for research on Wikipedia and related projects. We at the URJC offer our facilities to the Wikimedia Foundation (and I think, other people in this thread could do that too). 2. An intermediate board of researchers that would serve as a central point of contact (though mirrored in practice) to ask for research data about Wikipedia and centralize petitions to Wikimedia Foundation tech-masters. This way, everyone could focus his/her attention to their own tasks, and we would not slow down interesting research works about Wikipedia. Regards. Felipe --------------------------------- ¿Chef por primera vez? - Sé un mejor Cocinillas. Entra en Yahoo! Respuestas.

1 0

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [27539] trunk/extensions/ParserFunctions
by Simetrical 19 Nov '07

19 Nov '07

On 11/16/07, raymond(a)svn.wikimedia.org <raymond(a)svn.wikimedia.org> wrote: > Revision: 27539 > Author: raymond > Date: 2007-11-16 08:02:24 +0000 (Fri, 16 Nov 2007) > > - $this->message = wfMsgForContent( "expr_$msg", htmlspecialchars( $parameter ) ); > + $this->message = '' . wfMsgForContent( "pfunc_expr_$msg", htmlspecialchars( $parameter ) ) . ''; > - return wfMsgForContent( 'pfunc_rel2abs_invalid_depth', $fullPath ); > + return '' . wfMsgForContent( 'pfunc_rel2abs_invalid_depth', $fullPath ) . ''; > - $result = wfMsgForContent( 'pfunc_time_error' ); > + $result = '' . wfMsgForContent( 'pfunc_time_error' ) . ''; > - return wfMsgForContent( 'pfunc_time_too_long' ); > + return '' . wfMsgForContent( We usually use , not .

2 1

Parser madness
by Magnus Manske 18 Nov '07

18 Nov '07

Since we're on that topic again :-) I'd like to announce that I've added a script to my wiki2xml package (svn: wiki2xml/php) that runs the MediaWiki parser tests on it. At first glance, there are many errors, but at closer view, the XML is actually pretty good in most cases; just my XML-to-XHTML script is not entirely up to the task yet. Also, the "expected results" in the parser tests are sometimes rather MediaWiki-specific. Does it matter if there's a space after <li>? It's not rendered anyway. Or, "X\nY" vs. "X Y" in HTML - no difference, AFAIK (except in <pre>). These "non-errors" make up quite a few "wrong" results in my tests. Cheers, Magnus

7 11

dividing front-end from back-end grammar and parsers
by William Allen Simpson 17 Nov '07

17 Nov '07

I've just read the past couple of days of discussion, and would like to agree with Merlijn. One of the points missed is that the pipe trick and many of the other "end cases" are actually pre-processed, not stored in the database. The easy examples being: * [[turkey (bird)|]] is stored as [[turkey (bird)|turkey]] * [[stuff]]ing is stored as [[stuff|stuffing]] Other such behaviors could be regularized, and not affect the existing articles. Some years back, I made some suggestions in this wise, but they were not accepted. A case I was concerned with at the time was normalized pre-processing of [[stuff:]] versus [[:stuff]], and [[|stuff]] versus [[stuff|]], and their combinations -- [[:stuff (action)|]]. This is the kind of thing that could most easily be formalized. In regularizing the grammar, think about how the back-end data could be normalized to a new grammar for editing, and then stored again in the back-end form. For example, the // and ** ideas we've talked about multiple times over the years. No reason that the database couldn't continue to store them as '' and '''. Or better as and ! If we stick to just front-end parsing, the project might be doable in our lifetimes. === And as a final note for the computer scientists, remember that we often use LR(1) and LALR(1) grammars, but RL(1) is also possible! MW syntax has often seemed to me more like RL.... (Yes, back in university we were all required to write a parser -- a year-long project. I've written several for later projects, too. But university was a very long time ago.)

8 23

EBNF grammar project status?
by Steve Bennett 17 Nov '07

17 Nov '07

What's the status of the project to create a grammar for Wikitext in EBNF? There are two pages: http://meta.wikimedia.org/wiki/Wikitext_Metasyntax http://www.mediawiki.org/wiki/Markup_spec Nothing seems to have happened since January this year. Also the comments on the latter page seem to indicate a lack of clear goal: is this just a fun project, is it to improve the existing parser, or is it to facilititate a new parser? It's obviously a lot of work, so it needs to be of clear benefit. Brion requested the grammar IIRC (and there's a comment to that effect at http://bugzilla.wikimedia.org/show_bug.cgi?id=7 ), so I'm wondering what became of it. Is there still a goal of replacing the parser? Or is there some alternative plan? Steve

25 214

live mirror
by Steve Summit 17 Nov '07

17 Nov '07

Never sure where to report these. http://wikitionary.biz.

6 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l November 2007