Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
I'm having trouble in the "parsing" part of that. The RC announcement message sent from MW is very messy, with random numbers and other characters mixed in. I've been trying to write a regex to group this into the Username, the edit reason, the page edited, and the URL of the diff....but I'm having quite a bit of trouble.
I have this in my LocalSettings.php: $wgRC2UDPAddress = '127.0.0.1'; $wgRC2UDPPort = '1223'; $wgRC2UDPPrefix = 'Wiki: ';
The string sent to the socket is something like: Wiki: 14[[07To Do14]]4 10 02http://domain.tld/wiki/index.php?diff=230&oldid=201 5* 03Username 5* (-45) 10Removed IRC line; added something else
The regex I wrote works when I test it by putting the above text into a string and applying the regex. However, it DOES NOT work when I actually run the bot and parse the data coming over the socket. I'm not sure why it acts like this. I first thought it had to do with line endings, but I tried removing the ^ and $, as well as setting the "m" flag, for multi-line (where . matches linebreaks as well).
The regex I have now is: /Wiki: [0-9]{2}[[[0-9]{2}(.+)[0-9]{2}]].*(http://domain.tld/wiki/index.php.+) [0-9]* [0-9]{2}(.+) [0-9]* .+ [0-9]{2}(.*)/
Note, this is a PCRE regex.
And again, it works fine when I'm testing it against a string of the text, but not the actual data being sent over the socket. I have no idea why.
Is there some sort of generic regex that is available somewhere for parsing this text? Or what? Why does MediaWiki choose such a messy string to use as the announcement? It just seems odd to me and very troublesome.
Thanks for the help.
On Sat, Sep 12, 2009 at 18:27, APseudoUtopia apseudoutopia@gmail.com wrote:
Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
I'm having trouble in the "parsing" part of that. The RC announcement message sent from MW is very messy, with random numbers and other characters mixed in. I've been trying to write a regex to group this into the Username, the edit reason, the page edited, and the URL of the diff....but I'm having quite a bit of trouble.
I have this in my LocalSettings.php: $wgRC2UDPAddress = '127.0.0.1'; $wgRC2UDPPort = '1223'; $wgRC2UDPPrefix = 'Wiki: ';
The string sent to the socket is something like: Wiki: 14[[07To Do14]]4 10 02http://domain.tld/wiki/index.php?diff=230&oldid=201 5* 03Username 5* (-45) 10Removed IRC line; added something else
aren't those color codes for IRC?
henna/Finne
On Sat, Sep 12, 2009 at 12:59 PM, Finne Boonen hennar@gmail.com wrote:
On Sat, Sep 12, 2009 at 18:27, APseudoUtopia apseudoutopia@gmail.com wrote:
Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
I'm having trouble in the "parsing" part of that. The RC announcement message sent from MW is very messy, with random numbers and other characters mixed in. I've been trying to write a regex to group this into the Username, the edit reason, the page edited, and the URL of the diff....but I'm having quite a bit of trouble.
I have this in my LocalSettings.php: $wgRC2UDPAddress = '127.0.0.1'; $wgRC2UDPPort = '1223'; $wgRC2UDPPrefix = 'Wiki: ';
The string sent to the socket is something like: Wiki: 14[[07To Do14]]4 10 02http://domain.tld/wiki/index.php?diff=230&oldid=201 5* 03Username 5* (-45) 10Removed IRC line; added something else
aren't those color codes for IRC?
henna/Finne
Ah, I suppose they are. I generally dislike using colors on IRC because the client I use doesn't manage colors well (And colors aren't even part of the RFC.. ). Anyway, is there any way to strip these color codes out? I didn't see a configuration value anywhere in the docs for it.
Actually, could someone point to me the file which generates this RC announcement message, and I can edit it in MW instead of parsing it with a regex?
Thanks.
APseudoUtopia wrote:
The regex I have now is: /Wiki: [0-9]{2}[[[0-9]{2}(.+)[0-9]{2}]].*(http://domain.tld/wiki/index.php.+) [0-9]* [0-9]{2}(.+) [0-9]* .+ [0-9]{2}(.*)/
Those numbers are prepended by byte 3 (irc code introducing a color).
The string sent to the socket is something like: Wiki: 14[[07To Do14]]4 10 02http://domain.tld/wiki/index.php?diff=230&oldid=201 5* 03Username 5* (-45) 10Removed IRC line; added something else
aren't those color codes for IRC?
henna/Finne
Ah, I suppose they are. I generally dislike using colors on IRC because the client I use doesn't manage colors well (And colors aren't even part of the RFC..
It's defined at http://www.mirc.com/help/color.txt MediaWiki udp stream doesn't use background colors, so check for a character 3 followed by 1 or two digits.
). Anyway, is there any way to strip these color codes out? I didn't see a configuration value anywhere in the docs for it.
Actually, could someone point to me the file which generates this RC announcement message, and I can edit it in MW instead of parsing it with a regex?
Thanks.
Those colors *are* useful. They delimit the fields, even when they are empty. So for example the 4 10 you see are above a field which can be an action written in lowercase (move, block...) or capital letters, in which case they are flags (currently defined NMB: New, Minor and Bot).
Without the colors, you wouldn't be able to differenciate the fields on some corner cases, like an username designed to trick you.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
APseudoUtopia wrote:
Ah, I suppose they are. I generally dislike using colors on IRC because the client I use doesn't manage colors well (And colors aren't even part of the RFC.. ). Anyway, is there any way to strip these color codes out? I didn't see a configuration value anywhere in the docs for it.
Stripping (Perl, adjust accordingly):
sub strip{ my $text = shift; $text =~ s/\cC\d{1,2}(?:,\d{1,2})?|[\cC\cB\cI\cU\cR\cO]//g;
return $text; }
On Sat, Sep 12, 2009 at 12:27 PM, APseudoUtopia apseudoutopia@gmail.comwrote:
Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
Just use code from existing bots: http://www.mediawiki.org/wiki/Manual:IRC_RC_Bot or one of its see alsos, or http://www.mediawiki.org/wiki/Manual:MediaWiki-Recent_Changes-IRCBot
APseudoUtopia schrieb:
Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
I'm having trouble in the "parsing" part of that. The RC announcement message sent from MW is very messy, with random numbers and other characters mixed in.
As others have pointed out, these are color codes, and they may help with parsing. But the format is very messy and incomplete. What you get in UDP is probably more or less readable, if messy, but once this is passed through IRC, long messages get cut off.
Basically: we need a better solution for this. My goal is to get a live feed of XML formatted messages (using the same format used by the API), and transmitting them via XMPP (well, the first leap will still be UDP). I have been talking about this for a while now, and got nods for the idea from the core admins at Wikimania.
Extension code (incomplete and largely unfinished, buzt able to emit XML via UDP) is here: http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/XMLRC/.
But this isn't a priority project of mine, so it's progressing slowly. Especially the part where I would have to set up an XMPP server (not needed if you can receive UDP directly, but without it, this project would be quite useless for wikimedia projects). So... anyone want to help? I'm currently trying to set up openfire on the toolserver. I have never worked with openfire before. Can someone help me with that?
thanks daniel
mediawiki-l@lists.wikimedia.org