Hey list,
I have an IRC bot. I'm integrating the mediawiki RC announcement into the bot by having it listen on a UDP ip/port, parsing the incoming message, then announcing it on IRC.
I'm having trouble in the "parsing" part of that. The RC announcement message sent from MW is very messy, with random numbers and other characters mixed in. I've been trying to write a regex to group this into the Username, the edit reason, the page edited, and the URL of the diff....but I'm having quite a bit of trouble.
I have this in my LocalSettings.php: $wgRC2UDPAddress = '127.0.0.1'; $wgRC2UDPPort = '1223'; $wgRC2UDPPrefix = 'Wiki: ';
The string sent to the socket is something like: Wiki: 14[[07To Do14]]4 10 02http://domain.tld/wiki/index.php?diff=230&oldid=201 5* 03Username 5* (-45) 10Removed IRC line; added something else
The regex I wrote works when I test it by putting the above text into a string and applying the regex. However, it DOES NOT work when I actually run the bot and parse the data coming over the socket. I'm not sure why it acts like this. I first thought it had to do with line endings, but I tried removing the ^ and $, as well as setting the "m" flag, for multi-line (where . matches linebreaks as well).
The regex I have now is: /Wiki: [0-9]{2}[[[0-9]{2}(.+)[0-9]{2}]].*(http://domain.tld/wiki/index.php.+) [0-9]* [0-9]{2}(.+) [0-9]* .+ [0-9]{2}(.*)/
Note, this is a PCRE regex.
And again, it works fine when I'm testing it against a string of the text, but not the actual data being sent over the socket. I have no idea why.
Is there some sort of generic regex that is available somewhere for parsing this text? Or what? Why does MediaWiki choose such a messy string to use as the announcement? It just seems odd to me and very troublesome.
Thanks for the help.