Hello, I'm currently partially rewriting the log system, because the current one doesn't support i18n well enough.
I'm trying to avoid any radical changes like changes to the database schema. My changes mostly touch handling log entries and formatting them.
So, if you know any defects in the current log system, or have an wish what the new should do, or know someplace where these kind of wishes exist, please tell me. I have scanned the list of bugs in bugzilla quickly, but it is a bit hard to find relevant bugs when there is no logging component.
I'm aiming to solve at least these bugs: https://bugzilla.wikimedia.org/30737 User names should be moved into log messages. https://bugzilla.wikimedia.org/24156 Messages of log entries should support GENDER https://bugzilla.wikimedia.org/24620 Log entries are difficult to localize; rewrite logs system https://bugzilla.wikimedia.org21716 Log entries assume sentence starts with username
-Niklas
It would be nice to have standard functions that supports storing associative arrays in log_params rather than fragile ordered lists. I ended up hacking up a quick function in FlaggedRevs for this. Newer log types could make use of this and existing ones could if they had some b/c code.
Bugzilla from niklas.laxstrom@gmail.com wrote:
Hello, I'm currently partially rewriting the log system, because the current one doesn't support i18n well enough.
I'm trying to avoid any radical changes like changes to the database schema. My changes mostly touch handling log entries and formatting them.
So, if you know any defects in the current log system, or have an wish what the new should do, or know someplace where these kind of wishes exist, please tell me. I have scanned the list of bugs in bugzilla quickly, but it is a bit hard to find relevant bugs when there is no logging component.
I'm aiming to solve at least these bugs: https://bugzilla.wikimedia.org/30737 User names should be moved into log messages. https://bugzilla.wikimedia.org/24156 Messages of log entries should support GENDER https://bugzilla.wikimedia.org/24620 Log entries are difficult to localize; rewrite logs system https://bugzilla.wikimedia.org21716 Log entries assume sentence starts with username
-Niklas
Niklas Laxström
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I just commited many changes to logging code. There is more to come, but I think this is suitable place to write in more detail what is going on. I'd also like to request code review and testing :)
Thus far I have committed new formatting code and small cleanups. Both LogEventsList and RecentChanges are using the formatters now. I haven't committed my last patch, which changes Title.php to generate log entries using my new code. That will also fix page histories and IRC feed, which use static version of log action text, which is generated together when the new log item is inserted into the database.
There are two major parts in the new logging system: LogEntry and LogFormatter. LogEntry is a model around one log entry. It has multiple subclasses. For constructing new log entries, you will create a new ManualLogEntry and fill necessary info, after which you can call insert() and publish(). If you are loading entries from database, you can simply call DatabaseLogEntry::newFromRow( $row ). It supports rows both from logging and recentchanges table. Usually you want to go directly to LogFormatter and call newFromEntry or the hand newFromRow shortcut. LogFormatter provides getActionText() method, which formats the log action for you, taking those pesky LogPage::DELETED_FOO restrictions into account. The action text includes the username, to support different word orders. There is also getPlainActionText(), which formats the log entry so that it is suitable for page histories and IRC feeds.
LogEntries can have parameters. Parameters should be an associative array. When saved to database, it is encoded to JSON. If you can pass parameters directly to the message which is used to format the action text, you can name the keys like "#:foobar", where # is a number and should start from 4, because parameters 1, 2 and 3 are already reserved and should be common to all log entries. Those are user name link, username for gender and target page link respectively.
If they key is not in #:foobar format, it is not automatically available for the action text message. By subclassing LogFormatter you can do whatever you want with the parameters. Be aware of $this->plaintext value though, it indicates whether we can use any markup or just plaintext. This is how the MoveLogFormatter is registered. I've added a type/* shortcut to avoid some repetition. If the value is an existing class, it will be used. Otherwise the old behavior of calling the function is used through LegacyLogFormatter.
$wgLogActionsHandlers = array( // move, move_redir 'move/*' => 'MoveLogFormatter', );
So what does this all bring to us? * Flexible word order * The most complex piece of log formatting is done only once, and it also takes care of hiding any restricted items * Gender is supported * Ability to store parameters as an associative array * New message naming conventions to reduce boilerplate * Anonymous users can make log entries, that are actually shown * Global logs should be easier to implement now, but it is not directly supported by the current code. * Two simple methods: getActionText and getPlainActionText, instead of the mess of making log entries all over the place * All code for one log type is now in single place, instead of lots of switch $type in different places.
So once more, please text, review and comment. I still have lots to do, all the log types need to be converted one by one to the new system, to take the full benefit of improved i18n. Easiest way to find the commits is probably this page: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/author/nikerabbit
-Niklas
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
I'll toss these revs onto my review queue.
Bugzilla from niklas.laxstrom@gmail.com wrote:
I just commited many changes to logging code. There is more to come, but I think this is suitable place to write in more detail what is going on. I'd also like to request code review and testing :)
Thus far I have committed new formatting code and small cleanups. Both LogEventsList and RecentChanges are using the formatters now. I haven't committed my last patch, which changes Title.php to generate log entries using my new code. That will also fix page histories and IRC feed, which use static version of log action text, which is generated together when the new log item is inserted into the database.
There are two major parts in the new logging system: LogEntry and LogFormatter. LogEntry is a model around one log entry. It has multiple subclasses. For constructing new log entries, you will create a new ManualLogEntry and fill necessary info, after which you can call insert() and publish(). If you are loading entries from database, you can simply call DatabaseLogEntry::newFromRow( $row ). It supports rows both from logging and recentchanges table. Usually you want to go directly to LogFormatter and call newFromEntry or the hand newFromRow shortcut. LogFormatter provides getActionText() method, which formats the log action for you, taking those pesky LogPage::DELETED_FOO restrictions into account. The action text includes the username, to support different word orders. There is also getPlainActionText(), which formats the log entry so that it is suitable for page histories and IRC feeds.
LogEntries can have parameters. Parameters should be an associative array. When saved to database, it is encoded to JSON. If you can pass parameters directly to the message which is used to format the action text, you can name the keys like "#:foobar", where # is a number and should start from 4, because parameters 1, 2 and 3 are already reserved and should be common to all log entries. Those are user name link, username for gender and target page link respectively.
If they key is not in #:foobar format, it is not automatically available for the action text message. By subclassing LogFormatter you can do whatever you want with the parameters. Be aware of $this->plaintext value though, it indicates whether we can use any markup or just plaintext. This is how the MoveLogFormatter is registered. I've added a type/* shortcut to avoid some repetition. If the value is an existing class, it will be used. Otherwise the old behavior of calling the function is used through LegacyLogFormatter.
$wgLogActionsHandlers = array( // move, move_redir 'move/*' => 'MoveLogFormatter', );
So what does this all bring to us?
- Flexible word order
- The most complex piece of log formatting is done only once, and it
also takes care of hiding any restricted items
- Gender is supported
- Ability to store parameters as an associative array
- New message naming conventions to reduce boilerplate
- Anonymous users can make log entries, that are actually shown
- Global logs should be easier to implement now, but it is not
directly supported by the current code.
- Two simple methods: getActionText and getPlainActionText, instead of
the mess of making log entries all over the place
- All code for one log type is now in single place, instead of lots of
switch $type in different places.
So once more, please text, review and comment. I still have lots to do, all the log types need to be converted one by one to the new system, to take the full benefit of improved i18n. Easiest way to find the commits is probably this page: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/author/nikerabbit
-Niklas
-- Niklas Laxström
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
On 8 September 2011 13:36, Max Semenik maxsem.wiki@gmail.com wrote:
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
Do those cause actual problems or is it just matter of preference? In my opinion JSON is much better for anyone who wants to dig the logs without using PHP. Also, is (un)serialize guaranteed to be stable across PHP versions?
-Niklas
On 11-09-08 04:25 AM, Niklas Laxström wrote:
On 8 September 2011 13:36, Max Semenik maxsem.wiki@gmail.com wrote:
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
Do those cause actual problems or is it just matter of preference? In my opinion JSON is much better for anyone who wants to dig the logs without using PHP. Also, is (un)serialize guaranteed to be stable across PHP versions?
-Niklas
We already use serialize in HistoryBlob/Revision, the job queue, caching, file metadata, the localization cache, ...
So if you add any new fields to the db you should really stick to (un)serialize. We're already using serialize everywhere and we even use binary storage which is troublesome for anyone trying to stare at the database with most phpmyadmin installs. People being minorly inconvenienced when reading the database raw is the last of our issues. If you want to argue the irrelevant minority that would be slightly inconvenienced reading the database raw I'll argue the irrelevant minority that would be slightly inconvenienced trying to do db queries to mw code externally and have to parse json which isn't as simple as (un)serialize. ;) I'll also wager that HipHop makes the gap in speed between (un)serialize and json farther.
On 8 September 2011 17:57, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-08 04:25 AM, Niklas Laxström wrote:
On 8 September 2011 13:36, Max Semenik maxsem.wiki@gmail.com wrote:
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
Do those cause actual problems or is it just matter of preference? In my opinion JSON is much better for anyone who wants to dig the logs without using PHP. Also, is (un)serialize guaranteed to be stable across PHP versions?
-Niklas
We already use serialize in HistoryBlob/Revision, the job queue, caching, file metadata, the localization cache, ...
So if you add any new fields to the db you should really stick to (un)serialize. We're already using serialize everywhere and we even use binary storage which is troublesome for anyone trying to stare at the database with most phpmyadmin installs. People being minorly inconvenienced when reading the database raw is the last of our issues. If you want to argue the irrelevant minority that would be slightly inconvenienced reading the database raw I'll argue the irrelevant minority that would be slightly inconvenienced trying to do db queries to mw code externally and have to parse json which isn't as simple as (un)serialize. ;) I'll also wager that HipHop makes the gap in speed between (un)serialize and json farther.
Very well, r96585.
Big thank you for everyone who already looked and tested the code, especially to Aaron. I have fixed the few issues that have come up.
Have we reached to an agreement to serialize the parameters instead of formatting them with JSON? I am going commit code that actually creates log entries using this new system, so I'd rather be sure we are comfortable with what we have chosen, to avoid unnecessary mix of different formats in the database.
-Niklas
On 8 September 2011 20:00, Niklas Laxström niklas.laxstrom@gmail.com wrote:
On 8 September 2011 17:57, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-08 04:25 AM, Niklas Laxström wrote:
On 8 September 2011 13:36, Max Semenik maxsem.wiki@gmail.com wrote:
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
Do those cause actual problems or is it just matter of preference? In my opinion JSON is much better for anyone who wants to dig the logs without using PHP. Also, is (un)serialize guaranteed to be stable across PHP versions?
-Niklas
We already use serialize in HistoryBlob/Revision, the job queue, caching, file metadata, the localization cache, ...
So if you add any new fields to the db you should really stick to (un)serialize. We're already using serialize everywhere and we even use binary storage which is troublesome for anyone trying to stare at the database with most phpmyadmin installs. People being minorly inconvenienced when reading the database raw is the last of our issues. If you want to argue the irrelevant minority that would be slightly inconvenienced reading the database raw I'll argue the irrelevant minority that would be slightly inconvenienced trying to do db queries to mw code externally and have to parse json which isn't as simple as (un)serialize. ;) I'll also wager that HipHop makes the gap in speed between (un)serialize and json farther.
Very well, r96585.
I'd still prefer JSON for offline/non-PHP use. I'm not sure it's a huge deal though.
Bugzilla from niklas.laxstrom@gmail.com wrote:
Big thank you for everyone who already looked and tested the code, especially to Aaron. I have fixed the few issues that have come up.
Have we reached to an agreement to serialize the parameters instead of formatting them with JSON? I am going commit code that actually creates log entries using this new system, so I'd rather be sure we are comfortable with what we have chosen, to avoid unnecessary mix of different formats in the database.
-Niklas
On 8 September 2011 20:00, Niklas Laxström niklas.laxstrom@gmail.com wrote:
On 8 September 2011 17:57, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-08 04:25 AM, Niklas Laxström wrote:
On 8 September 2011 13:36, Max Semenik maxsem.wiki@gmail.com wrote:
On Thu, Sep 8, 2011 at 2:18 PM, Aaron Schulz aschulz4587@gmail.com wrote:
Yay for log_params. I was thinking JSON would be appropriate here, so I'm glat to see that.
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
Do those cause actual problems or is it just matter of preference? In my opinion JSON is much better for anyone who wants to dig the logs without using PHP. Also, is (un)serialize guaranteed to be stable across PHP versions?
-Niklas
We already use serialize in HistoryBlob/Revision, the job queue, caching, file metadata, the localization cache, ...
So if you add any new fields to the db you should really stick to (un)serialize. We're already using serialize everywhere and we even use binary storage which is troublesome for anyone trying to stare at the database with most phpmyadmin installs. People being minorly inconvenienced when reading the database raw is the last of our issues. If you want to argue the irrelevant minority that would be slightly inconvenienced reading the database raw I'll argue the irrelevant minority that would be slightly inconvenienced trying to do db queries to mw code externally and have to parse json which isn't as simple as (un)serialize. ;) I'll also wager that HipHop makes the gap in speed between (un)serialize and json farther.
Very well, r96585.
-- Niklas Laxström
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 8, 2011 at 8:36 PM, Max Semenik maxsem.wiki@gmail.com wrote:
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
I'd prefer JSON. I don't care about the speed, it's not a critical code path, and JSON is stable, well-defined and can be read by any client, whereas serialize is some scary PHP format that may or may not change without notice.
On 11-09-09 06:22 PM, Andrew Garrett wrote:
On Thu, Sep 8, 2011 at 8:36 PM, Max Semenik maxsem.wiki@gmail.com wrote:
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
I'd prefer JSON. I don't care about the speed, it's not a critical code path, and JSON is stable, well-defined and can be read by any client, whereas serialize is some scary PHP format that may or may not change without notice.
- We already (un)serialize data in and out of the database. - (un)serialize can't change, if it does we already have problems. - These are for database storage we have no reason to input data into a private database in a format expecting people to read the data back from other clients. - json in php requires a mess of code and potentially a 3rd party libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
On Sat, Sep 10, 2011 at 1:00 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:
- json in php requires a mess of code and potentially a 3rd party
libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
Well, that's why we have the FormatJson wrapper. Could use some tests to make sure that the output from Services_Json and json_{en,de}code are identical. I'm whipping up some trivial ones now.
-Chad
Am 10.09.2011 07:20, schrieb Chad:
- json in php requires a mess of code and potentially a 3rd party
libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
Well, that's why we have the FormatJson wrapper. Could use some tests to make sure that the output from Services_Json and json_{en,de}code are identical. I'm whipping up some trivial ones now.
-Chad
This is a good idea. Filed as https://bugzilla.wikimedia.org/show_bug.cgi?id=30841 "Development of unit tests for FormatJson.php versus PHP built-in json_encode() json_decode() "
On Sat, Sep 10, 2011 at 2:30 AM, Thomas Gries mail@tgries.de wrote:
Am 10.09.2011 07:20, schrieb Chad:
- json in php requires a mess of code and potentially a 3rd party
libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
Well, that's why we have the FormatJson wrapper. Could use some tests to make sure that the output from Services_Json and json_{en,de}code are identical. I'm whipping up some trivial ones now.
-Chad
This is a good idea. Filed as https://bugzilla.wikimedia.org/show_bug.cgi?id=30841 "Development of unit tests for FormatJson.php versus PHP built-in json_encode() json_decode() "
I said I was already whipping them up, and indeed I had committed them before you e-mailed. Resolving FIXED ;-)
-Chad
On 11-09-09 10:00 PM, Daniel Friesen wrote:
On 11-09-09 06:22 PM, Andrew Garrett wrote:
On Thu, Sep 8, 2011 at 8:36 PM, Max Semenik maxsem.wiki@gmail.com wrote:
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
I'd prefer JSON. I don't care about the speed, it's not a critical code path, and JSON is stable, well-defined and can be read by any client, whereas serialize is some scary PHP format that may or may not change without notice.
- We already (un)serialize data in and out of the database.
- (un)serialize can't change, if it does we already have problems.
- These are for database storage we have no reason to input data into a
private database in a format expecting people to read the data back from other clients.
- json in php requires a mess of code and potentially a 3rd party
libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
Here's another kick: - Using JSON in php, when you decode what you encoded you don't always get the same thing back (serialize you of course do)
var_dump(FormatJson::encode(array(1=>1,2=>2)));
string(13) "{"1":1,"2":2}" var_dump(FormatJson::decode('{"1":1,"2":2}')); object(stdClass)#20 (2) { ["1"]=> int(1) ["2"]=> int(2) }
array in, object out.
There is a FormatJson argument to return assoc arrays instead of objects, but this means rather than always getting the right type of data back you have to specifically note when you want assoc arrays and when you want objects.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On Sat, Sep 10, 2011 at 3:13 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-09 10:00 PM, Daniel Friesen wrote:
On 11-09-09 06:22 PM, Andrew Garrett wrote:
On Thu, Sep 8, 2011 at 8:36 PM, Max Semenik maxsem.wiki@gmail.com wrote:
Even though data in those fields is small enough, can serialize()/unserialize() be used instead? It's faster and doesn't require the mess of ServicesJSON to work correctly.
I'd prefer JSON. I don't care about the speed, it's not a critical code path, and JSON is stable, well-defined and can be read by any client, whereas serialize is some scary PHP format that may or may not change without notice.
- We already (un)serialize data in and out of the database.
- (un)serialize can't change, if it does we already have problems.
- These are for database storage we have no reason to input data into a
private database in a format expecting people to read the data back from other clients.
- json in php requires a mess of code and potentially a 3rd party
libraries because: -- the bulit-in json json_{en,de}code library functions may not be installed -- the bulit-in json library in some cases actually has a bug that makes it encode/decode json incorrectly
Here's another kick:
- Using JSON in php, when you decode what you encoded you don't always
get the same thing back (serialize you of course do)
var_dump(FormatJson::encode(array(1=>1,2=>2)));
string(13) "{"1":1,"2":2}" var_dump(FormatJson::decode('{"1":1,"2":2}')); object(stdClass)#20 (2) { ["1"]=> int(1) ["2"]=> int(2) }
array in, object out.
There is a FormatJson argument to return assoc arrays instead of objects, but this means rather than always getting the right type of data back you have to specifically note when you want assoc arrays and when you want objects.
Only mildly annoying, and not enough of a reason to stop using JSON imo. That's the behavior of json_decode() anyway, so we're not diverging from upstream.
-Chad
On Sat, Sep 10, 2011 at 9:13 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:
Here's another kick:
- Using JSON in php, when you decode what you encoded you don't always
get the same thing back (serialize you of course do)
Even serialize() doesn't round-trip cleanly in certain cases: https://bugs.php.net/bug.php?id=55495
Roan Kattouw (Catrope)
On 10 September 2011 12:23, Roan Kattouw roan.kattouw@gmail.com wrote:
On Sat, Sep 10, 2011 at 9:13 AM, Daniel Friesen lists@nadir-seen-fire.com wrote:
Here's another kick:
- Using JSON in php, when you decode what you encoded you don't always
get the same thing back (serialize you of course do)
Even serialize() doesn't round-trip cleanly in certain cases: https://bugs.php.net/bug.php?id=55495
Another point came up in IRC discussions: "JSON converts UTF-8 characters to unicode escape sequences. serialize() does not."
This effectively negates any size advantages JSON might have. I noticed JSON is used extensively in resource loader. Roan: did you consider this point when choosing JSON format for storing message resources?
-Niklas
On Sat, Sep 10, 2011 at 8:28 PM, Niklas Laxström niklas.laxstrom@gmail.com wrote:
Another point came up in IRC discussions: "JSON converts UTF-8 characters to unicode escape sequences. serialize() does not."
This effectively negates any size advantages JSON might have. I noticed JSON is used extensively in resource loader. Roan: did you consider this point when choosing JSON format for storing message resources?
No, I didn't know this and didn't think about it. The reason I chose JSON is because the primary use case of the message blobs was to be sent to JavaScript and used for i18n there, so it was convenient to keep it as JSON the whole time and not have to do any decoding or encoding for the primary use case. The only case in which PHP needs to read the JSON data is when a single message is changed in the MediaWiki: namespace, in which case the blob is decoded, the message contents are swapped out, and the blob is re-encoded.
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org