How to write a parser

List overview All Threads
Download

newer

older

Do you hate the "No nova...

Planning for the future: prepare...

Niklas Laxström

20 Jun 2012 20 Jun '12

1:02 p.m.

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

-Niklas

-- Niklas Laxström

Show replies by date

Domas Mituzas

20 Jun 20 Jun

3:13 p.m.

Well, the syntax is:

condition = and_condition ('or' and_condition)* and_condition = relation ('and' relation)* relation = is_relation | in_relation | within_relation | 'n' <EOL> is_relation = expr 'is' ('not')? value in_relation = expr ('not')? 'in' range_list

within_relation = expr ('not')? 'within' range_list expr = 'n' ('mod' value)? range_list = (range | value) (',' range_list)* value = digit+ digit = 0|1|2|3|4|5|6|7|8|9 range = value'..'value

Would this one work: http://pear.php.net/package/PHP_ParserGenerator

? Domas

On Jun 20, 2012, at 2:02 PM, Niklas Laxström wrote:

...

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

-Niklas

-- Niklas Laxström

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Antoine Musso

3:16 p.m.

Le 20/06/12 13:02, Niklas Laxström a écrit :

...

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

Have you considered using the `intl` PHP extension? It provides classes that supports the plural / number formatting from the CLDR. Out of the box :-)

That is of course going to need a lot of rewriting and rethinking the translatewiki system, but that would definitely be a huge time saver on the long term.

-- Antoine "hashar" Musso

Tim Starling

4:08 p.m.

On 20/06/12 21:02, Niklas Laxström wrote:

...

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

For input which is guaranteed to be small, a recursive descent parser is a reasonable choice -- maybe not the fastest method, but easy to understand and fun to write. There's lots of useful reference material available with a web search, e.g.:

http://teaching.idallen.com/cst8152/98w/recursive_decent_parsing.html

-- Tim Starling

Gabriel Wicke

8:20 p.m.

On 06/20/2012 01:02 PM, Niklas Laxström wrote:

...

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

I like the ease of disambiguation in Parsing Expression Grammars (PEG). Most PEG parser generators use memoization to achieve a runtime linear in the input. I have no experience with PEG parser generators for PHP, but am using PEG.js for the Parsoid tokenizer with good results.

If you try a PHP PEG generator, then please let us know about your results!

Gabriel

Siebrand Mazeland

8:50 p.m.

On Wed, June 20, 2012 20:20, Gabriel Wicke wrote:

...

On 06/20/2012 01:02 PM, Niklas LaxstrÃ¶m wrote:

...
No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

I like the ease of disambiguation in Parsing Expression Grammars (PEG). Most PEG parser generators use memoization to achieve a runtime linear in the input. I have no experience with PEG parser generators for PHP, but am using PEG.js for the Parsoid tokenizer with good results.

If you try a PHP PEG generator, then please let us know about your results!

A few links for the archive: * https://en.wikipedia.org/wiki/Comparison_of_parser_generators * https://github.com/hafriedlander/php-peg (triple licensed, under BSD, MPL and GPL by request) * http://sourceforge.net/projects/lime-php/ (GPL licensed)

Krinkle

21 Jun 21 Jun

7:13 a.m.

On Jun 20, 2012, at 1:02 PM, Niklas Laxström wrote:

...

No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

-Niklas

You may already know this, but santhosh is working on a parser[1] in javascript (as a node module, to be specific). I added a test suite to his repository. Ready to be expanded and build upon!

-- Krinkle

[1] https://github.com/santhoshtr/CLDRPluralRuleParser

Krinkle

7:20 a.m.

On Jun 21, 2012, at 7:13 AM, Krinkle wrote:

...

On Jun 20, 2012, at 1:02 PM, Niklas Laxström wrote:

...
No, this is not about a wikitext parser. Rather something much simpler.

Have a look at [1] and you will see rules like: n in 0..1 n is 2 n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99

Long ago when I wanted to compare the plural rules of MediaWiki and CLDR I wrote a parser for the CLDR rule format. Unfortunately my implementation uses regular expression and eval, which makes it unsuitable for production. Now, writing parsers is not my area of expertise, so can you please point me how to do this properly with PHP. Bonus points if it is also easily adaptable to JavaScript.

[1] http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_ru...

-Niklas

You may already know this, but santhosh is working on a parser[1] in javascript (as a node module, to be specific). I added a test suite to his repository. Ready to be expanded and build upon!

-- Krinkle

[1] https://github.com/santhoshtr/CLDRPluralRuleParser

Would be nice if there was an official test suite to use as input for it, so we don't have to maintain the test suite manually.

Also useful link, syntax specification: http://unicode.org/reports/tr35/#Language_Plural_Rules

-- Krinkle

4563

Age (days ago)

4564

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

7 participants

tags (0)

participants (7)

Antoine Musso
Domas Mituzas
Gabriel Wicke
Krinkle
Niklas Laxström
Siebrand Mazeland
Tim Starling