I'm planning to parse the page text from wikipedia downloads.
Is there a document of all the supported markups (past and present), or is the PHP code all there is to go off of?
Jeremy Dunck wrote:
I'm planning to parse the page text from wikipedia downloads.
Is there a document of all the supported markups (past and present), or is the PHP code all there is to go off of?
Unfortunately there is no formal, official grammar for the wiki markup. I'm afraid you'll have to work off the code (which isn't always 'right') and various help pages (which aren't always right either ;)
There is a set of parser test cases in maintenance/parserTests.txt (not all of which are passed by the current code).
-- brion vibber (brion @ pobox.com)
On 9/3/05, Brion Vibber brion@pobox.com wrote:
There is a set of parser test cases in maintenance/parserTests.txt (not all of which are passed by the current code).
Good info, thanks. I should have asked: is there already a page text parser in Python? A grep of the source for pywikipediabot didn't turn up anything.
Jeremy Dunck wrote:
Good info, thanks. I should have asked: is there already a page text parser in Python? A grep of the source for pywikipediabot didn't turn up anything.
pywikipediabot works directly with wikitext. I think Kate Turner did a simple mediawiki using python, can't find it though.
On 9/3/05, Ashar Voultoiz hashar@altern.org wrote:
Jeremy Dunck wrote:
Good info, thanks. I should have asked: is there already a page text parser in Python? A grep of the source for pywikipediabot didn't turn up anything.
pywikipediabot works directly with wikitext. I think Kate Turner did a simple mediawiki using python, can't find it though.
Thanks for pointing that out. I'm just finding that the parsing logic is spread throughout the functionality of the library in the form of regex's, as opposed to having an actual parser with resulting tree.
I'm sure this was an informed choice -- I'll post to their mailing list for information on that choice.
On 03/09/05, Jeremy Dunck jdunck@gmail.com wrote:
On 9/3/05, Brion Vibber brion@pobox.com wrote:
There is a set of parser test cases in maintenance/parserTests.txt (not all of which are passed by the current code).
Good info, thanks. I should have asked: is there already a page text parser in Python? A grep of the source for pywikipediabot didn't turn up anything.
You might want to take a look at http://meta.wikimedia.org/wiki/Alternative_parsers - and, of course, please add any others you come across (or write) to that page...
Jeremy Dunck wrote:
Is there a document of all the supported markups (past and present), or is the PHP code all there is to go off of?
Well... kind of. You see, I wrote a parser for wiki syntax in lex/yacc (or rather, flex/bison). This does, in a way, define a formal grammar for the wiki syntax.
However, it does not parse HTML tags yet. This is the main reason it's not in use, and also the main reason I have lost interest in developing it further. Maybe other syntax elements are missing too, but I can't think of any just now.
http://cvs.sourceforge.net/viewcvs.py/wikipedia/flexbisonparse/
Timwi
On 9/6/05, Timwi timwi@gmx.net wrote:
However, it does not parse HTML tags yet. This is the main reason it's not in use, and also the main reason I have lost interest in developing it further. Maybe other syntax elements are missing too, but I can't think of any just now.
Thanks Timwi. Is this because of tag soup? No offense, but what was the challenge?
Jeremy Dunck wrote:
On 9/6/05, Timwi timwi@gmx.net wrote:
However, it does not parse HTML tags yet. This is the main reason it's not in use, and also the main reason I have lost interest in developing it further. Maybe other syntax elements are missing too, but I can't think of any just now.
Thanks Timwi. Is this because of tag soup? No offense, but what was the challenge?
The challenge is getting the parser to match up nested start and end tags correctly without having to add each permissible tag as a separate token to the lexing file. I couldn't find a way to do this yet. I did actually start an attempt at adding every tag as an extra lexical token, and maybe it would have worked, but it proved laborious and boring.
Timwi
wikitech-l@lists.wikimedia.org