Hi all, I had mentioned this in the rewrite roadmap, and noticed it came up on IRC as well, so I'd like to run this by the mailing list:
User:The Earwig has written a pure-python (with optional C-speedups) MediaWiki text parser named mwparserfromhell[1]. Currently we have the textlib library and some various regexes that implement this in a non-perfect way. From my experience using mwparser (over 400k successful edits with no issues) I believe it is ready to be bundled with the framework. I think it would still be a good idea to keep textlib in as a fallback or for users who are currently using it and don't need to migrate.
As for actually adding it, in the rewrite branch we can just add it as a dependency in setup.py, and then convert various methods over. In trunk, I'm guessing we would need to add it as an external. (I'm not sure how that's actually done.)
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
Sounds interesting. What are your news for trunk users?
Yes, that'd be nice.
Hazard-SJ
________________________________ From: legoktm legoktm.wikipedia@gmail.com To: WikiMedia Mailing Lists pywikipedia-l@lists.wikimedia.org Sent: Wednesday, April 24, 2013 6:11 AM Subject: [Pywikipedia-l] Using a true MediaWiki parser (mwparserfromhell) instead of textlib methods
Hi all, I had mentioned this in the rewrite roadmap, and noticed it came up on IRC as well, so I'd like to run this by the mailing list:
User:The Earwig has written a pure-python (with optional C-speedups) MediaWiki text parser named mwparserfromhell[1]. Currently we have the textlib library and some various regexes that implement this in a non-perfect way. From my experience using mwparser (over 400k successful edits with no issues) I believe it is ready to be bundled with the framework. I think it would still be a good idea to keep textlib in as a fallback or for users who are currently using it and don't need to migrate.
As for actually adding it, in the rewrite branch we can just add it as a dependency in setup.py, and then convert various methods over.
In trunk, I'm guessing we would need to add it as an external. (I'm not sure how that's actually done.)
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I've gone ahead and implemented this in rewrite with r11737https://www.mediawiki.org/wiki/Special:Code/pywikipedia/11737and r11738. It's opt-in, so if you want to use it you need to install the mwparserfromhell https://github.com/earwig/mwparserfromhell library yourself, and set "use_mwparserfromhell = True" in your user-config.py. Eventually I'd like to add mwpfh as a requirement in the setup.py and make it an opt-out option. I haven't looked much into how it would be implemented in trunk, however I assume it will require something similar, which I'll do once its fully working in rewrite. If you notice any weird bugs, please let me know so I can fix them in pywikibot or report them upstream :D Thanks, -- Legoktm
On Wed, Apr 24, 2013 at 7:11 AM, legoktm legoktm.wikipedia@gmail.comwrote:
Hi all, I had mentioned this in the rewrite roadmap, and noticed it came up on IRC as well, so I'd like to run this by the mailing list:
User:The Earwig has written a pure-python (with optional C-speedups) MediaWiki text parser named mwparserfromhell[1]. Currently we have the textlib library and some various regexes that implement this in a non-perfect way. From my experience using mwparser (over 400k successful edits with no issues) I believe it is ready to be bundled with the framework. I think it would still be a good idea to keep textlib in as a fallback or for users who are currently using it and don't need to migrate.
As for actually adding it, in the rewrite branch we can just add it as a dependency in setup.py, and then convert various methods over. In trunk, I'm guessing we would need to add it as an external. (I'm not sure how that's actually done.)
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 24.04.2013 13:11, legoktm wrote:
As for actually adding it, in the rewrite branch we can just add it as a dependency in setup.py, and then convert various methods over. In trunk, I'm guessing we would need to add it as an external. (I'm not sure how that's actually done.)
I would be happy to support you with this in trunk (and rewrite). The system I implemented in trunk (and I am currently trying to promote... ;) could easily take the link [1] and make sure that this dependency gets installed automatically at the first bot run.
[1] https://github.com/earwig/mwparserfromhell/archive/develop.zip
Greetings DrTrigon
[1] https://github.com/earwig/mwparserfromhell/archive/develop.zip
At this stage, it might be best to use stable releases [1] instead of the development version, so you'd want [2] (or [3], which will always point to the latest version).
Also, as a note, the parser only supports Python 2.7 and 3, not Python 2.6. Is it worth extending compatibility to 2.6?
[1] https://github.com/earwig/mwparserfromhell/releases [2] https://github.com/earwig/mwparserfromhell/archive/v0.2.zip [2] https://github.com/earwig/mwparserfromhell/archive/master.zip
Earwig
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Yes of course! Makes a lot more sense to use a fixed stable revision.
I implemented this in trunk in r11744 [1]. The version used is v0.2.zip.
[1] http://www.mediawiki.org/wiki/Special:Code/pywikipedia/11744
Now the following works:
import wikipedia import externals
(here you will get asked whether you want to download 'mwparserfromhell' once)
import mwparserfromhell
it can be easily tested with:
python -c 'import wikipedia;import externals;import mwparserfromhell'
Greetings and all the best! DrTrigon
On 10.07.2013 22:42, Ben Kurtovic wrote:
[1] https://github.com/earwig/mwparserfromhell/archive/develop.zip
At this stage, it might be best to use stable releases [1] instead of the development version, so you'd want [2] (or [3], which will always point to the latest version).
Also, as a note, the parser only supports Python 2.7 and 3, not Python 2.6. Is it worth extending compatibility to 2.6?
[1] https://github.com/earwig/mwparserfromhell/releases [2] https://github.com/earwig/mwparserfromhell/archive/v0.2.zip [2] https://github.com/earwig/mwparserfromhell/archive/master.zip
Earwig
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
As I am known to be a fool (sometimes ;) I have to correct myself here:
import wikipedia import externals externals.check_setup('mwparserfromhell') # here you will get
asked whether you want to download 'mwparserfromhell' once
import mwparserfromhell
it can be easily tested with:
$ python -c 'import wikipedia;import externals;externals.check_setup("mwparserfromhell");import mwparserfromhell'
Sorry for the inconvenience and enjoy! Greetings DrTrigon
On 11.07.2013 22:20, Dr. Trigon wrote:
Yes of course! Makes a lot more sense to use a fixed stable revision.
I implemented this in trunk in r11744 [1]. The version used is v0.2.zip.
[1] http://www.mediawiki.org/wiki/Special:Code/pywikipedia/11744
Now the following works:
import wikipedia import externals
(here you will get asked whether you want to download 'mwparserfromhell' once)
import mwparserfromhell
it can be easily tested with:
python -c 'import wikipedia;import externals;import mwparserfromhell'
Greetings and all the best! DrTrigon
On 10.07.2013 22:42, Ben Kurtovic wrote:
[1] https://github.com/earwig/mwparserfromhell/archive/develop.zip
At this stage, it might be best to use stable releases [1] instead of the development version, so you'd want [2] (or [3], which will always point to the latest version).
Also, as a note, the parser only supports Python 2.7 and 3, not Python 2.6. Is it worth extending compatibility to 2.6?
[1] https://github.com/earwig/mwparserfromhell/releases [2] https://github.com/earwig/mwparserfromhell/archive/v0.2.zip [2] https://github.com/earwig/mwparserfromhell/archive/master.zip
Earwig
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
One important question for me here is "How is the handling/behaviour for malform(at)ed wiki syntax, like e.g. a text body:
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###
=== bad header 1 ==
A text containing mathematical equations like a = b + c or even something that could look like a header, like a == b == c but is anything BUT a header.
== bad header 2 ===
Some lorem ipsum bla bla ...
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###
I am despirately seeking a parser that has the same error behaviour and gives the same results like the original mw parser also in case of malform(at)ed wiki text.
Is this the case for mwparserfromhell??
Thanks and Greetings DrTrigon
On 24.04.2013 13:11, legoktm wrote:
Hi all, I had mentioned this in the rewrite roadmap, and noticed it came up on IRC as well, so I'd like to run this by the mailing list:
User:The Earwig has written a pure-python (with optional C-speedups) MediaWiki text parser named mwparserfromhell[1]. Currently we have the textlib library and some various regexes that implement this in a non-perfect way. From my experience using mwparser (over 400k successful edits with no issues) I believe it is ready to be bundled with the framework. I think it would still be a good idea to keep textlib in as a fallback or for users who are currently using it and don't need to migrate.
As for actually adding it, in the rewrite branch we can just add it as a dependency in setup.py, and then convert various methods over. In trunk, I'm guessing we would need to add it as an external. (I'm not sure how that's actually done.)
[1] https://github.com/earwig/mwparserfromhell
-- Legoktm
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
On Jul 14, 2013, at 6:48 AM, Dr. Trigon wrote:
One important question for me here is "How is the handling/behaviour for malform(at)ed wiki syntax [...]
I am despirately seeking a parser that has the same error behaviour and gives the same results like the original mw parser also in case of malform(at)ed wiki text.
Is this the case for mwparserfromhell??
Why don't you test it? One of my main intentions designing the parser was to have it handle malformed wikitext faithfully. I think I've done a pretty good job, and the example you gave is parsed correctly. I mean, that's partly the reason why we're using a non-regex-based parser in the first place: to avoid regex's limitations. Do point out situations where it makes mistakes, though, so they can be fixed.
Earwig
pywikipedia-l@lists.wikimedia.org