Hello all
From one of my assignments as a bot operator I have some code which
does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.
Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?
Greetings
Hannes
PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st
2012/1/23 Hannes Röst hroest_nospam2333@quantentunnel.de
Hello all
From one of my assignments as a bot operator I have some code which does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.
Hannes, this would be GREAT to see in the framework, because recognition of nested templates are one of the weakest points now. We had a previous talk about this, I will try to dig for it into the archives. Creating test cases for such a toy is also an advanced task.
@Hannes: Yes, please. That would be *awesome*. Please upload your work to a patch tracker item: http://sourceforge.net/tracker/?group_id=93107&atid=603140 , and someone will take a look at it!
On 24 January 2012 00:17, Bináris wikiposta@gmail.com wrote:
Hannes, this would be GREAT to see in the framework, because recognition of nested templates are one of the weakest points now. We had a previous talk about this, I will try to dig for it into the archives. Creating test cases for such a toy is also an advanced task.
You might have to dig quite far (summer 2006) for that - I have tried to write a wikitext parser, but lacking a formal specification (at that time), and not having any experience with writing parsers, I did not get very far (even though I did have the red dragon book).
http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/
Best, Merlijn
2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:
http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/
How to find this in the current mess? It leads to https://phabricator.wikimedia.org/diffusion/
On Fri, 2016-09-02 at 07:28 +0200, Bináris wrote:
2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:
http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/
How to find this in the current mess?
See https://phabricator.wikimedia.org/T66848#1195047
andre
2016-09-02 10:34 GMT+02:00 Andre Klapper aklapper@wikimedia.org:
See https://phabricator.wikimedia.org/T66848#1195047
Thank xou!
2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:
Please upload your work to a patch tracker item: http://sourceforge.net/ tracker/?group_id=93107&atid=603140 , and someone will take a look at it!
This leads to https://sourceforge.net/p/pywikipediabot/patches/ where I can read: https://sourceforge.net/p/pywikipediabot/patches/
*NOTE: Our bug tracker has been moved. Please submit your patches at bugzilla:* https://bugzilla.wikimedia.org/enter_bug.cgi?product=Pywikibot Needs some update frome someone who has the right on Sourceforge. People well may still find Pywiki there and we don't want them to reach the main page of Phabricator.
On Fri, 2016-09-02 at 07:34 +0200, Bináris wrote:
This leads to https://sourceforge.net/p/pywikipediabot/patches/ where I can read: https://sourceforge.net/p/pywikipediabot/patches/ NOTE: Our bug tracker has been moved. Please submit your patches at bugzilla: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Pywikibot Needs some update frome someone who has the right on Sourceforge. People well may still find Pywiki there and we don't want them to reach the main page of Phabricator.
Whoever can edit the SourceForge page could directly link to https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=pywik... if that was wanted. I don't see a big issue with reaching the main page of Phabricator though.
andre
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hello Hannes
Just wondering; is your text parser able to correctly find all headings (e.g. '== bla ==' as well as '<h2>bla</h2>') and distinguish headings from other similar text but within a paragraph? And finally return the byte offset of those headings?
I am using such a piece of code written with help of difflib and it is may be useful here also? (even though I had not that much time to write a unittest with full coverage... but a simple one is there ;)
Greetings DrTrigon
On 23.01.2012 23:34, Hannes Röst wrote:
Hello all
From one of my assignments as a bot operator I have some code which
does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.
Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?
Greetings
Hannes
PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Hi Merlijn
Great, I will do so tonight. I have to say that I it is *not* attempt to write a complete parser for wikitext but rather have a solution to a some very limited problem which I encountered. This means that I can find templates and parse them into key-value pairs and there is also some code that can parse Image/File tags. However it is not a complete parser and for example it does not parse headings as DrTrigon asked, it is mostly doing templates at the moment. Also there is currently no support for unnamed parameters.
However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.
Hannes
On 24 January 2012 09:55, Dr. Trigon dr.trigon@surfeu.ch wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hello Hannes
Just wondering; is your text parser able to correctly find all headings (e.g. '== bla ==' as well as '<h2>bla</h2>') and distinguish headings from other similar text but within a paragraph? And finally return the byte offset of those headings?
I am using such a piece of code written with help of difflib and it is may be useful here also? (even though I had not that much time to write a unittest with full coverage... but a simple one is there ;)
Greetings DrTrigon
On 23.01.2012 23:34, Hannes Röst wrote:
Hello all
From one of my assignments as a bot operator I have some code which
does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.
Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?
Greetings
Hannes
PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk8eceUACgkQAXWvBxzBrDBmJQCePmfUbs4Y8HNN18UT6vMFYo5r N1AAoLuN1VLpZQOrwegmkKWl08Te0Rxp =HXai -----END PGP SIGNATURE-----
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Hi Hannes,
On 24 January 2012 10:27, Hannes Röst hroest_nospam2333@quantentunnel.de wrote:
a some very limited problem which I encountered. This means that I can find templates and parse them into key-value pairs and there is also some code that can parse Image/File tags.
I see. There is already code to do this (Page.getTemplatesWithParams), so it would be interesting to run your test suite on that, too. In any case, I prefer any solution without regexps over one that does use regexps, so I'm interested to see your work.
I also did not find formal specifications for wikitext so it was a lot of learning by doing.
There has been a lot of work on that in the last year or so. See, for instance http://www.mediawiki.org/wiki/Future/AST/Sweble and http://sweble.org/crystalball/ .
Best, Merlijn
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.
As far as I can see there is no such specification. We all know how the wikipedia handles text markup and what format we have to use (e.g. to create a heading and so on...) IF we use correct syntax...
The problem is what happens IF your use NON-VALID wiki syntax on a page? The mediawiki software will then do "something" to get (at least) a valid HTML page, but what fall-backs are used? How is the proirity when parsing and so on... In my opinion this is the main issue here since "our" wikitext parser should behave similar on wrong wiki syntax also... (quite a messy thing I experienced... obviousely I am not a parser expert too... ;)
This is why I did not write a parser just my tiny (holy) 'getSections' method.
Greetings DrTrigon
Dear all
So I uploaded the code here: https://sourceforge.net/tracker/?func=detail&aid=3479070&group_id=93...
The following test might best describe what the code is doing, working on a nested template it is possible to retrieve the inner as well as the outer template as a dictionary of key-value pair:
def test_nested_template(self): nested_template = u"""
Cras suscipit lorem eget elit pulvinar et molestie magna tempus. Vestibulum. {{Toplevel template | key1 = value1 | key2 = [[File:Al3+.svg|40px|Aluminiumion]] {{nested template | nested_key1 = nested_value1 | nested_key2 = nested_value2 }} | key3 = value3 }} and more text """ # First fetch the outer template and assert that we get key1 through 3 template = templateparser.parse_template(nested_template, 'Toplevel template') expected = u'[[File:Al3+.svg|40px|Aluminiumion]] {{nested template \n | nested_key1 = nested_value1\n | nested_key2 = nested_value2\n }} ' self.assertEqual( len(template.parameters.keys()), 3) self.assertEqual( template.parameters['key1'], 'value1' ) self.assertEqual( template.parameters['key3'], 'value3' ) self.assertEqual( template.parameters['key2'], expected) self.assertEqual( template.start, 111 ) self.assertEqual( template.end, 401 ) self.assertFalse(template.parameters.has_key('nested_key1')) # # Now fetch the inner (nested) template and assert that we get nested_key 1 and 2 template = templateparser.parse_template(nested_template, 'nested template') self.assertEqual( len(template.parameters.keys()), 2) self.assertEqual( template.parameters['nested_key1'], 'nested_value1' ) self.assertEqual( template.parameters['nested_key2'], 'nested_value2' ) self.assertEqual( template.start, 239 ) self.assertEqual( template.end, 350 )
Greetings
Hannes
On 24 January 2012 12:49, Dr. Trigon dr.trigon@surfeu.ch wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.
As far as I can see there is no such specification. We all know how the wikipedia handles text markup and what format we have to use (e.g. to create a heading and so on...) IF we use correct syntax...
The problem is what happens IF your use NON-VALID wiki syntax on a page? The mediawiki software will then do "something" to get (at least) a valid HTML page, but what fall-backs are used? How is the proirity when parsing and so on... In my opinion this is the main issue here since "our" wikitext parser should behave similar on wrong wiki syntax also... (quite a messy thing I experienced... obviousely I am not a parser expert too... ;)
This is why I did not write a parser just my tiny (holy) 'getSections' method.
Greetings DrTrigon -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk8emucACgkQAXWvBxzBrDDLQgCfdDlxFuZv9lqJM3mQOYwlXXWP /ksAoIk0hBOOtBV6grXIA0TdTB1KQg8A =yJSp -----END PGP SIGNATURE-----
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I have found the previous talk about this: http://sourceforge.net/tracker/index.php?func=detail&aid=3158761&gro...