[Pywikipedia-l] Template parsing code

List overview All Threads
Download

newer

older

Re: [pywikibot] [Wikimedia-l] Fwd:...

Vital sign

Hannes Röst

23 Jan 2012 23 Jan '12

10:34 p.m.

Hello all

...

From one of my assignments as a bot operator I have some code which

does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.

Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?

Greetings

Hannes

PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st

Show replies by date

Bináris

23 Jan 23 Jan

11:17 p.m.

2012/1/23 Hannes Röst hroest_nospam2333@quantentunnel.de

...

Hello all

From one of my assignments as a bot operator I have some code which does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.

Hannes, this would be GREAT to see in the framework, because recognition of nested templates are one of the weakest points now. We had a previous talk about this, I will try to dig for it into the archives. Creating test cases for such a toy is also an advanced task.

-- Bináris

Merlijn van Deen

24 Jan 24 Jan

7:55 a.m.

@Hannes: Yes, please. That would be *awesome*. Please upload your work to a patch tracker item: http://sourceforge.net/tracker/?group_id=93107&atid=603140 , and someone will take a look at it!

On 24 January 2012 00:17, Bináris wikiposta@gmail.com wrote:

...

Hannes, this would be GREAT to see in the framework, because recognition of nested templates are one of the weakest points now. We had a previous talk about this, I will try to dig for it into the archives. Creating test cases for such a toy is also an advanced task.

You might have to dig quite far (summer 2006) for that - I have tried to write a wikitext parser, but lacking a formal specification (at that time), and not having any experience with writing parsers, I did not get very far (even though I did have the red dragon book).

http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/

Best, Merlijn

Bináris

2 Sep 2 Sep

5:28 a.m.

2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:

...

http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/

How to find this in the current mess? It leads to https://phabricator.wikimedia.org/diffusion/

-- Bináris

Andre Klapper

8:34 a.m.

On Fri, 2016-09-02 at 07:28 +0200, Bináris wrote:

...

2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:

...
http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/

How to find this in the current mess?

See https://phabricator.wikimedia.org/T66848#1195047

andre

-- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Bináris

3:53 p.m.

2016-09-02 10:34 GMT+02:00 Andre Klapper aklapper@wikimedia.org:

...

See https://phabricator.wikimedia.org/T66848#1195047

Thank xou!

Bináris

5:34 a.m.

2012-01-24 8:55 GMT+01:00 Merlijn van Deen valhallasw@arctus.nl:

...

Please upload your work to a patch tracker item: http://sourceforge.net/ tracker/?group_id=93107&atid=603140 , and someone will take a look at it!

This leads to https://sourceforge.net/p/pywikipediabot/patches/ where I can read: https://sourceforge.net/p/pywikipediabot/patches/

*NOTE: Our bug tracker has been moved. Please submit your patches at bugzilla:* https://bugzilla.wikimedia.org/enter_bug.cgi?product=Pywikibot Needs some update frome someone who has the right on Sourceforge. People well may still find Pywiki there and we don't want them to reach the main page of Phabricator.

-- Bináris

Andre Klapper

8:29 a.m.

New subject: Template parsing code

On Fri, 2016-09-02 at 07:34 +0200, Bináris wrote:

...

This leads to https://sourceforge.net/p/pywikipediabot/patches/ where I can read: https://sourceforge.net/p/pywikipediabot/patches/ NOTE: Our bug tracker has been moved. Please submit your patches at bugzilla: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Pywikibot Needs some update frome someone who has the right on Sourceforge. People well may still find Pywiki there and we don't want them to reach the main page of Phabricator.

Whoever can edit the SourceForge page could directly link to https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=pywik... if that was wanted. I don't see a big issue with reaching the main page of Phabricator though.

andre

-- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Dr. Trigon

24 Jan 24 Jan

8:55 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hello Hannes

Just wondering; is your text parser able to correctly find all headings (e.g. '== bla ==' as well as '<h2>bla</h2>') and distinguish headings from other similar text but within a paragraph? And finally return the byte offset of those headings?

I am using such a piece of code written with help of difflib and it is may be useful here also? (even though I had not that much time to write a unittest with full coverage... but a simple one is there ;)

Greetings DrTrigon

On 23.01.2012 23:34, Hannes Röst wrote:

...

Hello all

...
From one of my assignments as a bot operator I have some code which

does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.

Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?

Greetings

Hannes

PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st

_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8eceUACgkQAXWvBxzBrDBmJQCePmfUbs4Y8HNN18UT6vMFYo5r N1AAoLuN1VLpZQOrwegmkKWl08Te0Rxp =HXai -----END PGP SIGNATURE-----

Hannes Röst

9:27 a.m.

Hi Merlijn

Great, I will do so tonight. I have to say that I it is *not* attempt to write a complete parser for wikitext but rather have a solution to a some very limited problem which I encountered. This means that I can find templates and parse them into key-value pairs and there is also some code that can parse Image/File tags. However it is not a complete parser and for example it does not parse headings as DrTrigon asked, it is mostly doing templates at the moment. Also there is currently no support for unnamed parameters.

However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.

Hannes

On 24 January 2012 09:55, Dr. Trigon dr.trigon@surfeu.ch wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hello Hannes

Just wondering; is your text parser able to correctly find all headings (e.g. '== bla ==' as well as '<h2>bla</h2>') and distinguish headings from other similar text but within a paragraph? And finally return the byte offset of those headings?

I am using such a piece of code written with help of difflib and it is may be useful here also? (even though I had not that much time to write a unittest with full coverage... but a simple one is there ;)

Greetings DrTrigon

On 23.01.2012 23:34, Hannes Röst wrote:

...
Hello all

...
From one of my assignments as a bot operator I have some code which

does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community.

Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate?

Greetings

Hannes

PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st

_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8eceUACgkQAXWvBxzBrDBmJQCePmfUbs4Y8HNN18UT6vMFYo5r N1AAoLuN1VLpZQOrwegmkKWl08Te0Rxp =HXai -----END PGP SIGNATURE-----

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Merlijn van Deen

10:59 a.m.

Hi Hannes,

On 24 January 2012 10:27, Hannes Röst hroest_nospam2333@quantentunnel.de wrote:

...

a some very limited problem which I encountered. This means that I can find templates and parse them into key-value pairs and there is also some code that can parse Image/File tags.

I see. There is already code to do this (Page.getTemplatesWithParams), so it would be interesting to run your test suite on that, too. In any case, I prefer any solution without regexps over one that does use regexps, so I'm interested to see your work.

...

I also did not find formal specifications for wikitext so it was a lot of learning by doing.

There has been a lot of work on that in the last year or so. See, for instance http://www.mediawiki.org/wiki/Future/AST/Sweble and http://sweble.org/crystalball/ .

Best, Merlijn

Dr. Trigon

11:49 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

...

However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.

As far as I can see there is no such specification. We all know how the wikipedia handles text markup and what format we have to use (e.g. to create a heading and so on...) IF we use correct syntax...

The problem is what happens IF your use NON-VALID wiki syntax on a page? The mediawiki software will then do "something" to get (at least) a valid HTML page, but what fall-backs are used? How is the proirity when parsing and so on... In my opinion this is the main issue here since "our" wikitext parser should behave similar on wrong wiki syntax also... (quite a messy thing I experienced... obviousely I am not a parser expert too... ;)

This is why I did not write a parser just my tiny (holy) 'getSections' method.

Greetings DrTrigon

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8emucACgkQAXWvBxzBrDDLQgCfdDlxFuZv9lqJM3mQOYwlXXWP /ksAoIk0hBOOtBV6grXIA0TdTB1KQg8A =yJSp -----END PGP SIGNATURE-----

Hannes Röst

6:43 p.m.

Dear all

So I uploaded the code here: https://sourceforge.net/tracker/?func=detail&aid=3479070&group_id=93...

The following test might best describe what the code is doing, working on a nested template it is possible to retrieve the inner as well as the outer template as a dictionary of key-value pair:

def test_nested_template(self): nested_template = u"""

Cras suscipit lorem eget elit pulvinar et molestie magna tempus. Vestibulum. {{Toplevel template | key1 = value1 | key2 = [[File:Al3+.svg|40px|Aluminiumion]] {{nested template | nested_key1 = nested_value1 | nested_key2 = nested_value2 }} | key3 = value3 }} and more text """ # First fetch the outer template and assert that we get key1 through 3 template = templateparser.parse_template(nested_template, 'Toplevel template') expected = u'[[File:Al3+.svg|40px|Aluminiumion]] {{nested template \n | nested_key1 = nested_value1\n | nested_key2 = nested_value2\n }} ' self.assertEqual( len(template.parameters.keys()), 3) self.assertEqual( template.parameters['key1'], 'value1' ) self.assertEqual( template.parameters['key3'], 'value3' ) self.assertEqual( template.parameters['key2'], expected) self.assertEqual( template.start, 111 ) self.assertEqual( template.end, 401 ) self.assertFalse(template.parameters.has_key('nested_key1')) # # Now fetch the inner (nested) template and assert that we get nested_key 1 and 2 template = templateparser.parse_template(nested_template, 'nested template') self.assertEqual( len(template.parameters.keys()), 2) self.assertEqual( template.parameters['nested_key1'], 'nested_value1' ) self.assertEqual( template.parameters['nested_key2'], 'nested_value2' ) self.assertEqual( template.start, 239 ) self.assertEqual( template.end, 350 )

Greetings

Hannes

On 24 January 2012 12:49, Dr. Trigon dr.trigon@surfeu.ch wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

...
However it might be a starting point for further work. I also did not find formal specifications for wikitext so it was a lot of learning by doing. However I used it successfully on ~4k "Infobox Chemie" templates in the de-wiki.

As far as I can see there is no such specification. We all know how the wikipedia handles text markup and what format we have to use (e.g. to create a heading and so on...) IF we use correct syntax...

The problem is what happens IF your use NON-VALID wiki syntax on a page? The mediawiki software will then do "something" to get (at least) a valid HTML page, but what fall-backs are used? How is the proirity when parsing and so on... In my opinion this is the main issue here since "our" wikitext parser should behave similar on wrong wiki syntax also... (quite a messy thing I experienced... obviousely I am not a parser expert too... ;)

This is why I did not write a parser just my tiny (holy) 'getSections' method.

Greetings DrTrigon -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8emucACgkQAXWvBxzBrDDLQgCfdDlxFuZv9lqJM3mQOYwlXXWP /ksAoIk0hBOOtBV6grXIA0TdTB1KQg8A =yJSp -----END PGP SIGNATURE-----

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

30 Jan 30 Jan

6:58 p.m.

I have found the previous talk about this: http://sourceforge.net/tracker/index.php?func=detail&aid=3158761&gro...

-- Bináris

2863

Age (days ago)

4547

Last active (days ago)

pywikibot@lists.wikimedia.org

13 comments

5 participants

tags (0)

participants (5)

Andre Klapper
Bináris
Dr. Trigon
Hannes Röst
Merlijn van Deen