OCG contains a "plaintext" backend which generates quite nice plain-text versions of WP articles. Try clicking "create a book" in the enwiki sidebar, "start book creator", go to some article, click "add this page to your book" in the header then "show book", then change the format in the drop down to "Word processor (plain text)" and click "download".
You can also take the "download as PDF" link, something like https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_a... and replace the 'writer=rdf2latex' part at the end with 'writer=rdf2text', like: https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_a...
These tools can be used from the command-line, as described at https://github.com/wikimedia/mediawiki-extensions-Collection-OfflineContentG...
I hope that helps! --scott
On Fri, Nov 18, 2016 at 3:15 AM, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi Scott,
Thank you so much for your reply and offer to help with Parsoid. I used DizzyLogic as an easy parser to get Wikipedia articles' content stripped off the wiki markup. The results were in plain text files. I used it to parse the whole English and Arabic Wikipedia dumps back in January. It was easy to use because my coding knowledge is limited. I read the link you kindly provided about Parsoid and I think it can help me with parsing. However, I'm not sure how to start on testing this.
Thank you :)
Best, Reem
On 11 November 2016 at 19:55, C. Scott Ananian cananian@wikimedia.org wrote:
It was removed from that article recently (19 Oct 2016: https://www.mediawiki.org/w/index.php?title=Alternativ e_parsers&type=revision&diff=2265815&oldid=2247632) with the following comment:
"That link has been dead for over a year now as per this stackoverflow comment: http://stackoverflow.com/questions/13546254/whats-a-fast- way-to-parse-a-wikipedia-xml-dump-for-article-content-and-populate"
If you'd like to explain what you would have used DizzyLogic for, I'd love to help you figure out how to use Parsoid to accomplish your goals. It's an officially-supported WMF parser which has much better correctness that any 'alternative' parser out there, implements a friendly API similar to mwparserfromhell (see https://doc.wikimedia.org /Parsoid/master/#!/guide/jsapi), and has a well-documented AST ( https://www.mediawiki.org/wiki/Specs/HTML/1.2.1) which can be directly fetched via the REST api (cf https://en.wikipedia.org/api/ ). I believe dumps have also been planned, but I'm not sure what the current status is. --scott
On Fri, Nov 11, 2016 at 7:57 AM, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi Pine,
Thank you for your reply. It is an alternative parser. I believe I first saw on MediaWiki (here http://t.sidekickopen68.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs7gbG1nW4WYnHT8q-c7CVRbxS056dC2Qf1b_0xC02?t=https%3A%2F%2Fwww.mediawiki.org%2Fwiki%2FAlternative_parsers&si=5334612837924864&pi=be9d881d-b222-408c-e571-5331aacb58c8 ).
Best, Reem
On 11 November 2016 at 09:47, Pine W wiki.pine@gmail.com wrote:
Was this something on Labs? If so, it might have been purged during one of the Labs cleanups.
Pine
On Tue, Nov 8, 2016 at 2:33 PM, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi,
I'm just wondering if anybody knows what happened to DizzyLogic wiki parser? The website and program vanished. I used it in January 2016 so I know it was there at this time.
Best, Reem
--
*Kind regards,Reem Al-Kashif*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Kind regards,Reem Al-Kashif*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- (http://cscott.net)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Kind regards,Reem Al-Kashif*
Hi Scott,
Thank you very much. This does the job! I'm wondering if this existed and I missed it back in January because I remember that I looked at the book creator back then and there were lesser options (or maybe I simply missed them).
I will probably have to figure out a way to remove the references, external links, and notes sections. Regular expressions could be probably help (other ideas/suggestions are welcome), but Dizzy Logic had this cool thing where they added #Article at the beginning of each article to mark them. That would be a great feature to consider adding to book creator.
Best, Reem
On 18 November 2016 at 17:17, C. Scott Ananian cananian@wikimedia.org wrote:
OCG contains a "plaintext" backend which generates quite nice plain-text versions of WP articles. Try clicking "create a book" in the enwiki sidebar, "start book creator", go to some article, click "add this page to your book" in the header then "show book", then change the format in the drop down to "Word processor (plain text)" and click "download".
You can also take the "download as PDF" link, something like https://en.wikipedia.org/w/index.php?title=Special:Book& bookcmd=render_article&arttitle=Jack+Bosden&returnto= Jack+Bosden&oldid=741271566&writer=rdf2latex and replace the 'writer=rdf2latex' part at the end with 'writer=rdf2text', like: https://en.wikipedia.org/w/index.php?title=Special:Book& bookcmd=render_article&arttitle=Jack+Bosden&returnto= Jack+Bosden&oldid=741271566&writer=rdf2text
These tools can be used from the command-line, as described at https://github.com/wikimedia/mediawiki-extensions-Collection- OfflineContentGenerator-text_renderer
I hope that helps! --scott
On Fri, Nov 18, 2016 at 3:15 AM, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi Scott,
Thank you so much for your reply and offer to help with Parsoid. I used DizzyLogic as an easy parser to get Wikipedia articles' content stripped off the wiki markup. The results were in plain text files. I used it to parse the whole English and Arabic Wikipedia dumps back in January. It was easy to use because my coding knowledge is limited. I read the link you kindly provided about Parsoid and I think it can help me with parsing. However, I'm not sure how to start on testing this.
Thank you :)
Best, Reem
On 11 November 2016 at 19:55, C. Scott Ananian cananian@wikimedia.org wrote:
It was removed from that article recently (19 Oct 2016: https://www.mediawiki.org/w/index.php?title=Alternativ e_parsers&type=revision&diff=2265815&oldid=2247632) with the following comment:
"That link has been dead for over a year now as per this stackoverflow comment: http://stackoverflow.com/questions/13546254/whats-a-fast-way -to-parse-a-wikipedia-xml-dump-for-article-content-and-populate"
If you'd like to explain what you would have used DizzyLogic for, I'd love to help you figure out how to use Parsoid to accomplish your goals. It's an officially-supported WMF parser which has much better correctness that any 'alternative' parser out there, implements a friendly API similar to mwparserfromhell (see https://doc.wikimedia.org /Parsoid/master/#!/guide/jsapi), and has a well-documented AST ( https://www.mediawiki.org/wiki/Specs/HTML/1.2.1) which can be directly fetched via the REST api (cf https://en.wikipedia.org/api/ ). I believe dumps have also been planned, but I'm not sure what the current status is. --scott
On Fri, Nov 11, 2016 at 7:57 AM, Reem Al-Kashif reemalkashif@gmail.com wrote:
Hi Pine,
Thank you for your reply. It is an alternative parser. I believe I first saw on MediaWiki (here http://t.sidekickopen68.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs7gbG1nW4WYnHT8q-c7CVRbxS056dC2Qf1b_0xC02?t=https%3A%2F%2Fwww.mediawiki.org%2Fwiki%2FAlternative_parsers&si=5334612837924864&pi=be9d881d-b222-408c-e571-5331aacb58c8 ).
Best, Reem
On 11 November 2016 at 09:47, Pine W wiki.pine@gmail.com wrote:
Was this something on Labs? If so, it might have been purged during one of the Labs cleanups.
Pine
On Tue, Nov 8, 2016 at 2:33 PM, Reem Al-Kashif <reemalkashif@gmail.com
wrote:
Hi,
I'm just wondering if anybody knows what happened to DizzyLogic wiki parser? The website and program vanished. I used it in January 2016 so I know it was there at this time.
Best, Reem
--
*Kind regards,Reem Al-Kashif*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Kind regards,Reem Al-Kashif*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- (http://cscott.net)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Kind regards,Reem Al-Kashif*
-- (http://cscott.net)