Hi,
Wikipedia now renders with new design, so my previous tool relied on obtaining the text just by downloading it and applying an XPath, have to adjust to it. I have mixed results so the questions are: - Is there a plan to support the old design with some additional parameters? Even if not forever, just for comparison purposes it would be useful for me - Is there another better way to get the text. Basically I make a guessing work by converting some of the classical tags like H1/H2 etc into pseudo headings and so on, Bullet tags into bullet chars etc. The issue with the new design for me is that floating content now at the same level as all the items of the //main[@id='content'] tag, so I will have to do some filtering to get the main content without supplemental information.
Thanks
Max
Hi,
On 2/3/23 01:11, Max Vlasov wrote:
- Is there a plan to support the old design with some additional
parameters? Even if not forever, just for comparison purposes it would be useful for me
You can use ?useskin=vector to fetch pages with the old HTML structure.
- Is there another better way to get the text. Basically I make a
guessing work by converting some of the classical tags like H1/H2 etc into pseudo headings and so on, Bullet tags into bullet chars etc. The issue with the new design for me is that floating content now at the same level as all the items of the //main[@id='content'] tag, so I will have to do some filtering to get the main content without supplemental information.
Have you looked at Parsoid HTML? It's annotated HTML that makes it pretty straightforward to parse and extract content from wiki pages.
See https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_ for the API and https://www.mediawiki.org/wiki/Specs/HTML for the format.
-- Legoktm
On Thu, Feb 2, 2023 at 10:30 PM Kunal Mehta legoktm@debian.org wrote:
You can use ?useskin=vector to fetch pages with the old HTML structure.
Or use ?action=render to get the content without a skin. If you are willing to update your selectors, you are probably better off with Parsoid though. (There is also a library for traversing Parsoid HTML: https://github.com/wikimedia/parsoid-jsapi Not sure how well it's maintained though.)
mediawiki-api@lists.wikimedia.org