Hello,
I am writing a Java program to extract the abstract of the wikipedia page
given the title of the wikipedia page. I have done some research and found
out that the abstract with be in rvsection=0
So for example if I want the abstract of 'Eiffel Tower" wiki page then I am
querying using the api in the following way.
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Eiffel…
and parse the XML data which we get and take the wikitext in the tag <rev
xml:space="preserve"> which represents the abstract of the wikipedia page.
But this wiki text also contains the infobox data which I do not need. I
would like to know if there is anyway in which I can remove the infobox data
and get only the wikitext related to the page's abstract Or if there is any
alternative method by which I can get the abstract of the page directly.
Looking forward to your help.
Thanks in Advance
Aditya Uppu
Hi,
Wikipedia now renders with new design, so my previous tool relied on
obtaining the text just by downloading it and applying an XPath, have
to adjust to it. I have mixed results so the questions are:
- Is there a plan to support the old design with some additional
parameters? Even if not forever, just for comparison purposes it would
be useful for me
- Is there another better way to get the text. Basically I make a
guessing work by converting some of the classical tags like H1/H2 etc
into pseudo headings and so on, Bullet tags into bullet chars etc. The
issue with the new design for me is that floating content now at the
same level as all the items of the //main[@id='content'] tag, so I
will have to do some filtering to get the main content without
supplemental information.
Thanks
Max