Hello,
I am writing a Java program to extract the abstract of the wikipedia page given the title of the wikipedia page. I have done some research and found out that the abstract with be in rvsection=0
So for example if I want the abstract of 'Eiffel Tower" wiki page then I am querying using the api in the following way.
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
and parse the XML data which we get and take the wikitext in the tag <rev xml:space="preserve"> which represents the abstract of the wikipedia page. But this wiki text also contains the infobox data which I do not need. I would like to know if there is anyway in which I can remove the infobox data and get only the wikitext related to the page's abstract Or if there is any alternative method by which I can get the abstract of the page directly.
Looking forward to your help.
Thanks in Advance Aditya Uppu
2010/1/27 aditya srinivas usaditya86@gmail.com:
Hello, I am writing a Java program to extract the abstract of the wikipedia page given the title of the wikipedia page. I have done some research and found out that the abstract with be in rvsection=0 So for example if I want the abstract of 'Eiffel Tower" wiki page then I am querying using the api in the following way. http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles... and parse the XML data which we get and take the wikitext in the tag <rev xml:space="preserve"> which represents the abstract of the wikipedia page. But this wiki text also contains the infobox data which I do not need. I would like to know if there is anyway in which I can remove the infobox data and get only the wikitext related to the page's abstract Or if there is any alternative method by which I can get the abstract of the page directly.
The software doesn't know what the abstract is, it just gives you everything up to the first == Header ==. You can try stripping out the infobox by stripping out everything between {{ and their matching }} (especially the matching part is tricky).
Roan Kattouw (Catrope)
Hi,
Here in this link http://download.wikimedia.org/enwiki/latest/ u can find the file called "enwiki-latest-abstract.xmlhttp://download.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml" witch has all the abstracts of the wikipedia in english.
More information about this u can find in http://en.wikipedia.org/wiki/Wikipedia_database
[]'s
Daniel Hasan Dalip
On Wed, Jan 27, 2010 at 3:34 PM, aditya srinivas usaditya86@gmail.comwrote:
Hello,
I am writing a Java program to extract the abstract of the wikipedia page given the title of the wikipedia page. I have done some research and found out that the abstract with be in rvsection=0
So for example if I want the abstract of 'Eiffel Tower" wiki page then I am querying using the api in the following way.
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
and parse the XML data which we get and take the wikitext in the tag <rev xml:space="preserve"> which represents the abstract of the wikipedia page. But this wiki text also contains the infobox data which I do not need. I would like to know if there is anyway in which I can remove the infobox data and get only the wikitext related to the page's abstract Or if there is any alternative method by which I can get the abstract of the page directly.
Looking forward to your help.
Thanks in Advance Aditya Uppu
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
<a href="https://www.modelescortsindelhi.com">vip escorts in delhi</a> these escorts in Delhi are a very beautiful place in Delhi most expensive girls a sure one-night date with this girl will this change Delhi escorts to come on this website and complete your wish yah dream.
mediawiki-api@lists.wikimedia.org