Hi, folks,
Recently, I'm working on a research project which needs extracting article information from wikipedia.
I managed to get pywikibot work on my computer and was able to pull out a few simple results.
One question is regarding a method called pywikibot.pagegenerators.AllpagesPageGenerator.
By setting the argument "content" to "True", it will return a page generator with current version. But, which version will be returned if setting the argument to False?
Also, is there a way in pywikibot to get a page generator that contains articles/pages up to a certain date?
Maybe, pywikibot is not a right tool to do this.
I was thinking of using wiki dump data instead of using a wiki API.
But, it seems the files are huge. I appreciate it you happen to have any idea to deal with this.
Thanks a lot!
hz.cmu
As masti says, if you're interested in the content of all pages then using a dump is much more efficient. There are some very useful Python 3 libraries for processing them here: http://pythonhosted.org/mediawiki-utilities/
Also, there's a bunch of researchers who are familiar with this sort of problem to be found over on wiki-research-l: https://lists.wikimedia.org/mailman/listinfo/wiki-research-l (I'm one of them)
Cheers, Morten
On 16 March 2017 at 11:23, masti mastigm@gmail.com wrote:
if You set content to False the page.text will be None and the conetnt will be fetched live once You use it.
working on all pages contet will be easier from dumps. As you download them offline compressed. otherwise You make even more traffic from live wiki.
masti
On 16.03.2017 18:04, Haifeng Zhang wrote:
Hi, folks,
Recently, I'm working on a research project which needs extracting article information from wikipedia.
I managed to get pywikibot work on my computer and was able to pull out a few simple results.
One question is regarding a method called pywikibot.pagegenerators.Allpage sPageGenerator.
By setting the argument "content" to "True", it will return a page generator with current version. But, which version will be returned if setting the argument to False?
Also, is there a way in pywikibot to get a page generator that contains articles/pages up to a certain date?
Maybe, pywikibot is not a right tool to do this.
I was thinking of using wiki dump data instead of using a wiki API.
But, it seems the files are huge. I appreciate it you happen to have any idea to deal with this.
Thanks a lot!
hz.cmu
pywikibot mailing listpywikibot@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
Thanks a lot for the suggestions, Morten and masti. It's great meeting you both.
I will take a look at the python lib.
Haifeng
________________________________ From: pywikibot pywikibot-bounces@lists.wikimedia.org on behalf of Morten Wang nettrom@gmail.com Sent: Thursday, March 16, 2017 5:36 PM To: Pywikibot discussion list Subject: Re: [pywikibot] help on pywikibot
As masti says, if you're interested in the content of all pages then using a dump is much more efficient. There are some very useful Python 3 libraries for processing them here: http://pythonhosted.org/mediawiki-utilities/
Also, there's a bunch of researchers who are familiar with this sort of problem to be found over on wiki-research-l: https://lists.wikimedia.org/mailman/listinfo/wiki-research-l (I'm one of them)
Cheers, Morten
On 16 March 2017 at 11:23, masti <mastigm@gmail.commailto:mastigm@gmail.com> wrote: if You set content to False the page.text will be None and the conetnt will be fetched live once You use it.
working on all pages contet will be easier from dumps. As you download them offline compressed. otherwise You make even more traffic from live wiki.
masti
On 16.03.2017 18:04, Haifeng Zhang wrote:
Hi, folks,
Recently, I'm working on a research project which needs extracting article information from wikipedia.
I managed to get pywikibot work on my computer and was able to pull out a few simple results.
One question is regarding a method called pywikibot.pagegenerators.AllpagesPageGenerator.
By setting the argument "content" to "True", it will return a page generator with current version. But, which version will be returned if setting the argument to False?
Also, is there a way in pywikibot to get a page generator that contains articles/pages up to a certain date?
Maybe, pywikibot is not a right tool to do this.
I was thinking of using wiki dump data instead of using a wiki API.
But, it seems the files are huge. I appreciate it you happen to have any idea to deal with this.
Thanks a lot!
hz.cmu
_______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.orgmailto:pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
_______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.orgmailto:pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot