Hello,
I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.
Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.
Kind regards, Mathieu
you have several options 1-use regex e.g.: import re, codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") R=re.compile("{{(.+?)}}") #or other types of regex for name in R.findall(f.read()): page=wikipedia.Page(site,name) #do whatever you like with the page
2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page
for not loading the whole file, I don't think it's possible or simply you can read it, save it to so other variables or files and close it (e.g. f.close())
Best
On Sun, Dec 1, 2013 at 1:26 PM, Mathieu Stumpf < psychoslave@culture-libre.org> wrote:
Hello,
I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.
Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.
Kind regards, Mathieu
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 1 December 2013 11:38, Amir Ladsgroup ladsgroup@gmail.com wrote:
2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page
You can read the file per-line using 'for line in f' -- this will just read the current line in memory. Cleaning up the rest a bit results in:
import codecs, wikipedia # or pywikibot if you are using core site = wikipedia.getSite() # or pywikibot.Site() if you are using core f = codecs.open("file.txt", "r", "utf-8") for line in f: line = line.strip() name, definition = line.split(":", 1) page = wikipedia.Page(site, name) page.put(definition) # probably something else, though.
If you need some more assistance, I'd suggest joining #pywikipediabot on irc.freenode.net -- it's typically quicker than e-mail :-)
Merlijn
Thank everybody for all your answers, I think that I should be able to achieve my goal using your advice.
Le dimanche 01 décembre 2013 à 21:26 +0100, Merlijn van Deen a écrit :
On 1 December 2013 11:38, Amir Ladsgroup ladsgroup@gmail.com wrote:
2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page
You can read the file per-line using 'for line in f' -- this will just read the current line in memory. Cleaning up the rest a bit results in:
import codecs, wikipedia # or pywikibot if you are using core site = wikipedia.getSite() # or pywikibot.Site() if you are using core f = codecs.open("file.txt", "r", "utf-8") for line in f: line = line.strip() name, definition = line.split(":", 1) page = wikipedia.Page(site, name) page.put(definition) # probably something else, though.
If you need some more assistance, I'd suggest joining #pywikipediabot on irc.freenode.net -- it's typically quicker than e-mail :-)
Merlijn _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Mathieu Stumpf psychoslave@culture-libre.org wrote:
Hello,
I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.
Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.
I think that the secret sauce to make a working generator is "yield" Python keyword. Will try to provide a working example later.
//Saper
Mathieu Stumpf psychoslave@culture-libre.org wrote:
Hello,
I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.
Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.
All "pagegenerators" return only a series of Page objects and nothing else; they are useful to create just a list of pages to work on.
I wrote a very simple mini-bot using a different kind of generator that feeds the bot with both pagename and the content.
You can download the code from Gerrit:
https://gerrit.wikimedia.org/r/98457
You should run it like this:
python onelinecontent.py -simulate -contentfile:somecontent
where "somecontent" contains:
A:Test one line B:Second line
Hope that provides some starting point for you,
//Saper
wikitech-l@lists.wikimedia.org