you have several options 1-use regex e.g.: import re, codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") R=re.compile("{{(.+?)}}") #or other types of regex for name in R.findall(f.read()): page=wikipedia.Page(site,name) #do whatever you like with the page
2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page
for not loading the whole file, I don't think it's possible or simply you can read it, save it to so other variables or files and close it (e.g. f.close())
Best
On Sun, Dec 1, 2013 at 1:26 PM, Mathieu Stumpf < psychoslave@culture-libre.org> wrote:
Hello,
I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.
Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.
Kind regards, Mathieu
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l