[Pywikipediabot] Using the content of a file as input for articles

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [Wikidata-l]...

Call for participation: Wikis...

Mathieu Stumpf

1 Dec 2013 1 Dec '13

9:56 a.m.

Hello,

I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.

Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.

Kind regards, Mathieu

Show replies by date

Amir Ladsgroup

1 Dec 1 Dec

10:38 a.m.

New subject: [Pywikipediabot] Using the content of a file as input for articles

you have several options 1-use regex e.g.: import re, codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") R=re.compile("{{(.+?)}}") #or other types of regex for name in R.findall(f.read()): page=wikipedia.Page(site,name) #do whatever you like with the page

2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page

for not loading the whole file, I don't think it's possible or simply you can read it, save it to so other variables or files and close it (e.g. f.close())

Best

On Sun, Dec 1, 2013 at 1:26 PM, Mathieu Stumpf < psychoslave@culture-libre.org> wrote:

...

Hello,

I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.

Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.

Kind regards, Mathieu

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Amir

Merlijn van Deen

8:26 p.m.

New subject: [Pywikipediabot] Using the content of a file as input for articles

On 1 December 2013 11:38, Amir Ladsgroup ladsgroup@gmail.com wrote:

...

2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page

You can read the file per-line using 'for line in f' -- this will just read the current line in memory. Cleaning up the rest a bit results in:

import codecs, wikipedia # or pywikibot if you are using core site = wikipedia.getSite() # or pywikibot.Site() if you are using core f = codecs.open("file.txt", "r", "utf-8") for line in f: line = line.strip() name, definition = line.split(":", 1) page = wikipedia.Page(site, name) page.put(definition) # probably something else, though.

If you need some more assistance, I'd suggest joining #pywikipediabot on irc.freenode.net -- it's typically quicker than e-mail :-)

Merlijn

Mathieu Stumpf

3 Dec 3 Dec

7:21 a.m.

New subject: [Pywikipediabot] Using the content of a file as input for articles

Thank everybody for all your answers, I think that I should be able to achieve my goal using your advice.

Le dimanche 01 décembre 2013 à 21:26 +0100, Merlijn van Deen a écrit :

...

On 1 December 2013 11:38, Amir Ladsgroup ladsgroup@gmail.com wrote:

...
2- use readlines: import codecs site=wikipedia.getSite() f=codecs.open("file.txt","r","utf-8") for line in f.readlines(): line=line.replace("\n","").replace("\r","") name=line.split(":")[0] #or any kind that you like to get the title page=wikipedia.Page(site,name) #do whatever you like with the page

You can read the file per-line using 'for line in f' -- this will just read the current line in memory. Cleaning up the rest a bit results in:

import codecs, wikipedia # or pywikibot if you are using core site = wikipedia.getSite() # or pywikibot.Site() if you are using core f = codecs.open("file.txt", "r", "utf-8") for line in f: line = line.strip() name, definition = line.split(":", 1) page = wikipedia.Page(site, name) page.put(definition) # probably something else, though.

If you need some more assistance, I'd suggest joining #pywikipediabot on irc.freenode.net -- it's typically quicker than e-mail :-)

Merlijn _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Marcin Cieslak

1 Dec 1 Dec

1:59 p.m.

New subject: [Pywikipediabot] Using the content of a file as input for articles

...

...
Mathieu Stumpf psychoslave@culture-libre.org wrote:

Hello,

I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.

Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.

I think that the secret sauce to make a working generator is "yield" Python keyword. Will try to provide a working example later.

//Saper

Marcin Cieslak

9:36 p.m.

New subject: [Pywikipediabot] Using the content of a file as input for articles

...

...
Mathieu Stumpf psychoslave@culture-libre.org wrote:

Hello,

I want to add esperanto words to fr.wiktionary using as input a file where each line have the format "word:the fine definition". So I copied the basic.py, and started hacking it to achieve my goal.

Now, it's seems like the -file argument expect a file where each line is formated as "[[Article name]]". Of course I can just create a second input file, and read both in parallel, so I feed the genFactory with the further, and use the second to build the wiktionary entry. But maybe you could give me a hint on how can I write a generator that can feed a pagegenerators.GeneratorFactory() without creating a "miror file" and without loading the whole file in the main memory.

All "pagegenerators" return only a series of Page objects and nothing else; they are useful to create just a list of pages to work on.

I wrote a very simple mini-bot using a different kind of generator that feeds the bot with both pagename and the content.

You can download the code from Gerrit:

https://gerrit.wikimedia.org/r/98457

You should run it like this:

python onelinecontent.py -simulate -contentfile:somecontent

where "somecontent" contains:

A:Test one line B:Second line

Hope that provides some starting point for you,

//Saper

3865

Age (days ago)

3867

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Amir Ladsgroup
Marcin Cieslak
Mathieu Stumpf
Merlijn van Deen