Thank you, Petr Onderka
I found finally the problem, it came from the module of Python (lxml),
which work correctly with iterparse() ,
but when i use it to :
1st match <title>,
2nd match his parent <page>,
i have this problem. I pose this question in the mail list of lxml now.
Any way, i can extract the page what i want:
1st. match <page>
2nd, match <title>
it's just a little slowly.
Thank you very much!
Kun JIN
On 03/03/2014 08:31 PM, Petr Onderka wrote:
On Fri, Feb 28, 2014 at 3:13 PM, Kun JIN
<kun.jin(a)univ-bpclermont.fr> wrote:
I have another problem with
"frwiki-20140208-pages-meta-current.xml". I
tried to extract "
Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple)pple). In this
dump, i got last revision of course, but the page has missing text (see
Attached-file "page-Discussion:Apple.xml")
How exactly did you extract
the text? When I look into that dump, I
can see the full text.
Petr Onderka
[[en:User:Svick]]