Thank you, Petr Onderka
I found finally the problem, it came from the module of Python (lxml), which work correctly with iterparse() , but when i use it to : 1st match <title>, 2nd match his parent <page>, i have this problem. I pose this question in the mail list of lxml now.
Any way, i can extract the page what i want: 1st. match <page> 2nd, match <title> it's just a little slowly.
Thank you very much!
Kun JIN
On 03/03/2014 08:31 PM, Petr Onderka wrote:
On Fri, Feb 28, 2014 at 3:13 PM, Kun JIN kun.jin@univ-bpclermont.fr wrote:
I have another problem with "frwiki-20140208-pages-meta-current.xml". I tried to extract " Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In this dump, i got last revision of course, but the page has missing text (see Attached-file "page-Discussion:Apple.xml")
How exactly did you extract the text? When I look into that dump, I can see the full text.
Petr Onderka [[en:User:Svick]]