2009/12/11 Behrang Saeedzadeh
<behrangsa(a)gmail.com>om>:
> Hi,
>
> I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract
> main titles and store them in another file. For example, some titles have
> meta information (e.g. disambiguation etc.) and I want these to be removed.
> Can I remove all the text between parentheses from the titles to achieve
> this?
>
> Also some titles start with the "!" character. and some are enclosed
between
> two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in
> such cases? Also why some titles are enclosed between two double quotes such
> as "400_Years_of_Telescope"?
>
> Finally, is there a document describing all these conventions?
>
> P.S: Is this the right place to ask such questions?
>
> Cheers,
> Behrang Saeedzadeh
They are already complete.
The titles which contain "(disambiguation)" is because there is a page
whose title say that.
And there is also a page whose name is "!Adios_Amigos!" (in fact a
redirect to ¡Adios_Amigos!).
If you want to filter out redirects and disambiguation pages, just
enwiki-latest-all-titles-in-ns0 is not enough.
On the other hand, enwiki-latest-all-titles-in-ns0 has already filtered
all non-article titles, like templates, talks, user pages, pages about
wikipedia...