Hi,
I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract main titles and store them in another file. For example, some titles have meta information (e.g. disambiguation etc.) and I want these to be removed. Can I remove all the text between parentheses from the titles to achieve this?
Also some titles start with the "!" character. and some are enclosed between two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in such cases? Also why some titles are enclosed between two double quotes such as "400_Years_of_Telescope"?
Finally, is there a document describing all these conventions?
P.S: Is this the right place to ask such questions?
Cheers, Behrang Saeedzadeh ------------------------------- http://my.opera.com/behrangsa http://twitter.com/behrangsa
This is probably not the right place, you would want wikitech-l (where I've cc'ed this reply).
- d.
2009/12/11 Behrang Saeedzadeh behrangsa@gmail.com:
Hi,
I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract main titles and store them in another file. For example, some titles have meta information (e.g. disambiguation etc.) and I want these to be removed. Can I remove all the text between parentheses from the titles to achieve this?
Also some titles start with the "!" character. and some are enclosed between two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in such cases? Also why some titles are enclosed between two double quotes such as "400_Years_of_Telescope"?
Finally, is there a document describing all these conventions?
P.S: Is this the right place to ask such questions?
Cheers, Behrang Saeedzadeh
http://my.opera.com/behrangsa http://twitter.com/behrangsa _______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
2009/12/11 Behrang Saeedzadeh behrangsa@gmail.com:
Hi,
I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract main titles and store them in another file. For example, some titles have meta information (e.g. disambiguation etc.) and I want these to be removed. Can I remove all the text between parentheses from the titles to achieve this?
Also some titles start with the "!" character. and some are enclosed between two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in such cases? Also why some titles are enclosed between two double quotes such as "400_Years_of_Telescope"?
Finally, is there a document describing all these conventions?
P.S: Is this the right place to ask such questions?
Cheers, Behrang Saeedzadeh
They are already complete. The titles which contain "(disambiguation)" is because there is a page whose title say that. And there is also a page whose name is "!Adios_Amigos!" (in fact a redirect to ¡Adios_Amigos!).
If you want to filter out redirects and disambiguation pages, just enwiki-latest-all-titles-in-ns0 is not enough. On the other hand, enwiki-latest-all-titles-in-ns0 has already filtered all non-article titles, like templates, talks, user pages, pages about wikipedia...
Hi Behrang,
I think you should be asking these types of questions on a list where people give advice on how to do things using a certain programming language (e.g., Python, Perl, etc.). For isnatnce, you can find lists related to Python here: http://www.python.org/community/lists/.
Best,
--muhammad abdul-mageed, Ph.D. student, Indiana University Computational Linguistics and School of Library & Info. Science
On Fri, Dec 11, 2009 at 7:27 AM, Behrang Saeedzadeh behrangsa@gmail.comwrote:
Hi,
I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract main titles and store them in another file. For example, some titles have meta information (e.g. disambiguation etc.) and I want these to be removed. Can I remove all the text between parentheses from the titles to achieve this?
Also some titles start with the "!" character. and some are enclosed between two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in such cases? Also why some titles are enclosed between two double quotes such as "400_Years_of_Telescope"?
Finally, is there a document describing all these conventions?
P.S: Is this the right place to ask such questions?
Cheers, Behrang Saeedzadeh
http://my.opera.com/behrangsa http://twitter.com/behrangsa _______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l