Hi,
On Sat, Dec 12, 2009 at 4:35 PM, David Gerard <dgerard(a)gmail.com> wrote:
2009/12/11 Behrang Saeedzadeh <behrangsa(a)gmail.com>om>:
Hi,
I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to
extract
main titles and store them in another file. For
example, some titles have
meta information (e.g. disambiguation etc.) and I want these to be
removed.
Can I remove all the text between parentheses
from the titles to achieve
this?
You have to parse it by hand.
Also some
titles start with the "!" character. and some are enclosed
between
two or three of them such as !Adiso_Amigos!. What
is the purpose of "!"
in
> such cases?
It's part of the topic's name (in case of <
http://en.wikipedia.org/wiki/%C2%A1Adios_Amigos!>gt;, the band's name). The
reverse exclamation mark is part of the Spanish language.
Also why some
titles are enclosed between two double quotes such
as "400_Years_of_Telescope"?
Same case: The " are part of the
topic's name (e.g. <
http://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic>)t;).
Marco
PS: Next time, please do correct copy&paste so people have a chance to see
what you want. Both your supplied examples had to be corrected, the second
one was missing a "the": <http://en.wikipedia.org/wiki/
"400_Years_of_the_Telescope">
--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de