[WikiEN-l] Using english-Wikipedia XML dump

Carcharoth carcharothwp at googlemail.com
Mon Jun 29 10:27:55 UTC 2009


You might want to ask in the technical forum. Hopefully someone can
point you that way, or answer your question here.

Carcharoth

On Sat, Jun 27, 2009 at 10:24 PM, akhil1988<akhilanger at gmail.com> wrote:
>
> Hi All!
>
> Here's a newbie to this forum.
>
> I am looking for some references to help me use Wikipedia XML dump.
>
> Here's what I have to do with the XML dump:
>
> I will set up a server on which people can browse Wikipedia articles and
> also a processed version of the corresponding Wikipedia article. By
> processed version means a wikipedia article with some additional information
> with each line. eg
>
> A line in a Wikipedia article (http://en.wikipedia.org/wiki/Chicago) goes
> as:
>
> Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the
> U.S. state of Illinois, and with over 2.8 million people is the third
> largest city in the country.
>
> My processed version of wikipedia page would be like this:
>
> Chicago (pronounced /ʃɨˈkɑːɡoʊ/ or /ʃɨˈkɔːɡoʊ/) is the largest city in the
> U.S. state of Illinois, and with over 2.8 million people is the third
> largest city in the country. <Some additional information about this line>
>
> Dont bother about "Some additional information about this line". This is
> some NLP (natural Language Processing) stuff which processes the line and
> generates some additional information about the line.
>
> So, if somebody wants to access the processed version of any Wikipedia
> article, he can go to: http://myserver/wiki/processed_Chicago
>
> I hope I am clear what I intend to do with the wikipedia XML dump.
>
> For this I need to know the following things:
>
> 1. How should I extract articles from the XML dump, process them by
> extracting plain text from them and then insert the processed page back line
> by line at the same place in the XML article as before along with the
> additional information that will be generated by the NLP stuff.
> In this whole process, I want to maintain the look of the wikipedia page as
> the original version.
>
> 2. How to render a wikipedia page from the XML dump just like as we see in
> the online version of the Wikipedia.
>
> 3. XML dump does not have images in it, so how will I render images when a
> page on my server is accessed.
>
> Any references or ideas in this regard will be greatly appreciated.
>
> Thanks,
> Akhil
> --
> View this message in context: http://www.nabble.com/Using-english-Wikipedia-XML-dump-tp24236727p24236727.html
> Sent from the English Wikipedia mailing list archive at Nabble.com.
>
>
> _______________________________________________
> WikiEN-l mailing list
> WikiEN-l at lists.wikimedia.org
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>



More information about the WikiEN-l mailing list