Java API for reading Wikipedia XML dumps

List overview All Threads
Download

newer

older

partnerships strategic planning -...

Access to older wikipedia dumps

Delip Rao

18 Nov 2009 18 Nov '09

12:20 p.m.

Hello!

We have been working on a Java API for reading Wikipedia XML dumps for sometime and it's now reasonably functional. Check out:

http://code.google.com/p/wikixmlj/

*Features:*

- Easy access to important elements of a Wikipedia page - Also provides interfaces for Wiki text parsing. - Memory efficient - SAX interface for parsing - Lazy loading of files for DOM - Callback support with DOM - Directly operate on compressed wikipedia dumps (gzip/bzip2/native xml supported)

Best, Delip

Attachments:

attachment.htm (text/html — 940 bytes)

Show replies by date

Felipe Ortega

18 Nov 18 Nov

6:37 p.m.

Really interesting contribution.

I'll give it a try and let you know :).

Thanks!

F --

--- El mié, 18/11/09, Delip Rao deliprao@gmail.com escribió:

...

De: Delip Rao deliprao@gmail.com Asunto: [Wiki-research-l] Java API for reading Wikipedia XML dumps Para: wiki-research-l@lists.wikimedia.org Fecha: miércoles, 18 de noviembre, 2009 18:20 Hello!We have been working on a Java API for reading Wikipedia XML dumps for sometime and it's now reasonably functional. Check out:

http://code.google.com/p/wikixmlj/Features: Easy access to important elements of a Wikipedia pageAlso provides interfaces for Wiki text parsing.Memory efficient

SAX interface for parsingLazy loading of files for DOMCallback support with DOMDirectly operate on compressed wikipedia dumps (gzip/bzip2/native xml supported)Best,

Delip

-----Adjunto en línea a continuación-----

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Piotr Konieczny

24 Nov 24 Nov

10:46 p.m.

Delip Rao wrote:

...

Hello!

We have been working on a Java API for reading Wikipedia XML dumps for sometime and it's now reasonably functional. Check out:

http://code.google.com/p/wikixmlj/

*Features:*
* Easy access to important elements of a Wikipedia page
* Also provides interfaces for Wiki text parsing.
* Memory efficient
      o SAX interface for parsing
      o Lazy loading of files for DOM
      o Callback support with DOM
* Directly operate on compressed wikipedia dumps (gzip/bzip2/native
  xml supported)

Interesting. Is it usable by people who don't know what a "DOM parser" or "SAX interface" is?

-- Piotr Konieczny "The problem about Wikipedia is, that it just works in reality, not in theory."

Delip Rao

26 Nov 26 Nov

3:17 p.m.

On Tue, Nov 24, 2009 at 10:46 PM, Piotr Konieczny piokon@post.pl wrote:

...

Delip Rao wrote:

...
Hello!

We have been working on a Java API for reading Wikipedia XML dumps for sometime and it's now reasonably functional. Check out:

http://code.google.com/p/wikixmlj/

*Features:*
* Easy access to important elements of a Wikipedia page
* Also provides interfaces for Wiki text parsing.
* Memory efficient
      o SAX interface for parsing
      o Lazy loading of files for DOM
      o Callback support with DOM
* Directly operate on compressed wikipedia dumps (gzip/bzip2/native
  xml supported)
Interesting. Is it usable by people who don't know what a "DOM parser" or "SAX interface" is?

Thanks Piotr and everyone else for the interest. Yes, the API does not require any understanding of the underlying parser. Hope it fits your needs but let me know if there is anything I can accommodate.

Happy Thanksgiving!

...

-- Piotr Konieczny

"The problem about Wikipedia is, that it just works in reality, not in theory."

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

5521

Age (days ago)

5529

Last active (days ago)

wiki-research-l@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Delip Rao
Felipe Ortega
Piotr Konieczny