Dear Sir,
We are a group of 3 students currently pursuing our B.E - IT
(Bachelor of Engg. Information Technology)from the Mumbai
University, INDIA.
As of now we are working on a project titled " AUTO EXTRACTION
OF CONTENTS FROM THE WORLD WIDE WEB" as a part of our BE
project,
in the renowned institute os HBCSE-TIFR
( Homi Bhabha Center for Science Education - Tata Institute of
Fundamental Research) under the guidance of Scientist
Dr.Nagarjuna.G.
Our project is based on
OS - GNU/LINUX
Language - Python
Server - Zope
Application - GNOWSYS
GNOWSYS, Gnowledge Networking and Organizing System, is a web
application for developing and maintaining semantic web content
developed in Python and works as an installed product in Zope
Our project involves automatically extracting data from the
(WWW) World Wide Web) & use GNOWSYS for handling this vast
amount of data. This will not only help us store data in the
Gnowledge base in form of meaningful relationships but also see
its handling of huge amount of data.
The URL for our site is "http://www.gnowledge.org"
With this regards we could think no one but Wikipedia, which in
itself is a phenomenon.
We would be glad if u could answer to few of our queries :
1] What is the format in which the data is stored in Wikipedia
???
2] Apart from http or ftp are there any other specific protocols
that are in use,
which will be required to communicate to the Wikipedia Server
???
3] How can we utilize the SQL dump ???
We hope you will answer our queries at the earliest
With warm regards
Thanking You
[ Rameez Don , Jaymin Darbari, Ulhas Dhuri ]
________________________________
15 Mbytes Free Web-based and POP3
Sign up now: http://www.gawab.com
The last discussions on the lists about how to represent pronunciations
on Wikipedia didn't end with any definitive conclusions.
One of the proposed suggestions was to provide a way input IPA in an
easy-to-use format and then have it automagically get converted into
Unicode IPA. As a first step I wrote a Lex analyzer to convert X-SAMPA
to Unicode IPA (which results in C code that I compiled to an
executable). This will enable editors to enter pronunciations in the
ugly-but-easy-to-type X-SAMPA format and have them appear in IPA using
the Unicode IPA extensions.
I also dove into the Wikipedia code and patched OutputPage.php to
support <IPA> </IPA> tags that surround SAMPA and output it as <SPAN
class="IPA"> </SPAN>, with the Unicode IPA HTML entities inside the SPAN
tag. Finally, I patched style/wikistandard.css to have a .IPA style that
explicitly sets the font-family to a list of fonts that are known to
contain the IPA Unicode extensions. This is necessary because some
broswers (namely Windows IE) don't display IPA Unicode characters even
if they're installed unless the currently active font has those characters.
I envision the use of <IPA> tags on Wikipedia to be fairly limited, in
that IPA/SAMPA will only appear on pages discussing pronunciations, but
I think it will make a good starting point, perhaps, for representing
pronunciations on Wiktionary.
I can attach the diffs for OutputPage.php and style/wikistandard.css to
a future message (if I can ever get the sourceforge cvs server to
respond), but what should I do with my .lex file and Makefile for
building the parser? Should I post them too? I would guess most list
readers would not be pleased by my spamming them with such tediums. I
understand the hesitation developers have to handing out CVS access to
people whose code they have never seen, so email me privately if you
want me to send you what I've done.
Thanks!
- David [[User:Nohat]]
Note: for most purposes, X-SAMPA is backwards compatible with SAMPA for
various languages, but the Lex analyzer can be modified to support
something like <IPA lang="French"> or something like that for the
language-specific SAMPA encodings, if that is deemed desirable.
Jens Frank wrote:
>A RAID 0+1 configuration, with data striped over
>two disks on one adapter and mirrored to the
>other adapter, is more reliable and probably a little
>bit faster.
That's what my partner said when I told him about the proposed database server
(he has to deal with RAID issues at work sometimes).
-- Daniel Mayer (aka mav)
As some of you might remember, I tried to write a wiki(pedia) offline
reader/editor a while ago. After some trouble with the GUI, and because
someone pointed (rightly!) out that my parser won't do unicode, I
stopped working on this.
Lately, I had several people asking me about a stand-alone wiki(pedia)
parser. The latest related request was on the mailing list yesterday
("Java code...").
Now, I have created a (partly) working parser, written in C++. It is
based on a string object I also wrote, natively using 16-bit chars and
thus supporting unicode. (A function to actually import/export mysql
unicode remains to be written, though).
Together with this, I started rewriting the common wikipedia functions
(skins, languages, user management, etc.). That part is still at its
beginning, but it can already render the "Wikipedia:How to edit a page"
in a half-complete standard skin. This includes a function to parse the
"LanguageXX.php" files, so no need to rewrite all of that. (Import is a
little slow, though, so that won't be a permanent solution; it rather
screams "conversion").
As an example, the whole thing comes as a command line tool, which can
render a wiki-style text (from file or pipe) into HTML (no skin or
standard skin, by parameter).
I have commited the sources (hereby under GPL) at the wikipedia CVS,
into a new module called "Waikiki" (seemed like the natural extension of
wiki to me;-)
This could be the basis for several wiki(pedia) software projects:
* Offline editor
* Reader, to be distributed on CD/DVD
* Wiki(pedia) apache module
So, fire up those compilers and go to work! ;-)
Magnus
>> I don't think it's wise to have four parsers doing actually the same. <<
On the server side it seems that speed is a desired feature, so C or C++ seems to make sense as a preferred language. They also have the advantage of being very widely understood. Broad understanding probably favors C or simple C++ (simple C++ meaning without really fancy use of classes and inheritance). Hopefully either of those would allow a lot of people to get involved in optimizing and reusing the code.
Hello,
looks like many people have done almost the same thing: I've also
begun to write a parser. Mine is written is Python.
It's far away from completition, but I don't want to spend my time
writing useless code so I am asking how we want to continue:
I don't think it's wise to have four parsers doing actually the same.
However, everbody can do what he/she wants, but I personally
would rather like to work on one program than doing doubled, useless work.
As Magnus Manske already pointed out, such a parser could be the base of
several desired tools. We could write a library that could be used by
these applications.
The difference between mine program and all the others AFAIK is
the point how the wiki data actually get loaded.
As someone suggested on meta.wikipedia.org I have written
a file called 'raw.php' that retrieves the raw wiki data.
In my opinion that's quite useful for offline-editing applications
and such things.
You can get my parser, along with my adapted 'Article.php' and the file
'raw.php', here: www.fms-engel.de/buildHTML.tar.bz2
(Sorry for not providing a patch, when I find time, I'll do it)
I don't want to start a language-flamewar, but I probably prefer
Magnus Manske's version. C++ is quite fast and there are several
GUI libraries one can use for each platform to write nice GUIs.
I did not take a look at his program, but I guess it's much more mature
than, for example, mine.
I would really like to discuss this topic as I think a parser and the
resulting possibilities of having one is one of the most wanted
features, at least for me :)
There is probably a better solution than having four parsers that do
actually the same thing...
Regards,
Frithjof
Would someone please look through the server logs and see whether the
suspicion about User:Mediator is correct? If it's banned user
"EntmoonOfTrolls", then it looks like Jimbo wants some quick action.
Ed Poor
-----Original Message-----
From: Jimmy Wales [mailto:jwales@bomis.com]
Sent: Monday, October 13, 2003 4:49 PM
To: lazolla(a)hotmail.com; English Wikipedia
Subject: Re: [WikiEN-l] EofT/142.177.etc/24 back as User:Mediator
Well, let's make it a priority to figure out if it is him, and I think
it's time for me to start pursuing some legal means here.
> And if anyone can confirm that User:Mediator is not 142.177.etc., I'll
> submit my apology and buy the next two rounds of drinks.
By all means, let's be reasonably sure first, but I think we have to be
very firm in this case.
--Jimbo
_______________________________________________
WikiEN-l mailing list
WikiEN-l(a)Wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikien-l
Now most Wikimedia mailing lists you can read by use of gmane.org, A
mailing list to usenet service.
http://news.gmane.org/?match=wikipedia (old hierarchy, to be moved to the
new Wikimedia hierarchy)
http://news.gmane.org/?match=wikimedia
I have noticed that there are now several new lists who are not (yet) listed.
Lists and proposed names and discriptions;
1. MediaWiki-l
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
gmane.org.wikimedia.mediawiki
"Announcements about the MediaWiki software and site admin list"
! post adress is @wikimedia.org
2. Wikibugs-l
http://mail.wikipedia.org/mailman/listinfo/wikibugs-l
gmane.org.wikimedia.mediawikibugs
or
gmane.org.wikimedia.wikibugs
"Monitor updates to the MediaWiki bug tracker on SourceForge"
-> "Wikibugs-l" is the name of the list but it is about bugs to the
MediaWiki Software so "Mediawikibugs" sounds more clear.
3. Wikilegal-l
http://mail.wikipedia.org/mailman/listinfo/wikilegal-l
gmane.org.wikimedia.wikilegal
"Discussion list about the legal matters of the Wikimedia projects"
! post adress is @wikimedia.org
Is it ok if those lists are also listed at gmane.org? Comments?
There is not yet a;
* WikiquoteEN-l
* WiktionaryEN-l
Do not know of that would be usefull.
--
Contact: walter AT wikipedia.be
Ook een artikeltje schrijven? WikipediaNL, de vrije GNU/FDL encyclopedie
http://www.wikipedia.be
Tarquin, putting some specs on-line at Meta sounds like a good idea. And
per Alfio's suggestion, maybe we could start by copying "Editing help".
Then massage it from "tips to users" format into "specs for programmers"
format.
Ed Poor