Does anyone have a requirements document for the Wikipedia parser?
If not, will those programmers who have already begun work on such a
parser, like Magnus and Frithjof, please send me any scraps of
documentation you have?
I would like to assemble these into a wiki grammar or something like
that. So we can help each other with parser development.
I guess the list of "stupid parser tricks" would start with bracket
notation for links:
[http://www.edpoor.com] is a link to my outdated, static website
[http://www.edpoor.com/images/Ae-inAndDog.jpg girl with dog] is an
annotated link
[[Iraq]] links to the Wikipedia article on Iraq
[[Iraq|Rummyland]] links to Iraq but is shown as "Rummyland" (a
Doonesbury reference, okay? ;-)
Etc.
Along with parsing rules for the rendering of text, is the problem of
fetching and posting files. That is, coordinating each user's off-line
stash (cache?) with the database. Note that some users might not want
the entire encyclopedia, but perhaps only those articles they're working
on. Or articles one click away?
Ed Poor
The kerfuggle touched my what? I dunno :-)
I was under the impression that EVERY TIME a user requests a page,
the software has to double-check each internal link for the presence or
absence of the linked page.
My ancient Greek cry of jubilation only applies, if this is the case...
Ed
From: "Poor, Edmund W" <Edmund.W.Poor(a)abc.com>
Subject: RE: [Wikitech-l] visited links and empty
links in red
Anthere,
I have never been to Cologne, and I'm sorry you're
feeling blue. ;-)
Edmond LePauvre
Me neither Edmond.
I am right now fuming, so kinda red actually.
So you can figure me out with a blue skin with red
patches on the cheeks. The colors messing together
might turn black eventually :-)
__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
As a result of a robot run on the nl: wikipedia, I am now left with
25840 missing or incorrect links in other wikipedias. I wanted to put
these missing links on my user pages to let people help get them in. I
have done this on a smaller scale before, but this time it took me more
than 1 minute to upload each of 5 segments of the list to the en:
wikipedia. Even retrieving them feels like it is bringing the server
down to its knees. Did something in that aspect of the software change
that slowed it down dramatically?? The only thing I can imagine that is
"unique" to these pages is the number of international links.
See the 5 pages at: http://en.wikipedia.org/wiki/User_talk:Rob_Hooft
(but only try that if you want to investigate, because it takes about a
minute to generate each page from the database!)
Should I take these offline again awaiting?
Rob
--
Rob W.W. Hooft || rob(a)hooft.net || http://www.hooft.net/people/rob/
I hereby declare that $wgDatabaseMessages can be safely set to true. As
long as memcached is enabled.
I rolled all the messages into one memcached entry. At first it didn't
seem to make much difference. I was confused as to why it seemed to be
taking much longer to load from a cache than from Language::getMessage,
even though they were both doing the same thing. The reason, when I
eventually discovered it, was quite surprising: is_array() is quite fast
when it returns false, but painfully slow when it returns true. Like, a
millisecond. I rearranged my code so that it doesn't use that function.
-- Tim Starling.
My C++ parser is now a working offline browser. This is achieved by
converting a mysql dump of the cur table into an sqlite database once,
then using apache/php as a frontend and a php file calling the compiled
C++ executable, which renders HTML on-the-fly. Browse using your
favourite browser :-)
Why bother with this, if a static HTML version would do the same?
* Proof-of-principle
* sqlite databases can be changed, allowing for edits
* Encapsuled database object, can be easily switched to use mysql or any
other database instead of sqlite for use on a website
* Full-text search (which I'll implement next)
Some things that bug me:
* Is there an easy way to call an executable directly from apache,
without the need for a PHP script in-between?
* Is there a no-setup-needed web browser? Some .exe that I can start on
a windows machine, then fire up the web browser, and view/edit my local
wikipedia copy?
Magnus
A couple of days ago, I noticed the visited links
color changed. It is now red, of a color that is quite
similar to the missing links red.
Is this due to my browser and if so, how can I change
that ?
If it is software related, why was it changed ? Where
was it discussed ? Can it be reverted please ? Or
changed to another color ? For poor sighted people,
the two reds are confusing; I had to switch to the
feature ? for empty links again.
(I am in cologne blue)
__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
Could someone take over this question? I don't know what to tell these
guys.
Thanks.
Ed Poor
-----Original Message-----
From: Don Rameez [mailto:rameezdon@hotmail.com]
Sent: Saturday, October 18, 2003 3:00 PM
To: Poor, Edmund W
Subject: RE: "Auto Extraction from WWW"
Dear Edmund;
thanx for ur reply
i have installed MySql on Windows & i did try to open the SQL dump but
somehow was unable to do so
could u plz guide me regarding the same (i m a novice as far as Mysql is
concerned)
hope to hear from u soon
regards
Don Rameez
-
-------Original Message-------
From: Poor, Edmund W <mailto:Edmund.W.Poor@abc.com>
Date: Wednesday, October 15, 2003 01:06:19 AM
To: Don Rameez <mailto:rameezdon@hotmail.com>
Subject: RE: "Auto Extraction from WWW"
I don't know how to convert SQL tables into plain TEXT. Why not use a
SELECT statement? Like:
SELECT cur_text FROM cur
Ed Poor
-----Original Message-----
From: Don Rameez [mailto:rameezdon@hotmail.com]
Sent: Sunday, October 12, 2003 9:33 PM
To: Poor, Edmund W
Subject: RE: "Auto Extraction from WWW"
Dear Edmund,
thanx for acknowledging soon & answering to my queries
i appreciate ur concern for the Knowledge base
i would like to ask u one more query ....
Q) We now have the SQL dump, apart from MySQL is there any possibility
that we can
access the data in some other format ( say plain TEXT ....)
regards
Don Rameez
-------Original Message-------
From: Poor, Edmund W <mailto:Edmund.W.Poor@abc.com>
Date: Tuesday, October 14, 2003 11:27:37 AM
To: Don Rameez <mailto:rameezdon@hotmail.com>
Subject: RE: "Auto Extraction from WWW"
Your questions are best put to our senior developers, but I'll give you
some preliminary answers.
1. All our articles are stored as plain English text. There is a bit of
markup used for links.
2. We are not encouraging direct server-to-server links. Rather, we
invite users to edit articles via the web interface.
3. You can get a SQL dump, if you want the entire database. It's much
less than one GB in size, and could possibly fit on one CD (we are
planning to publish a CD eventually).
The difference between our project and yours is that we are a
non-encoded encyclopedia. We just have a collection of articles.
You are trying to "encode" knowledge, which is Very Difficult. Many
attempts have been made in the past; I can't think of a single success,
but I can think of half a dozen spectacular failures. It's harder than
it looks!
I applaud the attempt, but this task involves artificial intelligence
(AI), and AI has not progressed beyond the so-called "expert system" or
"neural net". These are the toys of AI and have not produced reliable,
comprehensive results.
What do you really hope to accomplish, in the next 5 to 10 years?
Sincerely,
Ed Poor
Developer & Sysop
Wikipedia
-----Original Message-----
From: Don Rameez [mailto:rameezdon@hotmail.com]
Sent: Saturday, October 11, 2003 11:35 PM
To: JeLuF(a)gmx.de; ts4294967296(a)hotmail.com;
maveric149(a)yahoo.com; Poor, Edmund W; wikitech-l(a)Wikipedia.org;
JeLuF(a)gmx.de; ts4294967296(a)hotmail.com; maveric149(a)yahoo.com; Poor,
Edmund W; wikitech-l(a)Wikipedia.org
Cc: nagarjun(a)hbcse.tifr.res.in
Subject: "Auto Extraction from WWW"
Dear Sir,
We are a group of 3 students currently pursuing our B.E - IT
(Bachelor of Engg. Information Technology)from the Mumbai University,
INDIA.
As of now we are working on a project titled " AUTO EXTRACTION OF
CONTENTS FROM THE WORLD WIDE WEB" as a part of our BE project, in the
renowned institute of HBCSE-TIFR
( Homi Bhabha Center for Science Education - Tata Institute of
Fundamental Research)
under the guidance of Scientist Dr.Nagarjuna.G.
Our project is based on :
OS - GNU/LINUX
Language - Python
Server - Zope
Application - GNOWSYS
GNOWSYS, Gnowledge Networking and Organizing System, is a web
application for developing
and maintaining semantic web content developed in Python and works as an
installed product in Zope. Our project involves automatically
extracting data from the (WWW) World Wide Web) &
use GNOWSYS for handling this vast amount of data. This will not only
help us
store data in the Gnowledge base in form of meaningful relationships but
also see its
handling of huge amount of data.
The URL for our site is http://www.gnowledge.org
With this regards we could think no one but Wikipedia, which in itself
is a phenomenon.
We would be glad if u could answer to few of our queries :
1] What is the format in which the data is stored in Wikipedia ???
2] Apart from http or ftp are there any other specific protocols that
are in use,
which will be required to communicate to the Wikipedia Server ???
3] How can we utilize the SQL dump ???
We hope you will answer our queries at the earliest
With warm regards
Thanking You
[ Rameez Don , Jaymin Darbari, Ulhas Dhuri ]
____________________________________________________
<http://www.incredimail.com/redir.asp?ad_id=309&lang=9> IncrediMail -
Email has finally evolved - Click Here
<http://www.incredimail.com/redir.asp?ad_id=309&lang=9>