> If anyone is interested, I have a rudimental Perl script that is
> capable of reading the downloadable SQL dump and output all the
> articles as separate files in a number of alphabetical directories.
> It's not very fast, but it works.
> What's missing from the script: wikimarkup -> HTML conversion,
Mr David A. Wheeler,
Have you seen my Perl script for conversion of the SQL dump to
TomeRaider datase? You might find useful code there.
It renders all pages in html, checks als hyperlinks and unlinks half a
million orphaned ones. It edits wiki code to remove redundant tags,
fixes some badly coded html tables, adds stats and language specific
introduction. Replaces html tags by extended ascii (saves a lot of
space). Resolves redirects, thus making hyperlinks point directly to the
proper article. It removes tables that only contained an image (plus
possibly a single footer text).
In fact I think the script could be extended to generate separate html
pages in a few hours. Plucker specifics not taken into account.
Script: http://members.chello.nl/epzachte/Wikipedia/WikiToTome.pl
More info: http://members.chello.nl/epzachte/Wikipedia
Erik Zachte
Modified Files:
includes/SearchEngine.php
languages/Language.php
languages/LanguageDe.php
Modified Search to display a link to generate a new page if with 'Go' no
article is found. You have to modify "nogomatch" like demonstrated in
LanguageDe.php to do so. The "showingresultnum" texts are modified, limit
is removed now.
Brion, take a look if it's now more as you like it. I still left $3, if
anyone insists on displaying the $limit value ($1). It's all testet, at
home of course, using "Nase" (nose) as query.
If you think it's ok, please install at german wikipedia, in LanguageDe.php
are further changes independent from the SearchEngine.
Thank you in advance.
--
Smurf
smurf(a)AdamAnt.mud.de
------------------------- Anthill inside! ---------------------------
Just another DB-Error:
<table border=1 bordercolor=black cellspacing=0 cellpadding=2><tr>
<th>cur_id</th><th>cur_namespace</th><th>cur_title</th><th>cur_text</th><th>cur_is_redirect</th></tr>
<tr><td>5144</td>
<td>0</td>
<td><a href="http://de.wikipedia.org/wiki/Tacitus_%28Kaiser%29"
class='internal'>Tacitus_(Kaiser)</a></td>
<td>#REDIRECT [[Marcus Claudius Tacitus]]</td>
<td>0</td>
</tr>
</table>
It should be "cur_is_redirect = 1" or did I miss something again?
How to solve this? (other than "update cur set cur_is_redirect=1 where
cur_id = '5144'")
--
Smurf
smurf(a)AdamAnt.mud.de
------------------------- Anthill inside! ---------------------------
Je Lundo 19 Majo 2003 23:24, Thomas Corell skribis:
> Can someone checking if the englisch translation of the
> ''showingresultsnum'' is propper english? I tried, but that means not
> so much ;)
"Showing below <b>$3</b> results using the respective limit of <b>$1</b>
starting with #<b>$2</b>."
Whoa... :) I'd prefer something simpler:
"Showing up to <b>$1</b> results starting with #<b>$2</b>."
Or better yet, just give the actual number of results and let the chunk
size be shown by the "next X" / "prev X" links.
-- brion vibber (brion @ pobox.
> Je Mardo 20 Majo 2003 03:53, Alfio Puglisi skribis:
> > I just subscribed (I'm the wikipedia user At18) to ask about the
> > automatic html dump function. ...
>
> > If anyone is interested, I have a rudimental Perl script that is
> > capable of reading the downloadable SQL dump and output all the
> > articles as separate files in a number of alphabetical directories.
> > It's not very fast, but it works.
That's great!
> > What's missing from the script: wikimarkup -> HTML conversion,
You should be able to call the existing PHP code that generates
HTML to do this.
A tool that generated the entire Wikipedia, in static HTML format,
would make it trivial to generate the "Plucker" format
for Palm PDAs. Plucker is offline web browser for Palm PDAs;
it's open source software/Free Software (OSS/FS) released under the
GPL.
It can handle HTML, as well as PNG, GIF, JPEG, txt, and a few others;
HTML is usually rendered as you'd expect (hypertext, italics, bold,
font size changes, lists, indenting all work).
It'd be very nice if the Wikipedia were available in Plucker format;
that would mean that an OSS/FS reader could be used to view the text
on a Palm PDA.
Plucker is available at: "http://www.plkr.org".
I have a Palm, and it is the MOST important program I use by far.
One minor problem is that Plucker doesn't have an index
facility. That could be solved by creating HTML pages that link to
sorted articles, e.g., "Master Index" could list "A, B, C...";
clicking on "A" would reach "Index A" which would list "AA, AB, AC...".
Then, modify the static version of the main main page so you
could quickly jump to the master index.
Internally, Plucker will break long pages (>32K) into multiple
pages with front and back link - but that'll be automatic and
won't affect anything.
I don't know of an automatic way to download the Wikipedia
images (which, in my mind, is a serious problem). Hopefully there
will soon be a way to download the images other than trawling.
However, for a Palm you'd have to drop the images in general anyway,
so for that particular use it wouldn't matter.
Just for kicks, I've added some preliminary, experimental support for
gzip encoding of pages that have been saved in the file cache. If
$wgUseGzip is not enabled in LocalSettings, it shouldn't have any
effect; if it is, it'll make compressed copies of cached files and then
serve them if the client claims to accept gzip.
At present this only affects file-cachable pages: so plain current page
views by not-logged-in users. Compression is only done when generating
the cached file, so it oughtn't to drain CPU resources too much. My
informal testing shows the gzipping takes about 2-3 ms, which is much
shorter than most of the page generation steps. (Though it will eat up
some additional disk space, as both uncompressed and compressed copies
are kept on disk.)
I'd appreciate some testing with various user agents to see if things
are working. If you receive a compressed page, there'll be a comment at
the end of the page like <!-- Cached/compressed [timestamp] -->
A few notes:
This needs zlib support compiled into PHP to work. I've done this on
Larousse.
An on-the-fly compression filter could also be turned on for dynamic
pages and logged-in users, but I haven't done this yet. Compression
support could be a user-selectable option, so those with problem
browsers could turn it off, or those with slow modems could turn it on
where off by default. :)
The purpose of all this is of course to save bandwidth; there are two
ends of this, the server and the client:
Jimbo has pooh-poohed concerns about our bandwidth usage; certainly the
server has a nice fat pipe to the internet and isn't in danger of
choking, and whatever Bomis's overall bandwidth usage, Jimbo hasn't
complained that we're crowding out his legitimate business. :) But
still, we're looking at 5-20 *gigabytes* *per day*. A fair chunk of
that is probably images and archive dumps, but a lot is text.
On the client end: schmucks with dial-up may appreciate a little
compression. :)
I've also fixed what seems to be a conflict between the page cache and
client-side caching.
There are some race conditions remaining as far as making sure that two
loads of the same page don't overwrite each other's work or read
another's page partway through, and adding a gzipped second file
perhaps complicates this a bit... also still some cases where caches
aren't invalidated properly.
-- brion vibber (brion @ pobox.com)
Hello,
I just subscribed (I'm the wikipedia user At18) to ask about the automatic
html dump function. I see from the database page that it's "in
development".
If anyone is interested, I have a rudimental Perl script that is capable
of reading the downloadable SQL dump and output all the articles as
separate files in a number of alphabetical directories. It's not very
fast, but it works.
What's missing from the script: wikimarkup -> HTML conversion, some
intelligence to autodetect redirects, dealing with images, and so on. I
don't know if someone is in charge of this fuction. If so, I can post the
script. Otherwise, I can further develop it myself, given some directions.
Alfio
Hi,
I am currently working on a small python library which allows reading and
writing articles. Among the features is an (semi)automatic detection of
copyright
violations.
When an article is written by the script, I remarked that the IP address of
my
machine shows up in the log. I would prefer it if my username shows up,
allowing everyone to contact me easily. Therefore my question: what
parameters do I have to send to /w/wiki.phtml in order to appear with my
username?
Any help with this would be very much appreciated,
thanks in advance,
Marco
P.S. Not sure this is the right list to ask this: could someone please
obscure the
emails in the list archives?
The problem is that by replying many people cite the full email address
which
shows up in the archive, e.g., see
http://mail.wikipedia.org/pipermail/wikipedia-l/2003-April/009828.html for
an
example. I already have enough spam in my inbox :-(
--
+++ GMX - Mail, Messaging & more http://www.gmx.net +++
Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage!
> Ok, cool !
> It's not hurried... anyhow the misspeling page don't work well on french
> wikipedia ;o)
> It don't check accent while it is the more common misspeling in french
> (with double consonant).
> Can you add the misspeling function to "class Language" that we can make a
> different version for "class LanguageFr" ?
> Regards,
>
> Aoineko
Sorry I just think there are perhaps a more easy way.
I don't know exactly what this used for :
"linktrail" => "/^([a-zàâçéèêîôû]+)(.*)\$/sD",
But perhaps we can use a different filter (it is?) for the misspelling page.
For search page -> we want to find page even if we forget an accent
For misspelling page -> we want to find pages where there are accent
misspelling
An other question :
I have the developper statute on CVS. If I finally succeed to connect to CVS
from my office I will be able to make my own update. Who to ask before to do
any update ? This list ? Where can I found the "todo list" ? I'm video-game
programmer so I'm more efficient in hard core coding than in data-base
programming.
Regards,
Aoineko
I've changed the string for the ISBN replacement page from "WIKI-ISBN" to
"MAGICNUMBER" to avoid triggering the "ISBN .." parser itself on the page.
Otherwise you get into trouble when you try something like "ISBN WIKI-
ISBN".
Regards,
Erik