I am trying to create an offline Wikipedia client for the Wikipedia XML dump. I know there are lot of programs on the, but all seems to render the page very badly, because the Wiki markup has enhanced considerably but all these programs are quite outdated and almost dead.
After scanning the internet since yesterday, I have come-up a number of libraries and programs but all of them don't render the page perfectly. Hence, I was toying with the idea of rendering the page using MediaWiki's php files as the people at woc.fslab.de (Offline Wikipedia Client) have done. I have downloaded Offline Wikipedia Client but I yet haven't been able to figure it out how to use it. Anyway, its looks too complicated and overly large. I want an Offline Client myself which would be served via Http (I have tried importing the dump into freashly installed Mediawiki but the rebuild links takes forever and WikiFilter is not for Linux - yet I tried that over wine).
So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.
All this may look very pointless, but after battering my brains over this thing and repeatedly getting disappointing results, my brain has gone fuzzy and desperate.
May you have peace pf mind.
Reagrds, Apple Grew my blog @ http://applegrew.blogspot.com/
Apple Grew schrieb:
So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.
I would modify the output classes, and take a look at the ?action=render parameter.
Marco
I have now downloaded the Offline Client (http://woc.fslab.de/). It seems that that they have patched mediawiki 1.7. Hopefully if we create a patched 1.12alpha then we could get render as in official pages.
On Feb 19, 2008 12:03 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Apple Grew schrieb:
So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.
I would modify the output classes, and take a look at the ?action=render parameter.
Marco
What r these output classes?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Feb 18, 2008 12:35 PM, Apple Grew applegrew@gmail.com wrote:
So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the arbits of codeticle markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.
Basically, right now it's all integrated, no serious attempt has been made to separate it all out. A lot of the major parts are in includes/Parser.php, which calls many other files, such as Linker.php, ParserOptions.php, Title.php, User.php, and so on. The main entry point is Parser::parse(). I've never tried to sort it out myself, but apparently if you really know what you're doing it's not too hard to get it to work as desired without a web request.
All this may look very pointless, but after battering my brains over this thing and repeatedly getting disappointing results, my brain has gone fuzzy and desperate.
If it's pointless, a lot of people have spent a lot of time doing something pointless. :) We should really decouple the parser somewhat more from the rest of the code and have it accessible standalone, to the extent possible. This seems to be a *very* common need, that wouldn't be too hard to address. Maybe someone could write up a maintenance script that will just accept wikitext on stdin and output HTML to stdout, with anything else necessary (e.g., title) passed as a command-line option. Ideally this wouldn't need a database, either, although in practice you'd need some amount of configuration (where should URLs point, what namespaces exist, . . .), which could hopefully be averted with sensible defaults.
I'm not volunteering to write it, though.
Simetrical schreef:
Maybe someone could write up a maintenance script that will just accept wikitext on stdin and output HTML to stdout, with anything else necessary (e.g., title) passed as a command-line option. Ideally this wouldn't need a database, either, although in practice you'd need some amount of configuration (where should URLs point, what namespaces exist, . . .), which could hopefully be averted with sensible defaults.
This more or less exists already in the API:
http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...
Roan Kattouw (Catrope)
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:
This more or less exists already in the API:
http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.
------Un related to above issue------ One major problem with all current implementation is that they try to abandon unresolvable things, like the pictures. It is a little complicated but what can be done is make a list of unresolved issues while indexing the dump like the list of pictures. When rendering the page if the pic is not in local disk then resolve it to online link and parallely also downlaod the pic to local disk. Next time no need to give link for online content.
On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:
This more or less exists already in the API:
http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.
One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)
However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.
Magnus
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"
and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.
./extensions/CrossNamespaceLinks/SpecialCrossNamespaceLinks_body.php ./extensions/CategoryTree/CategoryTreeFunctions.php ./includes/SearchTsearch2.php ./includes/SpecialAncientpages.php ./includes/SpecialLonelypages.php ./includes/SpecialWithoutinterwiki.php ./includes/ImagePage.php ./includes/SearchOracle.php ./includes/Export.php ./includes/SpecialUncategorizedpages.php ./includes/SpecialRecentchanges.php ./includes/SpecialMostlinked.php ./includes/Block.php ./includes/Sanitizer.php ./includes/SpecialRecentchangeslinked.php ./includes/SpecialWantedcategories.php ./includes/FileStore.php ./includes/LinkCache.php ./includes/SpecialUnusedcategories.php ./includes/SpecialDeadendpages.php ./includes/BagOStuff.php ./includes/SpecialShortpages.php ./includes/SpecialFewestrevisions.php ./includes/filerepo/File.php ./includes/filerepo/ICRepo.php ./includes/filerepo/LocalFile.php ./includes/SpecialUnusedimages.php ./includes/QueryPage.php ./includes/SiteStats.php ./includes/SpecialUnwatchedpages.php ./includes/Parser.php ./includes/SpecialExport.php ./includes/DatabaseOracle.php ./includes/Parser_OldPP.php ./includes/SearchPostgres.php ./includes/SpecialMostcategories.php ./includes/SpecialListredirects.php ./includes/SpecialLog.php ./includes/SpecialMostlinkedtemplates.php ./includes/Title.php ./includes/SpecialDisambiguations.php ./includes/SpecialDoubleRedirects.php ./includes/SkinTemplate.php ./includes/SpecialRandompage.php ./includes/SpecialMIMEsearch.php ./includes/SpecialPopularpages.php ./includes/LinkBatch.php ./includes/SpecialWantedpages.php ./includes/api/ApiQueryRecentChanges.php ./includes/Database.php ./includes/SpecialMostlinkedcategories.php ./includes/SpecialMostimages.php ./includes/Skin.php ./includes/SpecialBrokenRedirects.php ./includes/SpecialWatchlist.php ./includes/SearchMySQL.php ./includes/DatabasePostgres.php ./includes/SpecialNewimages.php ./includes/SpecialUnusedtemplates.php ./includes/SpecialMostrevisions.php ./includes/Categoryfinder.php ./includes/SpecialAllmessages.php ./includes/SpecialNewpages.php ./includes/SpecialUncategorizedimages.php ./includes/SpecialUpload.php ./includes/LinksUpdate.php ./includes/Article.php ./includes/WatchlistEditor.php ./skins/disabled/MonoBookCBT.php ./config/index.php ./tests/DatabaseTest.php ./tests/MediaWiki_TestCase.php ./profileinfo.php
On 2/19/08, Magnus Manske magnusmanske@googlemail.com wrote:
On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:
This more or less exists already in the API:
http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.
One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)
However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.
Magnus
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"
and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.
. . .
I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.
Simetrical wrote:
On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"
and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.
. . .
I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.
I once looked at it, too. But i think it'd be easier to replace Title. Database work at a level too low.
On Tue, Feb 19, 2008 at 6:59 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"
and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.
. . .
I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.
Unless you want to have all interface functionality of MediaWiki (which you don't, apparently), ./includes/Database.php should be all you need.
IIRC it was about 5 or so methods to tweak to get a page to render.
Magnus
As far as rendering wikimarkup is concerned, how does Database.php is going to help?
Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project? Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.
http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...
And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.
BTW------------This is Off-topic-------------- Are attachments allowed in the mailing list?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Apple Grew wrote:
Are attachments allowed in the mailing list?
Most attachments are automatically stripped for security reasons.
You can, of course, link to the file uploaded elsewhere.
MinuteElectron.
On Wed, Feb 20, 2008 at 5:26 PM, Apple Grew applegrew@gmail.com wrote:
As far as rendering wikimarkup is concerned, how does Database.php is going to help?
1. Tweak the database class to grab content from your local source instead of a database 2. Have MediaWiki do the rendering for you by simply requesting the article via HTTP 3. Profit!
Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project?
Yes.
Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.
http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...
Internally, wikitext gets always converted to XML, then this gets converted to the actual output format. Here, you're apparently running into two problems: * XHTML output is a little shaky ATM * Something doesn't parse right, as you can see when you switch to XML output (at the end, there's a "<span>" as plain text) I'll look into that.
And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.
This option works when you enter a list of articles, instead of raw wikitext; it will then use the API of the given MediaWiki installation to do the template replacement, which is faster and more reliable.
Magnus
On Thu, Feb 21, 2008 at 3:23 PM, Magnus Manske magnusmanske@googlemail.com wrote:
On Wed, Feb 20, 2008 at 5:26 PM, Apple Grew applegrew@gmail.com wrote:
As far as rendering wikimarkup is concerned, how does Database.php is going to help?
- Tweak the database class to grab content from your local source
instead of a database 2. Have MediaWiki do the rendering for you by simply requesting the article via HTTP 3. Profit!
Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project?
Yes.
Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.
http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...
Internally, wikitext gets always converted to XML, then this gets converted to the actual output format. Here, you're apparently running into two problems:
- XHTML output is a little shaky ATM
- Something doesn't parse right, as you can see when you switch to XML
output (at the end, there's a "<span>" as plain text) I'll look into that.
And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.
This option works when you enter a list of articles, instead of raw wikitext; it will then use the API of the given MediaWiki installation to do the template replacement, which is faster and more reliable.
Magnus
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for the reply, the database trick looks encouraging. Currently I am trying to use Bliki (java) library to do the rendering. This is a better option for me as I am also programming my application Java.
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"
and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.
./extensions/CrossNamespaceLinks/SpecialCrossNamespaceLinks_body.php ./extensions/CategoryTree/CategoryTreeFunctions.php ./includes/SearchTsearch2.php ./includes/SpecialAncientpages.php ./includes/SpecialLonelypages.php ./includes/SpecialWithoutinterwiki.php ./includes/ImagePage.php ./includes/SearchOracle.php ./includes/Export.php ./includes/SpecialUncategorizedpages.php ./includes/SpecialRecentchanges.php ./includes/SpecialMostlinked.php ./includes/Block.php ./includes/Sanitizer.php ./includes/SpecialRecentchangeslinked.php ./includes/SpecialWantedcategories.php ./includes/FileStore.php ./includes/LinkCache.php ./includes/SpecialUnusedcategories.php ./includes/SpecialDeadendpages.php ./includes/BagOStuff.php ./includes/SpecialShortpages.php ./includes/SpecialFewestrevisions.php ./includes/filerepo/File.php ./includes/filerepo/ICRepo.php ./includes/filerepo/LocalFile.php ./includes/SpecialUnusedimages.php ./includes/QueryPage.php ./includes/SiteStats.php ./includes/SpecialUnwatchedpages.php ./includes/Parser.php ./includes/SpecialExport.php ./includes/DatabaseOracle.php ./includes/Parser_OldPP.php ./includes/SearchPostgres.php ./includes/SpecialMostcategories.php ./includes/SpecialListredirects.php ./includes/SpecialLog.php ./includes/SpecialMostlinkedtemplates.php ./includes/Title.php ./includes/SpecialDisambiguations.php ./includes/SpecialDoubleRedirects.php ./includes/SkinTemplate.php ./includes/SpecialRandompage.php ./includes/SpecialMIMEsearch.php ./includes/SpecialPopularpages.php ./includes/LinkBatch.php ./includes/SpecialWantedpages.php ./includes/api/ApiQueryRecentChanges.php ./includes/Database.php ./includes/SpecialMostlinkedcategories.php ./includes/SpecialMostimages.php ./includes/Skin.php ./includes/SpecialBrokenRedirects.php ./includes/SpecialWatchlist.php ./includes/SearchMySQL.php ./includes/DatabasePostgres.php ./includes/SpecialNewimages.php ./includes/SpecialUnusedtemplates.php ./includes/SpecialMostrevisions.php ./includes/Categoryfinder.php ./includes/SpecialAllmessages.php ./includes/SpecialNewpages.php ./includes/SpecialUncategorizedimages.php ./includes/SpecialUpload.php ./includes/LinksUpdate.php ./includes/Article.php ./includes/WatchlistEditor.php ./skins/disabled/MonoBookCBT.php ./config/index.php ./tests/DatabaseTest.php ./tests/MediaWiki_TestCase.php ./profileinfo.php
------------------------------------------
However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.
The kde app for reading wiki dump does the very thing of reading directly from bz2 files and it is not slow. URL:http://www.kde-apps.org/content/show.php?content=65244
On 2/19/08, Magnus Manske magnusmanske@googlemail.com wrote:
On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:
This more or less exists already in the API:
http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...
Roan Kattouw (Catrope)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.
One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)
However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.
Magnus
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org