Rendering wikimarkup using code from Mediawiki

List overview All Threads
Download

newer

older

Wikistats in SQLite3

Array plus operator

Apple Grew

18 Feb 2008 18 Feb '08

8:35 p.m.

I am trying to create an offline Wikipedia client for the Wikipedia XML dump. I know there are lot of programs on the, but all seems to render the page very badly, because the Wiki markup has enhanced considerably but all these programs are quite outdated and almost dead.

After scanning the internet since yesterday, I have come-up a number of libraries and programs but all of them don't render the page perfectly. Hence, I was toying with the idea of rendering the page using MediaWiki's php files as the people at woc.fslab.de (Offline Wikipedia Client) have done. I have downloaded Offline Wikipedia Client but I yet haven't been able to figure it out how to use it. Anyway, its looks too complicated and overly large. I want an Offline Client myself which would be served via Http (I have tried importing the dump into freashly installed Mediawiki but the rebuild links takes forever and WikiFilter is not for Linux - yet I tried that over wine).

So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.

All this may look very pointless, but after battering my brains over this thing and repeatedly getting disappointing results, my brain has gone fuzzy and desperate.

May you have peace pf mind.

Reagrds, Apple Grew my blog @ http://applegrew.blogspot.com/

Show replies by date

Marco Schuster

18 Feb 18 Feb

9:33 p.m.

Apple Grew schrieb:

...

So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.

I would modify the output classes, and take a look at the ?action=render parameter.

Marco

Apple Grew

10:02 p.m.

I have now downloaded the Offline Client (http://woc.fslab.de/). It seems that that they have patched mediawiki 1.7. Hopefully if we create a patched 1.12alpha then we could get render as in official pages.

On Feb 19, 2008 12:03 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...

Apple Grew schrieb:

...
So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the article markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.

...

I would modify the output classes, and take a look at the ?action=render parameter.

Marco

What r these output classes?

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Apple Grew my blog @ http://applegrew.blogspot.com/

Simetrical

11:17 p.m.

On Feb 18, 2008 12:35 PM, Apple Grew applegrew@gmail.com wrote:

...

So, my question is. Can anyone please guide me to the php files of MediaWiki that I can use with little modification. I intend to provide it all the necessary details like list of Templates and their codes to substitute, the categories and the arbits of codeticle markup as input to the php file,etc. I expect to get the html code that can be sent to the user's browser.

Basically, right now it's all integrated, no serious attempt has been made to separate it all out. A lot of the major parts are in includes/Parser.php, which calls many other files, such as Linker.php, ParserOptions.php, Title.php, User.php, and so on. The main entry point is Parser::parse(). I've never tried to sort it out myself, but apparently if you really know what you're doing it's not too hard to get it to work as desired without a web request.

...

All this may look very pointless, but after battering my brains over this thing and repeatedly getting disappointing results, my brain has gone fuzzy and desperate.

If it's pointless, a lot of people have spent a lot of time doing something pointless. :) We should really decouple the parser somewhat more from the rest of the code and have it accessible standalone, to the extent possible. This seems to be a *very* common need, that wouldn't be too hard to address. Maybe someone could write up a maintenance script that will just accept wikitext on stdin and output HTML to stdout, with anything else necessary (e.g., title) passed as a command-line option. Ideally this wouldn't need a database, either, although in practice you'd need some amount of configuration (where should URLs point, what namespaces exist, . . .), which could hopefully be averted with sensible defaults.

I'm not volunteering to write it, though.

Roan Kattouw

19 Feb 19 Feb

12:55 a.m.

Simetrical schreef:

...

Maybe someone could write up a maintenance script that will just accept wikitext on stdin and output HTML to stdout, with anything else necessary (e.g., title) passed as a command-line option. Ideally this wouldn't need a database, either, although in practice you'd need some amount of configuration (where should URLs point, what namespaces exist, . . .), which could hopefully be averted with sensible defaults.

This more or less exists already in the API:

http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...

Roan Kattouw (Catrope)

Apple Grew

6:04 a.m.

On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

This more or less exists already in the API:

http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...

Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.

------Un related to above issue------ One major problem with all current implementation is that they try to abandon unresolvable things, like the pictures. It is a little complicated but what can be done is make a list of unresolved issues while indexing the dump like the list of pictures. When rendering the page if the pic is not in local disk then resolve it to online link and parallely also downlaod the pic to local disk. Next time no need to give link for online content.

-- Apple Grew my blog @ http://applegrew.blogspot.com/

Magnus Manske

6:13 p.m.

On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:

...

On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...
This more or less exists already in the API:

http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...

Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.

One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)

However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.

Magnus

Apple Grew

9:26 p.m.

Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"

and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.

./extensions/CrossNamespaceLinks/SpecialCrossNamespaceLinks_body.php ./extensions/CategoryTree/CategoryTreeFunctions.php ./includes/SearchTsearch2.php ./includes/SpecialAncientpages.php ./includes/SpecialLonelypages.php ./includes/SpecialWithoutinterwiki.php ./includes/ImagePage.php ./includes/SearchOracle.php ./includes/Export.php ./includes/SpecialUncategorizedpages.php ./includes/SpecialRecentchanges.php ./includes/SpecialMostlinked.php ./includes/Block.php ./includes/Sanitizer.php ./includes/SpecialRecentchangeslinked.php ./includes/SpecialWantedcategories.php ./includes/FileStore.php ./includes/LinkCache.php ./includes/SpecialUnusedcategories.php ./includes/SpecialDeadendpages.php ./includes/BagOStuff.php ./includes/SpecialShortpages.php ./includes/SpecialFewestrevisions.php ./includes/filerepo/File.php ./includes/filerepo/ICRepo.php ./includes/filerepo/LocalFile.php ./includes/SpecialUnusedimages.php ./includes/QueryPage.php ./includes/SiteStats.php ./includes/SpecialUnwatchedpages.php ./includes/Parser.php ./includes/SpecialExport.php ./includes/DatabaseOracle.php ./includes/Parser_OldPP.php ./includes/SearchPostgres.php ./includes/SpecialMostcategories.php ./includes/SpecialListredirects.php ./includes/SpecialLog.php ./includes/SpecialMostlinkedtemplates.php ./includes/Title.php ./includes/SpecialDisambiguations.php ./includes/SpecialDoubleRedirects.php ./includes/SkinTemplate.php ./includes/SpecialRandompage.php ./includes/SpecialMIMEsearch.php ./includes/SpecialPopularpages.php ./includes/LinkBatch.php ./includes/SpecialWantedpages.php ./includes/api/ApiQueryRecentChanges.php ./includes/Database.php ./includes/SpecialMostlinkedcategories.php ./includes/SpecialMostimages.php ./includes/Skin.php ./includes/SpecialBrokenRedirects.php ./includes/SpecialWatchlist.php ./includes/SearchMySQL.php ./includes/DatabasePostgres.php ./includes/SpecialNewimages.php ./includes/SpecialUnusedtemplates.php ./includes/SpecialMostrevisions.php ./includes/Categoryfinder.php ./includes/SpecialAllmessages.php ./includes/SpecialNewpages.php ./includes/SpecialUncategorizedimages.php ./includes/SpecialUpload.php ./includes/LinksUpdate.php ./includes/Article.php ./includes/WatchlistEditor.php ./skins/disabled/MonoBookCBT.php ./config/index.php ./tests/DatabaseTest.php ./tests/MediaWiki_TestCase.php ./profileinfo.php

On 2/19/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:

...
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...
This more or less exists already in the API:

http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...

...
...
Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.

One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)

However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.

Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Apple Grew my blog @ http://applegrew.blogspot.com/

Simetrical

9:59 p.m.

On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:

...

Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"

and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.

. . .

I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.

Platonides

20 Feb 20 Feb

2:24 a.m.

Simetrical wrote:

...

On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:

...
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"

and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.

. . .

I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.

I once looked at it, too. But i think it'd be easier to replace Title. Database work at a level too low.

Magnus Manske

4:51 p.m.

On Tue, Feb 19, 2008 at 6:59 PM, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote:

...
Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"

and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.

. . .

I think he meant alter the Database class, so that it intercepted queries that were being made by the parser and substituted some fake answer. Searching for every single SELECT made anywhere in MediaWiki code will give you a rather long list, yes.

Unless you want to have all interface functionality of MediaWiki (which you don't, apparently), ./includes/Database.php should be all you need.

IIRC it was about 5 or so methods to tweak to get a page to render.

Magnus

Apple Grew

8:26 p.m.

As far as rendering wikimarkup is concerned, how does Database.php is going to help?

Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project? Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.

http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...

And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.

BTW------------This is Off-topic-------------- Are attachments allowed in the mailing list?

-- Apple Grew my blog @ http://applegrew.blogspot.com/ ----------------------------------------------------------------------------------- On Wed, Feb 20, 2008 at 7:21 PM, Magnus Manske magnusmanske@googlemail.com wrote: > On Tue, Feb 19, 2008 at 6:59 PM, Simetrical > Simetrical+wikilist@gmail.com wrote: > > On Feb 19, 2008 1:26 PM, Apple Grew applegrew@gmail.com wrote: > > > Hey thanks for the tip. I tried grepping using the following command > > > grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/" > > > > > > and got the list below. Its really long. I have excluded the > > > maintenance and docs folders completely. The files in includes > > > directory is the first place I would be looking into. > > > > > > . . . > > > > I think he meant alter the Database class, so that it intercepted > > queries that were being made by the parser and substituted some fake > > answer. Searching for every single SELECT made anywhere in MediaWiki > > code will give you a rather long list, yes. > > Unless you want to have all interface functionality of MediaWiki > (which you don't, apparently), > ./includes/Database.php > should be all you need. > > IIRC it was about 5 or so methods to tweak to get a page to render. > > Magnus > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >

MinuteElectron

8:42 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Apple Grew wrote:

...

Are attachments allowed in the mailing list?

Most attachments are automatically stripped for security reasons.

You can, of course, link to the file uploaded elsewhere.

MinuteElectron.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAke8ZmkACgkQkJvUlhoE3wSJ6ACgnsJjCgddd+GPYY+G0B45S0bu tAAAoKcWH1WTgytRrjzGIY1XI9TVZjy9 =fQY2 -----END PGP SIGNATURE-----

Magnus Manske

21 Feb 21 Feb

12:53 p.m.

On Wed, Feb 20, 2008 at 5:26 PM, Apple Grew applegrew@gmail.com wrote:

...

As far as rendering wikimarkup is concerned, how does Database.php is going to help?

1. Tweak the database class to grab content from your local source instead of a database 2. Have MediaWiki do the rendering for you by simply requesting the article via HTTP 3. Profit!

...

Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project?

Yes.

...

Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.

http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...

Internally, wikitext gets always converted to XML, then this gets converted to the actual output format. Here, you're apparently running into two problems: * XHTML output is a little shaky ATM * Something doesn't parse right, as you can see when you switch to XML output (at the end, there's a "<span>" as plain text) I'll look into that.

...

And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.

This option works when you enter a list of articles, instead of raw wikitext; it will then use the API of the given MediaWiki installation to do the template replacement, which is faster and more reliable.

Magnus

Apple Grew

1:36 p.m.

On Thu, Feb 21, 2008 at 3:23 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Wed, Feb 20, 2008 at 5:26 PM, Apple Grew applegrew@gmail.com wrote:

...
As far as rendering wikimarkup is concerned, how does Database.php is going to help?

Tweak the database class to grab content from your local source

instead of a database 2. Have MediaWiki do the rendering for you by simply requesting the article via HTTP 3. Profit!

...
Hi, Magnus, is wiki2xml (http://svn.wikimedia.org/viewvc/mediawiki/trunk/wiki2xml/php/) your project?

Yes.

...
Its output was pretty encouraging, but it seems it has some issues with rendering the {{wikipedia}} template. Use the following url to see the problem I am talking about.

http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php?doit=1&whatsthis=wiki...

Internally, wikitext gets always converted to XML, then this gets converted to the actual output format. Here, you're apparently running into two problems:

XHTML output is a little shaky ATM

Something doesn't parse right, as you can see when you switch to XML

output (at the end, there's a "<span>" as plain text) I'll look into that.

...
And what does the Use API checkbox do? When I check this on then the templates are not rendered at all.

This option works when you enter a list of articles, instead of raw wikitext; it will then use the API of the given MediaWiki installation to do the template replacement, which is faster and more reliable.

Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thanks for the reply, the database trick looks encouraging. Currently I am trying to use Bliki (java) library to do the rendering. This is a better option for me as I am also programming my application Java.

-- Apple Grew my blog @ http://applegrew.blogspot.com/

Apple Grew

19 Feb 19 Feb

9:35 p.m.

Hey thanks for the tip. I tried grepping using the following command grep -Rl "SELECT" .|grep -v "/.svn/"|grep -v "/docs/"|grep -v "/maintenance/"

and got the list below. Its really long. I have excluded the maintenance and docs folders completely. The files in includes directory is the first place I would be looking into.

------------------------------------------

...

However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.

The kde app for reading wiki dump does the very thing of reading directly from bz2 files and it is not slow. URL:http://www.kde-apps.org/content/show.php?content=65244

On 2/19/08, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Feb 19, 2008 3:04 AM, Apple Grew applegrew@gmail.com wrote:

...
On Feb 19, 2008 3:25 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...
This more or less exists already in the API:

http://en.wikipedia.org/w/api.php?action=parse&text=%5B%5Bhello%5D%5D&am...

...
...
Roan Kattouw (Catrope)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

The problem with this is that it needs a install of Mediawiki with database working. The database too must have the necessary template pages in it. If we try to use the the api from the official website then we need a working internet connection for that (at least during parsing the XML file, not always) plus it is pointless as the XML file too contains the template information.

One approach I took some time ago was to alter the database access script. As a quick hack, use regexp to find queries that want text or data, then return bogus data (where it's unimportant for the rendering) or text (retrieve from XML dump). Ignore anything that doesn't start with "SELECT" ;-)

However, it turned out that access in bzipped files was way too slow, unzipped data was way too large to be of use, and re-indexing would take ages. I even tried sqlite, which bogged down. Maybe sqlite3 does better these days.

Magnus

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Apple Grew my blog @ http://applegrew.blogspot.com/

6125

Age (days ago)

6128

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

7 participants

tags (0)

participants (7)

Apple Grew
Magnus Manske
Marco Schuster
MinuteElectron
Platonides
Roan Kattouw
Simetrical