Note: I cross-posted this to several lists, because I think this is of
interest to many; please reply on wikitech-l only.
A long, long time ago, I started writing a PHP script to convert
MediaWiki markup into XML. I believe it is now feature-complete and
relatively reliable. Not only can it process a single wiki text, but a
list of articles, taking the text from any MediaWiki-based site online.
It uses the same method to replace templates.
The generated XML can now be converted into other formats. For
demonstration [1], I offer "plain text" and DocBook XML.
What I cannot demonstrate (due to limitations of my hosting service) is
the subsequence conversion to HTML or PDF from the DocBook XML. However,
it is quite easy to set up an automatic conversion locally if you have
the necessary DocBook files installed.
As an example, I have generated a PDF [2] by
1. Entering the titles of the articles I want to have
2. Chosing "DocBook PDF" as output format
3. Clicking "Convert"
4. Waiting for the PDF to open
Really, that easy! :-)
I am well aware of some shortcomings of the example PDF, however, most
of them (no left margin, gigantic tables, misshaped images) are flaws of
DocBook, or of the default stylesheets I use. I'm not really familiar
with DocBook and hope for help by people that are.
While the converter seems to work pretty well, I'm sure there are lots
of fun bugs to find. If you do find a page that breaks, please mail me
the title so I can find the bug, or even better, fix it yourself! The
code is in CVS, "wiki2xml" module, "php" directory (ignore the old C
code in the main directory;-)
A word about speed: Yes, the process of creating a PDF takes some time.
However, most of it is DocBook at work, and of course the loading times
for articles and templates. Converting the example from wiki markup to
XML to DocBook XML to PDF takes 2 minutes 20 seconds total, but the
actual conversion wiki-to-XML is done in just 8 seconds.
Apart from bug fixing, my next priority is ODT (OpenOffice) format
output. Also, I would like to extend Special:Export in MediaWiki so it
can return a list of authors, which can then be added automagically to
all converted files.
Awaiting your feedback,
Magnus
[1] http://magnusmanske.de/wiki2xml/w2x.php
[2] http://magnusmanske.de/wiki2xml/Biology_topics.pdf (3.7 MB!)
Hello
I'm quite new in the Wiki world, so please excuse me for my ignorance.
I'm wondering if anyone has any ideas about a Wiki supporting workflow.
Certainly, documents go through different stages since they are created
until they are filed or removed.
Is there an existing solution for this? Have you ever thought about it?
Regards
Jorge
Hello together,
I would like to know if it is possible to get the content to a
specific request from extern, that means NOT from the Wikipedia
Website.
The constellation would be: Fill a Form (Text Area) with the Word I am
searching for in my Website or a program or something like this. This
Website, Program, etc.. connects (however) to the Wikipedia website
and returns the content (if the word is specific enough) to the
website, program, etc...
I hope I formulate the case in a understandable way. Now the concrete
questions are:
- Is there a Wikipedia API-Specification in order to reach my goal?
- If yes, how can I be able to get it?
- If no, which alternatives are there?
Thank you in advance!
Gab
An automated run of parserTests.php showed the following failures:
Running test BUG 361: URL within URL, not bracketed... FAILED!
Running test External links: invalid character... FAILED!
Running test Bug 2702: Mismatched <i> and <a> tags are invalid... FAILED!
Running test A table with no data.... FAILED!
Running test A table with nothing but a caption... FAILED!
Running test Link containing "#<" and "#>" % as a hex sequences... FAILED!
Running test Template with thumb image (wiht link in description)... FAILED!
Running test Link to image page... FAILED!
Running test BUG 1887: A ISBN with a thumbnail... FAILED!
Running test BUG 1887: A <math> with a thumbnail... FAILED!
Running test BUG 561: {{/Subpage}}... FAILED!
Running test Simple category... FAILED!
Running test Basic section headings... FAILED!
Running test Section headings with TOC... FAILED!
Running test Handling of sections up to level 6 and beyond... FAILED!
Running test Resolving duplicate section names... FAILED!
Running test Template with sections, __NOTOC__... FAILED!
Running test Link inside a section heading... FAILED!
Running test Media link with nasty text... FAILED!
Running test Bug 2095: link with pipe and three closing brackets... FAILED!
Running test Parser hook: static parser hook inside a comment... FAILED!
Running test Sanitizer: Validating the contents of the id attribute (bug 4515)... FAILED!
Passed 275 of 297 tests (92.59%) FAILED!
Why is the option to make subpages in the main namespaces not enabled by
default? Is there something wrong with making subpages into the main
namespace?
hi,
My hosting provider is very crappy. One of the basic things, i.e. shell
access, is not available. This has many disadvantages because I cannot
run almost
any maintenance scripts. I found that I could use cron scripts to replace a
shell, (it does what I would do in a shell), but it is very tedious and has
been the cause of many problems. I had to import something to the wiki and
ran the scirpt through a cron script but forgot to remove it and it kept
importing it over and over till i saw that some pages that were deleted
again came up.
moreover, the server is in such a location that cvs update is
neversuccessful. In my computer, cvs update updates the code within 20
or 30
times, but in that server, it has never updated. I made a script to keep
updating it till it succedded and it updated only once in 24 hours. This
definately caused problems to the host, and they temporarily suspended my
account. So I see no way out of this except change my hosting provider.
Does anybody know a good hosting provider that suites my needs?
vedant
Guten Tag Rich Morin,
am Mittwoch, 22. März 2006 um 01:58 schrieben Sie:
RM> At 2:28 PM -0800 3/21/06, Brion Vibber wrote:
>> G@B wrote:
>>> - Is there a Wikipedia API-Specification in order to reach my goal?
>>
>> Not at this time.
>>
>>> - If no, which alternatives are there?
>>
>> If you're a nice person: open a web browser and point it at Wikipedia.
RM> That's approximately what I do in my web pages. My text is rife with
RM> links to WP (to the point that $WP is defined in the PHP code :-). In
RM> fact, the ability to use WP as a source of explanatory footnotes is a
RM> very big win for me.
RM> However, I can imagine situations where this would not be an optimal
RM> solution. For example, someone might wish to present the user with
RM> tooltips, image-mapped diagrams for context and navigation, etc. So,
RM> the web site would display derived information (probably linking to
RM> WP, as well).
RM> This sort of analysis requires intimate familiarity with the structure
RM> of the input data. So, definitions, consistency, and stability are
RM> important requirements.
RM> At 11:25 PM +0100 3/21/06, G@B wrote:
>> - If no, which alternatives are there?
RM> Here are some possibilities I've considered:
RM> * screen scraping
RM> UI design is hard enough without trying to keep things convenient
RM> for use by programs. So, most developers (let alone contributors)
RM> won't optimize for this use.
RM> XHTML, if used, helps with some low-level syntax issues (XML
RM> parsers work :-), but the structure may still be chaotic and
RM> subject to unannounced changes.
RM> Nonetheless, I've suggested that Semantic WP (SWP) tags be part
RM> of the generated XHTML, to enable analysis (etc) by browsers.
RM> * XML/SOAP/...
RM> Quite possible, assuming that WP will allow it and someone can do
RM> (or support) the necessary standardization and implementation.
RM> The SWP folks will be forced to do something like this, if nobody
RM> gets there first. In any case, it won't be trivial to do right.
RM> * RDBMS (eg, MySQL)
RM> Assuming that read access is available, a script can easily send
RM> off queries and evaluate the replies. WP could, in fact, allow
RM> this, but caution would be worthwhile, as this level of access
RM> might create new openings for DDoS attacks, etc. OTOH, if access
RM> were controlled, correct behavior could be enforced by fiat.
RM> Otherwise, you fall back to Brion's suggestion of keeping a mirror.
RM> The last time I checked, this was not a turnkey procedure, but the
RM> situation may be different now. Is mirroring automated now?
RM> * code-level (e.g., PHP) access
RM> If you have access to the MW code base, you can grab any data you
RM> like. However, this puts you in the role of maintaining a forked
RM> version of MW. Of course, if your changes are deemed useful and
RM> safe, you might get them into the MW code base. In fact, putting
RM> XML access and/or SMW facilities into MW is an example of this.
RM> * command-line access
RM> If you have command-line access on a machine where MW is running,
RM> and appropriate permissions, you can access data in a number of
RM> ways. For example, you could look directly into the MySQL files
RM> or behind MW's back at generated files, etc. (However, YMMV!)
RM> In summary, there are a variety of options. My own approach is to use
RM> mediated database access (eg, via Perl's DBI module). This shields me
RM> from implementation details and reduces portability issues. With the
RM> exception of the DB structure, I can treat MW largely as a "black box".
RM> Although I'm not sure I'll need it, DBI-Link provides a way to access
RM> arbitrary databases via PostgreSQL. So, if PostgreSQL can provide
RM> facilities that MySQL (or whatever) does not, it can be used as a
RM> "wrapper":
RM> ??? -> PostgreSQL -> PL/PerlU -> DBI-Link -> Perl DBI -> MySQL (etc)
RM> I would be happy to hear of other possibilities, etc. TMTOWTDI!
RM> -r
Thank you for your advices guys. I hope in future there will be a more
comfortable way to get the data. For me would be comfortable to have
the solution (for example) in Java, considered that a lot of Web
Technologies are beiing developed in this area (see JSP, JSF, etc..).
Well another question to Rich Morin:
Can you please give the url to you page in order to see how does "the
wiki implementation" look and work?
--
BR
Gab
An automated run of parserTests.php showed the following failures:
Running test BUG 361: URL within URL, not bracketed... FAILED!
Running test External links: invalid character... FAILED!
Running test Bug 2702: Mismatched <i> and <a> tags are invalid... FAILED!
Running test A table with no data.... FAILED!
Running test A table with nothing but a caption... FAILED!
Running test Link containing "#<" and "#>" % as a hex sequences... FAILED!
Running test Template with thumb image (wiht link in description)... FAILED!
Running test Link to image page... FAILED!
Running test BUG 1887: A ISBN with a thumbnail... FAILED!
Running test BUG 1887: A <math> with a thumbnail... FAILED!
Running test BUG 561: {{/Subpage}}... FAILED!
Running test Simple category... FAILED!
Running test Basic section headings... FAILED!
Running test Section headings with TOC... FAILED!
Running test Handling of sections up to level 6 and beyond... FAILED!
Running test Resolving duplicate section names... FAILED!
Running test Template with sections, __NOTOC__... FAILED!
Running test Link inside a section heading... FAILED!
Running test Media link with nasty text... FAILED!
Running test Bug 2095: link with pipe and three closing brackets... FAILED!
Running test Parser hook: static parser hook inside a comment... FAILED!
Running test Sanitizer: Validating the contents of the id attribute (bug 4515)... FAILED!
Passed 275 of 297 tests (92.59%) FAILED!
Hi there,
I've finished my first hack in Media Wiki and I've coded a new
extension: User Comment [1]. The main purpose of this extension is that
users can write some comments about the subject of a page, for posting
tips or annotations.
Well, I usually read the online manual of PHP [2], and sometimes it's
more important the information you've found in a comment that the
explanation of a function. So I think you could improve your wiki with
this feature! You can see in action this extension in a Spanish Wiki of
LDAP [3].
Anyway, how can I update the Category:Extensions in meta.mediawiki.org
for including this one?
Cheers!
[1] http://meta.wikimedia.org/wiki/User:Kan/UserComments
[2] http://www.php.net
[3] http://sugus.eii.us.es/siledap/WikiLDAP:DN
Hello,
Is there a way analyze the wikipedia logs to figure out what processes
takes the most time? There is no immediate need, but I wanted to shoot
off an idea to consider. If we were able to capture the processes that
do take a heavy server load and push them onto a distributed process
pool, would it help wikipedia or mediawiki in general? I'd imagine there
would be a difference with process time over the speed of network
traffic. Let say we determined that the code that creates a diff for two
pages is a hog and can be put into the pool. We could use something like
BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add
the diff process to the pool as the server load gets heavy. The use of
BOINC is more specific to research tasks, and it would need to be
different for mediawiki. I just used the idea to keep this message short
to get your feedback.
Thanks
Jonathan