Wikitech-l March 2006

wikitech-l@lists.wikimedia.org

106 participants
125 discussions

MediaWiki converter to Plain Text, XML, DocBook, PDF

by Magnus Manske

Note: I cross-posted this to several lists, because I think this is of interest to many; please reply on wikitech-l only. A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates. The generated XML can now be converted into other formats. For demonstration [1], I offer "plain text" and DocBook XML. What I cannot demonstrate (due to limitations of my hosting service) is the subsequence conversion to HTML or PDF from the DocBook XML. However, it is quite easy to set up an automatic conversion locally if you have the necessary DocBook files installed. As an example, I have generated a PDF [2] by 1. Entering the titles of the articles I want to have 2. Chosing "DocBook PDF" as output format 3. Clicking "Convert" 4. Waiting for the PDF to open Really, that easy! :-) I am well aware of some shortcomings of the example PDF, however, most of them (no left margin, gigantic tables, misshaped images) are flaws of DocBook, or of the default stylesheets I use. I'm not really familiar with DocBook and hope for help by people that are. While the converter seems to work pretty well, I'm sure there are lots of fun bugs to find. If you do find a page that breaks, please mail me the title so I can find the bug, or even better, fix it yourself! The code is in CVS, "wiki2xml" module, "php" directory (ignore the old C code in the main directory;-) A word about speed: Yes, the process of creating a PDF takes some time. However, most of it is DocBook at work, and of course the loading times for articles and templates. Converting the example from wiki markup to XML to DocBook XML to PDF takes 2 minutes 20 seconds total, but the actual conversion wiki-to-XML is done in just 8 seconds. Apart from bug fixing, my next priority is ODT (OpenOffice) format output. Also, I would like to extend Special:Export in MediaWiki so it can return a list of authors, which can then be added automagically to all converted files. Awaiting your feedback, Magnus [1] http://magnusmanske.de/wiki2xml/w2x.php [2] http://magnusmanske.de/wiki2xml/Biology_topics.pdf (3.7 MB!)

18 years, 1 month

Workflow

by Jorge S�nchez Seijo

Hello I'm quite new in the Wiki world, so please excuse me for my ignorance. I'm wondering if anyone has any ideas about a Wiki supporting workflow. Certainly, documents go through different stages since they are created until they are filed or removed. Is there an existing solution for this? Have you ever thought about it? Regards Jorge

18 years, 1 month

Access the Content of Wikipedia from extern

by G＠B

Hello together, I would like to know if it is possible to get the content to a specific request from extern, that means NOT from the Wikipedia Website. The constellation would be: Fill a Form (Text Area) with the Word I am searching for in my Website or a program or something like this. This Website, Program, etc.. connects (however) to the Wikipedia website and returns the content (if the word is specific enough) to the website, program, etc... I hope I formulate the case in a understandable way. Now the concrete questions are: - Is there a Wikipedia API-Specification in order to reach my goal? - If yes, how can I be able to get it? - If no, which alternatives are there? Thank you in advance! Gab

18 years, 1 month

MediaWiki automated test run failure 2006-03-23

by brion＠pobox.com

An automated run of parserTests.php showed the following failures: Running test BUG 361: URL within URL, not bracketed... FAILED! Running test External links: invalid character... FAILED! Running test Bug 2702: Mismatched <i> and <a> tags are invalid... FAILED! Running test A table with no data.... FAILED! Running test A table with nothing but a caption... FAILED! Running test Link containing "#<" and "#>" % as a hex sequences... FAILED! Running test Template with thumb image (wiht link in description)... FAILED! Running test Link to image page... FAILED! Running test BUG 1887: A ISBN with a thumbnail... FAILED! Running test BUG 1887: A <math> with a thumbnail... FAILED! Running test BUG 561: {{/Subpage}}... FAILED! Running test Simple category... FAILED! Running test Basic section headings... FAILED! Running test Section headings with TOC... FAILED! Running test Handling of sections up to level 6 and beyond... FAILED! Running test Resolving duplicate section names... FAILED! Running test Template with sections, __NOTOC__... FAILED! Running test Link inside a section heading... FAILED! Running test Media link with nasty text... FAILED! Running test Bug 2095: link with pipe and three closing brackets... FAILED! Running test Parser hook: static parser hook inside a comment... FAILED! Running test Sanitizer: Validating the contents of the id attribute (bug 4515)... FAILED! Passed 275 of 297 tests (92.59%) FAILED!

18 years, 1 month

Subpages in main namespace?

by Birger

Why is the option to make subpages in the main namespaces not enabled by default? Is there something wrong with making subpages into the main namespace?

18 years, 1 month

hosting problems

by Vedant Lath

hi, My hosting provider is very crappy. One of the basic things, i.e. shell access, is not available. This has many disadvantages because I cannot run almost any maintenance scripts. I found that I could use cron scripts to replace a shell, (it does what I would do in a shell), but it is very tedious and has been the cause of many problems. I had to import something to the wiki and ran the scirpt through a cron script but forgot to remove it and it kept importing it over and over till i saw that some pages that were deleted again came up. moreover, the server is in such a location that cvs update is neversuccessful. In my computer, cvs update updates the code within 20 or 30 times, but in that server, it has never updated. I made a script to keep updating it till it succedded and it updated only once in 24 hours. This definately caused problems to the host, and they temporarily suspended my account. So I see no way out of this except change my hosting provider. Does anybody know a good hosting provider that suites my needs? vedant

18 years, 1 month

Re: [Wikitech-l] Access the Content of Wikipedia from extern

by G＠B

Guten Tag Rich Morin, am Mittwoch, 22. März 2006 um 01:58 schrieben Sie: RM> At 2:28 PM -0800 3/21/06, Brion Vibber wrote: >> G@B wrote: >>> - Is there a Wikipedia API-Specification in order to reach my goal? >> >> Not at this time. >> >>> - If no, which alternatives are there? >> >> If you're a nice person: open a web browser and point it at Wikipedia. RM> That's approximately what I do in my web pages. My text is rife with RM> links to WP (to the point that $WP is defined in the PHP code :-). In RM> fact, the ability to use WP as a source of explanatory footnotes is a RM> very big win for me. RM> However, I can imagine situations where this would not be an optimal RM> solution. For example, someone might wish to present the user with RM> tooltips, image-mapped diagrams for context and navigation, etc. So, RM> the web site would display derived information (probably linking to RM> WP, as well). RM> This sort of analysis requires intimate familiarity with the structure RM> of the input data. So, definitions, consistency, and stability are RM> important requirements. RM> At 11:25 PM +0100 3/21/06, G@B wrote: >> - If no, which alternatives are there? RM> Here are some possibilities I've considered: RM> * screen scraping RM> UI design is hard enough without trying to keep things convenient RM> for use by programs. So, most developers (let alone contributors) RM> won't optimize for this use. RM> XHTML, if used, helps with some low-level syntax issues (XML RM> parsers work :-), but the structure may still be chaotic and RM> subject to unannounced changes. RM> Nonetheless, I've suggested that Semantic WP (SWP) tags be part RM> of the generated XHTML, to enable analysis (etc) by browsers. RM> * XML/SOAP/... RM> Quite possible, assuming that WP will allow it and someone can do RM> (or support) the necessary standardization and implementation. RM> The SWP folks will be forced to do something like this, if nobody RM> gets there first. In any case, it won't be trivial to do right. RM> * RDBMS (eg, MySQL) RM> Assuming that read access is available, a script can easily send RM> off queries and evaluate the replies. WP could, in fact, allow RM> this, but caution would be worthwhile, as this level of access RM> might create new openings for DDoS attacks, etc. OTOH, if access RM> were controlled, correct behavior could be enforced by fiat. RM> Otherwise, you fall back to Brion's suggestion of keeping a mirror. RM> The last time I checked, this was not a turnkey procedure, but the RM> situation may be different now. Is mirroring automated now? RM> * code-level (e.g., PHP) access RM> If you have access to the MW code base, you can grab any data you RM> like. However, this puts you in the role of maintaining a forked RM> version of MW. Of course, if your changes are deemed useful and RM> safe, you might get them into the MW code base. In fact, putting RM> XML access and/or SMW facilities into MW is an example of this. RM> * command-line access RM> If you have command-line access on a machine where MW is running, RM> and appropriate permissions, you can access data in a number of RM> ways. For example, you could look directly into the MySQL files RM> or behind MW's back at generated files, etc. (However, YMMV!) RM> In summary, there are a variety of options. My own approach is to use RM> mediated database access (eg, via Perl's DBI module). This shields me RM> from implementation details and reduces portability issues. With the RM> exception of the DB structure, I can treat MW largely as a "black box". RM> Although I'm not sure I'll need it, DBI-Link provides a way to access RM> arbitrary databases via PostgreSQL. So, if PostgreSQL can provide RM> facilities that MySQL (or whatever) does not, it can be used as a RM> "wrapper": RM> ??? -> PostgreSQL -> PL/PerlU -> DBI-Link -> Perl DBI -> MySQL (etc) RM> I would be happy to hear of other possibilities, etc. TMTOWTDI! RM> -r Thank you for your advices guys. I hope in future there will be a more comfortable way to get the data. For me would be comfortable to have the solution (for example) in Java, considered that a lot of Web Technologies are beiing developed in this area (see JSP, JSF, etc..). Well another question to Rich Morin: Can you please give the url to you page in order to see how does "the wiki implementation" look and work? -- BR Gab

18 years, 1 month

MediaWiki automated test run failure 2006-03-22

by brion＠pobox.com

18 years, 1 month

UserComments: new extension available

by Candido Rodriguez

Hi there, I've finished my first hack in Media Wiki and I've coded a new extension: User Comment [1]. The main purpose of this extension is that users can write some comments about the subject of a page, for posting tips or annotations. Well, I usually read the online manual of PHP [2], and sometimes it's more important the information you've found in a comment that the explanation of a function. So I think you could improve your wiki with this feature! You can see in action this extension in a Spanish Wiki of LDAP [3]. Anyway, how can I update the Category:Extensions in meta.mediawiki.org for including this one? Cheers! [1] http://meta.wikimedia.org/wiki/User:Kan/UserComments [2] http://www.php.net [3] http://sugus.eii.us.es/siledap/WikiLDAP:DN

18 years, 1 month

Distributed process pool

by Jonathan

Hello, Is there a way analyze the wikipedia logs to figure out what processes takes the most time? There is no immediate need, but I wanted to shoot off an idea to consider. If we were able to capture the processes that do take a heavy server load and push them onto a distributed process pool, would it help wikipedia or mediawiki in general? I'd imagine there would be a difference with process time over the speed of network traffic. Let say we determined that the code that creates a diff for two pages is a hog and can be put into the pool. We could use something like BOINC, http://boinc.berkeley.edu/, to standardize the pool. We can add the diff process to the pool as the server load gets heavy. The use of BOINC is more specific to research tasks, and it would need to be different for mediawiki. I just used the idea to keep this message short to get your feedback. Thanks Jonathan

18 years, 1 month

← Newer
1
2
3
4
5
6
7
8
...
13
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l March 2006