I am currently testing out the possibility of using mediawiki for a historical image collection. The image links and meta data are in mySql database. What is the best way to batch import a collection into mediawiki? Would the MediaWiki Bulk Page Creator be the best way?
Thanks,
Nick Ruest
Hi all,
After the previous thread, I ask out of curiosity, what if there
were just a small number of servers spread around the world, not owned
by the WMF. Something like:
- Users can access en.wikipedia.org, en.wikipedia-brazil.org,
en.wikipedia-japan.org etc interchangeably
- Each site is a complete copy of the database, and capable of serving
independently
- Every change on one database is immediately replicated to the others
- To make this worthwhile, all the servers except the original are
owned and operated by third parties on a sponsorship arrangement,
presumably meaning they get to discreetly stick a logo somewhere
Is this sort of thing technically feasible? The clear advantage is
faster response time for people near one of the overseas hosts,
especially when just browsing. I can see obvious problems with the
replication (ie, two competing requests arriving simultaneously), and
what happens when a link fails for an extended period of time. Perhaps
a simpler model:
- Read requests are fulfilled by these distributed servers
- All write requests are sent to the one central server which
immediately pushes out a "dirty page" notification (but not page
content) to the other servers
- The distributed servers fetch updated pages when a user requests a
"dirty page", or perhaps after some time period, to avoid the whole
database becoming too out of date
This model would rely on the fact that the vast majority of requests
are reads, not writes, and attempts to reduce the impact of a page
which is heavily modified by one server, while infrequently requested
on another.
Thoughts? I'm obviously not a network or database engineer so I'm just
wondering if this would be a workable or useful solution. It at least
avoids the problems of untrusted servers, unreliable servers, and
gives an additional benefit in responsiveness.
Steve
An automated run of parserTests.php showed the following failures:
This is MediaWiki version 1.10alpha (r19986).
Reading tests from "maintenance/parserTests.txt"...
Reading tests from "extensions/Cite/citeParserTests.txt"...
Reading tests from "extensions/Poem/poemParserTests.txt"...
18 still FAILING test(s) :(
* URL-encoding in URL functions (single parameter) [Has never passed]
* URL-encoding in URL functions (multiple parameters) [Has never passed]
* TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html) [Has never passed]
* TODO: Link containing double-single-quotes '' (bug 4598) [Has never passed]
* TODO: message transform: <noinclude> in transcluded template (bug 4926) [Has never passed]
* TODO: message transform: <onlyinclude> in transcluded template (bug 4926) [Has never passed]
* BUG 1887, part 2: A <math> with a thumbnail- math enabled [Has never passed]
* TODO: HTML bullet list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML ordered list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML nested bullet list, open tags (bug 5497) [Has never passed]
* TODO: HTML nested ordered list, open tags (bug 5497) [Has never passed]
* TODO: Inline HTML vs wiki block nesting [Has never passed]
* TODO: Mixing markup for italics and bold [Has never passed]
* TODO: 5 quotes, code coverage +1 line [Has never passed]
* TODO: dt/dd/dl test [Has never passed]
* TODO: Images with the "|" character in the comment [Has never passed]
* TODO: Parents of subpages, two levels up, without trailing slash or name. [Has never passed]
* TODO: Parents of subpages, two levels up, with lots of extra trailing slashes. [Has never passed]
Passed 493 of 511 tests (96.48%)... 18 tests failed!
On 2/18/07, Guillaume Pierre <gpierre(a)cs.vu.nl> wrote:
> As Gerard said, the Vrije Universiteit Amsterdam is working on
> distributed decentralized hosting of a wikipedia-like site. Our first
> results are summarized in an article available here:
> http://www.globule.org/publi/DWECWH_webist2007.html
>
The meat of the idea seems to be to use distributed hash tables to
allow the main database to be moved onto multiple mostly-independent
computers (i.e. break away from the inefficient MySQL
replication/cluster model). This is absolutely something which should
be done. Wikipedia's data model screams for the adoption of this
solution.
I question the benefit of then allowing untrusted third parties to run
the servers, though, because at the end of the paper you acknowledge
that all the data is going to have to pass back through trusted
parties anyway. I'm not convinced that there would be a significant
cost savings to the introduction of untrusted third parties in this
case. Once you've achieved an approximately linear scaling of the
database servers, which the appropriate use of DHTs will do, it seems
to me that the costs of downloading the data from untrusted third
parties (doubling the bandwidth) and checking the signatures (eating
up CPU) is going to be nearly as great as the cost of simply adding
another database server.
Of course, I see why you're proposing it - allowing untrusted third
parties to interact directly with the end-user would require end-users
to install some sort of client software if they want to authenticate
the content. But I really think that's the way you've gotta go if
you're going to achieve a real cost savings (or cost distribution).
Let the end-user software check the signatures.
Anthony
Wikipedia has hundreds of wonderful portals on every imaginable topic.
These are perhaps one of the most underexposed treasures on the site.
It would be lovely if people could subscribe to a portal feed, or
individual "boxes" (typical portal arrangement is into rectangles with
different content). Example:
http://en.wikipedia.org/wiki/Portal:Free_software
How could this be achieved? One way would be to support an RSS
extension that would operate as follows:
1) You put something like
<makefeed>
title=Selected article on free software
addto=freesoftware.xml
addto=freesoftware-sel.xml
</makefeed>
<feedicon>
feed=freesoftware-sel.xml
</feedicon>
inside a template, or indeed any page.
2) When a user edits a page, the extension checks for the presence of
<makefeed> in the wikitext. If it is present, it adds a
( ) Add as new item to RSS feed
( ) Update most recent RSS feed item for this page
(x) no change
selection to the page, below where the minor edit checkbox is. This
selection should only be available to users with a definable
permission level (e.g. autoconfirmed).
3) The feeds could be directly updated/written on the disk, in the
images/ directory. In any case, the <feedicon> tag would generate a
pretty link to a feed with a given name.
The feed content would be the action=render output for the page where
the <makefeed> instruction is found (ideally sans noinclude). It could
also include the edit summary.
Given that a feed could be accessed from multiple pages, you could
build aggregated feeds (in the above example, freesoftware.xml would
be a feed for the whole free software portal) and individual ones
(freesoftware-sel.xml would only be the selected article box). You'd
have to do some clever scanning of the file on disk to make safe
updates, but it shouldn't be too hard.
Any conceptual flaws? Any takers? I think this could really make a big
difference for content re-use, not just in the context of Wikipedia.
But the portals seem like a particularly attractive target application
to me.
--
Peace & Love,
Erik
DISCLAIMER: This message does not represent an official position of
the Wikimedia Foundation or its Board of Trustees.
"An old, rigid civilization is reluctantly dying. Something new, open,
free and exciting is waking up." -- Ming the Mechanic
Jim Wilson and I have been working on a new extension, which I call
PagesOnDemand that does the following:
Hooks at ArticleFromTitle
Uses a regex to look for a pattern in the requested title
If the title matches the pattern and the page does not already exist,
runs a function to create the desired stub content and saves it
Note that if this runs when a user clicks on a red link, it acts like
a blue link - the user is directed to a view of the brand new page.
Currently, if the user searches for a page that matches the regex,
the extension populates the page if the user clicks the red link in
the search page, or clicks "create this page", but they get the edit
page.
PagesOnDemand is set up so that other extension writers can add
additional regex/content creation combinations. Before I put up a
page in mediawiki.org, I'd like to check with the experts here about
how I've done a couple of things to let future extensions use
PagesOnDemand:
1) The future extensions would need to register to use a "hook"
inside the PagesOnDemand extension. So I'd like to reserve that hook
name - or whatever the devs want me to name it -within reason ; ) -
within mediawiki. I assume that this would have to be approved by
Brion et al? (I use "hook" in quotes because I'm not using RunHooks
to run the function).
2) To keep the regex and the function together, I've set it up so
that the content creator extensions register by pushing a 2-element
array onto $wgHooks, where the array is (regex, functionName). For
example, to look for page titles that correspond to PubMed IDs (which
is why I wrote it), the creator extension uses:
$wgHooks['PagesOnDemand'][] = array('/^PMID:\\d+
$/','wfPubMedOnDemand') ;
Is this kosher ? Jim W. doesn't like it, but I don't like his
alternative, which is to pass the regex matching to the content
creator. I don't want to distribute the code on mediawiki.org if it
does something evil. Thanks in advance for advice.
Jim
=====================================
Jim Hu
Associate Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054
An automated run of parserTests.php showed the following failures:
This is MediaWiki version 1.10alpha (r19982).
Reading tests from "maintenance/parserTests.txt"...
Reading tests from "extensions/Cite/citeParserTests.txt"...
Reading tests from "extensions/Poem/poemParserTests.txt"...
18 still FAILING test(s) :(
* URL-encoding in URL functions (single parameter) [Has never passed]
* URL-encoding in URL functions (multiple parameters) [Has never passed]
* TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html) [Has never passed]
* TODO: Link containing double-single-quotes '' (bug 4598) [Has never passed]
* TODO: message transform: <noinclude> in transcluded template (bug 4926) [Has never passed]
* TODO: message transform: <onlyinclude> in transcluded template (bug 4926) [Has never passed]
* BUG 1887, part 2: A <math> with a thumbnail- math enabled [Has never passed]
* TODO: HTML bullet list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML ordered list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML nested bullet list, open tags (bug 5497) [Has never passed]
* TODO: HTML nested ordered list, open tags (bug 5497) [Has never passed]
* TODO: Inline HTML vs wiki block nesting [Has never passed]
* TODO: Mixing markup for italics and bold [Has never passed]
* TODO: 5 quotes, code coverage +1 line [Has never passed]
* TODO: dt/dd/dl test [Has never passed]
* TODO: Images with the "|" character in the comment [Has never passed]
* TODO: Parents of subpages, two levels up, without trailing slash or name. [Has never passed]
* TODO: Parents of subpages, two levels up, with lots of extra trailing slashes. [Has never passed]
Passed 493 of 511 tests (96.48%)... 18 tests failed!
http://www.aspirationtech.org/events/devsummit
This should be of interest to anyone involved in both Wikimedia
Foundation issues and open source technology. Sorry for the late
notice, I just found out about it. Aspiration has a couple of other
cool projects, such as their index of nonprofit tools (some of them
open source, some of them not):
http://www.socialsourcecommons.org/
As well as Penguin Days:
http://www.penguinday.org/
--
Peace & Love,
Erik
DISCLAIMER: This message does not represent an official position of
the Wikimedia Foundation or its Board of Trustees.
"An old, rigid civilization is reluctantly dying. Something new, open,
free and exciting is waking up." -- Ming the Mechanic
An automated run of parserTests.php showed the following failures:
This is MediaWiki version 1.10alpha (r19975).
Reading tests from "maintenance/parserTests.txt"...
Reading tests from "extensions/Cite/citeParserTests.txt"...
Reading tests from "extensions/Poem/poemParserTests.txt"...
18 still FAILING test(s) :(
* URL-encoding in URL functions (single parameter) [Has never passed]
* URL-encoding in URL functions (multiple parameters) [Has never passed]
* TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html) [Has never passed]
* TODO: Link containing double-single-quotes '' (bug 4598) [Has never passed]
* TODO: message transform: <noinclude> in transcluded template (bug 4926) [Has never passed]
* TODO: message transform: <onlyinclude> in transcluded template (bug 4926) [Has never passed]
* BUG 1887, part 2: A <math> with a thumbnail- math enabled [Has never passed]
* TODO: HTML bullet list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML ordered list, unclosed tags (bug 5497) [Has never passed]
* TODO: HTML nested bullet list, open tags (bug 5497) [Has never passed]
* TODO: HTML nested ordered list, open tags (bug 5497) [Has never passed]
* TODO: Inline HTML vs wiki block nesting [Has never passed]
* TODO: Mixing markup for italics and bold [Has never passed]
* TODO: 5 quotes, code coverage +1 line [Has never passed]
* TODO: dt/dd/dl test [Has never passed]
* TODO: Images with the "|" character in the comment [Has never passed]
* TODO: Parents of subpages, two levels up, without trailing slash or name. [Has never passed]
* TODO: Parents of subpages, two levels up, with lots of extra trailing slashes. [Has never passed]
Passed 493 of 511 tests (96.48%)... 18 tests failed!