Hello,
I have used Special:Export at en.wikipedia to export
"Diabetes_mellitus" and ticked the box "include templates" (I'm only
really after the templates).
The resulting XML file is 40.1mb so I decided to go with mwdumper.js
rather than Special:Import.
I'm working on a fresh build of mediawiki on my local system. When
running the command:
java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml |
mysql -u root -p wiki
It is returning the following error:
1 pages (0.102/sec), 1,000 revs (102.062/sec)
ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1
Exception in thread "main" java.io.IOException: XML document
structures must start and end within the same entity.
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
Caused by: org.xml.sax.SAXParseException: XML document structures must
start and end within the same entity.
at
org
.apache
.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at
org
.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown
Source)
at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at
org
.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(Unknown Source)
at
org
.apache
.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:176)
... 2 more
Can anyone please advise? After some googling the only advice I
managed to find was:
"Before you start, try clearing the tables that mwdumper works in:
DELETE FROM page; DELETE FROM revision; DELETE FROM text; "
I have done this and tried again, but the same error continues.
Many thanks, Dawson
Just some basic comments, I'm sure Brion has more.
leonsp(a)svn.wikimedia.org schreef:
> Revision: 45755
> Author: leonsp
> Date: 2009-01-14 22:20:15 +0000 (Wed, 14 Jan 2009)
>
> Log Message:
> -----------
> (bug 17028) Added support for IBM DB2 database. config/index.php has new interface elements that only show up if PHP has ibm_db2 module enabled. AutoLoader knows about the new DB2 classes. GlobalFunctions has a new constant for DB2 time format. Revision class fixed slightly. Also includes new PHP files containing the Database and Search API implementations for IBM DB2.
>
> [...]
> Modified: trunk/phase3/includes/Revision.php
> ===================================================================
> --- trunk/phase3/includes/Revision.php 2009-01-14 22:15:50 UTC (rev 45754)
> +++ trunk/phase3/includes/Revision.php 2009-01-14 22:20:15 UTC (rev 45755)
> @@ -961,6 +961,10 @@
> */
> static function getTimestampFromId( $title, $id ) {
> $dbr = wfGetDB( DB_SLAVE );
> + // Casting fix for DB2
> + if ($id == '') {
> + $id = 0;
> + }
>
You should probably use intval($id) here.
> [...]
>
> Added: trunk/phase3/includes/SearchIBM_DB2.php
> ===================================================================
> --- trunk/phase3/includes/SearchIBM_DB2.php (rev 0)
> +++ trunk/phase3/includes/SearchIBM_DB2.php 2009-01-14 22:20:15 UTC (rev 45755)
> @@ -0,0 +1,247 @@
> +<?php
> +# Copyright (C) 2004 Brion Vibber <brion(a)pobox.com>
>
If you wrote this file, you should attribute yourself.
>
> Added: trunk/phase3/includes/db/DatabaseIbm_db2.php
> ===================================================================
> --- trunk/phase3/includes/db/DatabaseIbm_db2.php (rev 0)
> +++ trunk/phase3/includes/db/DatabaseIbm_db2.php 2009-01-14 22:20:15 UTC (rev 45755)
>
> +/**
> + * Utility class for generating blank objects
> + * Intended as an equivalent to {} in Javascript
> + * @ingroup Database
> + */
> +class BlankObject {
> +}
>
Just use $obj = new stdClass; here.
> +
> +/**
> + * This represents a column in a DB2 database
> + * @ingroup Database
> + */
> +class IBM_DB2Field {
> + private $name, $tablename, $type, $nullable, $max_length;
> +
> + /**
> + * Builder method for the class
> + * @param Object $db Database interface
> + * @param string $table table name
> + * @param string $field column name
> + * @return IBM_DB2Field
> + */
> + static function fromText($db, $table, $field) {
> + [...]
> + }
> + /**
> + * Get column name
> + * @return string column name
> + */
> + function name() { return $this->name; }
> + /**
> + * Get table name
> + * @return string table name
> + */
> + function tableName() { return $this->tablename; }
> + /**
> + * Get column type
> + * @return string column type
> + */
> + function type() { return $this->type; }
> + /**
> + * Can column be null?
> + * @return bool true or false
> + */
> + function nullable() { return $this->nullable; }
> + /**
> + * How much can you fit in the column per row?
> + * @return int length
> + */
> + function maxLength() { return $this->max_length; }
> +}
>
Why do you need this? The other Database backends don't have it.
> +
> +/**
> + * Wrapper around binary large objects
> + * @ingroup Database
> + */
> +class IBM_DB2Blob {
> + private $mData;
> +
> + function __construct($data) {
> + $this->mData = $data;
> + }
> +
> + function getData() {
> + return $this->mData;
> + }
> +}
>
Why do you need these wrapper objects?
> [...]
> + public function is_numeric_type( $type ) {
> + switch (strtoupper($type)) {
> + case 'SMALLINT':
> + case 'INTEGER':
> + case 'INT':
> + case 'BIGINT':
> + case 'DECIMAL':
> + case 'REAL':
> + case 'DOUBLE':
> + case 'DECFLOAT':
> + return true;
> + }
> + return false;
> + }
>
Indentation looks wrong here.
> + /**
> + * Construct a LIMIT query with optional offset
> + * This is used for query pages
> + * $sql string SQL query we will append the limit too
> + * $limit integer the SQL limit
> + * $offset integer the SQL offset (default false)
> + */
> + public function limitResult($sql, $limit, $offset=false) {
> + if( !is_numeric($limit) ) {
> + throw new DBUnexpectedError( $this, "Invalid non-numeric limit passed to limitResult()\n" );
> + }
> + if( $offset ) {
> + wfDebug("Offset parameter not supported in limitResult()\n");
> + }
> + // TODO implement proper offset handling
> + // idea: get all the rows between 0 and offset, advance cursor to offset
> + return "$sql FETCH FIRST $limit ROWS ONLY ";
> + }
>
So DB2 renames LIMIT $n to something else and doesn't even implement
offset handling, even though both are in the SQL specification?
> + /**
> + * USE INDEX clause
> + * DB2 doesn't have them and returns ""
> + * @param sting $index
> + */
> + public function useIndexClause( $index ) {
> + return "";
> + }
>
What do you mean DB2 "doesn't have them"? FORCE INDEX isn't supported in
DB2? Then unless its index choosing algorithm is extremely good, it
won't be able to run certain queries with satisfactory efficiency.
> + public function select( $table, $vars, $conds='', $fname = 'DatabaseIbm_db2::select', $options = array(), $join_conds = array() )
> + {
> + $res = parent::select( $table, $vars, $conds, $fname, $options, $join_conds );
> +
> + // We must adjust for offset
> + if ( isset( $options['LIMIT'] ) ) {
> + if ( isset ($options['OFFSET'] ) ) {
> + $limit = $options['LIMIT'];
> + $offset = $options['OFFSET'];
> + }
> + }
>
This only sets $limit if both $options['LIMIT'] and $options['OFFSET']
are set, which I'm pretty sure is not what you want.
> +
> +
> + // DB2 does not have a proper num_rows() function yet, so we must emulate it
> + // DB2 9.5.3/9.5.4 and the corresponding ibm_db2 driver will introduce a working one
> + // Yay!
>
You probably want to detect the version and use num_rows() if it's
available.
> +
> + // we want the count
> + $vars2 = array('count(*) as num_rows');
> + // respecting just the limit option
> + $options2 = array();
> + if ( isset( $options['LIMIT'] ) ) $options2['LIMIT'] = $options['LIMIT'];
>
Can't you just rewrite LIMIT n to FETCH FIRST n ROWS ONLY here?
> Added: trunk/phase3/maintenance/ibm_db2/README
> ===================================================================
> --- trunk/phase3/maintenance/ibm_db2/README (rev 0)
> +++ trunk/phase3/maintenance/ibm_db2/README 2009-01-14 22:20:15 UTC (rev 45755)
> @@ -0,0 +1,41 @@
> +== Syntax differences between other databases and IBM DB2 ==
> +{| border cellspacing=0 cellpadding=4
> +!MySQL!!IBM DB2
> +|-
> +
> +|SELECT 1 FROM $table LIMIT 1
> +|SELECT COUNT(*) FROM SYSIBM.SYSTABLES ST
> +WHERE ST.NAME = '$table' AND ST.CREATOR = '$schema'
> [...]
>
This is probably better off as plain text than as wikitext.
> Added: trunk/phase3/maintenance/ibm_db2/tables.sql
> ===================================================================
> --- trunk/phase3/maintenance/ibm_db2/tables.sql (rev 0)
> +++ trunk/phase3/maintenance/ibm_db2/tables.sql 2009-01-14 22:20:15 UTC (rev 45755)
> @@ -0,0 +1,604 @@
> +-- DB2
> +
> +-- SQL to create the initial tables for the MediaWiki database.
> +-- This is read and executed by the install script; you should
> +-- not have to run it by itself unless doing a manual install.
> +-- This is the IBM DB2 version.
> +-- For information about each table, please see the notes in maintenance/tables.sql
> +-- Please make sure all dollar-quoting uses $mw$ at the start of the line
> +-- TODO: Change CHAR/SMALLINT to BOOL (still used in a non-bool fashion in PHP code)
> +
> +
> +
> +
> +CREATE SEQUENCE user_user_id_seq AS INTEGER START WITH 0 INCREMENT BY 1;
> +CREATE TABLE mwuser ( -- replace reserved word 'user'
> + user_id INTEGER NOT NULL PRIMARY KEY, -- DEFAULT nextval('user_user_id_seq'),
> + user_name VARCHAR(255) NOT NULL UNIQUE,
> + user_real_name VARCHAR(255),
> + user_password clob(1K),
> + user_newpassword clob(1K),
> + user_newpass_time TIMESTAMP,
> + user_token VARCHAR(255),
> + user_email VARCHAR(255),
> + user_email_token VARCHAR(255),
> + user_email_token_expires TIMESTAMP,
> + user_email_authenticated TIMESTAMP,
> + user_options CLOB(64K),
> + user_touched TIMESTAMP,
> + user_registration TIMESTAMP,
> + user_editcount INTEGER
> +);
> +CREATE INDEX user_email_token_idx ON mwuser (user_email_token);
>
You shouldn't rename indices like that, because index names are used in
FORCE INDEX clauses (oh wait, but they weren't supported, right?)
> +-- should be replaced with OmniFind, Contains(), etc
> +CREATE TABLE searchindex (
> + si_page int NOT NULL,
> + si_title varchar(255) NOT NULL default '',
> + si_text clob NOT NULL
> +);
>
Don't you need some index on this table to enable efficient searching?
Again, this is only a very shallow look from my part, Brion will
probably have more interesting stuff to say.
Roan Kattouw (Catrope)
Hi,
Just completed a project using the Wikipedia page counters made available by
*Domas Mituzas (
http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html)
*WikiGeist is an attempt to build the Wikipedia equivalent of Google's Hot
Trends or other websites' most popular widget. It tracks, aggregates, ranks
and reports the page views on en.wikipedia.org. There are three types of
report: Top Pages by Count (ranks the articles according to the number of
page views during the past hour,) Top New Entries (ranks the articles by
page views with prior page views of 0) and Top Pages by Page Count Increase.
When articles are accessed individually, a excerpt of the wikipedia page is
shown as well as a graph reporting the trend during the past 24 hours.
Let me know what you think of it.
Thanks.
willy -- [[user:Tookam]]
It seems, image redirects are somehow broken. You cannot view history
diffs (nothing shows up) and categories won't work in image redirects.
Is this bug known? An example is visible at
<http://test.wikipedia.org/w/index.php?title=File:ARTI.svg>, see my
edits in the history there. It's broken on all wikis, not only on test.
I made my test there, to see whether the the new code to move images
activated on test will cure the problem. It seems, it will not.
Marcus Buck
Hello,
People on Persian Wikipedia are eager to know if there is a way to let
MediaWiki search feature suggest some keywords when misspelled words are
search in Persian Wikipedia. As I'm not that familiar with the MediaWiki
search suggestions feature, I would be thankful if someone could share their
thoughts about this idea.
Best,
Hojjat (aka Huji)
Hello,
I am looking for a way to download just the templates, even just a way
to download the basic templates that you need to make an infobox would
be fine.
I have setup a mediawiki and I'm looking to use the infobox template
from wikipedia. I've tried to manually copy the template but then
realised there were too many transcluded pages and this would take me
forever, which led me to http://en.wikipedia.org/wiki/Wikipedia_talk:Database_download#Downloading_t…
and it appears that quite a few other people are asking the same
question as me. One answer that's been given is to download pages-
articles.xml.bz2, however it's 4.1G and includes all Wikipedia
articles, and not just the templates.
Could someone advise?
Thank you, Dawson
Hi,
I have an extension that needs to parse some wiki text. I started with:
$title = $parser->internalParse($title);
but found that links were not always processing correctly [[Main Page]] was
saying as that, but interestingly the first [[:image: ]] tags was parsing
$title =
$parser->parse($title,$wgTitle,ParserOptions::newFromUser($wgUser),false,false)->getText();
... however did. This is with MW 1.13.1.
However now the batch processes take forever, and the culprit is the
"->parse" - it seems to take exponentially longer to run. internalParse runs
quicker, so for the time being I have detected commandline functions and
NS_SPECIAL pages and sent it that way (as I don't really care if the pages
parse correctly or not there).
Are there any expected known limitations of internalParse? Should it always
output the same as parse or are there certain cases which will never work
with it?
Kind regards,
Alex
--
Alex Powell
Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 759 5048178
skype: alexp700
mailto:alexp@exscien.com
http://www.exscien.com
Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
Hello,
if you look at http://download.wikipedia.org/dawiki/20090109/ the dump
process for the new dump for dawiki seems to be frozen. Perhaps it
should be killed manually.
Best regards
Andim