Hi everyone,
I am trying to restore the revision table from Wikipedia dumps. I understand that the file that I need is probably enwiki-XX-pages- logging.xml.gz
I've downloaded the file and I am using the 1.16 version of mwdumper from https://integration.wikimedia.org/ci/job/MWDumper-package/org.wikimedia$mwdu...
When I execute the following I get this error: java -server -jar mwdumper.jar --format=sql:1.5 enwiki-20130503-pages-logging.xml.gz | gzip -vc > enwiki-latest-pages-articles.sql.gz Exception in thread "main" java.lang.IllegalArgumentException: Unexpected <id> outside a <page>, <revision>, or <contributor> at org.mediawiki.importer.XmlDumpReader.readId(XmlDumpReader.java:329) at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:204) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142) 0.0%
Mwdumper works well with other 7z xml files but not for this one. I tried a couple of different xml page-logging files and even from a different language wikipedias.
Anyone knows what this error is and why it occurs on this specific file?
PS: I've also tried to build mwdumper: git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git mwdumper
However I couldn't use make or ant since there was not build.xml or makefile in the git.
I appreciate any help you can give me with this.
Στις 16-05-2013, ημέρα Πεμ, και ώρα 11:03 +0200, ο/η Michael Tsikerdekis έγραψε:
Hi everyone,
I am trying to restore the revision table from Wikipedia dumps. I understand that the file that I need is probably enwiki-XX-pages- logging.xml.gz
Actually you want one of the xml files with page content, either enwiki-20130503-pages-articles.xml.bz2 , enwiki-20130503-pages-meta-current.xml.bz2 or the various meta-history bz2 or 7z files. This depends on whether you want the current revision for the articles and related namespace pages only, the current revision for all pages, or all revisions for all pages.
Mwdumper will generate sql from these files to populate the revision, text and page tables, all in one output file. All of these will be written in one file intermingled, so you'll want to grab just the sql statements pertaining to the revision table if that's all you want to recreate. I don't know if it works well with the latest dumps.
The pages-logging xml file could conceivably be used for repopulating the logging table and part of the user table (poorly); I imagine most folks use it for research purposes rather than import data.
PS: I've also tried to build mwdumper: git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git mwdumper
However I couldn't use make or ant since there was not build.xml or makefile in the git.
You can backtrack a couple revisions in git to get one that's buildable. I'm cc-ing Chad on this since he knows about the build setup.
Ariel
On Fri, May 17, 2013 at 1:10 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
PS: I've also tried to build mwdumper: git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git mwdumper
However I couldn't use make or ant since there was not build.xml or makefile in the git.
You can backtrack a couple revisions in git to get one that's buildable. I'm cc-ing Chad on this since he knows about the build setup.
That would be because we swapped out Ant in favor of Maven a little while back. `mvn package` should work just fine.
-Chad
Great that should work just fine! The pages-meta-history are the files that I want although I modified the text blob columns to varchar since I really don't need this data to be restored and they tend to be the largest.
Thank you both for your help!
Michael
On Fri, May 17, 2013 at 11:57 AM, Chad innocentkiller@gmail.com wrote:
On Fri, May 17, 2013 at 1:10 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
PS: I've also tried to build mwdumper: git clone https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git mwdumper
However I couldn't use make or ant since there was not build.xml or makefile in the git.
You can backtrack a couple revisions in git to get one that's buildable. I'm cc-ing Chad on this since he knows about the build setup.
That would be because we swapped out Ant in favor of Maven a little while back. `mvn package` should work just fine.
-Chad
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
I am having problem with one of the files. Can anyone verify if there is a problem with the file or mwdumper?
I am using a freshly built version from git (just built it). Here is the log:
$ 7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z |java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc > temp.sql.gz
7-Zip (A) 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30 p7zip Version 9.04 (locale=en_US.ISO-8859-15,Utf16=on,HugeFiles=on,8 CPUs)
Processing archive: enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z
Extracting enwiki-20130503-pages-meta-history1.xml-p000006887p0000093163 pages (1.165/sec), 1,000 revs (388.35/sec) 3 pages (0.356/sec), 2,000 revs (237.164/sec) 8 pages (0.677/sec), 3,000 revs (253.807/sec) 13 pages (1.058/sec), 4,000 revs (325.627/sec) 13 pages (0.992/sec), 5,000 revs (381.505/sec) 16 pages (1.169/sec), 6,000 revs (438.436/sec) 16 pages (1.016/sec), 7,000 revs (444.501/sec) 17 pages (0.854/sec), 8,000 revs (401.849/sec) 17 pages (0.695/sec), 9,000 revs (367.752/sec) 18 pages (0.675/sec), 10,000 revs (374.967/sec) 18 pages (0.653/sec), 11,000 revs (399.332/sec) 18 pages (0.626/sec), 12,000 revs (417.043/sec) 18 pages (0.6/sec), 13,000 revs (433.117/sec) 18 pages (0.555/sec), 14,000 revs (431.766/sec) 18 pages (0.499/sec), 15,000 revs (416.17/sec) 19 pages (0.509/sec), 16,000 revs (428.483/sec) 22 pages (0.58/sec), 17,000 revs (448.43/sec) 22 pages (0.571/sec), 18,000 revs (467.302/sec) 23 pages (0.546/sec), 19,000 revs (450.835/sec) 24 pages (0.564/sec), 20,000 revs (469.649/sec) 26 pages (0.587/sec), 21,000 revs (473.912/sec) 28 pages (0.623/sec), 22,000 revs (489.182/sec) 31 pages (0.684/sec), 23,000 revs (507.469/sec) 31 pages (0.647/sec), 24,000 revs (500.584/sec) 33 pages (0.655/sec), 25,000 revs (495.835/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142) 77.4%
Michael
On Fri, May 17, 2013 at 3:57 PM, Michael Tsikerdekis tsikerdekis@gmail.comwrote:
Great that should work just fine! The pages-meta-history are the files that I want although I modified the text blob columns to varchar since I really don't need this data to be restored and they tend to be the largest.
Thank you both for your help!
Michael
On Fri, May 17, 2013 at 11:57 AM, Chad innocentkiller@gmail.com wrote:
On Fri, May 17, 2013 at 1:10 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
PS: I've also tried to build mwdumper: git clone
https://gerrit.wikimedia.org/r/p/mediawiki/tools/mwdumper.git
mwdumper
However I couldn't use make or ant since there was not build.xml or makefile in the git.
You can backtrack a couple revisions in git to get one that's buildable. I'm cc-ing Chad on this since he knows about the build setup.
That would be because we swapped out Ant in favor of Maven a little while back. `mvn package` should work just fine.
-Chad
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Στις 19-05-2013, ημέρα Κυρ, και ώρα 23:43 +0200, ο/η Michael Tsikerdekis έγραψε:
$ 7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z |java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc > temp.sql.gz
<snip>
31 pages (0.647/sec), 24,000 revs (500.584/sec) 33 pages (0.655/sec), 25,000 revs (495.835/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
Can you please rerun mwdumper with the additional argument --progress=1 which should tell us the exact number of revisions processed before it dies?
Thanks,
Ariel
Thanks Ariel.
I rerun the code with progress=1 and here are the final lines:
33 pages (0.594/sec), 25,370 revs (456.336/sec) 33 pages (0.594/sec), 25,371 revs (456.329/sec) 33 pages (0.594/sec), 25,372 revs (456.315/sec) 33 pages (0.593/sec), 25,373 revs (455.718/sec) 33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142) 77.4%
Michael
On Mon, May 20, 2013 at 1:09 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 19-05-2013, ημέρα Κυρ, και ώρα 23:43 +0200, ο/η Michael Tsikerdekis έγραψε:
$ 7za e -so
enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z
|java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc > temp.sql.gz
<snip>
31 pages (0.647/sec), 24,000 revs (500.584/sec) 33 pages (0.655/sec), 25,000 revs (495.835/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
Can you please rerun mwdumper with the additional argument --progress=1 which should tell us the exact number of revisions processed before it dies?
Thanks,
Ariel
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2] https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
great! at least we know what's causing it. I've seen the thread about xerces before but it was too old so I thought there is probably no relation.
Let me know when there is a new build to try out or anything else I can do to help fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2]
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Update on the matter. I've edited pom.xml and changed xerces version which was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still the error persists. Also, I tried to use mwdumper with an older version of wikipedia dump: 20130102.
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis tsikerdekis@gmail.comwrote:
great! at least we know what's causing it. I've seen the thread about xerces before but it was too old so I thought there is probably no relation.
Let me know when there is a new build to try out or anything else I can do to help fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2]
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
If you can stomach it I would report it upstream, linking to the earlier version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis έγραψε:
Update on the matter. I've edited pom.xml and changed xerces version which was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still the error persists. Also, I tried to use mwdumper with an older version of wikipedia dump: 20130102.
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis tsikerdekis@gmail.comwrote:
great! at least we know what's causing it. I've seen the thread about xerces before but it was too old so I thought there is probably no relation.
Let me know when there is a new build to try out or anything else I can do to help fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2]
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Thanks Ariel. One small thing, where exactly can I report it upstream? got a url?
Michael
On Tue, May 21, 2013 at 5:45 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
If you can stomach it I would report it upstream, linking to the earlier version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis έγραψε:
Update on the matter. I've edited pom.xml and changed xerces version
which
was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still the error persists. Also, I tried to use mwdumper with an older version of wikipedia dump: 20130102.
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis tsikerdekis@gmail.comwrote:
great! at least we know what's causing it. I've seen the thread about xerces before but it was too old so I thought there is probably no
relation.
Let me know when there is a new build to try out or anything else I
can do
to help fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <ariel@wikimedia.org
wrote:
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael
Tsikerdekis
έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown
Source)
at
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source) at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2]
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
I think this will do it:
http://xerces.apache.org/xerces2-j/jira.html
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 17:53 +0200, ο/η Michael Tsikerdekis έγραψε:
Thanks Ariel. One small thing, where exactly can I report it upstream? got a url?
Michael
On Tue, May 21, 2013 at 5:45 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
If you can stomach it I would report it upstream, linking to the earlier version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis έγραψε:
Update on the matter. I've edited pom.xml and changed xerces version
which
was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still the error persists. Also, I tried to use mwdumper with an older version of wikipedia dump: 20130102.
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
Posted the bug here:
https://issues.apache.org/jira/browse/XERCESJ-1614
Now we just have to wait and see. In the meantime I'll try to use the official mediawiki importer and see if that works.
Michael
On Tue, May 21, 2013 at 6:30 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
I think this will do it:
http://xerces.apache.org/xerces2-j/jira.html
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 17:53 +0200, ο/η Michael Tsikerdekis έγραψε:
Thanks Ariel. One small thing, where exactly can I report it upstream?
got
a url?
Michael
On Tue, May 21, 2013 at 5:45 PM, Ariel T. Glenn ariel@wikimedia.org
wrote:
If you can stomach it I would report it upstream, linking to the
earlier
version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the
upstream
report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael
Tsikerdekis
έγραψε:
Update on the matter. I've edited pom.xml and changed xerces version
which
was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still
the
error persists. Also, I tried to use mwdumper with an older version of wikipedia
dump:
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Στις 22-05-2013, ημέρα Τετ, και ώρα 13:55 +0200, ο/η Michael Tsikerdekis έγραψε:
Posted the bug here:
https://issues.apache.org/jira/browse/XERCESJ-1614
Now we just have to wait and see. In the meantime I'll try to use the official mediawiki importer and see if that works.
If you mean importDump.php, I strongly recommend against it. For a large wiki it's going to take forever, if indeed it completes at all.
I would suggest removing the page that causes the problem and using mwdumper on the rest, then using importDump.php on the one page and its revisions only.
There are a couple other experimental tools yu could try but they are indeed experimental, see http://meta.wikimedia.org/wiki/Data_dumps/Tools_for_importing
Ariel
mediawiki-l@lists.wikimedia.org