If you can stomach it I would report it upstream, linking to the earlier version of the bug they had with a proposed patch etc. I can give them a test file consisting of the one page with all its revisions, "only" 170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis έγραψε:
Update on the matter. I've edited pom.xml and changed xerces version which was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and other versions.
The out of bound error becomes different on later versions but still the error persists. Also, I tried to use mwdumper with an older version of wikipedia dump: 20130102.
The error still appears on the first file this time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis tsikerdekis@gmail.comwrote:
great! at least we know what's causing it. I've seen the thread about xerces before but it was too old so I thought there is probably no relation.
Let me know when there is a new build to try out or anything else I can do to help fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael Tsikerdekis έγραψε:
33 pages (0.593/sec), 25,374 revs (455.695/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
...
The file itself is fine; proof of that is that I isolated the problematic page, removed the first revision (which had been processed without problems) and then all remaining revisions including the 'bad' one were handled properly.
This is most likely a regression: http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 Our spec says to build against maven's xerces version 2.7.1, and I expect that never got the patch [1]. I'm not sure what version of the xerces library is good ([2]).
I'm adding Chad back on the cc though since he'll have to update the build specs. Chad, do you want a bugzilla report for this?
Ariel
[1] http://www.gossamer-threads.com/lists/wiki/mediawiki/128069 [2]
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.p...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l