Thanks Ariel. One small thing, where exactly can I report it upstream? got
a url?
Michael
On Tue, May 21, 2013 at 5:45 PM, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
If you can stomach it I would report it upstream,
linking to the earlier
version of the bug they had with a proposed patch etc. I can give them
a test file consisting of the one page with all its revisions, "only"
170 mb uncompressed :-D
It's fine to open a report locally too in mwdumper and link the upstream
report.
Thanks,
Ariel
Στις 21-05-2013, ημέρα Τρι, και ώρα 15:57 +0200, ο/η Michael Tsikerdekis
έγραψε:
Update on the matter. I've edited pom.xml and
changed xerces version
which
was set to 2.7.1 to 2.9.1, 2.11.0, 2.8.0 and
other versions.
The out of bound error becomes different on later versions but still the
error persists.
Also, I tried to use mwdumper with an older version of wikipedia dump:
20130102.
The error still appears on the first file this
time: enwiki-20130102-pages-meta-history1.xml-p000000010p000002070.7z
Should I report a new bug on bugzilla for mwdumper?
Michael
On Mon, May 20, 2013 at 4:49 PM, Michael Tsikerdekis
<tsikerdekis(a)gmail.com>wrote;wrote:
> great! at least we know what's causing it. I've seen the thread about
> xerces before but it was too old so I thought there is probably no
relation.
>
> Let me know when there is a new build to try out or anything else I
can do
to help
fix the problem.
Michael
On Mon, May 20, 2013 at 4:41 PM, Ariel T. Glenn <ariel(a)wikimedia.org
wrote:
>
>> Στις 20-05-2013, ημέρα Δευ, και ώρα 13:18 +0200, ο/η Michael
Tsikerdekis
>> έγραψε:
>>
>> > 33 pages (0.593/sec), 25,374 revs (455.695/sec)
>> > Exception in thread "main"
java.lang.ArrayIndexOutOfBoundsException:
>> 2048
>> > at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>> > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown
Source)
>> > at
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
> >> > Source)
>> > at
> >>
>
> >>
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> >> > Source)
>> > at
> >>
>
> >>
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> >> > Source)
>> > at
> >>
>
> >>
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> >> > Source)
>> > at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> >> > Source)
>> > at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> >> > Source)
> >> > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> >> > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> >> Source)
>> > at
> >>
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> >> > Source)
> >> ...
> >>
> >> The file itself is fine; proof of that is that I isolated the
> >> problematic page, removed the first revision (which had been processed
> >> without problems) and then all remaining revisions including the
'bad'
> >> one were handled properly.
> >>
> >> This is most likely a regression:
> >>
http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
> >> Our spec says to build against maven's xerces version 2.7.1, and I
> >> expect that never got the patch [1]. I'm not sure what version of the
> >> xerces library is good ([2]).
> >>
> >> I'm adding Chad back on the cc though since he'll have to update
the
> >> build specs. Chad, do you want a bugzilla report for this?
> >>
> >> Ariel
> >>
> >> [1]
http://www.gossamer-threads.com/lists/wiki/mediawiki/128069
> >> [2]
> >>
> >>
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.…
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l