Greetings,
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable and useless.
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months? Or alternatively, could you make the template API available so I could import it in my program?
Dear regards, Roberto Flores
what purpose would the dump serve? you dont want to keep the full dump on the device.
On Sun, Sep 9, 2012 at 2:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Greetings,
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable and useless.
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months? Or alternatively, could you make the template API available so I could import it in my program?
Dear regards, Roberto Flores _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable and useless.
Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now?
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment.
Or alternatively, could you make the template API available so I could import it in my program?
How would this template API function? What does import mean?
-Jeremy
Allow me to reply to each point:
(By the way, my offline app is called WikiGear Offline:) http://itunes.apple.com/us/app/wikigear-offline/id453614487?mt=8
Templates are dumped just like all other pages are...
Yes, but that's only a text description of what the template does. Code must be written to actually process them into HTML. There are tens of thousands of them, and some can't be even programmed by me (e.g., Wiktionary's conjugation templates) If they were already pre-processed into HTML inside the articles' contents, that would solve all of my problems.
what purpose would the dump serve? you dont want to keep the full dump on the device.
I made an indexing program that selects only content articles (namespaces included) and compresses it all to a reasonable size (e.g. about 7gb for the English Wikipedia)
How would this template API function? What does import mean?
By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display.
Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow.
2012/9/9 Jeremy Baron jeremy@tuxmachine.com
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app
for
the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable
and
useless.
Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now?
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment.
Or alternatively, could you make the template API available so I could import it in my program?
How would this template API function? What does import mean?
-Jeremy
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Take a look at http://en.wikipedia.org/w/api.php?action=parse it is exactly what you are looking for. Also a 7GB app is something you want to CLEARLY state as eating up that much device space/ download bandwidth is probably a problem for most users
On Sun, Sep 9, 2012 at 3:07 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Allow me to reply to each point:
(By the way, my offline app is called WikiGear Offline:) http://itunes.apple.com/us/app/wikigear-offline/id453614487?mt=8
Templates are dumped just like all other pages are...
Yes, but that's only a text description of what the template does. Code must be written to actually process them into HTML. There are tens of thousands of them, and some can't be even programmed by me (e.g., Wiktionary's conjugation templates) If they were already pre-processed into HTML inside the articles' contents, that would solve all of my problems.
what purpose would the dump serve? you dont want to keep the full dump on the device.
I made an indexing program that selects only content articles (namespaces included) and compresses it all to a reasonable size (e.g. about 7gb for the English Wikipedia)
How would this template API function? What does import mean?
By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display.
Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow.
2012/9/9 Jeremy Baron jeremy@tuxmachine.com
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app
for
the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable
and
useless.
Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now?
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment.
Or alternatively, could you make the template API available so I could import it in my program?
How would this template API function? What does import mean?
-Jeremy
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I think there is a slight misunderstanding on what my app is and does:
It is an offline Wikipedia (et al) viewer that contains all content articles in the dump. Everything must be contained within the app's code and the processed dump files downloadable from my own site (gearapps.com)
Take a look at http://en.wikipedia.org/w/api.php?action=parse...
My app is supposed to be fully offline. It does not make any network connections, thus I cant use the online api. I need to have the template-processing code within the app or the templates pre-processed into the dump.
Also a 7GB app is something you want to CLEARLY state as eating up that much device space/ download bandwidth is probably a problem for most users
The files are provided on my own site, so it doesn't add any load to Wikipedia's servers. The file sizes are viewable upon trying to download them.
2012/9/9 John phoenixoverride@gmail.com
Take a look at http://en.wikipedia.org/w/api.php?action=parse it is exactly what you are looking for. Also a 7GB app is something you want to CLEARLY state as eating up that much device space/ download bandwidth is probably a problem for most users
On Sun, Sep 9, 2012 at 3:07 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Allow me to reply to each point:
(By the way, my offline app is called WikiGear Offline:) http://itunes.apple.com/us/app/wikigear-offline/id453614487?mt=8
Templates are dumped just like all other pages are...
Yes, but that's only a text description of what the template does. Code must be written to actually process them into HTML. There are tens of thousands of them, and some can't be even programmed by me (e.g., Wiktionary's conjugation templates) If they were already pre-processed into HTML inside the articles'
contents,
that would solve all of my problems.
what purpose would the dump serve? you dont want to keep the full dump on the device.
I made an indexing program that selects only content articles (namespaces included) and compresses it all to a reasonable size (e.g. about 7gb for the English Wikipedia)
How would this template API function? What does import mean?
By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display.
Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow.
2012/9/9 Jeremy Baron jeremy@tuxmachine.com
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores <f.roberto.isc@gmail.com
wrote:
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app
for
the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to
mention,
it's a moving target). Without converting these templates, many articles are simply
unreadable
and
useless.
Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now?
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment.
Or alternatively, could you make the template API available so I could import it in my program?
How would this template API function? What does import mean?
-Jeremy
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 9 September 2012 22:30, Roberto Flores f.roberto.isc@gmail.com wrote:
I think there is a slight misunderstanding on what my app is and does:
It is an offline Wikipedia (et al) viewer that contains all content articles in the dump. Everything must be contained within the app's code and the processed dump files downloadable from my own site (gearapps.com)
Take a look at http://en.wikipedia.org/w/api.php?action=parse...
My app is supposed to be fully offline. It does not make any network connections, thus I cant use the online api. I need to have the template-processing code within the app or the templates pre-processed into the dump.
Also a 7GB app is something you want to CLEARLY state as eating up that much device space/ download bandwidth is probably a problem for most users
The files are provided on my own site, so it doesn't add any load to Wikipedia's servers. The file sizes are viewable upon trying to download them.
2012/9/9 John phoenixoverride@gmail.com
Take a look at http://en.wikipedia.org/w/api.php?action=parse it is exactly what you are looking for. Also a 7GB app is something you want to CLEARLY state as eating up that much device space/ download bandwidth is probably a problem for most users
On Sun, Sep 9, 2012 at 3:07 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Allow me to reply to each point:
(By the way, my offline app is called WikiGear Offline:) http://itunes.apple.com/us/app/wikigear-offline/id453614487?mt=8
Templates are dumped just like all other pages are...
Yes, but that's only a text description of what the template does. Code must be written to actually process them into HTML. There are tens of thousands of them, and some can't be even programmed by me (e.g., Wiktionary's conjugation templates) If they were already pre-processed into HTML inside the articles'
contents,
that would solve all of my problems.
what purpose would the dump serve? you dont want to keep the full dump on the device.
I made an indexing program that selects only content articles (namespaces included) and compresses it all to a reasonable size (e.g. about 7gb for the English Wikipedia)
How would this template API function? What does import mean?
By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display.
Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow.
2012/9/9 Jeremy Baron jeremy@tuxmachine.com
On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores <f.roberto.isc@gmail.com
wrote:
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app
for
the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to
mention,
it's a moving target). Without converting these templates, many articles are simply
unreadable
and
useless.
Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now?
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment.
Or alternatively, could you make the template API available so I could import it in my program?
How would this template API function? What does import mean?
-Jeremy
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Basically since you are making an offline app you either need to parse the wikipages into html pages offline yourself, or include parsing code into your app.
You are not the first to want this, but due to the nature and complexity of the markup, which includes "parser functions", and the parser, this is not trivial.
The only parser that is guaranteed to parse MediaWiki markup is MediaWiki, but the parser is tied to other code.
There is an open feature request to separate this code so apps like yours can take just the part of the rendering code you need, or translate that part of the code into another programming language.
Bug 25984 - Isolate parser from database dependencies https://bugzilla.wikimedia.org/show_bug.cgi?id=25984
Nobody at WikiMedia are working on this, but there's some patches from other people that will certainly get you on your way.
But the developers at WikiMedia are very busy making a whole new parser and WYSIWYG editor to go with it.
Hopefully this will clean up the code to the point that making your own parser becomes a lot easier.
Good luck and sympathy (-: Andrew Dunbar (hippietrail)
On Sun, Sep 9, 2012 at 7:07 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
How would this template API function? What does import mean?
By this I mean, a set of functions written in some computer language to which I could send them the template within the wiki markup and receive HTML to display.
Wikipedia does this whenever a page is requested, but I ignore the exact mechanism through which it's performed. Maybe you just need to make that code publicly available, and I'll try to make it work with my application somehow.
See https://gerrit.wikimedia.org/r/gitweb?p=operations%2Fmediawiki-config.git (master branch) for the current configuration (including which extensions are enabled or not for a specific wiki) and https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=shortlog;h=refs... (wmf/1.20wmf10 branch) for the core code and extensions (extensions are in submodules) with versions of each (repo) that are currently deployed. That branch name changes about every 2 weeks.
-Jeremy
Shouldn't you be using ZIM, and aren't dumpHTML and siblings The Right Way to do it? See also http://openzim.org/Build_your_ZIM_file
Nemo
נשלח מטלפון, שאולי עשה שטויות עם תיקון אוטומטי Sent from a phone, which may have done silly autocorrections בתאריך 9 בספט 2012 20:35, מאת "Roberto Flores" f.roberto.isc@gmail.com:
Greetings,
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML. However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable
and
useless.
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months? Or alternatively, could you make the template API available so I could import it in my program?
You can use the API parse action to parse the pages, including templates.
Dear regards, Roberto Flores _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Sep 9, 2012 at 8:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
How a template is rendered into HTML depends very much on the context (i.e. page title, last modification date, etc.) and the arguments that it is called with. So an HTML render of all template pages is unlikely to be very useful for you.
Bryan
Bryan Tong Minh wrote:
On Sun, Sep 9, 2012 at 8:34 PM, Roberto Flores f.roberto.isc@gmail.com wrote:
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months?
How a template is rendered into HTML depends very much on the context (i.e. page title, last modification date, etc.) and the arguments that it is called with. So an HTML render of all template pages is unlikely to be very useful for you.
This reply and others in this thread don't make any sense to me. It seems like the opening poster is looking at http://dumps.wikimedia.org/other/static_html_dumps/ and noticing that the HTML dumps haven't been updated in years. This is a problem and it's tracked by https://bugzilla.wikimedia.org/show_bug.cgi?id=15017.
For reference, the English Wikipedia has over 4 million content pages. At a rate of a page per second (which is completely unrealistic, but let's just assume for a moment), you could get the HTML of every content page on the English Wikipedia in about 46.3 days(!). The suggestion in any discussion regarding HTML dumps that people should just use the API themselves (presumably in combination with an XML dump containing the wikitext of each page) is just absurd. There's enormous value in the HTML dumps. This subject came up in December 2011 and from the comments in that thread, it seemed as though the only reason the HTML dumps have been updated is that nobody has run the relevant script.
MZMcBride
On Mon, Sep 10, 2012 at 8:01 AM, MZMcBride z@mzmcbride.com wrote:
page) is just absurd. There's enormous value in the HTML dumps. This subject came up in December 2011 and from the comments in that thread, it seemed as though the only reason the HTML dumps have been updated is that nobody has run the relevant script.
MZMcBride
AFAIK, E:DumpHTML needs some loving first.
K. Peachey wrote:
On Mon, Sep 10, 2012 at 8:01 AM, MZMcBride wrote:
page) is just absurd. There's enormous value in the HTML dumps. This subject came up in December 2011 and from the comments in that thread, it seemed as though the only reason the HTML dumps have been updated is that nobody has run the relevant script.
AFAIK, E:DumpHTML needs some loving first.
Can you elaborate on this? Is there anything actually stopping the extension (or rather the script) from being run? Of course every piece of software has bugs or feature requests, but if there are blockers to actually running this script, can you point me to the list of these (or more preferably add them as blockers to bug 15017)?
For context, "E:DumpHTML" refers to https://www.mediawiki.org/wiki/Extension:DumpHTML, a pseudo-extension (quasi-extension?) used to generate HTML dumps.
MZMcBride
* MZMcBride z@mzmcbride.com [2012-09-10 02:45]:
K. Peachey wrote:
On Mon, Sep 10, 2012 at 8:01 AM, MZMcBride wrote:
page) is just absurd. There's enormous value in the HTML dumps. This subject came up in December 2011 and from the comments in that thread, it seemed as though the only reason the HTML dumps have been updated is that nobody has run the relevant script.
AFAIK, E:DumpHTML needs some loving first.
Can you elaborate on this? Is there anything actually stopping the extension (or rather the script) from being run? Of course every piece of software has bugs or feature requests, but if there are blockers to actually running this script, can you point me to the list of these (or more preferably add them as blockers to bug 15017)?
For context, "E:DumpHTML" refers to https://www.mediawiki.org/wiki/Extension:DumpHTML, a pseudo-extension (quasi-extension?) used to generate HTML dumps.
I use this extension on my wiki (http://spiele.j-crew.de/, http://misc.j-crew.de/wiki-dump/), but I find it quite brittle in the face of MediaWiki software changes. Every few months, a change in trunk breaks the extension in one way or another.
I recently submitted a bunch of fixes for the extension (see https://gerrit.wikimedia.org/r/#/c/17697/). These changes used to work for me a few months ago, but on current trunk image handling in DumpHTML is broken again (filename mangling of images seems broken, and thumbs are not included in the dump, which used to work).
I think HTML dumps of Wikipedia would be very useful, but it needs someone from WMF who actively maintains this extension.
Best regards Thomas
On 9 Sep 2012, at 23:33, Thomas Bleher ThomasBleher@gmx.de wrote:
- MZMcBride z@mzmcbride.com [2012-09-10 02:45]:
K. Peachey wrote:
AFAIK, E:DumpHTML needs some loving first.
Can you elaborate on this? Is there anything actually stopping the extension (or rather the script) from being run? Of course every piece of software has bugs or feature requests, but if there are blockers to actually running this script, can you point me to the list of these (or more preferably add them as blockers to bug 15017)?
For context, "E:DumpHTML" refers to https://www.mediawiki.org/wiki/Extension:DumpHTML, a pseudo-extension (quasi-extension?) used to generate HTML dumps.
I use this extension on my wiki (http://spiele.j-crew.de/, http://misc.j-crew.de/wiki-dump/), but I find it quite brittle in the face of MediaWiki software changes. Every few months, a change in trunk breaks the extension in one way or another.
This sounds like a good candidate for some Jenkins integration tests (commits are meant to break neither core MediaWiki nor Wikimedia's deployment, which I guess would include this).
Of course, as you say, we'd need someone to take it on and at least fix it up enough to pass originally before extending the CI to it.
J.
Dear Roberto
Le 09/09/2012 20:34, Roberto Flores a écrit :
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML.
Great idea, but why reinventing the wheel concerning the format and not using the open and inter-operable ZIM format pushed by the movement: * http://www.openzim.org
We already have a few open-source readers and contents: * http://www.kiwix.org * http://cip.github.com/WikiOnBoard
However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable and useless.
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months? Or alternatively, could you make the template API available so I could import it in my program?
The way you want to do it can not really work for the reasons other people have alreday explained. There are Wikimedians whove have been involved with such topics since years ; why not talking with them before starting your project? We have a dedicated mailing list, you are welcome on it: https://lists.wikimedia.org/mailman/listinfo/offline-l
Regards Emmanuel
In all frankness, I don't see how can it be complicated or mind-blowing to generate a HTML dump when the software is there already to produce a wikimarkup one. WikiMarkup dumps are of very limited use and mainly to yourselves alone.
No need to tell me to go solve my problems myself, it's what I've been doing all the time.
2012/9/12 Emmanuel Engelhart emmanuel@engelhart.org
Dear Roberto
Le 09/09/2012 20:34, Roberto Flores a écrit :
I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app
for
the iPhone, which does a somewhat decent job at interpreting the wiki markup into HTML.
Great idea, but why reinventing the wheel concerning the format and not using the open and inter-operable ZIM format pushed by the movement:
We already have a few open-source readers and contents:
However, there are too many templates for me to program (not to mention, it's a moving target). Without converting these templates, many articles are simply unreadable
and
useless.
Could you please provide HTML dumps (I mean, with the templates pre-processed into HTML, everything else the same as now) every 3 or 4 months? Or alternatively, could you make the template API available so I could import it in my program?
The way you want to do it can not really work for the reasons other people have alreday explained. There are Wikimedians whove have been involved with such topics since years ; why not talking with them before starting your project? We have a dedicated mailing list, you are welcome on it: https://lists.wikimedia.org/mailman/listinfo/offline-l
Regards Emmanuel
Le 14/09/2012 05:26, Roberto Flores a écrit :
In all frankness, I don't see how can it be complicated or mind-blowing to generate a HTML dump when the software is there already to produce a wikimarkup one. WikiMarkup dumps are of very limited use and mainly to yourselves alone.
No need to tell me to go solve my problems myself, it's what I've been doing all the time.
Good luck! Emmanuel
wikitech-l@lists.wikimedia.org