Hi,
I created a nodejs service to convert wikitext to HTML with a frontent
(written in Golang) which reads wikipedia dump and feed wikitext to this
service[1]. However, after doing all this I discovered that Parsoid needs
to contact wikimedia server for template expansion. Since I want to convert
the entire wikipedia dump and HTML, I do not want to keep hitting wikimedia
servers for template expansion requests.
So, are there any plans to add support to Parsoid to do this expansion
offline. Once the wikipedia dump is downloaded, I want the entire process
of converting to HTML to be offline (of course, don't need images).
[1] https://github.com/nitingupta910/wikiparser
Thanks,
Nitin
I've published the first step of that MediaWiki Extension tool. So far
it's only a script that can populate a database with extension metadata
from extensions in Gerrit.
So, if you have any use for your own sqlite database of extension
metadata, here's how to get one (assuming you have node and git installed).
$ git clone -b v0.1.1 https://github.com/redwerks/mediawiki-extensionservice.git
$ cd mediawiki-extensionservice/
$ npm install
$ npm install -g sequelize-cli
$ npm install sqlite3
$ mkdir storage/
$ echo "STORAGE_DIR=./storage" >> .env
$ echo "DATABASE_TYPE=sqlite" >> .env
$ echo "DATABASE_STORAGE=./storage/db.sqlite" >> .env
$ sequelize db:migrate
$ bin/cron.js
You'll have to wait a few minutes for it to finish. But at the end you
can use whatever sqlite tools you have to look at the database in
`./storage/db.sqlite`.
The Extensions table will contain a list of extensions, besides the
extid and composerName indexes each row will have a data column
containing JSON with data on the extension.
Some of this data is only available for extensions containing an
extension.json file (The .php file is not parsed).
- name: The name of the extension (English text for "namemsg" -> "name"
-> final fallback to the extension's dirname)
- description: The extension description (English text for
"descriptionmsg" -> "description" -> empty)
- versionHint: The "version" in extension.json.
This data will always be available:
- repository: The git repository url.
- composerName: The "name" in composer.json if present.
- sources: An array of some of the possible ways to install the extension.
- git-master if the repository has a master branch (basically everything)
- git-stable if the repository has a HEAD that points to something
other than master (in this case a "stableBranch" will also be present)
- git-rel if the repository has REL#_## branches (basically everything)
- git-tag if the repository has #.#.# or v#.#.# tags
- composer if the repository has a valid composer.json with a name.
--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
As part of T88290, we are going to be making some changes to thedata-mw
spec for <ref> tags.
So far, the data-mw attribute for <ref> tags had the entire HTML for the
reference represented in the body.html property in data-mw. However, in
order to reduce the size of the HTML that we generate (and reduce
network load and parsing load on clients, especially visual editor), we
have been working on a change where we add a reference to the HTML via
the body.id attribute in data-mw.
https://gerrit.wikimedia.org/r/#/c/191593/ is the patch that Marc has
been working on.
An example at the end of this email will show the specific change and
how it looks. We will update the DOM spec page[1] shortly.
So, once this patch is reviewed, tested, and deployed (most likely Feb
25 or Mar 2 unless there are concerns / problems that show up), Parsoid
will only be emitting an id-based reference to the HTML. However,
Parsoid will continue to accept both data-mw.body.html and
data-mw.body.id for serialization.
That said, because of the specifics of Parsoid's selective serializer
implementation. if a <ref>'s content has been edited, Parsoid expects to
see *some* edit in the wrapper HTML of the <ref> itself. If you continue
to send Parsoid data-mw.body.html back, all will work fine (since that
will register as an edit). But, if you send Parsoid data-mw.body.id
back, you should change the value of that id to a different value.
This update is a bit late in coming -- kind of lost track of it amidst
the work, but as far as we know, only VE is affected by this change and
they have already fixed their code. Flow has confirmed they aren't. But,
let us know if there are any questions / concerns.
Subbu and Marc.
[1] https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
--------------------------------------------------------------------------------------------------------------------------------------------------------
*Wikitext**
*--------
A <ref>
This is a '''[[bolded link]]''' and this is a {{echo|transclusion}}
</ref>
<references />
*Current HTML**
*------------
<p>A <span about="#mwt2" class="reference" id="cite_ref-1"
rel="dc:references" typeof="mw:Extension/ref"
data-mw='{"name":"ref","body":{"html":"This is a <b
data-parsoid='{\"dsr\":[19,40,3,3]}'><a rel=\"mw:WikiLink\"
href=\"./Bolded_link\" title=\"Bolded link\"
data-parsoid='{\"stx\":\"simple\",\"a\":{\"href\":\"./Bolded_link\"},\"sa\":{\"href\":\"bolded
link\"},\"dsr\":[22,37,2,2]}'>bolded link</a></b> and this is
a <span about=\"#mwt3\" typeof=\"mw:Transclusion\"
data-parsoid='{\"pi\":[[{\"k\":\"1\",\"spc\":[\"\",\"\",\"\",\"\"]}]],\"dsr\":[55,76,null,null]}'
data-mw='{\"parts\":[{\"template\":{\"target\":{\"wt\":\"echo\",\"href\":\"./Template:Echo\"},\"params\":{\"1\":{\"wt\":\"transclusion\"}},\"i\":0}}]}'>transclusion</span>\n"},"attrs":{}}'><a
href="#cite_note-1">[1]</a></span></p>
<ol class="references" typeof="mw:Extension/references" about="#mwt5"
data-mw='{"name":"references","attrs":{}}'>
<li about="#cite_note-1" id="cite_note-1"><span rel="mw:referencedBy"><a
href="#cite_ref-1">↑</a></span> This is a <b><a rel="mw:WikiLink"
href="./Bolded_link" title="Bolded link">bolded link</a></b> and this is
a <span about="#mwt3" typeof="mw:Transclusion"
data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"transclusion"}},"i":0}}]}'>transclusion</span>
</li>
</ol>
*New HTML**
*--------
<p>A <span about="#mwt2" class="reference" id="cite_ref-1"
rel="dc:references" typeof="mw:Extension/ref"
data-mw='{"name":"ref","body":{"id":
"mw-reference-text-cite_note-1"},"attrs":{}}'><a
href="#cite_note-1">[1]</a></span></p>
<ol class="references" typeof="mw:Extension/references" about="#mwt5"
data-mw='{"name":"references","attrs":{}}'>
<li about="#cite_note-1" id="cite_note-1"><span rel="mw:referencedBy"><a
href="#cite_ref-1">↑</a></span> <span id="mw-reference-text-cite_note-1"
class="mw-reference-text">This is a <b><a rel="mw:WikiLink"
href="./Bolded_link" title="Bolded link">bolded link</a></b> and this is
a <span about="#mwt3" typeof="mw:Transclusion"
data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"transclusion"}},"i":0}}]}'>transclusion</span>
</span>
</li>
</ol>
Hello there,
I have a bit of a problem with parsing http://www.cruiserswiki.org/wiki/ on
my machine.
When I'm trying to use parsoid with this wiki then all external links at
the page do not seem to be converted into <a> element.
For example for the page http://www.cruiserswiki.org/wiki/Cesme external
link
[http://en.wikipedia.org/wiki/Cesme Çesme] (near the bottom of the page)
appears at the parsoid output as it is instead of being converted to
something like: <a href="http://en.wikipedia.org/wiki/Cesme" class="external
text" rel="nofollow" target="_blank">
I have Ubuntu 14.04 on my machine and initially I tried version 0.2.0 of
parsoid from http://parsoid.wmflabs.org:8080/deb . Then I cloned the
parsoid from the git repository. But result is the same.
Any feedback would be highly appreciated.
Regards
Vadim
Hi
I currently try to create a cache for "mwoffliner". A cache for images
(thumbnails) and a cache for Parsoid output. For the images/thumbnails
it's pretty straight forward thanks to the "last-modified" header.
Unfortunately, for the Parsoid output, this seems to be more
complicated. Gabriel's htmldumper relies only on the oldid value, but
I'm not really satisfied byt this approach because I want to be able to
download a new version of the HTML for the same oldid if necessary (for
example if the HTML output was improved with a Parsoid fix).
There is an "age" header but I don't really understand the fundamental
difference with "last-modified". Do we have the same information here
but presented in an other way? If yes, why is that better than
"last-modified"?
There is in addition the "x-varnish" header but this is IMO an internal
information I should not rely on (and BTW, time to time we get headers
with two "x-warning" header entries, what looks pretty weird to me - see
PS).
Finally my question, might we introduce a "last-modified" HTTP header?
Regards
Emmanuel
PS: Here an example of request with two "x-varnish" headers:
$ curl -I
"http://parsoid-lb.eqiad.wikimedia.org/dewiki/Almer%C3%ADa?oldid=133672544"
HTTP/1.1 200 OK
X-Powered-By: Express
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Cache-Control: s-maxage=2592000
content-revision-id: 133672544
X-Parsoid-Performance: duration=4063; start=1416051524354
Content-Type: text/html; charset=UTF-8
X-Varnish: 735376643 735208307
Via: 1.1 varnish
Date: Sat, 15 Nov 2014 12:03:47 GMT
X-Varnish: 1047669169
Age: 1499
Via: 1.1 varnish
Connection: keep-alive
X-Cache: cp1058 hit (6), cp1058 frontend miss (0)
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication