Hi All,
My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.
This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714
My mentors are Moriel and Roan.
I am looking forward to becoming a Wikimedia developer and an open source contributor.
Cheers, Hagar
Welcome / ברוכה הבאה!
בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo < hagarshilo1@mail.tau.ac.il>:
Hi All,
My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.
This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714
My mentors are Moriel and Roan.
I am looking forward to becoming a Wikimedia developer and an open source contributor.
Cheers, Hagar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi all,
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!
Best, Aidan
On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:
Hi all,
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!
Best, Aidan
[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?
Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...
Fae
Hi Fae,
On 03-05-2018 16:18, Fæ wrote:
On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:
Hi all,
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!
Best, Aidan
Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?
Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...
Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table.
Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :)
Cheers, Aidan
Hey Aidan!
I would suggest checking out RESTBase ( https://www.mediawiki.org/wiki/RESTBase), which offers an API for retrieving HTML versions of Wikipedia pages. It's maintained by the Wikimedia Foundation and used by a number of production Wikimedia services, so you can rely on it.
I don't believe there are any prepared dumps of this HTML, but you should be able to iterate through the RESTBase API, as long as you follow the rules (from https://en.wikipedia.org/api/rest_v1/):
- *Limit your clients to no more than 200 requests/s to this API. Each API endpoint's documentation may detail more specific usage limits.* - *Set a unique User-Agent or Api-User-Agent header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.*
On Thu, 3 May 2018 at 14:26, Aidan Hogan ahogan@dcc.uchile.cl wrote:
Hi Fae,
On 03-05-2018 16:18, Fæ wrote:
On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:
Hi all,
I am wondering what is the fastest/best way to get a local dump of
English
Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the
source
markup in the Wikipedia dumps to HTML, but the latest version seems to
take
on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would
take
several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any
help,
pointers or advice would be greatly appreciated!!
Best, Aidan
Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?
Ref
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...
Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table.
Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :)
Cheers, Aidan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
Their downloads use the ZIM file format, looks like there are libraries available for reading it in many programming languages: http://www.openzim.org/wiki/Readers
Also, for the curious, the request for dedicated HTML dumps is tracked in this Phabricator task: https://phabricator.wikimedia.org/T182351
On Thu, 3 May 2018 at 15:19, Bartosz Dziewoński matma.rex@gmail.com wrote:
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
Their downloads use the ZIM file format, looks like there are libraries available for reading it in many programming languages: http://www.openzim.org/wiki/Readers
-- Bartosz Dziewoński
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], the tool used to generate ZIM files. It dumps HTML files locally before generating the ZIM file. Though, HTML is an intermediary for the tool it could be held back if you wish. See [2] for more information about what options the tool accepts.
I'm not sure if it's possible to instruct the tool to stop immediately after the dumping of the pages thus avoiding the creation of the ZIM file altogether. But you could work around it by perusing the 'verbose' output (turned on through the '--verbose' option) of the tool to identify when dumping has been completed and stop it manually.
In case of any doubts about using the tool, feel free to reach out.
References: [1]: https://github.com/openzim/mwoffliner [2]: https://github.com/openzim/mwoffliner/blob/master/lib/parameterList.js
On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], ...
Note that the HTML is (of course) is not the same as the one you see when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present.
Hi all,
Many thanks for all the pointers! In the end we wrote a small client to grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) as suggested by Neil. The HTML looks perfect, and with the generous 200 requests/second limit (which we could not even manage to reach with our local machine), it only took a couple of days to grab all current English Wikipedia articles.
@Kaartic, many thanks for the offers of help with extracting HTML from ZIM! We also investigated this option in parallel with converting ZIM to HTML using Zimreader-Java [1], and indeed it looked promising, but we had some issues with extracting links. We did not try the mwoffliner tool you mentioned since we got what we needed through RESTBase in the end. In any case, we appreciate the offers of help. :)
Best, Aidan
[1] https://github.com/openzim/zimreader-java
On 08-05-2018 9:34, Kaartic Sivaraam wrote:
On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], ...
Note that the HTML is (of course) is not the same as the one you see when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present.
Good luck / בהצלחה!
On Thu, May 3, 2018 at 7:39 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
Welcome / ברוכה הבאה!
בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo < hagarshilo1@mail.tau.ac.il>:
Hi All,
My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.
This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714
My mentors are Moriel and Roan.
I am looking forward to becoming a Wikimedia developer and an open source contributor.
Cheers, Hagar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org