GSoC 2018 Introduction: Hagar Shilo

List overview All Threads
Download

newer

older

(no subject)

Nini kinakufanya uwe na furaha...

Hagar Shilo

3 May 2018 3 May '18

7:27 p.m.

Hi All,

My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.

This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714

My mentors are Moriel and Roan.

I am looking forward to becoming a Wikimedia developer and an open source contributor.

Cheers, Hagar

Show replies by date

Amir E. Aharoni

3 May 3 May

7:39 p.m.

Welcome / ברוכה הבאה!

בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏< hagarshilo1@mail.tau.ac.il>:

...

Hi All,

My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.

This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714

My mentors are Moriel and Roan.

I am looking forward to becoming a Wikimedia developer and an open source contributor.

Cheers, Hagar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aidan Hogan

9:54 p.m.

New subject: Getting a local dump of Wikipedia in HTML

Hi all,

I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!

Best, Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

Fæ

10:18 p.m.

New subject: Getting a local dump of Wikipedia in HTML

On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:

...

Hi all,

I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!

Best, Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?

Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...

Fae

-- faewik@gmail.com https://commons.wikimedia.org/wiki/User:Fae

Aidan Hogan

4 May 4 May

12:25 a.m.

New subject: Getting a local dump of Wikipedia in HTML

Hi Fae,

On 03-05-2018 16:18, Fæ wrote:

...

On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:

...
Hi all,

I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!!

Best, Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?

Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...

Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table.

Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :)

Cheers, Aidan

Neil Patel Quinn

12:43 a.m.

New subject: Getting a local dump of Wikipedia in HTML

Hey Aidan!

I would suggest checking out RESTBase ( https://www.mediawiki.org/wiki/RESTBase), which offers an API for retrieving HTML versions of Wikipedia pages. It's maintained by the Wikimedia Foundation and used by a number of production Wikimedia services, so you can rely on it.

I don't believe there are any prepared dumps of this HTML, but you should be able to iterate through the RESTBase API, as long as you follow the rules (from https://en.wikipedia.org/api/rest_v1/):

- *Limit your clients to no more than 200 requests/s to this API. Each API endpoint's documentation may detail more specific usage limits.* - *Set a unique User-Agent or Api-User-Agent header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.*

On Thu, 3 May 2018 at 14:26, Aidan Hogan ahogan@dcc.uchile.cl wrote:

...

Hi Fae,

On 03-05-2018 16:18, Fæ wrote:

...
On 3 May 2018 at 19:54, Aidan Hogan ahogan@dcc.uchile.cl wrote:

...
Hi all,

I am wondering what is the fastest/best way to get a local dump of

English

...
...
Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the

source

...
...
markup in the Wikipedia dumps to HTML, but the latest version seems to

take

...
...
on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would

take

...
...
several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any

help,

...
...
pointers or advice would be greatly appreciated!!

Best, Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for?

Ref

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_W...

...
Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table.

Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :)

Cheers, Aidan

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Neil Patel Quinn https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF (he/him/his) product analyst, Wikimedia Foundation

Bartosz Dziewoński

1:19 a.m.

New subject: Getting a local dump of Wikipedia in HTML

On 2018-05-03 20:54, Aidan Hogan wrote:

...

I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/

Their downloads use the ZIM file format, looks like there are libraries available for reading it in many programming languages: http://www.openzim.org/wiki/Readers

-- Bartosz Dziewoński

Neil Patel Quinn

1:39 a.m.

New subject: Getting a local dump of Wikipedia in HTML

Also, for the curious, the request for dedicated HTML dumps is tracked in this Phabricator task: https://phabricator.wikimedia.org/T182351

On Thu, 3 May 2018 at 15:19, Bartosz Dziewoński matma.rex@gmail.com wrote:

...

On 2018-05-03 20:54, Aidan Hogan wrote:

...
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/

Their downloads use the ZIM file format, looks like there are libraries available for reading it in many programming languages: http://www.openzim.org/wiki/Readers

-- Bartosz Dziewoński

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Neil Patel Quinn https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF (he/him/his) product analyst, Wikimedia Foundation

Kaartic Sivaraam

8 May 8 May

3:23 p.m.

New subject: Getting a local dump of Wikipedia in HTML

On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:

...

On 2018-05-03 20:54, Aidan Hogan wrote:

...
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/

In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], the tool used to generate ZIM files. It dumps HTML files locally before generating the ZIM file. Though, HTML is an intermediary for the tool it could be held back if you wish. See [2] for more information about what options the tool accepts.

I'm not sure if it's possible to instruct the tool to stop immediately after the dumping of the pages thus avoiding the creation of the ZIM file altogether. But you could work around it by perusing the 'verbose' output (turned on through the '--verbose' option) of the tool to identify when dumping has been completed and stop it manually.

In case of any doubts about using the tool, feel free to reach out.

References: [1]: https://github.com/openzim/mwoffliner [2]: https://github.com/openzim/mwoffliner/blob/master/lib/parameterList.js

-- Sivaraam QUOTE: “The most valuable person on any team is the person who makes everyone else on the team more valuable, not the person who knows the most.” - Joel Spolsky Sivaraam? You possibly might have noticed that my signature recently changed from 'Kaartic' to 'Sivaraam' both of which are parts of my name. I find the new signature to be better for several reasons one of which is that the former signature has a lot of ambiguities in the place I live as it is a common name (NOTE: it's not a common spelling, just a common name). So, I switched signatures before it's too late. That said, I won't mind you calling me 'Kaartic' if you like it [of course ;-)]. You can always call me using either of the names. KIND NOTE TO THE NATIVE ENGLISH SPEAKER: As I'm not a native English speaker myself, there might be mistaeks in my usage of English. I apologise for any mistakes that I make. It would be "helpful" if you take the time to point out the mistakes. It would be "super helpful" if you could provide suggestions about how to correct those mistakes. Thanks in advance!

Kaartic Sivaraam

3:34 p.m.

New subject: Getting a local dump of Wikipedia in HTML

On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:

...

On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:

...
On 2018-05-03 20:54, Aidan Hogan wrote:

...
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/

In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], ...

Note that the HTML is (of course) is not the same as the one you see when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present.

Aidan Hogan

14 May 14 May

5:27 a.m.

New subject: Getting a local dump of Wikipedia in HTML

Hi all,

Many thanks for all the pointers! In the end we wrote a small client to grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) as suggested by Neil. The HTML looks perfect, and with the generous 200 requests/second limit (which we could not even manage to reach with our local machine), it only took a couple of days to grab all current English Wikipedia articles.

@Kaartic, many thanks for the offers of help with extracting HTML from ZIM! We also investigated this option in parallel with converting ZIM to HTML using Zimreader-Java [1], and indeed it looked promising, but we had some issues with extracting links. We did not try the mwoffliner tool you mentioned since we got what we needed through RESTBase in the end. In any case, we appreciate the offers of help. :)

Best, Aidan

[1] https://github.com/openzim/zimreader-java

On 08-05-2018 9:34, Kaartic Sivaraam wrote:

...

On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:

...
On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:

...
On 2018-05-03 20:54, Aidan Hogan wrote:

...
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.

The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/

In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], ...

Note that the HTML is (of course) is not the same as the one you see when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present.

Eran Rosenthal

3 May 3 May

11:19 p.m.

Good luck / בהצלחה!

On Thu, May 3, 2018 at 7:39 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:

...

Welcome / ברוכה הבאה!

בתאריך יום ה׳, 3 במאי 2018, 19:27, מאת Hagar Shilo ‏< hagarshilo1@mail.tau.ac.il>:

...
Hi All,

My name is Hagar Shilo. I'm a web developer and a student at Tel Aviv University, Israel.

This summer I will be working on a user search menu and user filters for Wikipedia's "Recent changes" section. Here is the workplan: https://phabricator.wikimedia.org/T190714

My mentors are Moriel and Roan.

I am looking forward to becoming a Wikimedia developer and an open source contributor.

Cheers, Hagar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2265

Age (days ago)

2276

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

8 participants

tags (0)

participants (8)

Aidan Hogan
Amir E. Aharoni
Bartosz Dziewoński
Eran Rosenthal
Fæ
Hagar Shilo
Kaartic Sivaraam
Neil Patel Quinn