University project to make entire English Wikipedia history searchable on Hadoop using Solr

List overview All Threads
Download

newer

older

Wikipedia visitor data

stat1004 access to Analytics...

Tilman Bayer

17 May 2016 17 May '16

2:54 p.m.

Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software":

Steven Stulga, "English Wikipedia on Hadoop Cluster" https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)

IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term...

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

-- Sent from Gmail Mobile

Attachments:

attachment.htm (text/html — 948 bytes)

Show replies by date

Amir E. Aharoni

17 May 17 May

4:21 p.m.

That would be wonderful.

Is it something that you just found or do you actually know them?

Maybe they could consider starting from a smaller language. If their software is not good at parsing languages other than English, even the Simple English Wikipedia would be more manageable.

-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

2016-05-17 9:54 GMT+03:00 Tilman Bayer tbayer@wikimedia.org:

...

Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software":

Steven Stulga, "English Wikipedia on Hadoop Cluster" https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)

IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term...

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

-- Sent from Gmail Mobile

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

18 May 18 May

6:09 p.m.

New subject: University project to make entire English Wikipedia history searchable on Hadoop using Solr

Tilman Bayer

10:22 p.m.

Yes, of course *processing* the entire history (even with text) has been done before - but perhaps not storing or indexing it.

BTW is anyone still using "Wikihadoop"? https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedi... https://github.com/whym/wikihadoop

On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Hi Tilman, thanks for pointing to this research. We have indeed worked on this kind of project, for both ORES and the WikiCredit system. There are many challenges like memory and processing time. Loading the entire history without text is what we're working on right now for our Wikistats 2.0 project. Even this has many challenges.

As far as I can tell right now, any simple attempt to handle all the data in one way or one place is going to run into some sort of limit. If anybody finds otherwise, it would be useful to our work.

*From: *Tilman Bayer *Sent: *Tuesday, May 17, 2016 02:54 *To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Reply To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Cc: *A public mailing list about Wikimedia Search and Discovery projects *Subject: *[Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software":

Steven Stulga, "English Wikipedia on Hadoop Cluster" https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)

IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term...

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

-- Sent from Gmail Mobile

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

Joseph Allemandou

11:03 p.m.

@Tilman: Yes, some people are using Wikihadoop to convert dumps from XML to JSON. I have added code recently and maintain it.

On Wed, May 18, 2016 at 4:22 PM, Tilman Bayer tbayer@wikimedia.org wrote:

...

Yes, of course *processing* the entire history (even with text) has been done before - but perhaps not storing or indexing it.

BTW is anyone still using "Wikihadoop"?

https://blog.wikimedia.org/2011/11/21/do-it-yourself-analytics-with-wikipedi... https://github.com/whym/wikihadoop

On Wed, May 18, 2016 at 3:09 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi Tilman, thanks for pointing to this research. We have indeed worked on this kind of project, for both ORES and the WikiCredit system. There are many challenges like memory and processing time. Loading the entire history without text is what we're working on right now for our Wikistats 2.0 project. Even this has many challenges.

As far as I can tell right now, any simple attempt to handle all the data in one way or one place is going to run into some sort of limit. If anybody finds otherwise, it would be useful to our work.

*From: *Tilman Bayer *Sent: *Tuesday, May 17, 2016 02:54 *To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Reply To: *A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Cc: *A public mailing list about Wikimedia Search and Discovery projects *Subject: *[Analytics] University project to make entire English Wikipedia history searchable on Hadoop using Solr

Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software":

Steven Stulga, "English Wikipedia on Hadoop Cluster" https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)

IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term...

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

-- Sent from Gmail Mobile

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

3112

Age (days ago)

3113

Last active (days ago)

analytics@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Amir E. Aharoni
Dan Andreescu
Joseph Allemandou
Tilman Bayer