A wiki search engine

List overview All Threads
Download

newer

older

Extended CFP: IEEE TECHNICALLY...

Research:Anatomy of English...

emijrp

22 Oct 2012 22 Oct '12

9:29 p.m.

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es (only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Attachments:

attachment.htm (text/html — 3.7 KB)

Show replies by date

ENWP Pine

23 Oct 23 Oct

9:22 a.m.

I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

From: emijrp Sent: Monday, 22 October, 2012 08:29 To: Research into Wikimedia content and communities Subject: [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

You can check a pretty alpha version here http://www.todogratix.es (only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/ -------------------------------------------------------------------------------- _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

emijrp

2:41 p.m.

Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

...

I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

*From:* emijrp emijrp@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es (only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

emijrp

27 Oct 27 Oct

4:13 p.m.

After some tests and usability improvements, I'm going to launch an English alpha version.

I still need a cool name for the project, any idea?

Stay tunned.

2012/10/23 emijrp emijrp@gmail.com

...

Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

...
I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

*From:* emijrp emijrp@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es (only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Pierre-Carl Langlais

4:28 p.m.

Well here are some quick ideas. *Checkpedia *Questionmark (might sound a bit strange, but still relevant to the general idea) *Factsearch *Wikiget …

...

After some tests and usability improvements, I'm going to launch an English alpha version.

I still need a cool name for the project, any idea?

Stay tunned.

2012/10/23 emijrp emijrp@gmail.com Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

From: emijrp Sent: Monday, 22 October, 2012 08:29 To: Research into Wikimedia content and communities Subject: [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring- wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es (only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT | StatMediaWiki | WikiEvidens | WikiPapers | WikiTeam Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Bastien Guerry

4:43 p.m.

Pierre-Carl Langlais langlais.qobuz@gmail.com writes:

...

Well here are some quick ideas. *Checkpedia

I like this one.

What about "wikisearch"?

-- Bastien

emijrp

5:04 p.m.

Thanks. You propose nice names but most domains are registered by domain parkers. : (

2012/10/27 Bastien Guerry bzg@altern.org

...

Pierre-Carl Langlais langlais.qobuz@gmail.com writes:

...
Well here are some quick ideas. *Checkpedia

I like this one.

What about "wikisearch"?

-- Bastien

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Bastien Guerry

6:51 p.m.

emijrp emijrp@gmail.com writes:

...

Thanks. You propose nice names but most domains are registered by domain parkers. : (

Mhh... what about wikidigg.org ?

It is not yet bought and matches the purpose quite well IMO.

-- Bastien

emijrp

30 Oct 30 Oct

10:59 p.m.

Finally we decided a name and registered domains: LibreFind.

2012/10/27 Bastien Guerry bzg@altern.org

...

emijrp emijrp@gmail.com writes:

...
Thanks. You propose nice names but most domains are registered by domain parkers. : (

Mhh... what about wikidigg.org ?

It is not yet bought and matches the purpose quite well IMO.

-- Bastien

Emilio J. Rodríguez-Posada

3 Aug 3 Aug

10:23 p.m.

Hi all again;

After some months, we have the domain for LibreFind[1] and some usable results[2][3] (the bot is running). Also, there is a mailing list[4] and a Google Code project[5].

I would like you can join the brainstorm. We need to establish some policies about how to sort results, bots to check dead links, crawlers to improve the results, and many more. You can request an account for the closed beta.

Thanks for your time, emijrp

[1] http://www.librefind.org [2] http://www.librefind.org/wiki/Spain [3] http://www.librefind.org/wiki/Edgar_Allan_Poe [4] http://groups.google.com/group/librefind [5] https://code.google.com/p/librefind/

2012/10/27 emijrp emijrp@gmail.com

...

After some tests and usability improvements, I'm going to launch an English alpha version.

I still need a cool name for the project, any idea?

Stay tunned.

2012/10/23 emijrp emijrp@gmail.com

...
Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

...
I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

*From:* emijrp emijrp@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es(only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Samuel Klein

5 Aug 5 Aug

2:56 a.m.

Hi, awesome to see thid move forward. This is solving a major namespace style problem (for the namespace of queries) and I fully support it. Good luck with the work and I would love to help test the beta.

Sam. On Aug 4, 2013 12:24 AM, "Emilio J. Rodríguez-Posada" emijrp@gmail.com wrote:

...

Hi all again;

After some months, we have the domain for LibreFind[1] and some usable results[2][3] (the bot is running). Also, there is a mailing list[4] and a Google Code project[5].

I would like you can join the brainstorm. We need to establish some policies about how to sort results, bots to check dead links, crawlers to improve the results, and many more. You can request an account for the closed beta.

Thanks for your time, emijrp

[1] http://www.librefind.org [2] http://www.librefind.org/wiki/Spain [3] http://www.librefind.org/wiki/Edgar_Allan_Poe [4] http://groups.google.com/group/librefind [5] https://code.google.com/p/librefind/

2012/10/27 emijrp emijrp@gmail.com

...
After some tests and usability improvements, I'm going to launch an English alpha version.

I still need a cool name for the project, any idea?

Stay tunned.

2012/10/23 emijrp emijrp@gmail.com

...
Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

...
I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

*From:* emijrp emijrp@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es(only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Emilio J. Rodríguez-Posada

1:25 p.m.

Hi Samuel, thanks for your kind words. I'm going to contact you to create the account. Your experience in so many open-knowledge projects would be helpful!

2013/8/4 Samuel Klein meta.sj@gmail.com

...

Hi, awesome to see thid move forward. This is solving a major namespace style problem (for the namespace of queries) and I fully support it. Good luck with the work and I would love to help test the beta.

Sam. On Aug 4, 2013 12:24 AM, "Emilio J. Rodríguez-Posada" emijrp@gmail.com wrote:

...
Hi all again;

After some months, we have the domain for LibreFind[1] and some usable results[2][3] (the bot is running). Also, there is a mailing list[4] and a Google Code project[5].

I would like you can join the brainstorm. We need to establish some policies about how to sort results, bots to check dead links, crawlers to improve the results, and many more. You can request an account for the closed beta.

Thanks for your time, emijrp

[1] http://www.librefind.org [2] http://www.librefind.org/wiki/Spain [3] http://www.librefind.org/wiki/Edgar_Allan_Poe [4] http://groups.google.com/group/librefind [5] https://code.google.com/p/librefind/

2012/10/27 emijrp emijrp@gmail.com

...
After some tests and usability improvements, I'm going to launch an English alpha version.

I still need a cool name for the project, any idea?

Stay tunned.

2012/10/23 emijrp emijrp@gmail.com

...
Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions.

Thanks for your interest.

2012/10/23 ENWP Pine deyntestiss@hotmail.com

...
I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue?

Pine

*From:* emijrp emijrp@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine

Hi all;

I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots.

I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network.

My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the "External links" sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates.

Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above.

Search "Shakira" in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for "semantic queries" like "right-wing newspapers" in Google, you won't find real newspapers but "people and sites discussing about ring-wing newspapers". Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages.

How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus.

We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes.

It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today.

For weird queries like "Albert Einstein birthplace" we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki).

You can check a pretty alpha version here http://www.todogratix.es(only Spanish by now sorry) which I'm feeding with some bots.

I think that it is an interesting experiment. I'm open to your questions and feedback.

Regards, emijrp

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Heather Ford

7:33 a.m.

New subject: WikiSym proceedings available

WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1

Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym

Thanks, Dirk!

Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org

Samuel Klein

8:25 a.m.

New subject: WikiSym proceedings available

How great. Thanks for the link, and much love for your citations analysis. (please, please follow up with a comparison across languages other than English!)

SJ Just arrived in HKG

On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfordsa@gmail.com wrote:

...

WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1

Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym

Thanks, Dirk!

Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Heather Ford

8:35 a.m.

New subject: WikiSym proceedings available

On Aug 5, 2013, at 10:25 AM, Samuel Klein wrote:

...

How great. Thanks for the link, and much love for your citations analysis. (please, please follow up with a comparison across languages other than English!

Thanks, SJ :) Yes! Shilad, Dave and I just met in Minneapolis to make plans :)

...

SJ Just arrived in HKG

On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfordsa@gmail.com wrote:

...
WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1

Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym

Thanks, Dirk!

Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org

4133

Age (days ago)

4420

Last active (days ago)

wiki-research-l@lists.wikimedia.org

14 comments

7 participants

tags (0)

participants (7)

Bastien Guerry
emijrp
Emilio J. Rodríguez-Posada
ENWP Pine
Heather Ford
Pierre-Carl Langlais
Samuel Klein