Announcing the Wikimedia technical search tool

List overview All Threads
Download

newer

older

SMW Developer Workshop - when?

Semantic MediaWiki conference:...

Waldir Pimenta

10 Mar 2013 10 Mar '13

2:50 p.m.

Hi all,

I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search

The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).

Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?

--Waldir

1. http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an...

Show replies by date

Platonides

10 Mar 10 Mar

8:53 p.m.

On 10/03/13 15:50, Waldir Pimenta wrote:

...

The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).

I'm not convinced about [[en:MediaWiki_talk:*]] and [[en:Template_talk:*]], they can bring quite a bit of noise (similarly for [[en:Wikipedia:Village_pump_(technical)]]). I see how interesting discussions could be happening there, though.

...

Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?

Do we really want Google to index them?

Waldir Pimenta

9:35 p.m.

On Sun, Mar 10, 2013 at 8:53 PM, Platonides Platonides@gmail.com wrote:

...

I'm not convinced about [[en:MediaWiki_talk:*]] and [[en:Template_talk:*]], they can bring quite a bit of noise (similarly for [[en:Wikipedia:Village_pump_(technical)]]). I see how interesting discussions could be happening there, though.

The tabs in the search results page (sorry I didn't mention them in the previous email) can be used to filter results to more relevant content, if desired. I think that might help coping with noise.

...

Besides feedback on whether the engine works as you'd expect, I would like

...
to start some discussion about the ability for Google's bots to crawl

some

...
of the resources that are currently included in the URL filters, but

return

...
no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we

do

...
to improve the situation?

Do we really want Google to index them?

Why log them publicly if we don't make them searchable? Either we're committed to being open or we're not... having a public but hard-to-use archive seems somewhat contradictory to me.

--Waldir

David Gerard

10:09 p.m.

On 10 March 2013 21:35, Waldir Pimenta waldir@email.com wrote:

...

Why log them publicly if we don't make them searchable? Either we're committed to being open or we're not... having a public but hard-to-use archive seems somewhat contradictory to me.

This is in fact the policy with mailing list archives. (And it is in fact stupid, having been put in place to appease people who don't understand that if you say something on a public list, you've said it in public.)

- d.

Sumana Harihareswara

9:02 p.m.

On 03/10/2013 10:50 AM, Waldir Pimenta wrote:

...

Hi all,

I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search

The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).

Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?

--Waldir

http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an... _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Cool! Thanks for making this!

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Sumana Harihareswara

17 Jun 17 Jun

3:25 a.m.

On 03/10/2013 07:50 AM, Waldir Pimenta wrote:

...

Hi all,

I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search

The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).

Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?

--Waldir

http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an...

Waldir, in case it helps you improve the Wikimedia tech search tool, I wanted you to see this post about Mozilla's codebase search tool: https://blog.mozilla.org/webdev/2013/06/13/dxr-digests-the-firefox-codebase/ Try it out: http://dxr.mozilla.org/

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Federico Leva (Nemo)

9:47 a.m.

No, DXR doesn't help the CSE. I also doubt it will help restoring full text search on our end, but we'll see. https://bugzilla.wikimedia.org/show_bug.cgi?id=49674

On the contrary, gitblit is more robust and faster than gitweb, so it allows crawling by search engines. It was crawled very quickly so I added the repos to the search tool on the 10th. https://www.mediawiki.org/w/index.php?title=Wikimedia_technical_search&diff=708290&oldid=663882

Nemo

S Page

18 Jun 18 Jun

6:55 p.m.

On Sun, Mar 10, 2013 at 7:50 AM, Waldir Pimenta waldir@email.com wrote:

...

Test it here: http://hexm.de/mw-search

Nice. Is there a way to pass a query string to it, e.g. http://hexm.de/mw-search?q=%s ? Then we could store this as a bookmarklet with keyword 'ts'[1] and type `ts 49604` to search. It works if you do a custom search for foo and replace the &q=foo in the long www.google.com/cse URL with &q=%s.

Nemo commented:

...

gitblit is more robust and faster than gitweb, so it allows crawling by search engines.

It's working but gitblit pages have generic <title> tags and no meta description or keywords, so the results don't show the title of a patch. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35624sug... how HTML pages should be structured (though Google is deliberately vague to hinder search result spammers) and https://developers.google.com/custom-search/docs/structured_data talks about rich snippets available to custom search (I've never tried it).

[1] like the essential "jump to Wikipedia page" 'w' bookmarklet https://en.wikipedia.org/wiki/%s . Why search when you can go direct.

-- =S Page software engineer on E3

4197

Age (days ago)

4297

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

6 participants

tags (0)

participants (6)

David Gerard
Federico Leva (Nemo)
Platonides
S Page
Sumana Harihareswara
Waldir Pimenta