Hi all,
I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search
The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).
Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?
--Waldir
1. http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an...
On 10/03/13 15:50, Waldir Pimenta wrote:
The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).
I'm not convinced about [[en:MediaWiki_talk:*]] and [[en:Template_talk:*]], they can bring quite a bit of noise (similarly for [[en:Wikipedia:Village_pump_(technical)]]). I see how interesting discussions could be happening there, though.
Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?
Do we really want Google to index them?
On Sun, Mar 10, 2013 at 8:53 PM, Platonides Platonides@gmail.com wrote:
I'm not convinced about [[en:MediaWiki_talk:*]] and [[en:Template_talk:*]], they can bring quite a bit of noise (similarly for [[en:Wikipedia:Village_pump_(technical)]]). I see how interesting discussions could be happening there, though.
The tabs in the search results page (sorry I didn't mention them in the previous email) can be used to filter results to more relevant content, if desired. I think that might help coping with noise.
Besides feedback on whether the engine works as you'd expect, I would like
to start some discussion about the ability for Google's bots to crawl
some
of the resources that are currently included in the URL filters, but
return
no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we
do
to improve the situation?
Do we really want Google to index them?
Why log them publicly if we don't make them searchable? Either we're committed to being open or we're not... having a public but hard-to-use archive seems somewhat contradictory to me.
--Waldir
On 10 March 2013 21:35, Waldir Pimenta waldir@email.com wrote:
Why log them publicly if we don't make them searchable? Either we're committed to being open or we're not... having a public but hard-to-use archive seems somewhat contradictory to me.
This is in fact the policy with mailing list archives. (And it is in fact stupid, having been put in place to appease people who don't understand that if you say something on a public list, you've said it in public.)
- d.
On 03/10/2013 10:50 AM, Waldir Pimenta wrote:
Hi all,
I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search
The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).
Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?
--Waldir
http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an... _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Cool! Thanks for making this!
On 03/10/2013 07:50 AM, Waldir Pimenta wrote:
Hi all,
I'd like to announce a recently created tool that might help the Wikimedia technical community find stuff more easily. Sometimes relevant information is buried in IRC chat logs, messages in any of several mailing lists, pages in mediawiki.org, commit messages, etc. This tool (essentially a custom google search engine that filters results to a few relevant URL patterns) is aimed at relieving this problem. Test it here: http://hexm.de/mw-search
The motivation for the tool came from a post by Niklas [1], specifically the section "Coping with the proliferation of tools within your community". In the comments section, Nemo announced his initiative to create a custom google search to fit at least some of the requirements presented in that section, and I've offered to help him tweak it further. The URL list is still incomplete and can be customized by editing the page http://www.mediawiki.org/wiki/Wikimedia_technical_search (syncing with the actual engine still will have to happen by hand, but should be quick).
Besides feedback on whether the engine works as you'd expect, I would like to start some discussion about the ability for Google's bots to crawl some of the resources that are currently included in the URL filters, but return no results. For example, the IRC logs at bots.wmflabs.org/~wm-bot/logs/. Some workarounds are used (e.g. using github for code search since gitweb isn't crawlable) but that isn't possible for all resources. What can we do to improve the situation?
--Waldir
http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-an...
Waldir, in case it helps you improve the Wikimedia tech search tool, I wanted you to see this post about Mozilla's codebase search tool: https://blog.mozilla.org/webdev/2013/06/13/dxr-digests-the-firefox-codebase/ Try it out: http://dxr.mozilla.org/
No, DXR doesn't help the CSE. I also doubt it will help restoring full text search on our end, but we'll see. https://bugzilla.wikimedia.org/show_bug.cgi?id=49674
On the contrary, gitblit is more robust and faster than gitweb, so it allows crawling by search engines. It was crawled very quickly so I added the repos to the search tool on the 10th. https://www.mediawiki.org/w/index.php?title=Wikimedia_technical_search&diff=708290&oldid=663882
Nemo
On Sun, Mar 10, 2013 at 7:50 AM, Waldir Pimenta waldir@email.com wrote:
Test it here: http://hexm.de/mw-search
Nice. Is there a way to pass a query string to it, e.g. http://hexm.de/mw-search?q=%s ? Then we could store this as a bookmarklet with keyword 'ts'[1] and type `ts 49604` to search. It works if you do a custom search for foo and replace the &q=foo in the long www.google.com/cse URL with &q=%s.
Nemo commented:
gitblit is more robust and faster than gitweb, so it allows crawling by search engines.
It's working but gitblit pages have generic <title> tags and no meta description or keywords, so the results don't show the title of a patch. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35624sug... how HTML pages should be structured (though Google is deliberately vague to hinder search result spammers) and https://developers.google.com/custom-search/docs/structured_data talks about rich snippets available to custom search (I've never tried it).
[1] like the essential "jump to Wikipedia page" 'w' bookmarklet https://en.wikipedia.org/wiki/%s . Why search when you can go direct.
wikitech-l@lists.wikimedia.org