ClassCrawler – extremely fast and structured code search engine - Wikitech-l

List overview All Threads
Download

newer

ClassCrawler – extremely fast and structured code search engine

older

Phabricator database master...

Implementing jumbo frames/LACP?

dima.batt＠speedandfunction.com

4 Feb 2022 4 Feb '22

4:39 p.m.

Speed & Function prototyped a ClassCrawler - extremely fast and structured code search engine.

PROBLEM: Working with Wikimedia code is time-consuming and risky. It takes a lot of time and effort when it comes to research of classes and methods with specific characteristics, research of dependencies and complexity, and understanding how refactoring impacts the whole system.

SOLUTION FROM S&F: - ClassCrawler code search tool - https://classcrawler.prettyclear.com/ - Project repository in GitLab - https://gitlab.com/snf1/classcrawler; https://gitlab.com/snf1/classcrawler-php

WHAT IS a ClassCrawler? ClassCrawler is a code search engine.

It parses PHP code into structure → saves to MongoDB → provides results in simple Web interface, so you can use all power of MongoDB query to find exactly what you need.

See the Guide and Use Cases - https://classcrawler.prettyclear.com/guide

…And this is just the beginning of what can be done in the future.

WE NEED YOUR FEEDBACK

- We share this idea from developers (us) to developers – YOU! So you can start using this capable and powerful tool in your everyday workflow when it is released. - Your feedback is extremely valuable. Please tell us what you think and help us make it perfect for you and other developers. - You can share your feedback directly through ClassCrawler – https://classcrawler.prettyclear.com/feedback

Your friends from S&F

Show replies by date

Adam Baso

4 Feb 4 Feb

4:58 p.m.

Dima,

Thanks for sharing this. I was wondering, might it be possible to help bring this sort of functionality into Code Search ( https://codesearch.wmflabs.org ) ? I noticed the presentation of the search UI looked similar, but I see how the symbol resolution might be something useful for Code Search and upstream Hound. One thing I've missed from the now olden days of Windows native development was the ability of some of the tooling to do this kind of stuff elegantly. Just to be clear: we don't necessarily have lots of support for some of the tools that are up and running, but I know longtime contributors Legoktm and Ladsgroup are on the list here as well.

Adam Baso (he/him/his/Adam) Director of Engineering Product Engineering Wikimedia Foundation

On Fri, Feb 4, 2022 at 10:40 AM dima.batt@speedandfunction.com wrote:

...

Speed & Function prototyped a ClassCrawler - extremely fast and structured code search engine.

PROBLEM: Working with Wikimedia code is time-consuming and risky. It takes a lot of time and effort when it comes to research of classes and methods with specific characteristics, research of dependencies and complexity, and understanding how refactoring impacts the whole system.

SOLUTION FROM S&F:

ClassCrawler code search tool - https://classcrawler.prettyclear.com/

Project repository in GitLab - https://gitlab.com/snf1/classcrawler;

https://gitlab.com/snf1/classcrawler-php

WHAT IS a ClassCrawler? ClassCrawler is a code search engine.

It parses PHP code into structure → saves to MongoDB → provides results in simple Web interface, so you can use all power of MongoDB query to find exactly what you need.

See the Guide and Use Cases - https://classcrawler.prettyclear.com/guide

…And this is just the beginning of what can be done in the future.

WE NEED YOUR FEEDBACK

We share this idea from developers (us) to developers – YOU! So you can

start using this capable and powerful tool in your everyday workflow when it is released.

Your feedback is extremely valuable. Please tell us what you think and

help us make it perfect for you and other developers.

You can share your feedback directly through ClassCrawler –

https://classcrawler.prettyclear.com/feedback

Your friends from S&F _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Kunal Mehta

5 Feb 5 Feb

1:40 a.m.

Hi,

On 2/4/22 08:58, Adam Baso wrote:

...

Thanks for sharing this. I was wondering, might it be possible to help bring this sort of functionality into Code Search ( https://codesearch.wmflabs.org https://codesearch.wmflabs.org ) ? I noticed the presentation of the search UI looked similar, but I see how the symbol resolution might be something useful for Code Search and upstream Hound.

Indeed, we've been discussing and exploring symbol-based search for a while now: https://phabricator.wikimedia.org/T183795. There are some pretty neat upstream projects that do this like https://searchfox.org/ and zoekt, which is designed for Gerrit integration. I would also note that things are likely to change whenever we migrate to GitLab, which has its own search functionality built-in (https://phabricator.wikimedia.org/T268196). My assumption is that GitLab will add symbol-based search eventually to compete with GitHub, hopefully that ends up in the CE version someday...

While I very much disagree with the opening proposition that "Working with Wikimedia code is time-consuming and risky", I think symbol search of the MediaWiki codebase would be incredibly powerful and unlock a new level of tooling, just like Codesearch did when it was first introduced, so I'm glad to see people looking into it! For example we could do stuff like https://phabricator.wikimedia.org/T186771 with it.

There were two main principles in building MediaWiki Codesearch, first that everything be licensed as free software[1] (which Bryan covered as well) and that we try to use upstream as much as possible. Our modifications are injected by a proxy rather than patch the upstream code...which has turned out to be incredibly stable over the years.

If you want to collaborate, all the code and setup is in Git and can add you to the project, but I see little to no value in building proprietary tools or reinventing what other projects have done pretty well rather than building on top of them.

[1] https://mako.cc/writing/hill-free_tools.html

-- Legoktm

Amir Sarabadani

8:38 p.m.

I co maintain codesearch with Kunal and I have similar notes. I hope instead of duplicating the work, we could join forces to improve the development productivity infrastructure.

Codesearch has been working fine in the past couple of years. There is a new frontend being built and I hope we can deploy it soon to provide a better user experience and I personally don't see a value in re-implementing codesearch. Especially using non-open source software.

In a rather long-term solution, I hope/dream we could implement what Google has in automating refactoring. It's called LSC [1] (Large Scale Changes) and we can even piggy back to the library upgrader tool to automate easy depreciation fixes so developers could focus on complex cases. It's disheartening to me to see the valuable time of our volunteer developers being spent on something that could be automated. (For example see the sheer number of patches made for this deprecation: https://gerrit.wikimedia.org/r/q/bug:T286694). It doesn't have to be able to parse php code and do complex magic at first. We can start with simple regex replacements and then add using rector (a really nice library for doing refactors in php) and its equivalent in other languages.

[1] For more information see "Software Engineering at Google: Lessons Learned from Programming Over Time" book: https://www.goodreads.com/book/show/48816586-software-engineering-at-google

Best

On Sat, Feb 5, 2022 at 2:41 AM Kunal Mehta legoktm@debian.org wrote:

...

Hi,

On 2/4/22 08:58, Adam Baso wrote:

...
Thanks for sharing this. I was wondering, might it be possible to help bring this sort of functionality into Code Search ( https://codesearch.wmflabs.org https://codesearch.wmflabs.org ) ? I noticed the presentation of the search UI looked similar, but I see how the symbol resolution might be something useful for Code Search and upstream Hound.

Indeed, we've been discussing and exploring symbol-based search for a while now: https://phabricator.wikimedia.org/T183795. There are some pretty neat upstream projects that do this like https://searchfox.org/ and zoekt, which is designed for Gerrit integration. I would also note that things are likely to change whenever we migrate to GitLab, which has its own search functionality built-in (https://phabricator.wikimedia.org/T268196). My assumption is that GitLab will add symbol-based search eventually to compete with GitHub, hopefully that ends up in the CE version someday...

While I very much disagree with the opening proposition that "Working with Wikimedia code is time-consuming and risky", I think symbol search of the MediaWiki codebase would be incredibly powerful and unlock a new level of tooling, just like Codesearch did when it was first introduced, so I'm glad to see people looking into it! For example we could do stuff like https://phabricator.wikimedia.org/T186771 with it.

There were two main principles in building MediaWiki Codesearch, first that everything be licensed as free software[1] (which Bryan covered as well) and that we try to use upstream as much as possible. Our modifications are injected by a proxy rather than patch the upstream code...which has turned out to be incredibly stable over the years.

If you want to collaborate, all the code and setup is in Git and can add you to the project, but I see little to no value in building proprietary tools or reinventing what other projects have done pretty well rather than building on top of them.

[1] https://mako.cc/writing/hill-free_tools.html

-- Legoktm _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

-- Amir (he/him)

Daniel Kinzler

9:18 p.m.

Am 05.02.22 um 21:38 schrieb Amir Sarabadani:

...

Codesearch has been working fine in the past couple of years. There is a new frontend being built and I hope we can deploy it soon to provide a better user experience and I personally don't see a value in re-implementing codesearch. Especially using non-open source software.

While I agree with several points that have been raised, in particular about licensing and building on top of existing tools, I'd like to point out that the idea is not to re-implement codesearch, but to overcome some of its limitations. What we use codesearch for most is finding usages of methods (and sometimes classes). This works fine if the method name is fairly unique. But if the method name is generic, or you are moving a method from one class to another an you want to find callers of the old method, but not the new method, then regular experssions just don't cut it.

Basically, I'd want codesearch to allow me to do the kind of "find callers" search that IDEs like phpstorm support. Sure, I could do it in the ID, but I can't link to that from a ticket, and I'd have to make sure I have exactky the right set of extensions installed (and updated).

A tool very much like codesearch, but based not on regular expressions but rather on symbols and their relationships, would be very valuable to me. The question how exactly it should be build is of course open.

-- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation

Inductiveload

9:46 p.m.

On the theme of code search, I have submitted a MediaWiki docset to the Dash user contributions repo, so you can also use it with Zeal[1], which looks like this [2, 3]. It includes collaboration diagrams and references/referenced by listings.

Obviously, as it's only from the core doxygen it doesn't include extensions, but I find it goes a long way to demystifying things. Because it's only built[4] manually and submitted manually, it's not bang up-to-date, but I have found it handy. Really it just provides a less clunky interface to doxygen's HTML

You can download the docset via the XML feed here (just paste this into Zeal's "Add Feed" button): https://zealusercontributions.vercel.app/api/docsets/Mediawiki.xml

You can also see a lot more user contributed docsets here: https://zealusercontributions.vercel.app

Cheers,

-- IL

[1] https://zealdocs.org/ [1] https://phabricator.wikimedia.org/F34943176 [2] https://phabricator.wikimedia.org/F34943177 [4] https://github.com/inductiveload/mediawiki_docset

Giuseppe Lavagetto

7 Feb 7 Feb

7:43 a.m.

On Sat, Feb 5, 2022 at 10:19 PM Daniel Kinzler dkinzler@wikimedia.org wrote:

...

Am 05.02.22 um 21:38 schrieb Amir Sarabadani:

Codesearch has been working fine in the past couple of years. There is a new frontend being built and I hope we can deploy it soon to provide a better user experience and I personally don't see a value in re-implementing codesearch. Especially using non-open source software.

While I agree with several points that have been raised, in particular about licensing and building on top of existing tools, I'd like to point out that the idea is not to re-implement codesearch, but to overcome some of its limitations. What we use codesearch for most is finding usages of methods (and sometimes classes). This works fine if the method name is fairly unique. But if the method name is generic, or you are moving a method from one class to another an you want to find callers of the old method, but not the new method, then regular experssions just don't cut it.

Ok, why do you think symbol search can't be integrated in the current

codesearch? That's what Amir was proposing. Sadly I don't think much of the current code of ClassCrawler can be reused for that goal, and it's a pity.

Cheers, Giuseppe

-- Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation

tdvit＠mail.com

8:14 a.m.

peter.ovchyn＠speedandfunction.com

8 Feb 8 Feb

2:23 p.m.

In my opinion, there is a misunderstanding regarding the ClassCrawler intention to replace codesearch. They’re different.

While codesearch, simply shows the occurrences of the wanted text in multiple repositories, ClassCrawler rather shows and searches for relationships between methods and classes.

For example, this query https://classcrawler.prettyclear.com/?q=%7B%22fullName%22%3A%20%22AbstractCo... not only shows you a method and a place where it occurs, it additionally demonstrates your “overrides/overridden” sections with methods that belong to the same hierarchy.

Furthermore, you can search not only for specific text or symbol, but you can also search for specific relationship: https://classcrawler.prettyclear.com/?q=%7B%22overrides%22%3A+%7B%22%24in%22... demonstrates all methods which override Content::getRedirectTarget.

And not only in core, but it processes extensions as well. Note “ProofreadPage\Page\PageContent::getRedirectTarget()” belongs to an extension.

Basically, ClassCrawler is like your local IDE that works globally like codesearch. So ClassCrawler does not replace Code search, it rather extremely extends its functionality.

Daniel Kinzler

11 Feb 11 Feb

9:25 a.m.

Am 07.02.22 um 08:43 schrieb Giuseppe Lavagetto:

...

Ok, why do you think symbol search can't be integrated in the current codesearch? That's what Amir was proposing. Sadly I don't think much of the current code of ClassCrawler can be reused for that goal, and it's a pity.

It could be integrated with the UI, but would require a very different backend.

I very much like Kunal's suggestion about LSIF. I'll reply to his mail.

-- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation

Kunal Mehta

8:15 a.m.

Hi,

On 2/5/22 13:18, Daniel Kinzler wrote:

...

Basically, I'd want codesearch to allow me to do the kind of "find callers" search that IDEs like phpstorm support. Sure, I could do it in the ID, but I can't link to that from a ticket, and I'd have to make sure I have exactky the right set of extensions installed (and updated).

Seems like what you're asking for is https://docs.gitlab.com/ee/user/project/code_intelligence.html, right?

AFAICT that functionality is available in self-hosted GitLab, it just requires someone writing a LSIF implementation for PHP, as it's not already listed on https://lsif.dev/#implementations-server.

Assuming no one else is, working on that would be a pretty cool project and nice contribution to the broader PHP community :-)

-- Legoktm

Daniel Kinzler

9:41 a.m.

Am 11.02.22 um 09:15 schrieb Kunal Mehta:

...

Seems like what you're asking for is https://docs.gitlab.com/ee/user/project/code_intelligence.html, right?

AFAICT that functionality is available in self-hosted GitLab, it just requires someone writing a LSIF implementation for PHP, as it's not already listed on https://lsif.dev/#implementations-server.

Oh nice! I did a few minutes of digging on LSIF and sourcegraph, and it does sound quite good! Sourcegraph provides a search backend backend, navigation frontend, integration API and plugins for GitLab as well as Phabricato and even browser extensions. We could integrate it with codesearch as well, via its API. And LSIF is an open format for representing the kind if info we need.

Peter, what do you think of targeting LSIF instead of MongoDB?

I mean, just as an experiment. We still need to look closely whether LSIF really covers our needs. https://code.visualstudio.com/blogs/2019/02/19/lsif says that: /Same as LSP, LSIF doesn't contain any program symbol information nor does the LSIF define any symbol semantics (for example, what makes the definition of a symbol or whether a method overrides another method). The LSIF therefore doesn't define a symbol database, which is consistent with the LSP approach./

That's actually quite a bummer. The most critical kind of search after "where is this called" is "what overrides this method"... Do I understand correctly that LSIF doesn't doe that?/ /

PS: Sourcegraph's licensing model is a bit confusing though, seems like it's Apache for the core, and "open but not free" for some extra bits.

-- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation

peter.ovchyn＠speedandfunction.com

4:25 p.m.

Daniel Kinzler wrote:

...

Am 11.02.22 um 09:15 schrieb Kunal Mehta:

...
Seems like what you're asking for is https://docs.gitlab.com/ee/user/project/code_intelligence.html, right?

AFAICT that functionality is available in self-hosted GitLab, it just requires someone writing a LSIF implementation for PHP, as it's not already listed on https://lsif.dev/#implementations-server.

Oh nice! I did a few minutes of digging on LSIF and sourcegraph, and it does sound quite good! Sourcegraph provides a search backend backend, navigation frontend, integration API and plugins for GitLab as well as Phabricato and even browser extensions. We could integrate it with codesearch as well, via its API. And LSIF is an open format for representing the kind if info we need.

Peter, what do you think of targeting LSIF instead of MongoDB?

Not sure. I need to research. My main concern is that LSIF is an index-based graph. So it needs some stuff on top of it to turn it into something usable for humans. Sourcegraph is

...

I mean, just as an experiment. We still need to look closely whether LSIF really covers our needs. https://code.visualstudio.com/blogs/2019/02/19/lsif says that: /Same as LSP, LSIF doesn't contain any program symbol information nor does the LSIF define any symbol semantics (for example, what makes the definition of a symbol or whether a method overrides another method). The LSIF therefore doesn't define a symbol database, which is consistent with the LSP approach./

That's actually quite a bummer. The most critical kind of search after "where is

this called" is "what overrides this method"... Do I understand correctly that LSIF doesn't do that?

Seems like it does with limitations.

...

PS: Sourcegraph's licensing model is a bit confusing though, seems like it's Apache for the core, and "open but not free" for some extra bits.

Generally speaking, seems like LSIF can work. I'll research and get back next week.

peter.ovchyn＠speedandfunction.com

12 Feb 12 Feb

2:54 p.m.

https://sourcegraph.com/github.com/wikimedia/mediawiki - we can check sourcegraph for mediawiki.

Jay prakash

4 Feb 4 Feb

4:59 p.m.

Hi Dima,

It looks very identical to our current code search tool https://codesearch.wmcloud.org/. I am curious how much is fast from hound-search.[1]

Jay Prakash

[1] https://github.com/hound-search/hound

On Fri, Feb 4, 2022 at 10:10 PM dima.batt@speedandfunction.com wrote:

...

Speed & Function prototyped a ClassCrawler - extremely fast and structured code search engine.

PROBLEM: Working with Wikimedia code is time-consuming and risky. It takes a lot of time and effort when it comes to research of classes and methods with specific characteristics, research of dependencies and complexity, and understanding how refactoring impacts the whole system.

SOLUTION FROM S&F:

ClassCrawler code search tool - https://classcrawler.prettyclear.com/

Project repository in GitLab - https://gitlab.com/snf1/classcrawler;

https://gitlab.com/snf1/classcrawler-php

WHAT IS a ClassCrawler? ClassCrawler is a code search engine.

It parses PHP code into structure → saves to MongoDB → provides results in simple Web interface, so you can use all power of MongoDB query to find exactly what you need.

See the Guide and Use Cases - https://classcrawler.prettyclear.com/guide

…And this is just the beginning of what can be done in the future.

WE NEED YOUR FEEDBACK

We share this idea from developers (us) to developers – YOU! So you can

start using this capable and powerful tool in your everyday workflow when it is released.

Your feedback is extremely valuable. Please tell us what you think and

help us make it perfect for you and other developers.

You can share your feedback directly through ClassCrawler –

https://classcrawler.prettyclear.com/feedback

Your friends from S&F _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

peter.ovchyn＠speedandfunction.com

8 Feb 8 Feb

2:43 p.m.

Yes, you're right. The interface looks similar. Functionality is different though. While codesearch, implement full-text search, ClassCrawler is about relationship graph and search in it.

https://classcrawler.prettyclear.com/?q=%7B%22fullName%22%3A%20%22AbstractCo...

https://classcrawler.prettyclear.com/?q=%7B%22overrides%22%3A+%7B%22%24in%22...

Bryan Davis

4 Feb 4 Feb

5:24 p.m.

On Fri, Feb 4, 2022 at 9:40 AM dima.batt@speedandfunction.com wrote:

...

saves to MongoDB

This is problematic from the point of view of shared use within the Wikimedia movement. MongoDB is a source available product under the non-free SSPL license [0]. This license was invented by MongoDB and submitted to the Open Source Initiative (OSI) for OSI approval and then later withdrawn [1]. The OSI now have a page explaining why this license is not likely to ever be given OSI approval [2].

This is all esoteric property rights management things to many people, but the Wikimedia Cloud Services environment (Cloud VPS and Toolforge) Terms of Use [3] requires that software installed in these environments is licensed under an OSI approved license. Thus MongoDB, modern versions of Elasticsearch, and other SSPL licensed software are not allowed. Even if SSPL was OSI approved it would be problematic in Cloud VPS & Toolforge as the main point of the license is to restrict cloud service providers from offering SSPL licensed software as a service to their clients.

[0]: https://en.wikipedia.org/wiki/Server_Side_Public_License [1]: https://lists.opensource.org/pipermail/license-review_lists.opensource.org/2... [2]: https://opensource.org/node/1099 [3]: https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#Wha...

Bryan

-- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

864

Age (days ago)

872

Last active (days ago)

wikitech-l@lists.wikimedia.org

16 comments

11 participants

tags (0)

participants (11)

Adam Baso
Amir Sarabadani
Bryan Davis
Daniel Kinzler
dima.batt＠speedandfunction.com
Giuseppe Lavagetto
Inductiveload
Jay prakash
Kunal Mehta
peter.ovchyn＠speedandfunction.com
tdvit＠mail.com