Brute Force Code Nav

2024/08/15

Sophisticated IDEs are nice. Or... at least they are supposed to be. In theory.

Here is the promise: you can use an all-encompassing indexer system that understands your code and will let you navigate it quickly. You can immediately know where a particular function is being called from, taking into account its exact set of parameters. When asked to rename a class, it will find it anywhere but will not mess up that other class that has the same name but is in a different scope.

Now, for such a system to be useful, it needs to be 100% accurate. If it understands 90% of your code but has no idea about how to expand that weird template that uses that other library that only your build system knows where to find, it will not show up in your search results. You'll only realize this once you try building your code, or possibly not even then. (Unless you were specifically looking for something in that part. In that case, you are just simply doomed.)

One of the solutions is having as good of an approximation of your actual build system as possible, possibly by getting it to output the actual build commands. Then, your indexer can approximate what your compiler does (possibly by being pretty much the same thing as your compiler itself).

Of course, this is somewhat hard to do if your programming language insists on not being compiled until you end up running it.

This post is not about how to do this well. This post is about how to get 70% of the way there, by instead doing something that's extremely stupid and low effort (but nice and robust).

Ctags

Things in a code base are already named so that they are easy to find.

In fact, people who are reviewing a PR on GitHub do not have access to an IDE anyway; they generally appreciate if not all the functions in your code are named get(), despite IDEs understanding perfectly well which one is which.

Also, what you really need is not precision, but recall. If your system is low precision, it might give you some search results that are calls to things that are not your function (... unless they are all called get(), that is); the extra ones are not too hard to filter out. It's harder to fix if you have low recall, missing things you would have been curious about.

So can we just index all the words in the text with some knowledge of what's a class and function definition, and return search results somewhat liberally?

This is the idea behind TAGS files. It is a relatively simple format, just listing various noteworthy places in an easier to find format.

We can take such ideas to a logical extreme though.

Search

Wouldn't it be nice if we could just search within a project, without any indexing whatsoever? Using whatever regular expression we want?

Of course, this is not especially feasible if your repo is large enough, so...

... well, actually, with modern SSDs and ripgrep, it is perfectly feasible to the point of not even taking a noticeable amount of time in huge code bases. Coming from plain grep, it's unbelievable, until you try.

(Or you're already using Visual Studio Code, which can do this out of the box?)

"Navigation by search" also has the nice property of forcing you to pick long, unique function and class names, resulting in more readable code, without even accounting for searchability.

Refactoring

Sometimes, that missing 30% of functionality sounds like it would be nice to have. Being able to rename a function without further thought is very cool, especially considering that you're more likely to end up doing things that are easy to do.

In practice though, you'll end up reading the code you're refactoring anyway, as renaming functions is just a tiny piece of what you're doing. Having to click through each of the search results before you rename them is also a good way of ensuring that your interfaces are not ugly and inconvenient to use. (More refactor ideas!)

Ambitions

Meanwhile, IDEs themselves feel like they are missing at least an additional 30%. Wouldn't it be nice to have a nice call tree, giving an overview of how your system works, but also taking into account asynchronous calls and leaving out boring library ones?

We might be coming full circle on this. Simple text search is powerful because it just treats code as text, reading it the way humans do. IDEs try to build a fancy set of abstractions instead, which is a lot of effort to maintain, but brings you some benefits. Meanwhile, the even more novel concept of stuffing your entire code base into an LLM, having it extract interesting parts, is closer to treating it as text, with the benefits of LLMs being able to deal with any human-readable format.

Will it bring us that extra 30%?

(Until then, ripgrep it is.)