June 28, 2022

Better npm search proposal

TL;DR; Current npm search engines aren't great. I explore a npm search algorithm that gives fewer points for popularity and more for consistency of commits, releases, and responses in issues/discussions. This way I want to: 1) save a lot of developer time, 2) give visibility to conscientious developers who don't promote their work, 3) and more 👇.

The problem with finding a good npm package

I have constant issues with finding good npm packages:

npms.io is the best-performing algorithm for me. However, sometimes it's slow, other times I get an error and no results. Some checks no longer work — this results in older packages having higher scores because the index is updated only when a new version is published. If you go through the implementation, you will find a lot of the ranking is determined by some things that don't correlate well with how good the library is ¹. All this reminds me of Bogus rankings damage what they rank. Especially if those being ranked play along.

npmjs.com search is bad. For example, one of my libraries that shows at the top on npms.io shows in 13th place on npmjs.com. The library is the most downloaded localStorage hook and I had consistent commits and releases for the past 2 years. I don't know what happened with npm Blog Archive: Better search is here!

Others. I have hopes for the future of socket.dev — it often has good results. However, the UX still has some issues and sometimes the results aren't optimal. libraries.io search quality is sporadic. I also use GitHub search and Google.

Currently, I use the bash script below to search, the places I mentioned, all at once:

args=$@
encodedValue=$(node --eval "process.stdout.write(encodeURIComponent(\"$args\"))")

open -a "Google Chrome" \
  "https://npms.io/search?q=$encodedValue" \
  "https://socket.dev/search?q=$encodedValue" \
  "https://github.com/search?l=TypeScript&q=$encodedValue&type=Repositories" \
  "https://github.com/search?l=JavaScript&q=$encodedValue&type=Repositories" \
  "https://libraries.io/search?languages=&platforms=NPM&q=$encodedValue" \
  "https://www.google.com/search?q=site:npmjs.org+$encodedValue"

The bash script helps. However, this workflow is time-consuming and frustrating. My experience is: I open 10-20 tabs, close duplicates, close all irrelevant, close all without any activity, dive deep in those that are left. It seems a lot of other people have my problem.

My proposed solution

The main question I ask myself when I think about a solution is: If you open-source the algorithm and people try to optimize for it, does it yield better libraries? Here are the things I'm proposing:

Commit, release, response consistency. Most points are given for consistency of releases, commits, and issue responses (excluding non-maintainers). The longer the period of consistency, the better — libraries that have existed for a lot of time and have been consistently updated should have the highest scores. Think about it, if a library has been consistently updated for a lot of time, don't you want to see it regardless of the download count? More points for evenly spread activity, fewer points for occasional bursts ². Optionally, if a library is over a threshold, show an icon/badge for consistency.

Account consistency. Some people go directly to Sindre Sorhus's repositories page and search there. If a person spends a big amount of time contributing, that's valuable. Give points to consistent accounts. Optionally, if a user is over a threshold, show an icon/badge.

Popularity. You can't ignore stars and downloads. That's an important factor. However, give it fewer points. This is a key aspect of this algorithm.

Give more points to. Most search engines have a "Sort by" option. This doesn't work. This is why I'm proposing an alternative "Give more points to" option that just switches the magnitude for specific criteria. Possible values will be "Repo consistency", "Account consistency" and "Popularity". "Repo consistency" will be selected by default. Selecting "Popularity" will make it work more like existing search engines.

Exclude bots. Bots activity should be excluded, otherwise, the search will probably get a lot worse. Also, it opens an opportunity for easy manipulation. For example, a version bump by a bot shouldn't count at all. Similar to how GitHub's repo contributions page work.

A possible pitfall in the idea

A big portion of repos will have a low consistency rating. A good fallback may be needed to account for that. I'm not sure if popularity is good enough of a fallback.

Is it possible the strange and opinionated scores used by other search engines needed? — I would bet on "no" but I'm very cautious with that guess.

What I've done to move the idea forward

I contacted Algolia and they gave me access to their npm index. I can use it for a basic implementation of my idea because it includes the history of all the releases. Also, the API returns sorted search results that can be used as a fallback or a base score. Not sure if this will be enough to produce consistently better results compared to other search engines.

I created a new discussion in the npm/feedback repo to share my idea. I also mentioned my idea in relevant discussions: npm scores, Weird search behavior with stats, and Improve search functionality on npmjs.com.

If you are a person that can move this idea forward, please write to me.

Why I wrote this article

At first, I wanted a better search. However, after researching and exploring the topic I now like it more for the opportunity it can create:

save a lot of developer time ⁴
allow non-vocal but conscientious developers be recognized
when incentivizing the right thing, in theory, the quality of libraries and the ecosystem as a whole should improve
in a utopian future, when the ecosystem improves and more rely on open-source, open-source developers get paid better

Some of the considered things for the ranking are: badges count in the readme, readme length, .npmignore or package.json's files property, existence of changelog.md file, does it use a linter. Some of the checks aren't implemented well and incorrectly return false. npms.io ranking algorithm explained ↩
I'm not entirely sure what the specific implementation should look like. I think it should calculate evenness. Something like this — Is there a measure of 'evenness' of spread?. However, if you understand the algorithm/maths behind it, write to me so I can add it to the article.↩
Reminds me of a Steve Jobs story — Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day..↩