Installing NPM Packages Very Quickly

Some package managers are faster than others. The early JavaScript package managers, npm and yarn, are commonly replaced by faster alternatives like bun and pnpm. I've also seen benchmarks between package managers where the performance gap is rather large – but I'm not sure why one package manager would ever be significantly faster than another.

To understand more about package manager performance, I traced some call paths through bun's Zig codebase and pnpm's TypeScript codebase but I was still missing some details about the performance challenges these projects were taking on.

So I built my own toy package manager called caladan. For now, it just does two things: install npm packages from a valid package-lock.json file and run bin scripts.

I wanted to get close to the cold install performance of bun and I'm pretty happy with the results. Benchmarks are usually incorrect so there's a good chance I'm being unfair to bun here. Here are the results nonetheless:

# ran on m1 mac w/ 600mbps network, bun v1.2.5
# both have an equivalent lockfile with 16 packages (311mb on disk)
# cache is cleared before each run with `bun pm cache rm && rm -rf node_modules`

./benchmark.sh
Benchmark 1: ./caladan install-lockfile fixtures/1
  Time (mean ± σ):      1.767 s ±  0.052 s    [User: 2.168 s, System: 2.236 s]
  Range (min … max):    1.729 s …  1.857 s    5 runs
 
Benchmark 2: bun install --force --ignore-scripts --no-cache --network-concurrency 64
  Time (mean ± σ):      1.587 s ±  0.097 s    [User: 0.496 s, System: 1.293 s]
  Range (min … max):    1.486 s …  1.693 s    5 runs
 
Summary
  bun install --force --ignore-scripts --no-cache --network-concurrency 64 ran
    1.11 ± 0.08 times faster than ./caladan install-lockfile fixtures/1

The much lower user time of bun points to its efficient Zig codebase. Seeing similar-ish system times and overall wall clock times suggests that both tools have the same fundamental limits (whether network, disk I/O, or system call overhead). On a faster and more capable machine, bun would be able to make better use of the available resources.

To verify that my package manager is doing the same work, I checked that the sizes of the directories inside node_modules were comparable, and I checked that the bin scripts ran without any errors (e.g. nanoid, next, and image-size).

./caladan run fixtures/1 nanoid     
Running nanoid with args: []
Working directory: fixtures/1
guxvWmbNcvIuAowqzrnEu

The benchmark script is open source and hopefully you'll correct me if I've set it up unfairly.

I'll outline my efforts to get close to bun's cold install performance in the following sections.

Installing a Package

package-lock.json is automatically generated by the previous install to lock the exact versions of all dependencies (and their dependencies) in a Node.js project. It ensures consistent installations across different environments by recording the precise dependency tree at the previous install.

It's mostly made up of dependency entries like this:

  "dependencies": {
    // ..
    "date-fns": {
    "version": "2.29.3",
    "resolved": "<https://registry.npmjs.org/date-fns/-/date-fns-2.29.3.tgz>",
    "integrity": "sha512-dDCnyH2WnnKusqvZZ6+jA1O51Ibt8ZMRNkDZdyAyK4YfbDwa/cEmuztzG5pk6hqlp9aSBPYcjOlktquahGwGeA=="
  },

Our job, as a minimal package manager, is to install all of these dependencies.

Parse package-lock.json
Download the compressed files from resolved
Verify their integrity by calculating the hash of these files
Extract them to node_modules
Parse node_modules/$package/package.json and check for a bin property
(If so, create a symlink inside node_modules/.bin/$package)

Not listed here are other features like pre- and post-install scripts that I haven't implemented. I think I'm also missing some validation steps (e.g. checking if package.json differs from the lockfile).

To get everything working, I started by implementing these steps to run sequentially. It was very slow and took, like, ~30sec to install all the packages for my small project.

I got a 2x improvement by skipping installing extra packages when I didn't need them (i.e. by filtering by OS). On my MacBook, I don't need to install node_modules/@next/swc-darwin-x64 but I do need to install node_modules/@next/swc-darwin-arm64.

The next big improvement was to run things in parallel. I put each package's download-and-extract step in its own goroutine and stuck them in an errgroup.

g := errgroup.Group{}

// Process each package in parallel
for pkgName, pkgInfo := range packages {
    g.Go(func() error {

        // Skip OS-specific packages that don't match current OS
        // ..
        // Create package directory
        // ..
        // Normalize package path
        // ..

        // Download the package tarball
        DownloadAndExtractPackage(
          ctx,
          httpSemaphore,
          tarSemaphore,
          client,
          pkgInfo.Resolved,
          pkgInfo.Integrity,
          pkgPath
        )

        return nil
    })
}

// Wait for all packages to complete
err := g.Wait()
// ..

This was much faster than doing everything sequentially. However, without limits on parallelism, there was resource contention in two areas: HTTP requests, and unzipping files.

Comparing CPU Profiles

From reading their codebases, I knew that bun and pnpm used different levels of concurrency for HTTP requests and unzipping files.

When I added separate semaphores around these steps, the performance of my install step improved by ~20% for the small project I've been testing. I knew intuitively that these semaphores helped with resource contention but I thought it would be interesting to prove this using profiling tools.

I've chosen to highlight the effect of adding the semaphore for unzipping files as the performance improvement is more significant there.

In my program, I have an env var that allows me to output CPU profiles:

if cpuProfilePath := os.Getenv("CPU_PROFILE"); cpuProfilePath != "" {
    f, err := os.Create(cpuProfilePath)
    if err != nil {
        fmt.Printf("Error creating CPU profile file: %v\n", err)
        os.Exit(1)
    }
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()
    fmt.Printf("CPU profiling enabled, writing to: %s\n", cpuProfilePath)
}

I used pprof's -text output to compare two different profiles (with unzip sema and without it) side-by-side in my code editor:

go tool pprof -text cpu_without_sema.prof > cpu_without_sema.txt
go tool pprof -text cpu_with_sema.prof > cpu_with_sema.txt

Decompression Performance Improvement

With the semaphore, the core decompression functions represented less of the overall percentage of program time, and were also quicker to run. Below is the profile data for huffmanBlock (decoding a single Huffman block), and huffSym (reading the next Huffman-encoded symbol).

# with semaphore
 flat  flat%   sum%        cum   cum%
0.11s  1.94% 89.22%      0.42s  7.42%  compress/flate.(*decompressor).huffmanBlock
0.11s  1.94% 87.28%      0.19s  3.36%  compress/flate.(*decompressor).huffSym

# without semaphore
 flat  flat%   sum%        cum   cum%
0.11s  1.88% 88.57%      0.51s  8.70%  compress/flate.(*decompressor).huffmanBlock
0.19s  3.24% 82.08%      0.29s  4.95%  compress/flate.(*decompressor).huffSym

There was also a ~5% decrease in the time spent waiting on system calls (syscall.syscall) and I/O (os.(*File).Write and os.(*File).ReadFrom).

More Detail on Why

The semaphore limits the number of concurrent extraction operations, preventing CPU, memory, and I/O contention. By matching the extraction concurrency to available CPU resources (using 1.5x cores), the system avoids thrashing and context switching.

Notably, there was an increase in "scheduling time" which may seem counterintuitive but here it's desirable as it means synchronization is more orderly and predictable and there's less chaotic contention for system resources:

runtime.schedule       +2.70%
runtime.park_m         +1.23%
runtime.gopreempt_m    +0.42%
runtime.goschedImpl    +0.42%
runtime.notewakeup     +0.21%
runtime.lock           +1.31%
runtime.lockWithRank   +1.31%
runtime.lock2          +1.31%

We traded a small amount of scheduling time for faster I/O and faster decompression (CPU).

Keeping Things in Memory

One of the ways you can be fast is to avoid disk operations altogether. This was the final optimization I added. Initially, I downloaded each package to a temporary file and then extracted it into node_modules.

I realized I could do everything at the same time using the HTTP response stream:

Download the bytes of the archive
Extract directly to the final location
Calculate the hash as we go so we can verify each package's integrity

// DownloadAndExtractPackage downloads a package tarball and extracts it
func DownloadAndExtractPackage(ctx context.Context, httpSemaphore, tarSemaphore *semaphore.Weighted, client *http.Client, url, integrity, destPath string) error {
    httpSemaphore.Acquire(ctx, 1)
    defer httpSemaphore.Release(1)

    // Request the tarball
    resp, err := client.Get(url)
    if err != nil {
        return fmt.Errorf("error downloading package: %v", err)
    }
    defer resp.Body.Close()

    // Setup hash verification
    var hash interface {
        io.Writer
        Sum() []byte
    }
    // ..

    // Use a TeeReader to compute hash while reading
    teeReader := io.TeeReader(resp.Body, hash)
    reader := teeReader

    tarSemaphore.Acquire(ctx, 1)
    defer tarSemaphore.Release(1)

    // Extract directly from the download stream
    err := extractTarGz(reader, destPath)
    if err != nil {
        return fmt.Errorf("error extracting package: %v", err)
    }

    // Compare hashes
    // ..

    return nil
}

In a way, everything gets blocked on the semaphore that wraps the extracting step. But seeing as it's an order of magnitude faster than downloading bytes over the network, it feels like a good design.

Running Scripts

The final part of my package manager program configures the symlinks for any bin scripts that the packages might have. It also runs them when invoked with caladan run <directory> <script> <args>.

After a package is downloaded to node_modules/$package/ it has a package.json file which may have a bin property.

For example, nanoid has:

"bin": "./bin/nanoid.cjs",

Which means there's a file at node_modules/nanoid/bin/nanoid.cjs that we need to create an executable symlink for at node_modules/.bin/nanoid.

The hardest part is getting the relative file paths correct and ensuring that args are passed correctly. Running the script isn't too hard. It's effectively just exec.Command.

func Run(directory string, args []string) {
    // Set up command to run script using project-relative path
    binScriptName := filepath.Join("./node_modules/.bin", scriptName)
    cmd := exec.Command("sh", "-c", binScriptName+" "+strings.Join(scriptArgs, " "))

    // Set working directory to the specified directory (project root)
    cmd.Dir = directory
    fmt.Printf("Working directory: %s\n", directory)

    // Connect standard IO
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    cmd.Stdin = os.Stdin

    // Run the command and wait for it to finish
    err := cmd.Run()
    // ..

To Conclude

All this to have a package manager that implements 2% of the spec that users expect, hah.

It's ~700 lines of Go (open source) and it was fun to write. I have more of an understanding about the upper-end of performance that's possible when it comes to installing packages.

I'd like to be able to handle a cold install of a package.json (creating and updating the lockfile) at similar speeds. I hope to put a follow-up post together when I'm able to get my dependency resolution and hoisting to match how npm does it.

I'd also like to look into the cache optimizations that bun does for repeat package installs which, in some cases, takes tens of milliseconds.

After getting up close to the basics of package manager-ing over the past week, I feel like JavaScript doesn't cut it as far as the required performance is concerned. I used to think that package managers were network-bound but now I've changed my mind.

The raw performance (and concurrency primitives) of a systems-y language like Go give you so much more power.

To end on a Jarred Sumner post:

A lot of performance optimizations come from looking closely at things people assume is "just the network" or "just I/O bound"