smoores.dev

Building stuff on the Internet.

Overcoming I/O Limits in Node.js

Nov. 18, 2024

Note: A previous version of this post utilized fileHandle.readableWebStream() to create an async iterator to stream a file contents into memory. I have since discovered that there’s a bug in Node.js’ Web Streams implementation that gets triggered by this usage. I filed an issue report for it here!

Node.js has a (poorly) documented 2GB size limitation on file reads, and a seemingly undocumented 2GB size limitation on FormData parsing from web Request objects. This is a brief walkthrough of how I updated Storyteller to work around these limitations, allowing users to process longer books! If you’re running into this issue as well, and are just looking for the solution, you can skip ahead to how I fixed it.

A few days ago, a Storyteller user reported a bug. They were unable to download a book that had just finished syncing; when they tried, they saw an error in their server logs:

code: 'ERR_FS_FILE_TOO_LARGE'

Those of you familiar with the Cosmere may be unsurprised to learn that this was a Brandon Sanderson novel (Sanderson writes some famously lengthy novels). The synced book, which includes the audio narration, was 3.1GB. This is large, but it’s not that large. For comparison, ultra-HD video files for movies are often more than 10GB. It certainly didn’t seem large enough to trigger some sort of hard-coded file size limitation.

But the error message was pretty clear, so I went digging. I started with the Node.js docs, which had a pretty terse but not unhelpful explanation:

ERR_FS_FILE_TOO_LARGE

An attempt has been made to read a file whose size is larger than the maximum allowed size for a Buffer.

The next stop, of course, was to the Node.js Buffer docs, to find the maximum allowed size for a Buffer:

On 64-bit architectures, this value currently is 253 - 1 (about 8 PiB).

... Wait what? The maximum Buffer size is 8 petabytes. That’s 2.6 million Brandon Sanderson novels. This made truly no sense, and sure enough, I found a pull request on the Node repo, miraculously opened the very day that I was looking into this issue, that was updating the docs, in response to a comment from a Node maintainer on another issue:

This is a documentation issue. The 2GB limit is not for the Buffer, but rather an I/O limit.

And indeed, there’s a pull request on the libuv repo from seven years ago attempting to remove this limitation. libuv is the asynchronous I/O library used by Node.js under the hood. According to the PR description, some platforms don’t allow I/O operations for more than INT32_MAX bytes at a time, so libuv hard codes a limit of 2GB for all such calls.

I now had two useful pieces of information. First, Node.js Buffers can be really, really big. And second, libuv only allows you to read 2GB of data from a file at a time. The solution, therefore, was to stream data into a buffer in memory, one chunk at a time:

const fileHandle = await open(path)
const stats = await fileHandle.stat()
const fileData = new Uint8Array(stats.size)

let i = 0
for await (const chunk of fileHandle.createReadStream()) {
const buffer = chunk as Buffer
fileData.set(buffer, i)
i += buffer.byteLength
}

await fileHandle.close()

This fixes the original issue that was reported; users can now download files of any size, as long as their server has enough memory available. As I worked through it, though, I remembered another Storyteller bug report that mentioned a 2GB limit:

TypeError: Failed to parse body as FormData.

If the audio file(s) is ≥ 2GB, it will trigger the error.

This occurs when uploading a file to Storyteller that’s more than 2GB in size. The error is, oddly enough, different, and seems to be caused by the fact that undici’s form data parser simply stops after 2GB, even if there’s still more data to read, resulting in a misleading error suggesting that the body was formatted incorrectly.

To fix this, I had a hunch that I needed to take a similar approach, but this time I need to stream data from the network onto disk, rather than from disk into memory. This was made somewhat more complicated by the fact that the request body was encoded as multipart/form-data and contained multiple files. I couldn’t simply stream the bytes directly to disk — I had to actually parse the data on the fly!

Luckily, there’s a time-tested library that serves precisely this purpose: busboy. Using busboy, I was able to stream each file to disk, regardless of it’s size (note that this is within a Next.js route handler, so request.body returns a ReadableWebStream):

const body = request.body

const headers = Object.fromEntries(request.headers.entries())
const tmpDir = join(tmpdir(), `storyteller-upload-${randomUUID()}`)

await new Promise((resolve, reject) => {
const bus = busboy({ headers: headers })
bus.on("file", (name, file, info) => {
const tmpNamedDir = join(tmpDir, name)
const tmpFile = join(tmpNamedDir, info.filename)
mkdirSync(tmpNamedDir, { recursive: true })
file.pipe(createWriteStream(tmpFile))
})

bus.on("error", reject)
bus.on("close", resolve)

Readable.fromWeb(body).pipe(bus)
})

Hopefully, one day, that libuv pull request will get merged. At the very least, hopefully we can fix the Node.js docs that still incorrectly suggest that this error is related to max Buffer size. But for now, Storyteller users can read their Brandon Sanderson novels, and that’ll have to do.