smoores.dev

Building stuff on the Internet.

The Unreasonable Effectiveness of ProseMirror Model in Rich Text Transformation

May 18, 2026

By day, I’m a simple rich text editing engineer. I spend almost all of my working hours thinking about, using, and sometimes reimplementing ProseMirror. I do love ProseMirror, probably quite a bit more than the next guy, but it is a little all consuming, if I’m being honest.

Which is why by night I maintain Storyteller, a platform for automatically aligning, reading, and listening to readaloud-enabled ebooks. It has nothing at all to do with rich text editing, so obviously it doesn’t depend on ProseMirror.

Obviously

Except about month a go I might have added a minimal implementation of ProseMirror Model in Storyteller’s alignment package. But I can explain! It’s not my fault! It’s just that ProseMirror’s data model is such a good fit for rich text. I couldn’t resist. I don’t have a problem, you have a problem.

My problem

Storyteller’s primary job is to “align” ebooks and audiobooks. The basic idea is that we extract the text of the ebook, use automatic speech recognition to transcribe the audiobook, and then use a text-to-text forced alignment algorithm to figure out the best match for each sentence of text in the audiobook. ASR gives us the timestamps of each word in the transcript, so we can then figure out where each sentence of text starts and stops in the audio timeline.

This is genuinely hard, but even after we do all of this, there’s another hard problem we have to solve. EPUB files use XHTML (HTML semantics with XML syntax) to represent textual content. They use SMIL (a different XML application) to represent text-to-audio synchronization. In SMIL, text is referenced by URI, and audio is referenced by URI + start and end timestamps. Here’s an example:

<par id="sentence1">
<text src="chapter001.xhtml#sentence1" />
<audio src="audio001.mp4" clipBegin="0" clipEnd="3" />
</par>

If you’re familiar with URIs, you may be noticing an interesting limitation here. The URI for the text element uses a URI fragment (#sentence1) to specify which specific span if the text this audio clip corresponds to.

That means that we can only synchronize audio clips at the level of HTML elements (and only if those elements have unique IDs)! This is a pretty significant limitation, since nearly all EPUBs only have textblock-level markup, and rarely with IDs on every element. What do we do, if we want to provide a sentence-level synchronization? What about word-level?

Marking it up

If our only mechanism for referencing a span of text is via an element ID (technically, it’s not!), then our only option for modifying which spans we can reference is to modify the markup itself. We need to ensure that every span of text we care about is wrapped in a single contiguous element with a unique element ID. So, by way of example, the following XHTML:

<p>
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
</p>

Needs to become:

<p>
<span id="sentence1">Call me Ishmael.</span> <span id="sentence2">Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.</span>
</p>

Which, at first glance, doesn’t seem so bad? You could imagine an algorithm that looks roughly like:

  1. For each text block
    1. Split the text content of the text block into sentences
    2. For each sentence
      1. Create a <span> with an ID, using a global counter to make sure they’re unique, and set the text content to be that sentence
    3. Replace the text block’s children with the concatenated span elements

It’s a good thought, but unfortunately we’re not only working with plain text. Well, maybe it’s not unfortunate if you’re a reader, but it does make our lives a bit more challenging! Let’s look at another example:

<p>
This is a sentence with <em>emphasis. And it continues</em> into the next sentence!
</p>

Now we have a conundrum. We can preserve the original markup, but only at the expense of our ability to uniquely identify each sentence. If we want to keep the emphasis exactly as it is, we’re stuck with splitting up our sentence spans instead:

<p>
<span id="sentence1-1">This is a sentence with </span><em><span id="sentence1-2">emphasis.</span> <span id="sentence2-1">And it continues</span></em><span id="sentence2-2"> into the next sentence!</span>
</p>

But this isn’t what we want. It means that we no longer have any real control over which spans of text get highlighted for the user while they’re using readaloud mode — instead, we’re limited to working around the existing markup. And the more the markup varies, the more we have to split up our sentences.

Instead, we can split up the emphasis:

<p>
<span id="sentence1">This is a sentence with <em>emphasis.</em></span><em> </em><span id="sentence2"><em>And it continues</em> into the next sentence!</span>
</p>

Now we’re back to exactly two spans, one per sentence. We’ve broken up the emphasis markup quite a lot, but it should look visually identical to the reader, and the semantics are more or less the same, too. I should be clear that I know that this isn’t a perfect solution — it’s absolutely possible for an EPUB to have CSS that breaks if you change the markup like this, for example — but it works in every book we’ve tested so far, and I think that it’s the right tradeoff for our product needs.

At this point, it should be clear that this problem is actually fairly complex. It’s also, essentially, a rich text editing problem. The sentence spans are inline marks that we need to add to specific spans of text, which may already have any number of existing inline marks on them. It’s the same problem we have to solve when we add a bold mark to a selection that already has a link and italics in it. And the best solution to nearly every rich text editing problem is, well, ProseMirror!

Making it ProseMirror

Our naïve approach to solving this problem, before we took inline elements into account, was a single-pass approach that mutated the DOM tree in place. But ProseMirror has a much more structured approach to modeling rich text. Let’s step back and frame this the way that ProseMirror does.

Step 1: Parsing

First, we need to parse the DOM into a data structure that’s going to be useful for us. That data structure is, essentially, the ProseMirror Model, though we’re not going to enforce a schema like ProseMirror usually does, because we don’t know what the shape of our document will be ahead of time.

In ProseMirror, content is essentially broken down into nodes, marks, and attributes. Nodes and marks can both have attributes. Nodes can have child nodes, and they can also have marks. Semantically, a node represents a piece of content, and a mark represents some formatting information for that content. Here’s how we would model our second example with ProseMirror:

root(
p(
text("This is a sentence with "),
text("emphasis. And it continues", [em])
text(" into the next sentence!")
)
)

This looks pretty similar to the XHTML, but there’s one rather important difference: the text is flat. Rather than nesting the emphasized text one level deeper than the un-emphasized text, all of the text nodes are direct children of the paragraph node. This will be important later!

Our parser ends up actually looking rather simple. Note that BLOCKS is a list of XHTML tag names that represent block content, like p, li, h1, div, etc.:

class Parser {
parseDom(xml: ParsedXml) {
const children = this.parseDomChildren(xml)
return new Root(children)
}

parseDomChildren(xml: ParsedXml) {
const children: (Node | TextNode)[] = []
for (const child of xml) {
const result = this.parseDomNode(child)
const nodes = Array.isArray(result) ? result : [result]
children.push(...nodes)
}

return children
}

parseDomNode(
xmlNode: XmlNode,
marks?: Mark[],
): Node | TextNode | (Node | TextNode)[] {
if (isTextNode(xmlNode)) {
return new TextNode(xmlNode.text, marks)
}

if (BLOCKS.includes(xmlNode.tagName)) {
return new Node(xmlNode.tagName, xmlNode.attrs, xmlNode.children, marks)
}

if (!xmlNode.children.length) {
return new Node(xmlnode.tagName, xmlNode.attrs, [], marks)
}

return xmlNode.children.flatMap((child) =>
this.parseDomNode(child, [
...(marks ?? []),
new Mark(tagName, xmlNode.attrs),
]),
)
}
}

Step 2: Transforming

Now that we have our data model, let’s see what we can do with it! First we need to figure out where each sentence starts and stops. ProseMirror has two tools that help us with this: positions and mappings

In ProseMirror, every single position in the document is addressed by a unique integer. This enables some really wonderful properties, including mappings, which we’ll talk about next. For now, here’s a simple document with the positions annotated:

root(
⁰p(
text("¹a²b³c⁴")
)⁵
)

Some things to note:

  1. There are unique positions before and after node boundaries. Position 0 is before the paragraph node, and position 1 is after the paragraph node, but before the letter “a”.
  2. Since text nodes don’t have children (just like in the HTML DOM, a text node is a leaf node with a value), there’s just one position at the start of the text node

A mapping is a data structure that represents how positions change through a sequence of transformations to the document. For example, if we inserted the character “d” at position 2 in our previous document, we would end up with this new document:

root(
⁰p(
text("¹a²d³b⁴c⁵")
)⁵
)

And this mapping:

0 → 0
1 → 1
2 → 3
3 → 4
4 → 5

You can read this as saying “the content that was previously at position 2 is now at position 3”. It’s a way to track nodes and text across changes — before our insertion, “b” was at position 2, and after it’s at position 3. The mapping is how we know where it moved to!

There are two transforms that we need to implement for Storyteller, using our new position and mapping systems:

Transform 1: Lift text

First, we need to “lift” the text out of the document. We want to end up with a flat string that we can pass to our sentence segmenter, and an accompanying mapping that lets us identify what position each character in the string had in the original document.

To make this easier, we’ll implement ProseMirror’s descendants iterator. This is a function that takes a node and calls a callback for each descendant of that node:

export function descendants(
node: Root | Node,
cb: (
node: Node | TextNode,
pos: number,
parent: Node | Root,
index: number,
) => boolean,
pos = 0,
) {
pos += node.border

for (const [i, child] of enumerate(node.children)) {
const descend = cb(child, pos, node, i)
if (descend && !child.isLeaf) {
descendants(child as Node, cb, pos)
}
pos += child.nodeSize
}
}

Now we can use this iterator to produce a single string with the text of the document. Each time we append to the string, we also add a single StepMap to our mapping, “deleting” the positions between the start of the text we’re adding now and the end of the last text we added.

Taking a look at our sample document again

root(
p(
text("This is a sentence with "),
text("emphasis. And it continues", [em])
text(" into the next sentence!")
)
)

Here’s the string produced by liftText:

This is a sentence with emphasis. And it continues into the next sentence!

And our mapping looks like:

1 → 0
2 → 3
3 → 2
... // and so on

We’ve “deleted” the paragraph block boundary, so the beginning of the first character (“T”), which was previously at position 1, is now at position 0.

Transform 2: Add mark

Now we have a flat string to segment. Our segmenter can output sentences like:

[
"This is a sentence with emphasis. ",
"And it continues into the next sentence!"
]

The next step is to add our sentence spans, based on these sentences. Remember how I said that it would be important later that ProseMirror models text as “flat” sequences of text nodes, with no nesting? This is why!

Our first sentence starts at position 0 in the flattened string, and ends at position 34. The second sentence starts at 34, and ends at 74. Since we’re going to ultimately be transforming our ProseMirror document, not the flat string, we need to find the corresponding positions in the original document.

Luckily, we have a mapping for exactly this purpose! We can invert the mapping, producing a new mapping like this:

0 → 1
1 → 2
2 → 3
... // and so on

This inverted mapping represents how positions would change if we transformed our flat string back into the ProseMirror document. So position 0 becomes 1, 34 becomes 35, and 74 becomes 75.

We know that marks have to be assigned to nodes. If we want to have one mark for sentence 1, and a different mark for sentence 2, we need to ensure that we have distinct text nodes to assign those marks to. That means we need to split our second text node, the one with the emphasis mark, in two:

root(
p(
text("This is a sentence with "),
text("emphasis. ", [em]),
text("And it continues", [em])
text(" into the next sentence!")
)
)

Adding our marks is now very simple! We traverse our ProseMirror document, keeping track of the position we’re at, and add a mark to every text node we encounter between the start and end positions of the new mark.

addMark(from: 1, to: 35, {tagName: "span", attrs: {id: "sentence1"}})



root(
p(
text("This is a sentence with ", [span{id=sentence1}]),
text("emphasis. ", [span{id=sentence1}, em]),
text("And it continues", [em])
text(" into the next sentence!")
)
)

Note that we always add new marks to the beginning of the marks array for a node. This is important — when we serialize back to XHTML later, we’ll collapse adjacent nodes with the same starting mark into a single XHTML element.

And again for the second sentence:

addMark(from: 35, to: 75, {tagName: "span", attrs: {id: "sentence2"}})



root(
p(
text("This is a sentence with ", [span{id=sentence1}]),
text("emphasis. ", [span{id=sentence1}, em]),
text("And it continues", [span{id=sentence2}, em])
text(" into the next sentence!", [span{id=sentence2}])
)
)

Step 3: Serializing

Now all that’s left is to serialize this back to XHTML. In general, this is pretty straightforward — traverse the ProseMirror tree, and produce an XHTML element for each node. We have to be a little thoughtful about how we handle marks, though, so that we don’t split our sentence spans!

To keep our complexity manageable, we use a simple rule when serializing marks. We join adjacent nodes whose first mark is the same. This doesn’t always lead to the longest possible contiguous span, but it is predictable (and much easier to implement). This is why it was important earlier that we add our new span marks to the beginning of the mark set — that guarantees that they’ll always be contiguous.

The algorithm is recursive. For a given set of adjacent text nodes, partition them by their first mark. For each partition, serialize the first mark, remove it from each node’s mark set, and then partition again. Here’s an example:

text("a", [strong, em, ins]),
text("b", [strong, em]),
text("c", [strong, del, em]),
text("d", [strong, del, a])

First, we partition by first mark, then we remove that first mark from each node’s mark set All of these nodes have the same first mark, so we only end up with one partition:

strong
======
text("a", [em, ins]),
text("b", [em]),
text("c", [del, em]),
text("d", [del, a])

Then we repeat:

strong
======

em
------
text("a", [ins]),
text("b")

del
------
text("c", [em]),
text("d", [a])

This gives us a pretty clear XHTML representation:

<strong>
<em>
<ins>a</ins>
b
</em>
<del>
<em>c</em>
<a>d</a>
</del>
</strong>

Again, this is not the optimal serialization, if we were to optimize for contiguity. We could have re-ordered the marks in the last two text nodes to extend the em mark across the first three nodes. That’s ok! Our goal is predictability, not optimal contiguity.

Going back to our demo doc, we now have the serialization we wanted at the beginning:

<p>
<span id="sentence1">This is a sentence with <em>emphasis.</em></span><em> </em><span id="sentence2"><em>And it continues</em> into the next sentence!</span>
</p>

What this gives us

const doc = parseDom(body)

const lifted = liftText(doc.root)

const segmentation = await segmentChapter(lifted.result, {
primaryLocale: locale,
})

const inverted = lifted.mapping.invert()

let root = doc.root
let pos = 0
let i = 0
for (const sentence of segmentation) {
root = addMark(
root,
inverted.map(pos),
inverted.map(pos + sentence.length, -1),
new Mark("span", { id: `sentence${i}` })
)
pos += sentence.length
i++
}

const markedUpBody = serializeDom(root)

This is, in my opinion, really elegant! We hijacked ProseMirror’s data and transform models, which are so well suited for rich text that the code we ended up with is simple, legible, and easy to test.

Now, if you’ll excuse me, I’m going to get back to adding support for rich text notes to the Storyteller mobile app. What? Yeah of course I’m using ProseMirror, why do you ask?