smoores.dev - Phonetic Matching

Phonetic Matching

Nov. 12, 2024

Just as heads up: This post starts out somewhat technical and includes a discussion of interesting algorithmic topics, like forced alignment and phonetic matching. But it ends by delving into some deeper social and human topics that might not be what everyone is looking for in a blog that’s mostly about software.

Recently, I’ve spent quite a lot of time working on Storyteller. It’s an open source, self-hostable platform for automatically syncing audiobooks and ebooks. You give Storyteller the ebook and audiobook files for the same book, and it spits out a new ebook file with the audio embedded, such that the text can be highlighted while the audio is playing. In order to do this, Storyteller has to answer a pretty challenging question:

How do you automatically align the text with the corresponding audio?

In computer science, this problem is called “forced alignment”, and there are multiple ways to approach it. Storyteller’s approach is to transcribe the audio to text — with timestamps for each word — and then use a fuzzy matching algorithm to attempt to line up each sentence in the text with the corresponding transcription. Since there are timestamps for each word in the transcription, this allows Storyteller to determine when a given sentence starts and ends in the audio.

Storyteller delegates the first part of this solution, transcribing the audio, to dedicated tools like Whisper AI. This is a very, very hard problem, and one that is genuinely best solved by large neural networks like Whisper. Even Whisper isn’t perfect, though, and even if it were, audiobook narrators don’t always perfectly narrate the written text. In fact, sometimes there are significant, intentional differences between the narration and what’s in the text version of the book!

To account for this, Storyteller has to use a fuzzy matching algorithm to attempt to determine which part of the transcription matches a given sentence. The basic idea is to walk through the text of the book one sentence at a time, and for each sentence, compare it to the next chunk of transcription and find the place that looks the least different. For example, in the book Moby Dick, The actual text of the book contains this sentence:

Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Which Whisper transcribed as this:

Call me Ishmael. Some years ago, never mind how long precisely, having little or no money in my purse and nothing particular to interest me on sure. I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.

Storyteller’s goal is to identify that the sentence starts at character 17 in the transcription, and ends at character 254, because that’s the closest match to the text. It must account for the fact that the transcription used commas instead of dashes in the first clause, the fact that “shore” was spelled “sure,” and the fact that the transcription breaks this sentence up into two sentences, with the first ending after the word “sure.” It does this using something called the Levenshtein distance, which is a measure of how many edits need to be made to one string of characters to turn it into another string. For example, the Levenshtein distance between the words “shore” and “sure” is 2, because we would need to make 2 edits to “shore” to turn it into “sure”:

Replace the “o” with an “u”: shure
Delete the “h”: sure

This example actually reveals a flaw with this approach for forced alignment: English spelling is weird. “Shore” and “sure” are pronounced almost identically, but they’re spelled differently. We had to make two edits to “shore,” which is only five letters long. This situation is fairly common, because transcription tools will often “mishear” a spoken word as a different, valid word, which may be spelled very differently, despite sounding similar.

For over a year, this has been the status quo for Storyteller. Levenshtein distance calculations were good enough for most use cases, and if sometimes Storyteller missed a sentence or two... well, it’s a fully automated process, it’ll never be perfect!

Then, about two weeks ago, a Storyteller user made a suggestion:

Have you considering using Soundex or another phonetic encoding on both texts before performing the forced alignment? It can be used with Levenshtein distance and might be more robust to inaccurate transcriptions since you’re comparing similar sounding words rather than similar spellings of words.

In fact, I had not considered using a phonetic encoding for Storyteller, because, of course, I had never heard of phonetic encodings. So, with shores on my mind, I started researching the world of phonetic encodings.

In 1922, Robert C. Russell and Margaret Kind Odell patented an algorithm called Soundex. It was a variation on a patent that Russell had acquired four years earlier, and it was part of a collection of indexing-related patents that Russell’s company owned and sold. They sold it to companies and government organizations as an indexing system that would group together names that were pronounced the same, even if they were spelled differently.

Years later, as part of President Franklin D. Roosevelt’s Works Progress Administration, a variation on the algorithm was developed and named “American Soundex”. It was initially used by the Census Bureau to assist in finding records for people who needed official proof of age, because many states did not have uniform birth registries. There was no guarantee that someone with the surname “Mayer” wouldn’t have been registered under the misspelled name “Meyer”, for example. But both “Mayer” and “Meyer” are encoded as M600 in Soundex!

It’s worth noting that Soundex predates modern computing by several decades. At the time people manually encoded records by hand, which was part of why the algorithm was so simple. Since then, and especially in more recent years, with the explosion in computing power, several more elaborate phonetic encoding algorithms have been developed.

In 1970, New York State introduced “New York State Identification and Intelligence System (NYSIIS)”, an improvement over the American Soundex system.
In 1985, genealogists Gary Mokotoff and Randy Daitch developed Daitch-Mokotoff Soundex, after encountering issues trying to apply the original Russell Soundex to Germanic and Slavic Jewish surnames.
In 1990, Lawrence Phillips — an analyst working at an insurance agency at the time — developed an alternative to Soundex, called Metaphone, which attempts to index words (not just names) by their English pronunciation. He now runs Anthropomorphic Software, a company that owns and sells licenses for Metaphone 3, the most recent iteration of this encoding algorithm.
In 2008, mathematician and genealogist Alexander Beider and computer and software engineer Stephen Morse developed the Beider-Morse Phonetic Matching algorithm to aid in their Jewish genealogical research.

Beider-Morse Phonetic Matching (BMPM) stands out from the rest of the algorithms. The authors developed their own limited alphabet of phonetic tokens, and thousands of contextual rules for encoding. The results are remarkable. Taking a look at our “shore”/“sure” example from earlier, we can see that BMPM knows that they can be pronounced the same way, despite the different spelling:

> encode(“shore”, approx)

sDr
sor
sur
sDrY
sDri
sorY
sori
surY
suri

> encode(“sure”, approx)

siur
siurY
siuri
sur
surY
suri

The encodings for both words contain sur, surY, and suri. Without having to even use a Levenshtein distance-based matching step, we can already see that these words might be pronounced the same way. Even better, BMPM encodings include far fewer false positives than previous encodings, making them much more useful for automated searches, like the kind that Storyteller does.

In the paper announcing BMPM, Morse walks through an example usage of the BMPM algorithm, using his own grandfather’s surname as the example query. He talks about how, years ago, he found “Grandpa Louis’s” immigration ship record by using microfilm readers and manually, physically searching through historical immigration records. Then, when the Ellis Island database was put online, he used the Datch-Mokotoff Soundex algorithm to search for names that sounded like “Matinsky”, his grandfather’s surname before he got married and changed it to Morse.

This new algorithm that he developed with Beider enabled him to find out that his grandfather had two cousins, Abraham and Judel, that lived in Manhattan. Their names had been transcribed by immigration agents as “Meistinsky”, rather than “Matinsky”.

Beider and Morse, and Daitch and Mokotoff before them, represent something integral about what it means to be a Jew in the Diaspora. Something about peeling back layers, about searching for connection and history in a world that has tried quite hard to remove them from us.

This past Yom Kippur, my wife and I drove two hours to spend the afternoon at my aunt’s house, with my cousins. As the night drew on, conversation roamed from television shows and books to politics and philosophy. The circle grew as we touched on increasingly sensitive and challenging topics, drawing us in.

We didn’t agree, per se. We were engaging in debate as often as we were engaging in conversation. But we all love each other deeply, and the amount of care and restraint that went into how each person expressed their disagreement was palpable.

The conversation died down as we started cleaning up from dinner and packing away the outdoor furniture. One of my cousins, with whom I had been in disagreement most during the group debates, came up to me, and we started talking again. As we did, we realized that we had both been fixated on the same thought recently: to be Jewish is to be constantly reminded that you might not be safe. Through that lens, our disagreements started to look… different. This was a lens through which she and I would always be able to understand each other. It’s a lens through which I can understand almost every Jewish person that I meet, no matter how little our other perspectives overlap.

On Morse’s website, in addition to a BMPM index of the Ellis Island database, is an index of the Dachau Concentration Camp Records. Just another reminder. Another lens to see through.