Skip to main content

SKU Deduplication with Sentence Embeddings, Beyond Levenshtein

Sharvari Joshi Updated May 30, 2026 2 min read

Duplicate SKUs are quiet poison for inventory accuracy: the same physical item under two codes splits stock, breaks reorder math, and inflates the catalogue. The usual dedup tool, string-edit distance, catches typos but misses the duplicates that matter most: the ones described differently. Sentence embeddings close that gap.

Why Levenshtein is not enough

Levenshtein (edit distance) measures how many character changes turn one string into another. It catches Widget Blue vs Widjet Blue. It completely misses 16oz Coffee Mug vs Coffee Mug, 16 ounce vs Mug - coffee - 16 oz, which are the same product written three ways. Those share almost no character sequence, so edit distance scores them as unrelated, and the duplicates survive.

What embeddings do differently

A sentence-embedding model maps each item description to a vector positioned by meaning, not spelling. Descriptions that mean the same thing land near each other in vector space, so you can measure similarity by cosine distance:

# illustrative
emb = model.encode(item_descriptions) # one vector per SKU
sim = cosine_similarity(emb[i], emb[j]) # 1.0 = identical meaning
if sim > THRESHOLD: # candidate duplicate
 queue_for_review(i, j)

Now 16oz Coffee Mug and Coffee Mug, 16 ounce score as near-duplicates despite sharing little text.

The threshold trap

The catch is the threshold. Set it too high and you miss real duplicates; too low and you merge genuinely different items (a 16oz and a 12oz mug are semantically close but are different SKUs). Two safeguards matter:

The human step is the point

This is a Path-B-style workflow: the model proposes, a person disposes. The embedding pass turns an impossible manual job (compare every SKU to every other) into a short ranked list of likely duplicates for a data steward to confirm. That review is not overhead; it is what keeps you from collapsing two real products into one and corrupting the very master data you are trying to clean.

The takeaway

Use edit distance for typos, but use sentence embeddings to find the semantic duplicates that actually fragment your stock. Blend embedding similarity with hard attributes, treat the output as candidates, and keep a human in the merge loop. The result is a cleaner catalogue and inventory math that stops splitting one item across two codes.


Implementing this at your scale?

The walkthrough above comes from production work. AvanSaber’s inventory practice has implemented variations of this pattern across multiple customer engagements.

If you are building this and want expert review of your design, or would rather have the team that built this build yours, book a discovery conversation or describe your situation at [email protected].

See our Implementation engagement model

Related reading