Extracting merchant names from bank-transaction strings

How we extracted merchant names from a major bank's transaction feed with a character-level CNN — why it beat RNN baselines, and what we'd build today.

A few years ago, one of the largest retail banks we work with came to us with what sounded like a small problem: take their post-terminal transaction feed and extract clean merchant names from the raw strings the card network handed them.

The strings looked like this:

CASTLE ACADEMY LLC 303-663-7300 CO
APL* ITUNES.COM/BILL 866-712-7753 CA
TARGET 00014423 WATERTOWN MA
USAA.COM PMT - THANK YOU SAN ANTONIO TX

The bank’s customers were opening their banking app and asking “where did I spend?” — and the answer the app could give was the whole raw string. It looked like data exhaust. The product team wanted “Castle Academy”, “iTunes”, “Target” — the actual merchant a person would recognize on their statement.

Why “just regex it” doesn’t work

If you stare at one of those strings, the rules feel obvious. The merchant is at the start, before the phone number. Strip city and state. Strip transaction IDs. Done.

That works for about 70% of strings. But the bank had thousands of acquirers feeding into the network, each one with its own slightly different convention:

Some put the merchant after a prefix code: APL* ITUNES.COM/BILL
Some embed numbers inside the merchant name: STORE #1023 CHILDREN'S PLACE
Some use abbreviations that aren’t merchant names at all: PMT - THANK YOU
Some have legitimate merchants whose names are numeric or all-symbols
Many strings mix UTF-8, Latin-1, and “we tried our best”

Every regex we wrote against a sample broke on next week’s batch. The bank’s existing in-house approach was a multi-thousand-line ruleset maintained by hand. It took an analyst the better part of a week to add support for each new acquirer, and accuracy was still well below where the product team wanted it.

Why character-level — and not word-level NER

The natural reach for an NLP engineer here is named-entity recognition: tokenize the string and tag each token as B-MERCHANT, I-MERCHANT, or O. We tried it. Results were poor.

The problem is that a “token” doesn’t really exist in this data. 00014423, APL*, ITUNES.COM/BILL — these aren’t tokens by any whitespace or punctuation rule that generalizes. The word-level RNN baselines we trained converged but plateaued well below where we needed to be.

So we went one level down: model the string as a sequence of characters, and predict two indices — start and end of the merchant span — directly.

The architecture

The model is a fully-convolutional character-level network with two output heads, one for the start index and one for the end.

input string  →  char embedding (8-d, length 300, zero-padded)
              ↓
              ┌─ Conv1d(k=1) ┐
              ├─ Conv1d(k=3) ┤
              ├─ Conv1d(k=5) ┤   →  concat  →  BN  →  ReLU
              └─ Conv1d(k=7) ┘
              ↓
              [DilatedConvBlock]  ×  6, residual concat to previous block
              ↓
              concat outputs of all 6 blocks
              ↓
              ┌─ Conv1d → softmax over positions  →  start
              └─ Conv1d → softmax over positions  →  end

Two ideas drove the design.

Multi-kernel input stack. Real merchant patterns live at different character scales. Brand prefixes like APL* or SQ * are 2–4 characters; words like STARBUCKS are 9. Running four parallel Conv1d branches with kernel sizes 1, 3, 5, 7 lets every layer see all those scales at once. We concatenate the four feature maps, normalize, ReLU, and pass the stacked tensor up.

Dilated residuals. Each subsequent block is a dilated Conv1d with a residual concatenation back to the previous block’s input — a DenseNet-style stack. As you go deeper, each position can attend to more of the string without losing local detail.

The output is two Conv1d layers each reducing to one channel, flattened, then softmaxed across the position axis. Argmax of each gives the predicted (start, end).

The trap: positional bias

When we trained the first version, accuracy looked great. Suspiciously great. We dug in.

Roughly 74% of training transactions had the merchant at position 0. The model had learned, with very high confidence, to always predict start = 0 and end ≈ length-of-first-word. On a naïve held-out set it scored well — by exploiting the prior, not by understanding anything.

We needed augmentation that broke the positional prior without breaking the labels. Three operations did the work:

Slide. Keep the merchant substring intact but move it to every other word boundary in the same string. One transaction becomes N transactions, each with a different (start, end).
Swap. Take two transactions, swap their merchant substrings between them. Both labels move with their merchant.
Pad. Splice another transaction’s non-merchant content into the start of this one’s, pushing the merchant deeper into the string.

In code, this lives in build_dataset.py:

# slide the merchant to every other word boundary
new_transaction = " ".join(
    [tmp_transaction[:idx], transaction[start:end], tmp_transaction[idx:]]
)

# swap merchant strings between two transactions
transaction_3 = (
    transaction_1[:start_1] + transaction_2[start_2:end_2] + transaction_1[end_1:]
)

After augmentation, the position-0 fraction dropped from 74% to roughly 20%, and validation accuracy on the augmented data initially fell — exactly what we wanted, because the model now had to actually look at the string instead of leaning on the prior. After we trained through it, span accuracy on the un-augmented held-out test set was meaningfully higher than the un-augmented baseline.

What worked, in production

A few takeaways from running this in the bank’s pipeline:

Character-level CNNs are fast. Inference on a single CPU core was sub-millisecond per transaction. The bank could embed the model inline in their nightly ETL without a GPU.
The augmentation was the model. Roughly half of our final accuracy came from the data work, not the architecture. We rotate this rule into client engagements routinely now: before you reach for a bigger model, check whether your data is teaching the model to cheat.
The CNN beat the RNN baselines we tried. A character-level convolutional stack with this kind of dilation pattern was a better fit for the data shape than any word-level recurrent model we trained — and it was much cheaper to serve.

What we’d do differently in 2026

The public version of the code is frozen at a 2018 PyTorch (torch==0.4.1.post2, with Variable, loss.data[0], and a hand-rolled iterator). Things we’d change today:

Modeling.

Replace the 8-bit binary character encoding (bin(ord(c)) truncated to 8 bits) with a learned nn.Embedding over a fixed vocabulary. The current scheme silently truncates ord(c) >= 256 and gives correlated representations to unrelated characters. Cheap accuracy.
Move dilation from a fixed 2 per block to the standard [1, 2, 4, 8, 16, 32] exponential schedule for proper exponential receptive-field growth.
Strip the in-model nn.Softmax and train against CrossEntropyLoss on raw logits — the current code double-softmaxes.
Add a biaffine span head that scores every (i, j) pair with j > i directly, instead of two independent position softmaxes that can predict end < start.

A modern baseline. A fine-tuned CANINE — Google’s character-level BERT — would likely beat the CNN by several F1 with a few thousand labeled rows. We’d ship it as the accuracy ceiling and keep the CNN as the fast-inference baseline behind a --model flag.

Code and ops. A real train.py (we never shipped one publicly), pyproject.toml instead of an unbuildable 2018 requirements.txt, AMP mixed precision, W&B for experiment tracking, and a Dockerfile so a new engineer can docker run and reproduce a model in 90 seconds.

Eval discipline. The augmentation can leak the same merchant string into both train and val if you split by row index. The right split is by merchant identity — partition merchants first, then route their transactions. We’ve watched more than one team get burned by this since.

The core insight survived: when your input is dirty, character-level convolution with aggressive label-preserving augmentation is a strong, cheap, debuggable baseline. We’ve reached for variants of this design several times since the bank engagement, and it keeps earning its place.

If you’re staring at a similar problem — fragmentary structured strings where the entity boundary is what matters — start here. The transformer can come later.