Build product feed training data for LLMs with barcode and MPN matching

Dylan Bertalli

19 Nov 2025 • 3 min read

LLM ranking and generation models learn faster when examples are unambiguous. Product feed training data is only as good as its identifiers. Use barcode and MPN matching to group identical items across merchants, then layer price, discount, and availability to produce clean labels that teach models to compare, rank, and generate with precision. Normalized identifiers remove guesswork when titles vary, which is the rule rather than the exception in affiliate feeds.

LLM here means large language model. Barcode refers to GTIN, UPC, EAN, or ISBN. MPN means manufacturer part number. These fields anchor product identity across networks and merchants even when names and descriptions diverge. A single normalized product result can consolidate offers from many sources.

Why identifiers matter for training data

Most merchants describe the same SKU differently. If you train on titles alone, your model will confuse variants or bundle listings with the base item. Barcode is the most reliable way to assert physical product sameness. Use MPN and SKU for merchant specific precision.

Normalization lets you compare like with like, then label price or availability outcomes with confidence. This is the foundation for model features such as best price suggestion, offer deduping, and explanation prompts.

Barcode matching across merchants

Start from a product barcode to retrieve every matching listing across merchants. This collapses title variation and ensures you are training on the same physical item.
For each matched offer, extract final price, regular price, discount, availability, and merchant ID. Label the lowest final price as best offer and others as alternatives.
Optionally rank by discount or availability to create multiple supervisory signals.

Use deduplication off to expose all matched offers for training a ranking head. Switch deduplication on to generate a single representative entry for generation tasks where duplicates add noise.

ASIN to barcode mapping

When your seed is an Amazon ASIN, query by ASIN and barcode to locate the same product at other merchants. Label examples with fields that matter for decisioning, such as final price and in stock. This removes manual searching and creates reliable positive pairs for contrastive learning.

Positive and negative set construction

Use barcode equality to define positive matches. Create hard negatives from near matches that share brand and model family but differ in MPN or bundle contents. This teaches the model to avoid lookalike traps that often mislead price comparison pages.

For discovery and brainstorming tasks where recall matters, start with the any field to cast a wide net, then refine with brand, price, and availability so your negatives are realistic, not random.

Pricing and availability labeling

Training a deal detector or best value summarizer requires more than raw price. Use regular price, final price, and discount to compute savings, and on sale as a binary filter. Combine with in stock to avoid teaching the model to recommend dead links.

Example labeling recipe
• Sort: sort offers by final price ascending, tie break by higher discount.
• Sale Discount: bucket discount into greater than 20 percent
• Availability: include only in stock for positive examples. Use out of stock as counterexamples in ranking.

Handling cross currency effects

When the same product appears in multiple regions, connect listings to one product ID, then normalize comparisons while preserving local currency context. This prevents models from conflating currency differences with price changes.

Implementation with Affiliate.com

Open the Query Builder, use the any field to collect candidates, then switch to barcode or MPN for exact matches. Layer discount and in stock, toggle deduplication based on your task, and click Share to capture a reproducible query link for your training pipeline.

Prefer bulk collection through the Product Search API when you need volume. Use comparison sets to produce focused evaluation suites, for example one product with multiple merchant offers where the label is the best final price. Verify prices in the live UI before shipping benchmarks.