The right way to store Chilean RUTs in your database

Storing Chilean RUTs as the user typed them (Postgres, MySQL, SQLite — same problem) — dots, dashes, lowercase k, leading zeros, whitespace — is how you get duplicate customers and broken joins. The same person arrives as 12.345.678-5, 12345678-5, and 12345678k, and a naïve equality check treats every one as a different identity. The bug is not in the query; it is in the decision to tolerate multiple representations. Normalize once at the write boundary and pick a canonical column layout up front. Everything downstream becomes simpler: lookups are point reads, uniqueness constraints work as expected, and no ad-hoc REPLACE() gymnastics appear in SQL to paper over the inconsistency.

How should I store a Chilean RUT in the database?#

Pick Layout B — a single column, cleaned to digits-plus-verifier (123456785) — for most product databases. Normalize at the write boundary with clean() from rut.ts, add a UNIQUE index, and a Postgres CHECK constraint for defense-in-depth. Reach for Layout C (body + verifier columns) only when the body-only query pattern is frequent.

Layout	What it stores	Pick when
A	Formatted (`12.345.678-5`)	Almost never — only when the column is purely for display
B	Cleaned digits (`123456785`)	Default for product DBs; one index, predictable joins
C	`body` + `verifier` columns	Analytics or joins against government datasets keyed on the body

The three layouts (and which to pick)#

Layout A — single column, formatted stores the RUT as it appears on a Chilean ID card: 12.345.678-5. It is human-readable, but a poor choice for joins and indexes. Every writer must format the value identically, or you end up with the same multi-representation problem you were trying to avoid.

Layout B — single column, cleaned stores only digits plus the verifier character, no punctuation: 123456785. This is the recommended layout for most applications. It is compact, sorts predictably, requires a single index, and integrates cleanly with any ORM or raw SQL equality lookup. The verifier is appended directly to the body digits, so the value is self-contained and re-formattable at the display layer whenever needed.

Layout C — two columns, body and verifier splits the RUT into a body column (e.g. 12345678) and a verifier column ('5' or 'K', one character). This layout pays off when you frequently query the body without the verifier — common in analytics pipelines and joins against government datasets that key on the body alone.

For most product databases, Layout B is the right starting point. Reach for Layout C only when the body-only query pattern is frequent enough to justify managing two columns.

Normalize on write with `clean()` and `decompose()`#

Before any RUT value reaches an INSERT or UPDATE, it must pass the strict acceptance gate. Call validate(input, { strict: true }) and reject the request if it returns false. This blocks malformed values, bad check digits, and placeholder patterns like 11.111.111-1 that are technically self-consistent but never correspond to a real identity.

clean(input, { throwOnError: false }) strips all punctuation and uppercases the verifier character, producing the canonical digit string for Layout B. decompose(input, { throwOnError: false }) returns { body, verifier } for Layout C. The throwOnError: false option returns null on failure instead of throwing, so you can branch on the result without a try/catch. When a path needs only one half — a body-only analytics key, say — getBody() and getVerifier() return just that part, each with the same { throwOnError: false } safe mode and a VerifierDigit-typed result for the verifier.

TypeScript

import { clean, decompose, validate } from "rut.ts";
 
export function canonicalizeRut(input: string): string | null {
  if (!validate(input, { strict: true })) return null;
  return clean(input, { throwOnError: false }); // e.g. "123456785"
}
 
export function splitRut(input: string) {
  if (!validate(input, { strict: true })) return null;
  return decompose(input, { throwOnError: false }); // { body, verifier }
}

Keep these helpers at the service boundary. The service layer owns the validation decision; the repository layer receives values that are already canonical.

Indexes and constraints#

Always add a UNIQUE index on the canonical RUT column. Duplicate customers are expensive to clean up, and the constraint is cheap insurance. On PostgreSQL, pair it with a CHECK constraint for defense-in-depth:

sql

ALTER TABLE customers
  ADD CONSTRAINT customers_rut_format
  CHECK (rut ~ '^[0-9]{7,8}[0-9K]$');

This catches accidental writes that bypass the application layer — direct psql sessions, migration scripts, or a service that skips the validation helper. The regex is simple and anchored; it will not affect write performance.

The K verifier must be stored uppercase. clean() already uppercases it unconditionally — do not use CITEXT to paper over a lowercase k. It costs comparison performance on every index scan and masks an application bug that should be fixed at the source.

For Layout C, index (body, verifier) as a composite unique index. If your database supports generated columns, a generated verifier = calculate_verifier(body) column prevents a body and verifier from different RUTs from being stored in the same row. In PostgreSQL, a trigger achieves the same effect when generated columns are not available.

Migrating existing dirty data#

If you are retrofitting normalization onto an existing table, do not apply the UNIQUE constraint first. Canonicalize the data, detect collisions, and resolve them before constraining.

A migration script that calls validate() and clean() over every existing row will flag values that do not survive strict validation — malformed entries that were never real RUTs. Once every surviving row has a canonical value, run a collision detection query:

sql

SELECT rut_canonical, COUNT(*) AS n
FROM customers
GROUP BY rut_canonical
HAVING COUNT(*) > 1;

A collision means two formatted strings like 12.345.678-5 and 12345678-5 canonicalize to the same identity. The library cannot decide which row to keep — that is a business decision. Pick a survivor based on data completeness or recency, merge dependent records (orders, addresses, documents) to the survivor's primary key, then delete the duplicates. Only after that step is the table safe to constrain.

Pitfalls#

Storing a lowercase k. Some input sources produce the verifier as lowercase k. clean() returns uppercase unconditionally — trust it and do not second-guess the casing in SQL after the fact.

Trimming punctuation with SQL REPLACE(). A REPLACE(rut, '.', '') chain in SQL does not handle unicode dashes (‐, –, —) or non-breaking whitespace. These characters are rare but not impossible in data that arrived from a copy-paste or an external API. Normalize in the application layer with clean(), which handles the full range of separators the library is designed for, then write the result to the database. rut.ts ships zero runtime dependencies, so wiring it into a repository layer adds nothing to your transitive package graph.

Joining a formatted column against a cleaned column. If one table stores 12.345.678-5 and another stores 123456785, an equality join returns zero rows. No error is raised and tests that use matching literal values in fixtures will not catch it. The silent failure can persist for months in production. The fix is to normalize both sides to the same canonical form before any join runs — or, better, to normalize at write time so the mismatch never exists in the first place. For comparisons that happen in application code rather than in SQL, equals() sidesteps the problem entirely: equals('12.345.678-5', '123456785') returns true by comparing the underlying values, with no manual normalization on either side.

Letting two sources of truth drift. A RUT that lives in the customers table and also in a denormalized event payload stored as JSON will drift if the payload was written before normalization was enforced. Establish a single canonical source and treat any other occurrence as derived. When a discrepancy appears, always trust the source of truth and correct the derived copy.

Echoing the RUT into your logs. Query logs, audit trails, and slow-query logs routinely capture column values — and a RUT is PII under Ley 19.628. Don't mirror the raw value into application logs that shadow the database. mask('12.345.678-5') → '12.***.***-5' gives you a correlatable token to trace a row without copying the identity into your log pipeline.