Why regex alone can't validate a Chilean RUT

Nearly every "Chilean RUT validation" thread on the internet ends with a regex. Stack Overflow answers accumulate upvotes, blog posts replicate the pattern, and a copy-paste ecosystem of regular expressions circulates across codebases. The pattern feels correct — it rejects obvious garbage and accepts familiar-looking strings. The problem is that regex matches shape. A RUT is valid only when its verifier digit reconciles with the body under Modulo 11, and no regular expression can do arithmetic. The right tool is two tools: one for shape, one for semantics.

Can regex validate a Chilean RUT?#

No. A regular expression can verify the shape of a RUT string — digit count, separator position, allowed verifier characters — but it cannot perform the Modulo 11 arithmetic that proves the verifier digit reconciles with the body. The correct production pipeline is isRutLike(value) for shape plus validate(value, { strict: true }) for semantics. Both ship in rut.ts, the zero-dependency, TypeScript-native library — no extra packages, no hand-rolled patterns.

What regex actually checks#

The popular pattern for Chilean RUT validation is ^[0-9]+-[0-9kK]$. It checks that the string starts with one or more digits, contains a hyphen, and ends with a digit or the letter k/K. That is a reasonable first filter, but it has both false positives and false negatives that matter in practice.

Consider what it accepts that it should not. The string 99999999-9 matches cleanly — body digits, hyphen, digit verifier — but the correct verifier for body 99999999 under Modulo 11 is not 9. The string 00000000-0 also matches; as a semantic matter it is meaningless, since a body of eight zeros was never assigned as a real identity. And 123456789012-K passes the unbounded + quantifier: twelve body digits, a hyphen, a K. No length ceiling was enforced.

Now consider what the same pattern rejects that you do want to accept. A user who types 12.345.678-5 — the dot grouping that appears on every Chilean national ID card — gets rejected because the pattern does not allow dots. That is a real RUT in its canonical display format, and a regex without the formatted variant turns legitimate users into validation failures.

Regex can be a denial-of-service vector#

The unbounded quantifier is not only a correctness bug — it is a security one. A regex with ambiguous or nested repetition can be driven into catastrophic backtracking: a crafted input makes the engine explore exponentially many ways to match, and a single request can pin a CPU core for seconds, stalling every other request on the same process. This class of bug — ReDoS — has taken down production services built on a one-line "RUT regex" copied from a forum thread.

The defense is to bound the input before the matcher ever runs. rut.ts caps input length up front (a fixed 64-character security bound; a real RUT is ~9 characters), so an attacker can't hand the parser a megabyte of digits to chew on — validate('9'.repeat(1_000_000)) returns false immediately, without ever entering a matching loop. Treating RUT acceptance as a trust boundary, not just a string check, is the whole posture of the library; the security guide covers the threat model in full.

Shape vs. semantics#

These two failure modes point at a deeper distinction. Shape is "is this string in roughly the right form?" — bounded length, allowed character set, expected separators in plausible positions. Semantics is "does the verifier digit match the body under Modulo 11?" — and answering that requires performing the weighted sum, applying the modular reduction, and mapping the result to the verifier character. Those are arithmetic operations. A regular expression engine cannot execute them.

isRutLike() from rut.ts answers the shape question. It checks that the input looks like a RUT — the right kind of characters, a plausible length, a separator in a position that makes sense — without executing the checksum computation. It is fast and appropriate anywhere you need a cheap first filter. validate() answers the semantic question: it normalizes the input, runs the full Modulo 11 computation, verifies the check digit, and — when called with { strict: true } — rejects repeated-digit placeholder patterns that satisfy the algorithm but were never real identities. That arithmetic is the part you don't want to re-implement: the Modulo 11 path in rut.ts is covered by a 523-case suite with property-based tests (fast-check) and Stryker mutation testing at ~90% on that exact hot path — the kind of coverage a hand-rolled checker almost never gets.

Use them together. Do not conflate the two — a shape failure and a semantic failure are different problems that produce different error messages for the user.

The right two-step#

In any system that processes RUT strings from the outside world, a two-step pattern keeps things fast and correct. Use isRutLike() as a fast short-circuit on untrusted input streams: CSV ingestion, search bar filters, log scanners. Discarding obvious non-RUTs before running arithmetic is cheaper at scale.

validate(value, { strict: true }) belongs at every trust boundary — before an API endpoint stores a RUT as a customer identity, before a database write, before an authentication flow accepts a national ID as a credential. Those are the moments where the full semantic check is mandatory.

TypeScript

import { isRutLike, isValidRut } from "rut.ts";
import type { Rut } from "rut.ts";
 
export function acceptRut(input: unknown): Rut | null {
  if (typeof input !== "string") return null;
  if (!isRutLike(input)) return null;                    // cheap shape reject
  if (!isValidRut(input, { strict: true })) return null; // authoritative + narrows
  return input;                                          // input is now typed `Rut`
}

The type guard comes first — rejecting non-strings before any parsing is a baseline defensive habit. The shape check comes second because it is cheap and eliminates most garbage input. The authoritative check comes last because it is the only one that guarantees correctness. And because that last check is isValidRut — a type guard, not just a boolean — the value that survives all three lines is typed as the branded Rut, not a bare string. The function returns proof that the value was validated, which the caller can carry through the type system instead of re-checking.

When regex is fine#

Regex is the right tool for extracting RUT-shaped substrings from free-form text: log lines, support tickets, CSV cells with chatty headers. A regex finds the candidates. You then pipe each candidate through validate() to filter false positives — strings that matched the pattern but carry a wrong check digit. The regex is the candidate extractor, not the validator.

Quick visual highlighting in a code editor is another appropriate use. Syntax highlighting is not a runtime trust decision, and the user is looking at the result — the editor is not making an identity claim.

Stripping formatting before normalization, however, is a job for clean(), not a regex of your own. A hand-rolled pattern will miss unicode dashes — en-dashes, em-dashes, figure dashes — that appear in real data copied from documents or received from external APIs. clean() handles the full range of separator variants; your regex will not.

Pitfalls#

Hand-rolled RUT regexes fail in predictable ways. The most common: case sensitivity on the verifier. Some patterns accept lowercase k but not uppercase K, or the reverse. Real users type both. Patterns that hard-code one case reject legitimate RUTs silently.

A second pitfall is formatting assumptions. A pattern that requires dot grouping — XX.XXX.XXX-Y — rejects the equivalent 12345678-5 typed without dots. A pattern that rejects the formatted form blocks users pasting from an official document. Both representations refer to the same identity.

A third is the unbounded body quantifier. Patterns that use [0-9]+ without a length ceiling accept bodies of any length. A 20-digit body is not a RUT, but the regex will not say so. Validators with no length cap have reached production and caused downstream confusion when an oversized string passed validation and failed at the database column's character limit.

Finally, watch for untrimmed input. Regex applied to a string with leading or trailing whitespace will fail even when the RUT itself is correct. The failure is confusing because the value looks fine in a regex tester. Trim before you match — or use isRutLike(), which handles that internally.