Coordinately

Address Standardization

Address standardization is the cleaning step that turns messy human input into a canonical form a geocoder can look up. USPS Publication 28 defines US conventions: street-suffix abbreviations, directional prefixes, secondary-unit handling, ZIP+4 normalization. The Universal Postal Union defines international conventions. The article covers the standardization rules, the libpostal library, common pitfalls (over-aggressive normalization, multilingual edge cases), and why standardization is upstream of every geocoder.

By . Published . Last updated .

The /learn/how-geocoding-works article identified standardization as the second step in the geocoding pipeline (after parsing, before lookup). This article goes deeper on standardization specifically — the rules, the standards, the implementation patterns, and the edge cases.

What standardization does

The job: transform an arbitrary address string into a canonical form. Examples:

Input:  "123 main st apt 4b, springfield il 62701"
Output: "123 MAIN ST APT 4B, SPRINGFIELD IL 62701-1234"

Input:  "1600 Penn Ave NW Washington DC"
Output: "1600 PENNSYLVANIA AVE NW, WASHINGTON DC 20500"

Input:  "350 5 av nyc"
Output: "350 5TH AVE, NEW YORK NY 10118"

Standardization combines several operations:

  • Casing: normalise to a defined convention (USPS uses uppercase; other systems use title case).
  • Abbreviation expansion / contraction: USPS standard is “Avenue → AVE”; ISO 19160-1 prefers full spelling.
  • Directional placement: “NW Pennsylvania Ave” → “Pennsylvania Ave NW” (per USPS convention).
  • Secondary-unit handling: “Apt 4B” → “APT 4B”; alternative forms (“Unit 4B”) normalised to a consistent designator.
  • State abbreviation: “Illinois” → “IL”.
  • ZIP+4 lookup: complete the postcode with the +4 extension where derivable.

The result is a string that can be reliably looked up in any USPS-CASS-certified database or fed into any major geocoder.

USPS Publication 28

USPS Pub 28 is the authoritative US standardization reference. Key sections:

Street suffix abbreviations

USPS Pub 28 Appendix C2 lists the canonical abbreviations for every street suffix. A selection:

| Long form | USPS abbreviation | | ------------------ | ----------------- | | Avenue | AVE | | Boulevard | BLVD | | Court | CT | | Drive | DR | | Highway | HWY | | Lane | LN | | Parkway | PKWY | | Place | PL | | Road | RD | | Square | SQ | | Street | ST | | Terrace | TER |

There are over 200 street suffixes in the official table. Standardizers map any input variation (Av, Avn, Avenue, AVE) to the canonical abbreviation.

Directional prefixes and suffixes

N    NORTH
S    SOUTH
E    EAST
W    WEST
NE   NORTHEAST
NW   NORTHWEST
SE   SOUTHEAST
SW   SOUTHWEST

Directionals can be prefix (before the street name) or suffix (after). The USPS convention varies by address: “1600 Pennsylvania Avenue NW” has NW as a suffix; “100 N Main St” has N as a prefix. Standardizers handle both placements.

Secondary-unit designators

USPS Pub 28 Appendix C1 lists secondary-unit designators:

APT       Apartment
STE       Suite
UNIT      Unit
BLDG      Building
FL        Floor
RM        Room
PH        Penthouse

For dwellings with multiple units, the secondary designator identifies the specific unit. USPS convention places the secondary after the street: “123 MAIN ST APT 4B”.

State abbreviations

Two-letter USPS state codes (e.g., IL, NY, CA) per ISO 3166-2:US. Used in addresses, ZIP-code databases, and shipping-rate tables.

ZIP+4 normalization

ZIP codes can be just 5 digits or the full ZIP+4 form (5

  • 4 = 9 digits with a hyphen). Standardization may look up the +4 extension from USPS data:
Input:  "1600 Pennsylvania Ave NW, Washington DC 20500"
Output: "1600 PENNSYLVANIA AVE NW, WASHINGTON DC 20500-0001"

The +4 extension narrows the address to a specific mail-delivery route or block, useful for direct-mail optimisation. Not all addresses have a published +4; some have multiple +4 extensions for different sides of the street or different sequence types.

International standardization

The Universal Postal Union (UPU) publishes country-specific addressing standards through its Addressing Solutions programme. Each country has its own conventions:

  • United Kingdom: postcode is the primary identifier (15 addresses average per postcode unit). Lines: building number / name, street, locality, post town, postcode.
  • Germany: house number after street, postal code before city. “Hauptstraße 123, 80331 München”.
  • France: number, then street, then 5-digit postal code + city.
  • Japan: prefecture / municipality / district / block / lot / building. Very different from Western conventions.
  • China: province / city / district / street / number in addresses that include them; rural addresses often reference administrative villages.
  • India: variable. Urban addresses are Western-like; rural addresses often reference landmarks (“opposite the school”) more than formal streets.

Standardizing across all these requires country-specific rules and a country detector to dispatch to the right ruleset. libpostal is the standard open-source answer; commercial standardization services (Loqate, Melissa, SmartyStreets) offer country-specific certified standardization.

libpostal

libpostal is the dominant open-source international address-parsing-and- standardization library. Released by OpenVenues / Mapzen in 2015, it:

  • Handles 60+ languages.
  • Uses statistical models trained on OpenStreetMap data.
  • Provides both parsing (string → components) and expansion (components → standardised variants).
  • Is written in C with bindings for Python, Ruby, Go, Java, JavaScript (via WASM).
  • Is used internally by Pelias (open-source geocoder), OpenAddresses, and several commercial systems.

libpostal's parse_address function returns labelled components; expand_address returns canonical-form variants. For production use, the C library is the recommended path; the bindings layer over it.

CASS certification (US)

USPS Coding Accuracy Support System (CASS) certifies standardization software for US addresses. CASS-certified software passes the USPS's monthly tests against a known address dataset, demonstrating standardization accuracy. Commercial standardization providers (Melissa, SmartyStreets, Loqate) are CASS-certified; their output is the gold standard for US mailing-list cleanup, qualifying for USPS commercial-mail discounts.

CASS-certified standardization is more aggressive than generic standardization — it includes USPS's proprietary delivery-point validation (DPV) data, which flags addresses that don't exist or aren't deliverable. For bulk mailing, this saves money on undeliverable mail.

Common pitfalls

A few standardization mistakes that recur:

Over-aggressive normalization. Aggressive standardization can break valid addresses. For example, “The Pennsylvania” is a building name in some addresses; standardizing to “PENNSYLVANIA” loses that information. Best practice: preserve the original input alongside the standardized form.

Locale mismatches. A standardizer configured for US addresses applied to UK addresses produces nonsense (“SW1A 1AA” in “London W1” doesn't map to USPS conventions). Always dispatch to the right country-specific standardizer based on a country-detection step.

Smart-quote autocorrect. Modern spreadsheets and word processors auto-replace straight quotes with smart quotes: ' becomes . Address suffixes that include apostrophes (O'Hare, McDonald's) break when standardizers expect ASCII apostrophes.

Truncation. Address fields in older databases were often limited to 30 or 40 characters. Truncated addresses lose secondary-unit information; standardizers need to either preserve the truncation or detect that the truncation occurred.

Multiple address lines. Some inputs use line 1 / line 2 / city / state / ZIP separate fields. Some use a single combined string. Standardizers need to handle both forms.

Non-USPS conventions. Inside the US, some addresses don't follow USPS conventions: military APO / FPO / DPO addresses, rural delivery routes, PO boxes. Each has its own format that diverges from the standard.

Standardization in the wild: a worked example

Consider a contact-list dataset with these inputs:

1.  123 main st, springfield IL 62701
2.  123 Main Street, Springfield Illinois 62701-1234
3.  123 main st spfd il
4.  123 Main, Springfield (no state, no ZIP)
5.  123 MAIN ST APT 4B, SPRINGFIELD IL 62701
6.  123 main # 4B springfield IL
7.  Apt 4B, 123 Main Street, Springfield IL 62701

A standardizer that applies USPS Pub 28 conventions should produce canonical output for each:

1.  123 MAIN ST, SPRINGFIELD IL 62701
2.  123 MAIN ST, SPRINGFIELD IL 62701-1234
3.  123 MAIN ST, SPRINGFIELD IL  (ZIP lookup possible if state+city allow)
4.  123 MAIN, SPRINGFIELD  (incomplete; flagged as low-confidence)
5.  123 MAIN ST APT 4B, SPRINGFIELD IL 62701
6.  123 MAIN ST APT 4B, SPRINGFIELD IL  (# converted to APT)
7.  123 MAIN ST APT 4B, SPRINGFIELD IL 62701  (apartment moved to suffix position)

Note how the standardizer:

  • Uppercases everything per USPS convention.
  • Expands “Street” to “ST”.
  • Recognises “#” and various secondary indicators all map to “APT”.
  • Re-orders apartment-first input to apartment-suffix output.
  • Converts “Illinois” to “IL”.
  • Flags incomplete input as lower confidence (case 4).

A downstream geocoder receiving any of inputs 1, 2, 5, 6, or 7 in standardized form returns the same coordinate; without standardization, only inputs that happen to match the geocoder's indexed format would resolve. Standardization is what makes geocoding robust to input variability.

Why standardization is hard

Several reasons standardization isn't a solved problem:

  • No single canonical form: USPS uses one convention, ISO another, country-specific norms vary. Choosing “the” canonical form requires deciding which target.
  • Information loss vs preservation: aggressive standardization (strip apartment numbers, normalise building names) loses information; gentle standardization doesn't deduplicate well. Production systems typically store both forms.
  • Language and script handling: addresses in Cyrillic, Devanagari, CJK, or Arabic scripts need transliteration or native-script handling. libpostal handles this; some USPS-focused tools don't.
  • Historical addresses: 19th-century street names, renamed streets, addresses that no longer exist — a standardizer trained on modern data may not match.
  • Informal addresses: rural, favela, or otherwise non-formal addresses don't fit USPS Pub 28; they need geocoder-specific handling (often POI matching or landmark-based lookup).

Implementation patterns

For developers handling address-cleaning at scale:

Standardize at the input boundary, not throughout the pipeline. A single canonical form makes downstream code (deduplication, geocoding, deliverability checking) simpler.

Preserve the raw input alongside the standardized form. Users sometimes need to see their original input for verification; auditing also benefits.

Use a CASS-certified library for US production work. Loqate, SmartyStreets, Melissa, and others offer CASS-certified APIs. For non-US work, libpostal is typically sufficient.

Track standardization confidence. Outputs labeled “high confidence” (rooftop match, all components valid) should be treated differently from “medium confidence” (some components inferred) and “low confidence” (missing components).

Re-standardize periodically. Addresses change (new construction, renaming, new ZIP+4 assignments). A standardized address from 2018 may not match 2026 USPS data; periodic re-standardization keeps the dataset fresh.

Handle the international case explicitly. For multi-country datasets, run a country-detection step first, then dispatch to the right country-specific standardizer. A US standardizer applied to a UK address produces nonsense output that downstream geocoding will fail to resolve. The country-detection step itself can use libpostal's parser, the postal code format, or explicit user input.

Cache only the standardized form for downstream indexing, not the raw input. The standardized form is canonical; the raw input is an artefact of the user's typing. Indexing on the raw input creates duplicates that the standardized form would eliminate.

Common misconceptions

“Standardization is the same as deduplication.” Related but distinct. Standardization normalises input to a canonical form. Deduplication identifies duplicate records across a database. Standardization is upstream of deduplication: standardize first, then check for duplicates in the canonical form.

“USPS Publication 28 applies worldwide.” US only. International addresses follow country-specific conventions defined by national postal authorities and the UPU.

“Standardization is solved by libpostal.” libpostal is excellent for parsing and basic standardization, but production address-cleaning often requires CASS-certified output (for US) or country-certified output (for other jurisdictions) that includes deliverability flags. libpostal doesn't certify deliverability — that requires authoritative postal data.

“Modern geocoders don't need pre-standardized input.” They're tolerant of unstandardized input (the parser handles common variations), but standardized input gives more reliable results. For batch mailing-list work, pre-standardize aggressively; for interactive geocoding, the geocoder's internal parsing is usually sufficient.

“Standardization preserves all information.” It may lose information. Casing variations (proper names), historical address forms (legacy spellings), and informal descriptive components (“the red door”) are typically stripped or normalized away. Production systems that need to preserve original input should store both the original and standardized forms.

“ZIP+4 is always the more specific form.” ZIP+4 narrows to a delivery point but can be ambiguous on the “sides of a street” question (one side may have a different +4 than the other). For most applications, the 5-digit ZIP is sufficient; the +4 matters for direct mail and presort discounts.

Frequently asked questions

What is address standardization?

Address standardization is the process of converting a free-form address string into a canonical form that a geocoder or postal system can reliably look up. '123 main st, springfield il' becomes '123 MAIN ST, SPRINGFIELD, IL 62701' (or whichever canonical form the standardizer targets). Standardization handles abbreviations, casing, punctuation, spelling variations, and structural reorganisation. It's the upstream step before geocoding lookup in any production address-processing system.

What is USPS Publication 28?

USPS Publication 28 is the official US Postal Service standard for addressing US mail. It defines: standard street-suffix abbreviations (Avenue → AVE, Boulevard → BLVD), directional prefixes (N, S, E, W, NE, NW, SE, SW), secondary-unit designators (Suite → STE, Apartment → APT), ZIP+4 format, state abbreviations, and other conventions. Publication 28 is the canonical reference for US address standardization; every commercial US address-cleaning service implements it (or a close variant).

Why is address standardization needed before geocoding?

Because address databases index canonical forms, not arbitrary input variations. A geocoder's index might have 'Main Street' but not 'main st'; without standardization, a query for 'main st' fails to match. Standardization is the bridge between human input variability and the geocoder's structured lookup. Some geocoders perform standardization internally (Google, Mapbox); others expect pre-standardized input (some specialised mailing-list cleaning services). Either way, the standardization step happens — it's just a question of where.

How does international standardization differ from US?

Dramatically. US conventions (number, street, city, state, ZIP) don't apply globally. The UK uses postcodes that identify ~15 addresses; Germany has separate street and number ordering; Japan addresses use prefecture / ward / block / building rather than street; China and India have substantial informal addressing in rural areas. The Universal Postal Union (UPU) publishes country-specific standards, but the conventions are inherently country-specific. International standardization libraries (libpostal) handle 60+ languages and many country-specific formats.

What's the difference between standardization and parsing?

Parsing extracts components (house_number, street, city, etc.) from a string. Standardization normalizes those components to canonical form (Ave → Avenue, lowercase → uppercase, etc.). Many systems do both in a single library call but conceptually they're distinct: parsing is structural (component extraction), standardization is normalization (form transformation). libpostal does both. USPS CASS-certified standardization services (commercial) typically do both within a single workflow.

Sources

  1. USPSUSPS Publication 28 — Postal Addressing Standards · https://pe.usps.com/text/pub28/welcome.htm · Accessed .
  2. UPUUniversal Postal Union — International addressing standards · https://www.upu.int/en/Postal-Solutions/Programmes-Services/Addressing-Solutions · Accessed .
  3. libpostallibpostal — open-source international address parser · https://github.com/openvenues/libpostal · Accessed .
  4. US CensusTIGER/Line — US address-range reference · https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html · Accessed .

Cite this article

APA format:

Steve K. (2026). Address Standardization. Coordinately. https://coordinately.org/learn/address-standardization

BibTeX:

@misc{coordinately_addressstandardization_2026,
  author = {K., Steve},
  title  = {Address Standardization},
  year   = {2026},
  publisher = {Coordinately},
  url    = {https://coordinately.org/learn/address-standardization},
  note   = {Accessed: 2026-06-05}
}