Methodology – Cefic Comext ETL

Pipeline at a glance

Overview

This pipeline ingests monthly EU external trade statistics from the Eurostat Comext bulk distribution, filters them down to the chemistry-relevant CN chapters (CN 20–39), loads them into a partitioned PostgreSQL table, then republishes three Parquet files: a filtered fact table containing only the substances Cefic tracks, plus a partner-country and a product-code dimension. The three files feed the EU27 chemicals trade radar SPA.

It runs unattended every Monday morning. Once the run finishes, this site is regenerated so the Status page reflects exactly what is in the database and on disk.

Download

Eurostat .7z archive per period

Extract

Decompress to .dat tab-delimited

Transform

Filter CN chapters · cast types · clean nulls

Load

INSERT into partitioned comext_monthly

Export

Three Parquet files for the trade radar

Stages

Pipeline stages

1. Download (`etl/download.py`)

Each target period YYYYMM is fetched from the Eurostat dissemination API. A HEAD/GET returning 404 is treated as "month not yet published" rather than an error: the run logs a warning and moves on. This is what you see when running the pipeline ahead of the Eurostat release calendar.

2. Extract (`etl/extract.py`)

Archives are .7z bundles containing one .dat file encoded in latin-1, tab- or comma-delimited depending on the vintage. The extractor handles both delimiters transparently.

3. Transform (`etl/transform.py`)

Reads in chunks of 200,000 rows with pandas to keep memory bounded.
Renames legacy column headers (REPORTER → DECLARANT, VALUE_EUR → VALUE_IN_EUROS, …) to a stable schema.
Filters rows whose PRODUCT_NC starts with one of the broad CN prefixes 20, 21, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39: the universe of substances Cefic might care about.
Casts numeric columns; drops pd.NA rows that would break the COPY into PostgreSQL.

4. Load (`etl/load.py`)

Cleaned chunks are pushed into comext_monthly in batches of 50,000 rows via psycopg2.execute_values. The loader is idempotent (see Schedule & idempotency below).

5. Export (`export_parquet.py`)

After every successful load (and on every cron run, even if no new file was available), the Parquet exporter rewrites comext_export.parquet alongside the two dimension files. This guarantees that the artefacts the trade radar consumes always reflect the current state of PostgreSQL.

Cadence

Schedule & idempotency

The cron job cron_weekly.sh runs every Monday at 06:00 CEST and chains three pipeline stages before rebuilding the downstream sites:

python3 run_etl.py --year $(date +%Y)        # 1. pick up newly published months
python3 run_etl.py --check-revisions          # 2. re-import months that Eurostat republished
python3 export_parquet.py                     # 3. rebuild Parquet files

Idempotency is enforced by the etl_log table: a (period, filename) pair already marked success will cause the loader to skip the download entirely. The --check-revisions step bypasses that idempotency selectively, but only for periods where Eurostat’s upload calendar shows a republication more recent than our last load timestamp (rolling 24-month window). It uses DELETE WHERE period=X + INSERT, so the row count is replaced rather than duplicated. A manual --force remains available for one-off backfills:

python3 run_etl.py --year 2025 --month 3 --force

Eurostat typically publishes month N around the middle of month N + 2, then republishes corrected versions of the previous 6–12 months as Member States submit late or amended figures. It is normal for the year-to-date pass to find nothing new (the Status page will read "No new data") while the revision sweep still re-imports several past months. Both outcomes are visible on the Status page.

Source

Data source

Files are downloaded from:

https://ec.europa.eu/eurostat/api/dissemination/files
  ?file=comext/COMEXT_DATA/PRODUCTS/full_v2_{YYYY}{MM}.7z

The corresponding Eurostat dataset is EU trade since 1988 by HS6 / CN8 (Comext bulk). Coverage in this database starts at January 2002 and runs to the latest period Eurostat has published.

Scope

Product filtering

Two filters apply, at two different points of the pipeline:

Broad chapter filter (load time): only CN codes in chapters 20–39 are stored in PostgreSQL. This already drops > 80 % of the raw Comext volume.
Narrow substance filter (export time): the Parquet export only keeps rows whose 8-digit CN matches one of ~1,369 codes listed in data/upload/SubstanceId.csv. This file is the CPA2015 → CN2025 correspondence table maintained by Cefic; it is the single source of truth for "is this substance in scope?".

Updating SubstanceId.csv and re-running export_parquet.py is the supported way to add or remove substances. The underlying PostgreSQL table never needs to be touched.

Schema · Postgres

PostgreSQL schema

comext_monthly is partitioned by year using PostgreSQL declarative partitioning. New years (e.g. 2027) need a one-line partition added to sql/schema.sql before they can be loaded.

CREATE TABLE comext_monthly (
    id              BIGSERIAL,
    period          INTEGER     NOT NULL,  -- YYYYMM
    declarant       CHAR(2)     NOT NULL,
    partner         CHAR(2)     NOT NULL,
    product_nc      CHAR(8)     NOT NULL,
    flow            SMALLINT    NOT NULL,  -- 1 = import, 2 = export
    value_in_euros  BIGINT,
    quantity_in_kg  BIGINT,
    sup_quantity    BIGINT,
    loaded_at       TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (id, period)
) PARTITION BY RANGE (period);

The legacy years (2002 → 2023) live in a non-partitioned twin table comext_monthly_legacy. The Parquet exporter UNIONs both at query time so consumers see one continuous time series.

Outputs

Parquet export

The exporter runs a single SQL query joining comext_monthly and comext_monthly_legacy, filters by the substance allow-list, and streams the result to a local Parquet file. A few notable transforms happen here, not in the database:

period is converted from the integer YYYYMM to a real datetime set to the last day of the month, so the trade radar can do native date math.
The CN code is left-padded with zeros to a full 8 characters.
chapter_cn is derived as the first two characters of product_nc for grouping.
Greece is normalised: Eurostat encodes it as GR in the raw files but uses EL in its own publications. The exporter rewrites GR → EL before the EU27 aggregation step.

Aggregation

EU27 aggregation

The Parquet export collapses the 27 EU member states into a single virtual country code, on both sides of the trade flow:

The export only keeps rows whose declarant is an EU27 country: GB/UK and every other non-EU27 reporter are filtered out. The declarant column is then flattened to the constant EU.
When the partner is an EU27 country, its code is replaced by EU27. Non-EU27 partners (including GB/UK) keep their original ISO code.

The two different labels are intentional: they let the trade radar distinguish "EU reporting" from "trade with the EU bloc" without having to reapply a join. Aggregation happens at the SUM(value_in_euros) / SUM(quantity_in_kg) level so totals reconcile against published Eurostat aggregates.

EU27 is the post-Brexit definition (24 member states + Croatia + Bulgaria + Romania, no UK). Historical periods before 2020 use the same EU27 mask for consistency; UK rows always remain visible separately.

Schema · Parquet

Output columns

The fact Parquet that powers the EU27 chemicals trade radar contains exactly these columns:

Column	Type	Description
`period`	datetime	Last day of the month (e.g. `2025-03-31`)
`declarant`	char(2)	Always `EU` (rows are filtered to EU27 reporters only, GB/UK excluded)
`partner`	char(2)	Trade partner, or `EU27` for the EU27 bloc
`product_nc`	char(8)	Combined Nomenclature 8-digit code
`cpa2015`	string	Joined from `SubstanceId.csv`
`chapter_cn`	char(2)	First two digits of `product_nc`
`flow`	smallint	`1` = import, `2` = export
`flow_label`	string	`Import` or `Export`
`value_in_euros`	bigint	Trade value in EUR (statistical value)
`quantity_in_kg`	bigint	Trade volume in kilograms

Dimensions · partners

Partner dimension (`partner_dim.parquet`)

Alongside the fact table, export_parquet.py rebuilds a small secondary Parquet: a dimension table of every partner country code that can appear in the fact's partner column. It is sourced live from Eurostat's authoritative SDMX 3.0 ESTAT/PARTNER codelist (gzipped TSV, ~18 KB), so labels stay in sync with Eurostat without any manual maintenance.

Join it on partner_code to get English, French and German labels for every country and Eurostat aggregate (EU27_2020, EFTA, EXT_EU27_2020 and others). 506 rows total: 488 from the Eurostat codelist plus 18 hand-curated entries for codes that still appear in older Comext archives but have been retired from the current codelist: historical country codes (CS, YU, TP, GR, GB) and COMEXT special codes for bunkers, stores, confidential trade and Spanish exclaves (QP–QZ, XC, XL, XM, XS). The source column distinguishes them so the trade radar can filter them out as needed.

Column	Type	Description
`partner_code`	string	Primary key (ISO alpha-2 or Eurostat aggregate code)
`label_en`	string	English label (e.g. `China`, `European Union - 27 countries (from 2020)`)
`label_fr`	string	French label
`label_de`	string	German label
`is_standard_code`	boolean	Eurostat "Standard code" flag, not a real-country filter (aggregates like `EU27_2020` are also flagged `true`)
`source`	string	`eurostat` (488 codes from the SDMX codelist), `historical` (5 retired country codes: `CS`, `YU`, `TP`, `GR`, `GB`) or `special` (13 Comext codes for bunkers, stores, unspecified partners, Spanish exclaves…)
`note`	string	Free-text composition note (list of member countries for aggregates, or explanation of special codes)
`ec_corporate_code`	string	Parent aggregate (e.g. `CHN`, `EUR`)
`ec_corporate_uri`	string	Authority URI on `publications.europa.eu`

Gotcha: our fact Parquet collapses the 27 EU member states into the synthetic codes EU (declarant) and EU27 (partner). Neither exists in the official Eurostat codelist (the trade radar ships a fallback label for them).

Dimensions · products

Product dimension (`product_dim.parquet`)

A third Parquet, rebuilt in the same export_parquet.py run, carries the full Combined Nomenclature hierarchy sourced live from Eurostat's authoritative SDMX 3.0 ESTAT/CXT_NC codelist (~1.5 MB gzipped, 40 409 codes). Every HS chapter, heading, HS 6-digit subheading and CN 8-digit code ships with English, French and German labels, plus two derived columns (level and parent_code) that let the trade radar build a drill-down hierarchy without any manual work.

Column	Type	Description
`product_code`	string	Primary key. Join to the fact's `product_nc` column for the 8-digit codes.
`label_en`	string	English label (e.g. `Chlorine`, `INORGANIC CHEMICALS`)
`label_fr`	string	French label
`label_de`	string	German label
`level`	string	Position in the CN hierarchy: `chapter` (2 digits), `heading` (4), `hs_subheading` (6), `cn_code` (8) or `special` for `TOTAL` and `00*` confidential aggregates.
`parent_code`	string	Parent in the hierarchy, the code minus its last two characters. Null for chapters and specials.
`chapter`	string	First two characters (HS chapter). Handy for direct chapter-level slicing.
`is_not_elsewhere_specified`	boolean	`true` for catch-all rows containing `X` (e.g. `280110XX`, `2801XX`). These represent trade that couldn't be attributed to a specific subcode.

Hierarchy breakdown: 98 chapters, 1 822 headings, 9 450 HS subheadings, 29 028 CN 8-digit codes, 11 specials. The current codelist covers 100 % of the 1 369 distinct product_nc values in our fact table.

Caveats

Known limitations

Eurostat revisions. Comext republishes corrected versions of past months several times a year. The Monday cron now runs run_etl.py --check-revisions after the year-to-date pass: it parses Eurostat’s dates of latest data uploads workbook (per-MS, Intra and Extra), finds any period whose latest republication is more recent than our etl_log.finished_at (rolling 24-month window), and re-imports it via DELETE WHERE period=X + INSERT (no duplicates, no manual --force). The "Eurostat revisions sweep" section on the Status page reports each per-period reload (when Eurostat republished, when we previously loaded, when we reloaded, and how many rows were replaced).
Suppressed cells. Eurostat masks small or confidential trade flows (typically < 100 € or single-trader exports). These rows are simply absent; there is no flag column.
Partial 2020. The 2020 archive originally shipped 17 monthly files (mid-year revision). All are loaded; the duplicate periods are deduplicated at the etl_log level.
Brexit. The United Kingdom is in the data both as an EU27 member (until December 2020) and as a third country (from January 2021). Time series should be sliced accordingly.
No real-time freshness. The Status page is rebuilt only when the cron runs, or when generate_status.py is called manually. A Parquet rebuild without a status refresh will not update the homepage.

CLI

Operational commands

# Full pipeline (every year listed in config.py)
python3 run_etl.py

# A specific year
python3 run_etl.py --year 2025

# A specific month
python3 run_etl.py --year 2025 --month 3

# Force a reload even if etl_log says it is done
python3 run_etl.py --year 2025 --month 3 --force

# Detect and re-import periods that Eurostat republished since our last load
# (default window: 24 months back; --min-period YYYYMM to override)
python3 run_etl.py --check-revisions
python3 run_etl.py --check-revisions --min-period 202301

# Just one stage
python3 run_etl.py --step download
python3 run_etl.py --step load

# Rebuild the Parquet file
python3 export_parquet.py

# Refresh this status site
python3 scripts/generate_status.py

How the radar is built.

Overview

Pipeline stages

1. Download (etl/download.py)

2. Extract (etl/extract.py)

3. Transform (etl/transform.py)

4. Load (etl/load.py)

5. Export (export_parquet.py)