Methodology

How the Cefic Comext ETL pipeline turns raw Eurostat archives into a single Parquet file ready for Power BI — and the editorial choices made along the way.

Overview

This pipeline ingests monthly EU external trade statistics from the Eurostat Comext bulk distribution, filters them down to the chemistry-relevant CN chapters (CN 20–39), loads them into a partitioned PostgreSQL table, then republishes a single Parquet file that contains only the substances Cefic tracks.

It runs unattended every Monday morning. Once the run finishes, this site is regenerated so the Status page reflects exactly what is in the database and on disk.

1
Download
Eurostat .7z archive per period
2
Extract
Decompress to .dat tab-delimited
3
Transform
Filter CN chapters · cast types · clean nulls
4
Load
INSERT into partitioned comext_monthly
5
Export
Single Parquet for Power BI

Pipeline stages

1. Download (etl/download.py)

Each target period YYYYMM is fetched from the Eurostat dissemination API. A HEAD/GET returning 404 is treated as "month not yet published" rather than an error: the run logs a warning and moves on. This is what you see when running the pipeline ahead of the Eurostat release calendar.

2. Extract (etl/extract.py)

Archives are .7z bundles containing one .dat file encoded in latin-1, tab- or comma-delimited depending on the vintage. The extractor handles both delimiters transparently.

3. Transform (etl/transform.py)

  • Reads in chunks of 200,000 rows with pandas to keep memory bounded.
  • Renames legacy column headers (REPORTER → DECLARANT, VALUE_EUR → VALUE_IN_EUROS, …) to a stable schema.
  • Filters rows whose PRODUCT_NC starts with one of the broad CN prefixes 20, 21, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39 — the universe of substances Cefic might care about.
  • Casts numeric columns; drops pd.NA rows that would break the COPY into PostgreSQL.

4. Load (etl/load.py)

Cleaned chunks are pushed into comext_monthly in batches of 50,000 rows via psycopg2.execute_values. The loader is idempotent — see Schedule & idempotency below.

5. Export (export_parquet.py)

After every successful load (and on every cron run, even if no new file was available), the Parquet exporter rewrites comext_export.parquet. This guarantees the file Power BI consumes always reflects the current state of PostgreSQL.

Schedule & idempotency

The cron job cron_weekly.sh runs every Monday at 06:00 CEST:

python3 run_etl.py --year $(date +%Y)
python3 export_parquet.py

Idempotency is enforced by the etl_log table: a (period, filename) pair already marked success will cause the loader to skip the download entirely. To force a reload — for example after upstream Eurostat republishes a corrected file — pass --force:

python3 run_etl.py --year 2025 --month 3 --force
i

Eurostat typically publishes month N around the middle of month N + 2. It is normal for the latest cron run to find nothing new — the Status page will read "No new data", not an error.

Data source

Files are downloaded from:

https://ec.europa.eu/eurostat/api/dissemination/files
  ?file=comext/COMEXT_DATA/PRODUCTS/full_v2_{YYYY}{MM}.7z

The corresponding Eurostat dataset is EU trade since 1988 by HS6 / CN8 (Comext bulk). Coverage in this database starts at January 2002 and runs to the latest period Eurostat has published.

Product filtering

Two filters apply, at two different points of the pipeline:

  1. Broad chapter filter (load time) — only CN codes in chapters 20–39 are stored in PostgreSQL. This already drops > 80 % of the raw Comext volume.
  2. Narrow substance filter (export time) — the Parquet export only keeps rows whose 8-digit CN matches one of ~1,369 codes listed in data/upload/SubstanceId.csv. This file is the CPA2015 → CN2025 correspondence table maintained by Cefic; it is the single source of truth for "is this substance in scope?".

Updating SubstanceId.csv and re-running export_parquet.py is the supported way to add or remove substances — the underlying PostgreSQL table never needs to be touched.

PostgreSQL schema

comext_monthly is partitioned by year using PostgreSQL declarative partitioning. New years (e.g. 2027) need a one-line partition added to sql/schema.sql before they can be loaded.

CREATE TABLE comext_monthly (
    id              BIGSERIAL,
    period          INTEGER     NOT NULL,  -- YYYYMM
    declarant       CHAR(2)     NOT NULL,
    partner         CHAR(2)     NOT NULL,
    product_nc      CHAR(8)     NOT NULL,
    flow            SMALLINT    NOT NULL,  -- 1 = import, 2 = export
    value_in_euros  BIGINT,
    quantity_in_kg  BIGINT,
    sup_quantity    BIGINT,
    loaded_at       TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (id, period)
) PARTITION BY RANGE (period);

The legacy years (2002 → 2023) live in a non-partitioned twin table comext_monthly_legacy. The Parquet exporter UNIONs both at query time so consumers see one continuous time series.

Parquet export

The exporter runs a single SQL query joining comext_monthly and comext_monthly_legacy, filters by the substance allow-list, and streams the result to a local Parquet file. A few notable transforms happen here, not in the database:

  • period is converted from the integer YYYYMM to a real datetime set to the last day of the month, so Power BI can do native date math.
  • The CN code is left-padded with zeros to a full 8 characters.
  • chapter_cn is derived as the first two characters of product_nc for grouping.
  • Greece is normalised: Eurostat encodes it as GR in the raw files but uses EL in its own publications. The exporter rewrites GR → EL before the EU27 aggregation step.

EU27 aggregation

The Parquet export collapses the 27 EU member states into a single virtual country code, on both sides of the trade flow:

  • When the declarant is an EU27 country, its code is replaced by EU.
  • When the partner is an EU27 country, its code is replaced by EU27.

The two different labels are intentional: they let Power BI distinguish "EU reporting" from "trade with the EU bloc" without having to reapply a join. Aggregation happens at the SUM(value_in_euros) / SUM(quantity_in_kg) level so totals reconcile against published Eurostat aggregates.

!

EU27 is the post-Brexit definition (24 member states + Croatia + Bulgaria + Romania, no UK). Historical periods before 2020 use the same EU27 mask for consistency — UK rows always remain visible separately.

Output columns

The Parquet file Cefic Power BI consumes contains exactly these columns:

ColumnTypeDescription
perioddatetimeLast day of the month (e.g. 2025-03-31)
declarantchar(2)Reporter country, or EU for the EU27 bloc
partnerchar(2)Trade partner, or EU27 for the EU27 bloc
product_ncchar(8)Combined Nomenclature 8-digit code
cpa2015stringJoined from SubstanceId.csv
chapter_cnchar(2)First two digits of product_nc
flowsmallint1 = import, 2 = export
flow_labelstringImport or Export
value_in_eurosbigintTrade value in EUR (statistical value)
quantity_in_kgbigintTrade volume in kilograms

Partner dimension (partner_dim.parquet)

Alongside the fact table, export_parquet.py rebuilds a small secondary Parquet — a dimension table of every partner country code that can appear in the fact's partner column. It is sourced live from Eurostat's authoritative SDMX 3.0 ESTAT/PARTNER codelist (gzipped TSV, ~18 KB), so labels stay in sync with Eurostat without any manual maintenance.

Join it in Power BI on partner_code to get English, French and German labels for every country and Eurostat aggregate (EU27_2020, EFTA, EXT_EU27_2020…). 506 rows total — 488 from the Eurostat codelist plus 18 hand-curated entries for codes that still appear in older Comext archives but have been retired from the current codelist: historical country codes (CS, YU, TP, GR, GB) and COMEXT special codes for bunkers, stores, confidential trade and Spanish exclaves (QPQZ, XC, XL, XM, XS). The source column distinguishes them so you can filter them out in Power BI if needed.

ColumnTypeDescription
partner_codestringPrimary key (ISO alpha-2 or Eurostat aggregate code)
label_enstringEnglish label (e.g. China, European Union - 27 countries (from 2020))
label_frstringFrench label
label_destringGerman label
is_standard_codebooleanEurostat "Standard code" flag — not a real-country filter (aggregates like EU27_2020 are also flagged true)
sourcestringeurostat (488 codes from the SDMX codelist), historical (5 retired country codes: CS, YU, TP, GR, GB) or special (13 Comext codes for bunkers, stores, unspecified partners, Spanish exclaves…)
notestringFree-text composition note (list of member countries for aggregates, or explanation of special codes)
ec_corporate_codestringParent aggregate (e.g. CHN, EUR)
ec_corporate_uristringAuthority URI on publications.europa.eu

Gotcha: our fact Parquet collapses the 27 EU member states into the synthetic codes EU (declarant) and EU27 (partner). Neither exists in the official Eurostat codelist — provide a fallback label for them in Power BI.

Product dimension (product_dim.parquet)

A third Parquet, rebuilt in the same export_parquet.py run, carries the full Combined Nomenclature hierarchy sourced live from Eurostat's authoritative SDMX 3.0 ESTAT/CXT_NC codelist (~1.5 MB gzipped, 40 409 codes). Every HS chapter, heading, HS 6-digit subheading and CN 8-digit code ships with English, French and German labels, plus two derived columns (level and parent_code) that let Power BI build a drill-down hierarchy without any manual work.

ColumnTypeDescription
product_codestringPrimary key. Join to the fact's product_nc column for the 8-digit codes.
label_enstringEnglish label (e.g. Chlorine, INORGANIC CHEMICALS)
label_frstringFrench label
label_destringGerman label
levelstringPosition in the CN hierarchy: chapter (2 digits), heading (4), hs_subheading (6), cn_code (8) or special for TOTAL and 00* confidential aggregates.
parent_codestringParent in the hierarchy — the code minus its last two characters. Null for chapters and specials.
chapterstringFirst two characters (HS chapter). Handy for direct chapter-level slicing.
is_not_elsewhere_specifiedbooleantrue for catch-all rows containing X (e.g. 280110XX, 2801XX). These represent trade that couldn't be attributed to a specific subcode.

Hierarchy breakdown: 98 chapters, 1 822 headings, 9 450 HS subheadings, 29 028 CN 8-digit codes, 11 specials. The current codelist covers 100 % of the 1 369 distinct product_nc values in our fact table.

Known limitations

  • Eurostat revisions. Comext republishes corrected versions of past months several times a year. The pipeline only picks them up when run with --force; otherwise an already-loaded period is treated as final.
  • Suppressed cells. Eurostat masks small or confidential trade flows (typically < 100 € or single-trader exports). These rows are simply absent — there is no flag column.
  • Partial 2020. The 2020 archive originally shipped 17 monthly files (mid-year revision). All are loaded; the duplicate periods are deduplicated at the etl_log level.
  • Brexit. The United Kingdom is in the data both as an EU27 member (until December 2020) and as a third country (from January 2021). Time series should be sliced accordingly.
  • No real-time freshness. The Status page is rebuilt only when the cron runs, or when generate_status.py is called manually. A Parquet rebuild without a status refresh will not update the homepage.

Operational commands

# Full pipeline (every year listed in config.py)
python3 run_etl.py

# A specific year
python3 run_etl.py --year 2025

# A specific month
python3 run_etl.py --year 2025 --month 3

# Force a reload even if etl_log says it is done
python3 run_etl.py --year 2025 --month 3 --force

# Just one stage
python3 run_etl.py --step download
python3 run_etl.py --step load

# Rebuild the Parquet file
python3 export_parquet.py

# Refresh this status site
python3 scripts/generate_status.py