Methodology
How the Cefic Comext ETL pipeline turns raw Eurostat archives into a single Parquet file ready for Power BI — and the editorial choices made along the way.
Overview
This pipeline ingests monthly EU external trade statistics from the Eurostat Comext bulk distribution, filters them down to the chemistry-relevant CN chapters (CN 20–39), loads them into a partitioned PostgreSQL table, then republishes a single Parquet file that contains only the substances Cefic tracks.
It runs unattended every Monday morning. Once the run finishes, this site is regenerated so the Status page reflects exactly what is in the database and on disk.
.7z archive per period.dat tab-delimitedcomext_monthlyPipeline stages
1. Download (etl/download.py)
Each target period YYYYMM is fetched from the Eurostat dissemination
API. A HEAD/GET returning 404 is treated as
"month not yet published" rather than an error: the run logs a warning
and moves on. This is what you see when running the pipeline ahead of the
Eurostat release calendar.
2. Extract (etl/extract.py)
Archives are .7z bundles containing one .dat file
encoded in latin-1, tab- or comma-delimited depending on the
vintage. The extractor handles both delimiters transparently.
3. Transform (etl/transform.py)
- Reads in chunks of 200,000 rows with pandas to keep memory bounded.
- Renames legacy column headers (
REPORTER → DECLARANT,VALUE_EUR → VALUE_IN_EUROS, …) to a stable schema. - Filters rows whose
PRODUCT_NCstarts with one of the broad CN prefixes20, 21, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39— the universe of substances Cefic might care about. - Casts numeric columns; drops
pd.NArows that would break the COPY into PostgreSQL.
4. Load (etl/load.py)
Cleaned chunks are pushed into comext_monthly in batches of
50,000 rows via psycopg2.execute_values. The loader
is idempotent — see Schedule & idempotency below.
5. Export (export_parquet.py)
After every successful load (and on every cron run, even if no new file was
available), the Parquet exporter rewrites comext_export.parquet.
This guarantees the file Power BI consumes always reflects the current state
of PostgreSQL.
Schedule & idempotency
The cron job cron_weekly.sh runs every Monday at 06:00 CEST:
python3 run_etl.py --year $(date +%Y)
python3 export_parquet.py
Idempotency is enforced by the etl_log table: a
(period, filename) pair already marked success will
cause the loader to skip the download entirely. To force a reload — for example
after upstream Eurostat republishes a corrected file — pass --force:
python3 run_etl.py --year 2025 --month 3 --force
Eurostat typically publishes month N around the middle of month N + 2. It is normal for the latest cron run to find nothing new — the Status page will read "No new data", not an error.
Data source
Files are downloaded from:
https://ec.europa.eu/eurostat/api/dissemination/files
?file=comext/COMEXT_DATA/PRODUCTS/full_v2_{YYYY}{MM}.7z
The corresponding Eurostat dataset is EU trade since 1988 by HS6 / CN8 (Comext bulk). Coverage in this database starts at January 2002 and runs to the latest period Eurostat has published.
Product filtering
Two filters apply, at two different points of the pipeline:
- Broad chapter filter (load time) — only CN codes in chapters 20–39 are stored in PostgreSQL. This already drops > 80 % of the raw Comext volume.
-
Narrow substance filter (export time) — the Parquet export
only keeps rows whose 8-digit CN matches one of ~1,369 codes listed in
data/upload/SubstanceId.csv. This file is theCPA2015 → CN2025correspondence table maintained by Cefic; it is the single source of truth for "is this substance in scope?".
Updating SubstanceId.csv and re-running export_parquet.py
is the supported way to add or remove substances — the underlying PostgreSQL
table never needs to be touched.
PostgreSQL schema
comext_monthly is partitioned by year using PostgreSQL declarative
partitioning. New years (e.g. 2027) need a one-line partition added to
sql/schema.sql before they can be loaded.
CREATE TABLE comext_monthly (
id BIGSERIAL,
period INTEGER NOT NULL, -- YYYYMM
declarant CHAR(2) NOT NULL,
partner CHAR(2) NOT NULL,
product_nc CHAR(8) NOT NULL,
flow SMALLINT NOT NULL, -- 1 = import, 2 = export
value_in_euros BIGINT,
quantity_in_kg BIGINT,
sup_quantity BIGINT,
loaded_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (id, period)
) PARTITION BY RANGE (period);
The legacy years (2002 → 2023) live in a non-partitioned twin table
comext_monthly_legacy. The Parquet exporter UNIONs
both at query time so consumers see one continuous time series.
Parquet export
The exporter runs a single SQL query joining comext_monthly and
comext_monthly_legacy, filters by the substance allow-list, and
streams the result to a local Parquet file. A few notable transforms happen
here, not in the database:
periodis converted from the integerYYYYMMto a real datetime set to the last day of the month, so Power BI can do native date math.- The CN code is left-padded with zeros to a full 8 characters.
chapter_cnis derived as the first two characters ofproduct_ncfor grouping.- Greece is normalised: Eurostat encodes it as
GRin the raw files but usesELin its own publications. The exporter rewritesGR → ELbefore the EU27 aggregation step.
EU27 aggregation
The Parquet export collapses the 27 EU member states into a single virtual country code, on both sides of the trade flow:
- When the declarant is an EU27 country, its code is replaced by
EU. - When the partner is an EU27 country, its code is replaced by
EU27.
The two different labels are intentional: they let Power BI distinguish "EU
reporting" from "trade with the EU bloc" without having to reapply a join.
Aggregation happens at the SUM(value_in_euros) /
SUM(quantity_in_kg) level so totals reconcile against published
Eurostat aggregates.
EU27 is the post-Brexit definition (24 member states + Croatia + Bulgaria + Romania, no UK). Historical periods before 2020 use the same EU27 mask for consistency — UK rows always remain visible separately.
Output columns
The Parquet file Cefic Power BI consumes contains exactly these columns:
| Column | Type | Description |
|---|---|---|
period | datetime | Last day of the month (e.g. 2025-03-31) |
declarant | char(2) | Reporter country, or EU for the EU27 bloc |
partner | char(2) | Trade partner, or EU27 for the EU27 bloc |
product_nc | char(8) | Combined Nomenclature 8-digit code |
cpa2015 | string | Joined from SubstanceId.csv |
chapter_cn | char(2) | First two digits of product_nc |
flow | smallint | 1 = import, 2 = export |
flow_label | string | Import or Export |
value_in_euros | bigint | Trade value in EUR (statistical value) |
quantity_in_kg | bigint | Trade volume in kilograms |
Partner dimension (partner_dim.parquet)
Alongside the fact table, export_parquet.py rebuilds a small
secondary Parquet — a dimension table of every partner country code that
can appear in the fact's partner column. It is sourced live
from Eurostat's authoritative
SDMX 3.0 ESTAT/PARTNER codelist
(gzipped TSV, ~18 KB), so labels stay in sync with Eurostat without any
manual maintenance.
Join it in Power BI on partner_code to get English, French
and German labels for every country and Eurostat aggregate
(EU27_2020, EFTA, EXT_EU27_2020…).
506 rows total — 488 from the Eurostat codelist plus
18 hand-curated entries for codes that still appear in older Comext
archives but have been retired from the current codelist: historical
country codes (CS, YU, TP,
GR, GB) and COMEXT special codes for bunkers,
stores, confidential trade and Spanish exclaves
(QP–QZ, XC, XL,
XM, XS). The source column
distinguishes them so you can filter them out in Power BI if needed.
| Column | Type | Description |
|---|---|---|
partner_code | string | Primary key (ISO alpha-2 or Eurostat aggregate code) |
label_en | string | English label (e.g. China, European Union - 27 countries (from 2020)) |
label_fr | string | French label |
label_de | string | German label |
is_standard_code | boolean | Eurostat "Standard code" flag — not a real-country filter (aggregates like EU27_2020 are also flagged true) |
source | string | eurostat (488 codes from the SDMX codelist), historical (5 retired country codes: CS, YU, TP, GR, GB) or special (13 Comext codes for bunkers, stores, unspecified partners, Spanish exclaves…) |
note | string | Free-text composition note (list of member countries for aggregates, or explanation of special codes) |
ec_corporate_code | string | Parent aggregate (e.g. CHN, EUR) |
ec_corporate_uri | string | Authority URI on publications.europa.eu |
Gotcha: our fact Parquet collapses the 27 EU member
states into the synthetic codes EU (declarant) and
EU27 (partner). Neither exists in the official Eurostat
codelist — provide a fallback label for them in Power BI.
Product dimension (product_dim.parquet)
A third Parquet, rebuilt in the same export_parquet.py
run, carries the full Combined Nomenclature hierarchy sourced live
from Eurostat's authoritative
SDMX 3.0 ESTAT/CXT_NC codelist
(~1.5 MB gzipped, 40 409 codes). Every HS chapter, heading,
HS 6-digit subheading and CN 8-digit code ships with English,
French and German labels, plus two derived columns
(level and parent_code) that let Power BI
build a drill-down hierarchy without any manual work.
| Column | Type | Description |
|---|---|---|
product_code | string | Primary key. Join to the fact's product_nc column for the 8-digit codes. |
label_en | string | English label (e.g. Chlorine, INORGANIC CHEMICALS) |
label_fr | string | French label |
label_de | string | German label |
level | string | Position in the CN hierarchy: chapter (2 digits), heading (4), hs_subheading (6), cn_code (8) or special for TOTAL and 00* confidential aggregates. |
parent_code | string | Parent in the hierarchy — the code minus its last two characters. Null for chapters and specials. |
chapter | string | First two characters (HS chapter). Handy for direct chapter-level slicing. |
is_not_elsewhere_specified | boolean | true for catch-all rows containing X (e.g. 280110XX, 2801XX). These represent trade that couldn't be attributed to a specific subcode. |
Hierarchy breakdown: 98 chapters, 1 822 headings,
9 450 HS subheadings, 29 028 CN 8-digit codes,
11 specials. The current codelist covers 100 % of the 1 369
distinct product_nc values in our fact table.
Known limitations
- Eurostat revisions. Comext republishes corrected versions of past months several times a year. The pipeline only picks them up when run with
--force; otherwise an already-loaded period is treated as final. - Suppressed cells. Eurostat masks small or confidential trade flows (typically < 100 € or single-trader exports). These rows are simply absent — there is no flag column.
- Partial 2020. The 2020 archive originally shipped 17 monthly files (mid-year revision). All are loaded; the duplicate periods are deduplicated at the
etl_loglevel. - Brexit. The United Kingdom is in the data both as an EU27 member (until December 2020) and as a third country (from January 2021). Time series should be sliced accordingly.
- No real-time freshness. The Status page is rebuilt only when the cron runs, or when
generate_status.pyis called manually. A Parquet rebuild without a status refresh will not update the homepage.
Operational commands
# Full pipeline (every year listed in config.py)
python3 run_etl.py
# A specific year
python3 run_etl.py --year 2025
# A specific month
python3 run_etl.py --year 2025 --month 3
# Force a reload even if etl_log says it is done
python3 run_etl.py --year 2025 --month 3 --force
# Just one stage
python3 run_etl.py --step download
python3 run_etl.py --step load
# Rebuild the Parquet file
python3 export_parquet.py
# Refresh this status site
python3 scripts/generate_status.py