Code-Driven Manifesto: Evaluating Tokyo's Open Data from an Engineering Lens

どうも〜おかむーです! Today I'm taking a swing at Tokyo's open data offerings with an engineer's eye — "code speaks" style. I'll walk through what's working, what's painful, and how to actually make these datasets usable in production.
- Tokyo publishes many CSV datasets (evacuation shelters, PM2.5 streams, stats) but formats and update rhythms vary
- Machine-readability gaps (PDFs, rounding quirks, missing APIs) limit reuse; small engineering fixes unlock big value
- Concrete proposals: schema-first APIs, streaming endpoints, validation CI, and developer SDKs
Conclusion
Tokyo has a lot of valuable public data (see portal.data.metro.tokyo.lg.jp and catalog.data.metro.tokyo.lg.jp), and legal groundwork is mostly fine (CC BY 4.0 for shelters). But from an engineering standpoint the ecosystem needs: consistent machine-readable formats, stable APIs (including time-series/geo APIs), metadata standards (DCAT/JSON-LD), and automated quality checks. Implementing those will turn documents into dependable infrastructure for apps, research, and policy verification.
Report
What I looked at
These public sources are relevant: Tokyo Open Data Portal and Catalog (portal.data.metro.tokyo.lg.jp / catalog.data.metro.tokyo.lg.jp), Japan Dashboard and e-Stat, and national UX/UI guidance from the Digital Agency and METI. Also notable: datasets like evacuation shelter lists (licensed CC BY 4.0) and 1-minute PM2.5 measurements appear in CSV form.
Pain points — engineer's POV
- Format heterogeneity: CSVs, occasional PDFs, and some HTML tables. Code writers will get it — parsing PDFs is a last resort!
- Rounding/typing issues: catalog notes that indices are rounded to 1 decimal place and integers appear where 0 decimals occur. That breaks reproducible aggregations and type inference.
- API gaps: there's excellent portal/catalogation, but few guaranteed REST/SDK contracts for time-series or geospatial queries (e.g., Feature API / OGC API missing).
- Update cadence & provenance: some datasets moved from yearly to monthly updates; consumers need stable timestamps, versioning, and changelogs.
Quick technical checks & examples
These are small, practical snippets developers use when a portal gives CSVs.
1) Fetch CSV and load with pandas (handles many quirks):
import pandas as pd
url = "https://catalog.data.metro.tokyo.lg.jp/dataset/.../resource.csv"
df = pd.read_csv(url)
fix numeric types
for c in df.select_dtypes(object):
df[c] = pd.to_numeric(df[c], errors='ignore')
2) Rounding-safe aggregation (avoid float-then-round traps):
# if original source rounds, request raw counts or keep distributed decimals
df['population_total'] = df[['age0_9','age10_19',...]].sum(axis=1)
3) Convert to efficient storage for APIs:
# CSV -> Parquet for fast reads and columnar queries
python -c "import pandas as pd; pd.read_csv('in.csv').to_parquet('out.parquet')"
Concrete improvement roadmap
- Adopt DCAT/JSON-LD metadata across the catalog so machine agents can discover datasets reliably.
- Provide OGC/Feature API or simple REST endpoints for geospatial datasets (evacuation shelters as GeoJSON with stable IDs).
- Expose time-series with pagination and ISO timestamps for PM2.5 (consider SSE or MQTT for near-real-time subscribers).
- Publish schema (Table Schema / Data Package) and semantic types to avoid ad-hoc parsing.
- Run CI on datasets: schema validation, null-rate alerts, distribution drift checks. Publish changelog with each release.
- Offer example SDKs and reproducible notebooks (GitHub + Binder) that show how to reproduce policy metrics.
Policy measurement & gaps
This is where data meets politics: if policy sets numerical targets, the portal must expose both the target and periodic measurements in machine-readable form. Right now, fragmented tables and rounding rules make automated monitoring brittle. A simple contract: every policy KPI gets a dataset with (kpi_id, target, period, actual_value, source_url, last_updated).
UX and developer experience
Digital Agency's service design guidance and METI's UI/UX tips matter: catalogs should be searchable, filterable, and provide sample queries. A developer portal with curl and Python examples reduces friction massively.
まとめ
These datasets are already valuable — evacuation shelters, environmental time-series, and statistical tables are gold. But to turn them into robust civic infrastructure, Tokyo should standardize metadata, provide stable APIs (geo & time-series), and add automated validation and developer tooling. Small investments (JSON-LD metadata, OGC APIs, CI pipelines) yield outsized returns for transparency, research, and app development.
おかむーから一言
I've built startups and shipped product-grade data systems — this is totally doable. Make the catalog an API-first platform, add CI for data, and watch civic innovation explode! Let's turn policy into verifiable code.
Sources
- https://portal.data.metro.tokyo.lg.jp/
- https://catalog.data.metro.tokyo.lg.jp/dataset
- https://catalog.data.metro.tokyo.lg.jp/ja/dataset/?res_format=CSV
- https://opendata.pref.saitama.lg.jp/
- https://catalog.data.metro.tokyo.lg.jp/dataset?res_format=CSV
- https://metidx-gov.note.jp/n/n9468573c213b
- https://www.digital.go.jp/policies/servicedesign/government-system-ui
- https://zenn.dev/govtechtokyo/articles/b65dc687e50918
- https://picks-design.com/blog/5751/
- https://www.meti.go.jp/meti_lib/report/2024FY/000072.pdf
- https://www.digital.go.jp/resources/japandashboard
- https://dashboard.e-stat.go.jp/
- https://www.kantei.go.jp/
- https://www.kantei.go.jp/jp/news/index.html
- https://www.stat.go.jp/dstart/tool/
Share
Related Reports

Code-driven Manifesto: Auditing Local Gov Data and Systems (Kagawa case study)
Local gov systems run but hide data behind UIs; expose CSV/JSON, APIs, and common schemas to unlock value.

Code-driven Check: Japan’s Open Data and the Machine-Readable Gap
Digital Japan has dashboards and rules, but PDFs and messy formats still block automated policy verification; mandate CSV/JSON, APIs, and dataset linting.

Code Speaks: Testing Japan's Gov Data and Dashboards
Japan has great dashboards but inconsistent machine-readability. This report inspects e-Stat, Japan Dashboard, Kantei PDFs, and proposes API-first fixes and practical code examples.