Data Platform Engineer (data Quality & Ai Workflows)

May 8, 2026

No location

Prácticas

GetonBoard

Apply
Descripción

Job Description

Tritone Analytics is a music-technology startup building a forensic royalty auditing platform. We help artists, managers, and rights-holders identify unpaid or misreported royalties by combining deterministic financial analysis with AI-assisted workflows.
Our platform ingests real royalty statements from major distributors, labels, and publishers, normalizes them into a canonical analytical schema, and surfaces discrepancies through both deterministic audit checks and LLM-powered contract analysis.
You'll work directly on the data infrastructure that powers everything: the ingestion pipeline, schema normalization, data quality systems, and the preparation layer that feeds our AI workflows.
This is hands-on, production-grade work. Our platform processes
15M+ royalty rows
across 30+ databases from 20+ label and publisher datasets. The data is real, messy, and financially consequential.
This will be a short-term internship with the intension to convert to a full-time position.
Postula desde getonbrd.com.

Data ingestion & normalization
Extend our profile-based CSV detection system, to handle new sources and edge cases
Map inconsistent raw schemas to our canonical normalized schema, reconciling formats across a wide range of distributor and publisher statement types
Debug ingestion failures, detect schema variants, and write deterministic fix logic
Data quality & validation
Write SQL validation queries to catch normalization errors, equation mismatches, and financial inconsistencies
Build and extend audit checks that verify royalty calculations deterministically (e.g. rate × units = net, gross × participation = artist amount)
Profile and document data loss during transformation
AI pipeline support
Prepare and chunk contract documents (PDF/DOCX) for vector-based retrieval
Clean and structure inputs for LLM extraction workflows (contract term extraction, rate comparison, anomaly classification)
Help maintain the read-only SQL agent that answers financial queries against live analytical databases
Infrastructure & reliability
Write and maintain tests (4,000+ in the suite); new modules require 80%+ coverage
Improve error handling and logging in ingestion pipelines
Collaborate via GitHub: PRs, code review, and CI/CD workflows (tests, linting, type checking run on every PR)
Work within strict type-checking constraints (mypy strict mode throughout)

Must have:
Strong Python: comfortable with real production data scripts, not just notebooks
Strong SQL: analytical queries, data validation, debugging financial discrepancies
Experience wrangling genuinely messy data — inconsistent headers, encoding issues, multi-format CSVs, missing columns
Ability to trace a data bug from raw file → transformation → output and explain what went wrong
Comfortable working with Git and GitHub: branching, PRs, code review, and CI/CD pipelines
Comfort reading and extending other people's code in a mid-size codebase
Comfortable using AI tools (e.g., ChatGPT, Claude, Copilot) as part of your development workflow—for debugging, data analysis, or accelerating implementation.
Nice to have:
DuckDB (our primary analytical engine — a real advantage)
Parquet / columnar formats (PyArrow, Polars)
Vector databases or RAG pipelines
Experience with financial or accounting data
Music industry domain knowledge (royalty statements, publishing vs master, mechanical vs performance)
pytest, mypy, type-annotated Python
Who Thrives Here
You enjoy finding the bug in a pipeline by reading data, not just logs. You're comfortable with "the schema changed again" as a normal day. You care about correctness — in this domain, a normalization error means an artist doesn't get paid. You write tests because you've been burned before, not because someone asked you to.
We're a small team moving fast on real problems in a notoriously opaque industry.