Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

mathisdrn/orca

Repository files navigation

Orca - a modern data platform

Orca is a template for building a production-ready and agentic-enabled data warehouse covering the needs of 99% of data teams. It leverages a local-first development workflow that scales to the cloud using best-in-class, free and open-source tools.

Orca is a set of patterns and a reference implementation of a modern data stack. It provides a comprehensive framework for data ingestion, transformation, modeling, analytics, machine learning and reporting.

Orca is currently in early development. This README serve as a roadmap and does not reflect current implementation status.

Feedback and contributions are welcome!

Design Philosophy

  • Open: Rely on open code, standards, formats and protocols.
  • Composable: Components can be easily replaced, extended, or removed.
  • Declarative: Code-based tools enable modern development practices and agentic workflows. It also improves interoperability and reproducibility.

Core Value Proposition

  • Fitting most data teams needs: Cover main lifecycle of data in a data warehouse (orchestration and observability, ingestion, transformation, reporting, machine learning)
  • Production-ready: Provide a solid foundation for production workloads (managing environment, deployments, CI/CD, secrets management).
  • Benefit from modern development practices: Version control, changes are one commit away, CI/CD, testing, code review and more.
  • Agentic-ready: Enable agentic behavior by providing agents with the right context, tools and security boundaries.
  • Quick onboarding: Quickly get up and running with your sources of data and with a clear path to build your data warehouse.

Architecture

Orca uses the following stack:

Role Tool Purpose
Environment uv Python environment management.
Orchestration Dagster Orchestrates the asset graph, providing observability, scheduling, and orchestration.
Ingestion dlt Handles robust, schema-evolving data loading from APIs and external sources.
Compute DuckDB Provides serverless, in-process SQL compute for fast analytical queries.
Storage DuckLake Manages the data lake layer, decoupling storage (S3/Parquet/Iceberg) from compute.
Transformation SQLMesh Brings CI/CD, virtual environments, and column-level lineage to SQL transformations.
Modeling Malloy Defines a rich and composable semantic layer.
Reporting Evidence Generates static BI reports using Markdown and SQL.
Data Apps Streamlit Builds interactive data applications using pure Python.
Notebooks Marimo Provides a reactive, reproducible notebook environment for exploration and ML.

Agentic-Ready

Orca is designed to be agentic-ready, allowing agents to operate autonomously across the stack:

  • Add a new data source from API docs.
  • Create a new transformation to clean and join data.
  • Define a new semantic model to represent a business concept.
  • Create a Streamlit dashboard to share insights with stakeholders.

The agentic-ready architecture is enabled by infrastructure-as-context approach:

  1. Progressive Context Loading: Using AGENTS.md files at the root and within subfolders like /ingestion or /reporting, agents gain situational awareness of the directory structure, local conventions, and coding best practices.
  2. Documentation access: The Context7 MCP grants agents direct access to external documentation, ensuring they can reference the official specs for every tool in the stack.
  3. Specialized Skills: Dynamically loaded instructions sets and guidance for specific workflows:
    • project-onboarding: Guides users to understand the stack and connect their data up to a working dashboard.
    • dev-ops: Assist users with DevOps-related questions (managing environment, deployments, CI/CD, secrets management).
    • data-ingestion: Scaffolds advanced dlt pipelines with schema evolution, testing, and monitoring.
    • data-transformation:
      • Creates SQLMesh transformation pipelines.
      • Enforces SQLMesh best practices
      • Implements and tests SCD Type 1, 2 and 3 and Star/Snowflake schemas.
    • data-analyst-python:
      • Querying the data warehouse using Python.
      • Creating plot using altair.
      • Transforming data using polars.
      • Creating Streamlit dashboards.
    • data-scientist
      • Creating advanced sklearn pipeline and models.
      • Explaining model results using PDP, SHAP and LIME.
  4. Security Boundaries: Agents operate within a role-based, least-privilege execution environment to ensure safe command execution.

Get started

  1. Click on Use this template and Create a new repository.
  2. Clone this new repository and open it in VS Code.
  3. Read .agents/skills/project-onboarding/SKILL.md or simply ask an agent to Get onboarded. If properly configured, it will quickly help you integrate your data or get to know the architecture better. For a quick demo see Orca-demo.

Staying Up-to-Date

To incorporate the latest features, improvements and fixes from the core template, follow this workflow:

# Ensure upstream remote is configured (ignores error if already exists)
git remote add upstream https://github.com/mathisdrn/Orca.git || true

# Fetch and merge latest changes
git fetch upstream
git merge upstream/main --allow-unrelated-histories

# Sync the Python environment
uv sync

After running these commands, manually resolve merge conflicts.

References

Prior Art & Inspiration

Real-world implementations of the modern data stack that inspired this project.

Core Concepts & Philosophy

Foundational reading to understand the architectural decisions behind this platform.

Emerging Standards & Interoperability

Future-facing protocols and standards relevant to the platform's roadmap.

About

Orca is a template and a set of patterns helping data teams efficiently build a production-ready and agentic-ready data warehouse.

Topics

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages