Data generators can be simply defined as a function or a set of functions that generate data either randomly or based on specific rules or patterns. They make it easy to create fake datasets for development, testing, or simulation purposes.
Personally, I love to model different business domains and build my own simulations of those domains -- creating datasets that represent the operations happening within them. Whether you're a beginner or a seasoned professional working on pet projects or trying to learn something new, knowing how to build your own data generator helps eliminate the scarcity of datasets in niche or unfamiliar domains.
In this episode of ruff notes, I'll walk you through a fun design problem and my approach to building a simple, reusable data generator. if you want to code along you can find the code here
The Quick and Dirty Way
A data generator can be as simple as a function that creates data and performs a write operation against a destination. Most times when I'm trying to get a fake dataset up and running, I reach for the faker library, throw together a few helper functions, and write out an array of records directly into a database or a file. It's fast, it works, and it gets me to "something useful" without overthinking it.
Let's say we want to simulate orders for a fictional ecommerce system, and we've already got a Postgres table called ecommerce.orders with the following columns:
ID
customer_name
item_id
quantity
total_amount
We can slap together a generator that inserts, updates, and deletes records like this:
It's simple and does the job. But as you can already guess, there's a lot going on in that one function. We're creating data, opening a new DB connection for every record, writing directly to a hardcoded table, and applying updates and deletes all inline.
A Little More Structure
We can start breaking things up -- one function for generating data, a few more for inserting, updating, and deleting. It doesn't solve everything, but it starts to separate concerns and gives us room to grow.
This version introduces:
A clear data generator (gen_order_data)
Isolated functions for insert, update, and delete
A single connection for all inserts (yay!)
While this feels more loosely coupled, we still have a ton of assumptions embedded in it. For instance:
We assume the tables already exist.
We assume the schema is fixed.
We hardcode table and column names.
We can't lift the insert/update/delete functions and use them elsewhere.
This means that every time you want to simulate a different domain or rule set, you're back to rewriting everything from scratch.
Toward a Reusable Design
As soon as we change anything -- a column name, a table structure, or even the target domain -- we find ourselves rewriting or hacking through our functions all over again. That's a signal: the code is too tightly coupled to the current use case.
We can do better.
By stepping back and applying some basic software design principles, we open up the possibility for something more flexible -- a system that's loosely coupled, reusable, and easier to adapt.
If we strip away the domain-specific details, what we're really trying to do isn't that complicated. At a high level, the flow we care about looks like this:
Define the schema and ensure the table(s) exist
Generate one or more records
Insert those records into a database
Optionally update or delete a subset of those records
That's it.
So here's the question I asked myself: What if every column could define its own logic for how values are generated -- and the record generator didn't care what those values were?
What if structure and behavior lived side-by-side? With that mindset, we can shift from hardcoded record-building functions to something more declarative, modular, and expressive -- and that starts with rethinking how we define a column.
To do that, we introduce a simple but powerful class: ColumnDefinition. Each column in our table becomes an instance of this class, carrying two things -- a name and a function that knows how to generate a value for that column.
This decouples the "what" from the "how."
Rather than hardcoding column logic into a generator function, we define columns as self-contained units. Each one knows how to produce its own value, and our record generator simply loops through the list and assembles a dictionary.
Here's what that looks like:
Let's bring this to life with our ecommerce example.
Previously, we generated each field inline using fake.user_name(), random.randint(...), and so on, all buried inside a single function. Now, we define each column's behavior independently using ColumnDefinition, and pass them into our record generator.
This makes it incredibly easy to tweak the structure of the data we're generating. Want to simulate a different domain? Swap out the column definitions -- the record generator doesn't change.
Here's how we might define the orders table for our ecommerce use case:
Reusable Mutations: Insert, Update, Delete
Our earlier insert and update functions were hardcoded -- they assumed specific column names, specific update logic, and a known table. That works when everything is fixed, but it quickly falls apart the moment we need to adapt.
What we really want is a component that handles mutations (insert, update, delete) generically. Something that can look at a dictionary of values and write it to the right table -- without caring what the columns are.
That's where the MutationEngine comes in. It takes care of translating our records into SQL statements dynamically. It knows which table to write to, and which column is the primary key, but it doesn't assume anything about the data structure itself. Even better: it works across any domain, as long as you pass in the right config and a well-formed record.
Here's a first look:
Putting It All Together
We now have all the moving parts:
A
ColumnDefinitionclass that defines how each field is generatedA
generate_record()function that turns those definitions into real dataA
MutationEnginethat inserts, updates, and deletes records without caring what the table or columns look like
So let's wire everything up and take it for a spin. We'll define our column schema, use it to generate some fake ecommerce order data, insert the records into our database, and then perform a few random updates and deletions -- all using the same reusable tools.
Here's what that looks like in practice:
Managing the Schema
So far, we've made a lot of progress. We can define columns flexibly, generate full records, and mutate them without ever writing SQL by hand. But there's still one major assumption holding this whole system together:
The table already exists.
And sure -- maybe that's fine when you're hacking around locally. But if we're serious about building a reusable tool, we can't rely on a manually pre-created schema. The generator should be able to stand on its own, spin up the table if needed, and move on.
This leads us to the final piece of the puzzle: automated schema creation.
To automate schema creation, we extend the ColumnDefinition class to carry not just how a column generates its values, but also how it should be defined in SQL. That means adding metadata like the SQL type (TEXT, UUID, INTEGER) and any constraints (PRIMARY KEY, NOT NULL, etc.).
Once each column knows how to describe itself, we can loop through the list and generate a valid CREATE TABLE statement on the fly.
The SchemaManager handles this. It reads your column definitions, checks if the schema exists, and creates the table if it's missing -- all without you needing to write raw SQL.
This gives our data generator one more superpower: it's now self-sufficient.
Going End-to-End: From Schema to Mutations in Batches
We now have all the pieces we need to build an end-to-end data generator -- one that can:
Automatically create its target table
Generate realistic records based on column definitions
Insert those records into the database
Perform updates and deletes to simulate real-world data changes
To bring it all together, we'll extend our example slightly. Instead of inserting one record at a time, we'll introduce a simple batch_generator function that can produce records in bulk.
This not only improves performance but also mirrors how data typically arrives in production systems -- in bursts or batches.
With everything wired up -- schema manager, batch generator, mutation engine -- we now have a flexible, plug-and-play data generation system that's ready for real-world use.
Wrapping Up
That's it -- a complete, end-to-end walkthrough of building a reusable data generator. We've gone from a tightly coupled script to a cleanly abstracted system with:
Declarative column definitions
A generic record generator
Flexible insert/update/delete logic
Automated schema creation
Batch support
These are the building blocks -- and now they're yours to ride on.
If you're thinking of taking it further, here are some ideas:
Package this into a mini Python library
Add unit tests and integration tests (Practice TDD )
Build CLI commands around your generators
Add support for other destinations (CSV, JSON, object storage)
Layer in constraints, defaults, or schema evolution
Try it out with a domain you're curious about -- health, finance, gaming, you name it
Whether you're mocking up data for integration tests or simulating an entire business process, you now have a flexible foundation that can grow with your imagination.
Go build something cool!
Until next time -- keep hacking!!
Thanks for reading Data Engineering Basics! Subscribe for free to receive new posts and support my work.