If you're like me, you probably also have many scripts lying around which look like this:
CSV.open('jira.csv', 'w') do |csv|
[2024, 2025].each do |year|
(1..365).each do |day|
date = Date.ordinal(year, day)
break if date > Date.today
jira.metrics(date.strftime('%Y-%m-%d')).each do |result|
csv << result
end
end
end
end
Collecting massive amounts of data from various sources across your company's ecosystem. Sources such as Jira, GitHub, Google Drive, and Confluence.
This allows you to have your probes in place so that you can make data-driven decisions. Decisions to help you save costs, improve team performance, and optimize processes.
But just like me, you probably also don't have a good solution to where to store all that data.
You might have been using CSV files, JSON files, or spreadsheets. You might have later upgraded to using a database like MySQL or PostgreSQL. And if you're lucky, you might have even been using a data warehouse like Snowflake or BigQuery.
But none of these solutions work well for busy engineering managers. Files are clumsy to manage, databases can be rigid and require maintenance, and data warehouses are expensive and complex.
Until now!
What is DuckLake?
DuckLake gives you your own data lake which you can carry in your pocket. In essence, it is a data lake specification which uses Parquet files as the storage format and a database to store its metadata. For a more detailed (and accurate explanation), you should watch Hannes Mühleisen and Mark Raasveldt introducing DuckLake.
Nothing magical about it, which is another reason I love it so much. It's simple, lightweight, and easy to use.
First let's install DuckDB. There are a gazillion ways to do this. In my case, I use Homebrew on macOS, so I can install it like this:
brew install duckdb
Once installed, we can launch it with:
duckdb
Now we’re in DuckDB but we still need to install the DuckLake extension of DuckDB which can be done as follows:
INSTALL ducklake;
And to start using it:
ATTACH 'ducklake:metadata.ducklake' AS ducklake;
use ducklake
And we’re off to the races 🏇
Eating data
First thing you want to do is to start ingesting data into your personal data lake. For the sake of the example let’s assume that you have some data you scraped from Jira which is saved in a CSV file called jira_2024_2025.csv
.
Let’s create a table called jira and ingest the data into it.
CREATE TABLE ducklake.jira AS
SELECT * FROM read_csv_auto('jira_2024_2025.csv');
We can verify that the table has been created by running:
SHOW TABLES;
And verify that the data is in place by running:
FROM ducklake.jira;
Which will do a SELECT and return the results.
If we look in the file system we can see that DuckDB has created a couple of files/folders. The metadata.ducklake
file is the actual DuckDB database which contains all the necessary metadata. Next to that we have a folder called metadata.ducklake.files
which contains all the parquet files. At this point in time you should be seeing a folder called main
under which you will see another folder called jira
. This represents the table we currently have in our data lake. Under this folder, you will see a single parquet file.
Eating more data
You realize that there is more value in also having data from 2023 so you go ahead and generate the CSV. Now you need to import that as well. In order to do so you can do the following:
INSERT INTO ducklake.jira
SELECT * FROM read_csv_auto('jira_2023.csv');
Now when we look in the file system, you should see a 2nd parquet file appear under metadata.ducklake.files/main/jira
.
Road to DataViz
As an engineering manager, we’ve been collecting all this data in order to be able to uncover issues and tell a story. This means that we do need to get this data into our favorite data visualization tool (Tableau, Power BI or Livebook). Since DuckLake is a relatively new technology, there aren’t that many adapters for it yet. But in our example, since we use DuckDB under the hood we can export the data in any format we’d like. I prefer to use Parquet again since it’s so lightweight and all DataViz tools I’ve mentioned above have connectors for it.
We can export a table like so:
COPY ducklake.jira TO 'jira.parquet' (FORMAT parquet);
This will create a new file called jira.parquet
on the file system which we can point to with our favorite DataViz tool of preference.
Conclusion
DuckLake makes it incredibly easy for engineering managers to collect, store, and analyze data without the usual headaches. You don't need to rely on the cloud. Everything can live locally on your machine, making it both private and portable. If you ever want to scale up, you can simply move your Parquet files to blob storage and host the metadata database on a server, giving you flexibility as your needs grow (which I doubt you will need anytime soon).
DuckLake is super lightweight, fast, and efficient. You can store years of data in just a few megabytes, and querying or exporting your data is a breeze. With support for open formats like Parquet and seamless integration with popular data visualization tools, you get all the power of a modern data lake without the complexity or cost. For busy engineering managers who want actionable insights without the overhead, DuckLake is a game changer.