Blog

Distributed Dashboarding with DuckDB-Wasm, could it be the future of BI?

16 Apr, 2024
Xebia Background Header Wave

A blog from 2023: “BIG DATA IS DEAD” by Jordan Tigani, inspired me to dive a bit into the DuckDB ecosystem and Motherduck it’s cloud offering. Tigani is a former founding engineer of Google BigQuery, and worked for more than 10 years on the product. In his blog “BIG DATA IS DEAD”, Tigani explains that in his eyes the Big Data era turned out differently than expected. Lots of organizations are building these giant data warehouses, leveraging cloud compute to run their supposably big data workloads in the cloud. In reality it turns out that about 90% of the queries being executed on BigQuery, process less than 100 MB of data. Don’t get me wrong, Big Data does exist, however it is not being queried or processed as often as you would expect.

In the meantime over the past ten years the specs of an average laptop increased drastically, it is rather common than an exception to have 16 or 32 GB of memory and a powerful multi-core processor in a standard business laptop. Doesn’t it make more sense, especially for these smaller workloads, to be run on your local machine? A machine that is already paid for. Of course this will bring other challenges, I mean when do you decide to run something locally or in the cloud? And do you even want to decide this yourself? Isn’t the cost of transferring data over a network (referring to egress costs in the cloud) more expensive than the actual compute for the workload itself? Unfortunately in this blog post you will not find answers to these questions, however the concept seems promising. I would highly recommend giving Tagani’s article a read, to get more feeling for the figures and the point he is trying to make.

A company called Motherduck recently raised $100 Million to work on their cloud offering of DuckDB, and is working on solutions and answers around the questions I mentioned before. DuckDB is a blazingly fast in-process analytical database system, written in C++. Motherduck is not "just a cloud offering" of DuckDB, they actually offer some cool features on top of DuckDB. Motherduck their cloud offering is built around the concept of hybrid query execution. With hybrid query execution Motherduck decides to run a query locally, partly locally and partly in the cloud, or fully in the cloud. With the embedded nature of DuckDB and the arrival of DuckDB-Wasm this is being made possible.

With DuckDB-Wasm it is possible to run DuckDB entirely within the browser (powered by WebAssembly). This appears to be a pioneering step towards a possible new era in the data warehousing and analytics domain. The idea is to do more data processing locally on your own machine, instead of spending money on compute in the cloud. Motherduck is really focusing on this concept, and I am quite excited what the future will bring, and if it will change the way how we currently handle BI use-cases on supposably big data 😉

Distributed Dashboarding 📊

The term Distributed Dashboarding might have caught your attention, and triggered you into reading this blog post in the first place. I have to be honest here, it is a term I just made up to get your attention. However the concept is in my opinion quite promising, and I think we will see more and more examples of it in the future. The idea behind Distributed Dashboarding is to send a part, or in some cases the full dataset towards the client together with the (analytical) processing logic. The browser downloads this dataset once, and then stores it into the cache of the browser. Because of DuckDB-Wasm and the embedded nature of DuckDB, the browser can load this dataset into DuckDB. This offers a blazingly fast SQL interface on top of your raw dataset. All the slicing and dicing of the data can happen within the browser on your local machine.

Of course there are a lot of challenges around this concept, and it will definitely not fit all the existing BI and/or OLAP use-cases. However I think there are use-cases where this can work quite well, especially those were you do not necessarily do analysis on extremely large amounts of data. I will go through some of the challenges that I can think of. For example what if the dataset is quite big or the network latency is not that good? Maybe it is possible to send pre-aggregated data for the dashboard in the first place, and then download the entire dataset (or the part that you need) in the background? Maybe if the client detects slow network latency, it can decide to do the execution in the cloud. Or have some type of hybrid execution, a part of the query gets executed locally and another part in the cloud. For example only the right hand side of a join operation, or a subquery, can get executed in the cloud and combined with your local data. I also think there are a lot of use-cases where most users only look at the last week or even last days of data. Maybe it is possible to include this (smaller) timespan of data with the dashboard and do the analytical processing for most users locally. However when a user wants to look further back than the provided timespan, the query gets executed against the cloud warehouse. Another challenge might be the retrieval of increments. For instance if I have the last month of data locally, and I open the dashboard tomorrow, there should be a mechanism that downloads the new day of data and combines it with the local dataset.

Hopefuly the Motherduck extension for DuckDB will offer solutions for most of these problems. When developing BI dashboards you don’t want to really write in-depth logic about when to execute something locally, in the cloud, or use some sort of hybrid execution.

Example of Distributed Dashboarding using React and DuckDB-Wasm

To bring the concept of Distributed Dashboarding a bit more into a real-life example, and get my hands dirty on using DuckDB-Wasm, I decided to build a small dashboard application using React and DuckDB-Wasm. A colleague worked on an internal use-case by which we track Github contributions of our colleagues, and offer a dashboard to gain insights into for example which OSS project we are most active. The project uses dbt with the dbt-duckdb adapter, and a self made dbt plugin which fetches data from the Github API. Eventually dbt writes all data to a .duckdb file as an artifact. This .duckdb file contains all the github contributions of our colleagues, and is shipped with the web-application so it can be used in the browser with DuckDB-Wasm.

Github Contributions Dashboard using DuckDB-Wasm Try out the github-contributions dashboard yourself

The cool thing is that the browser only has to download the github-contributions.duckdb file once, and from then on it is stored within browser cache. All the filtering and aggregations going on in the dashboard are being done by SQL statements executed against the local DuckDB-Wasm instance. Github Contributions Dashboard using DuckDB-Wasm

To make this possible I used the duckdb-wasm-kit which makes it a bit easier to use DuckDB-Wasm within React applications. The duckdb-wasm-kit offers hooks and utilities to make it easier to use DuckDB-Wasm within a react application. In case your are curious of the exact implementation, the source code for the github-contributions project can be found here.

To install the duckdb-wasm-kit you can use the following command:

npm install duckdb-wasm-kit

To load an existing .duckdb file into DuckDB-Wasm it is important that the .duckdb file is accessible over HTTP(S) via fetch. This can be done by calling the initializeDuckDb function within the useEffect hook in the root of your application.

import { DuckDBConfig } from "@duckdb/duckdb-wasm";
import { initializeDuckDb } from "duckdb-wasm-kit";

const MyApp = () => {
    useEffect(() => {
        const config: DuckDBConfig = {
            path: "./sample_database.duckdb"
        }
        initializeDuckDb({ config, debug: true });
    }, []);
    ...
}

Congratulations! 🎉

You now know how to load in an existing .duckdb database file into your React web application. The next step is to actually query some data, and display this data in the UI. To do this we are going to use the useDuckDbQuery hook. It is good to know that DuckDB-Wasm uses Apache Arrow as their data protocol for the data import and all query results. Arrow is a database-friendly columnar format that is organized in chunks of column vectors, called record batches and that support zero-copy reads with only a small overhead. The npm library apache-arrow implements the Arrow format in the browser and is already used by other data processing frameworks.

import { useDuckDbQuery } from "duckdb-wasm-kit";

const MyComponent = () => {
    const { arrow, loading, error } = useDuckDbQuery(`
        SELECT * FROM movies
        WHERE actor = 'John Cleese';
    `);

    if (loading) {
        return <SomeLoadingSpinner />;
    }

    if (error) {
        // some error handling
        return <SomeErrorComponent />;
    }

    if (arrow) {
        return <SomeDataRenderComponent arrow={arrow} />
    }
}

Another option is to use the useDuckDb hook and gain access to all the AsyncDuckDb methods.

t.b.d.

Conculsion

t.b.d.

Photo by NASA on Unsplash

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts