Our journey to become data-driven with Metabase and Redshift

Developers May 10, 2022

Storyblok is the first headless CMS that works for developers & marketers alike.

The journey began over a year ago when we started to face difficulties in keeping up with monitoring our business performances due to the growing amount of data. It became almost impossible to maintain every piece of data manually since we hadn’t decided on a proper data infrastructure yet.

Back then, we had enough data sources to create meaningful dashboards, but the amount of manual work required to preprocess, aggregate, and visualize the data was not scalable for us. The manual collection process required to report our business performance through Key Performance Indicators (KPIs) involved many hours of manual data curation, filtering, aggregation, and copying and pasting to calculate and visualize them on a spreadsheet. That spreadsheet was then used to inform stakeholders about the success and growth. At that time, usual KPIs would include the development of our Annual Recurring Revenue (ARR), Monthly Recurring Revenue (MRR), and other financial key indicators as well.

The complexity grew, even more, when the numbers of business metrics to keep track of were also growing with us. Suddenly, we had to maintain a lot more data in the spreadsheet to keep the report up-to-date. We added metrics that would report our Marketing efforts, which included new data sources that needed to be maintained as well. This was the turning point to evolve our data communication.

The inspiration to become data-driven

Before we started researching our technical setup, we had identified our biggest challenges and made up a clear list of needs that we wanted to cover. The biggest pain point for us to resolve was the lack of transparency and the incomplete view of our business performance each month. The data was also mainly accessible by only one department and isolated from the rest of the company. Those data silos can cause action steps to be drawn not based on data but rather on one’s experience and knowledge. I don’t think this is a bad thing to do in general, especially in the early stages of a business. However, if the business grows further, there is a point in time where it gets almost impossible to know everything by fact. Noteworthy is that we wanted to automate our reporting process so that at each point in time, one would know exactly our current state. With this list in our hands, we started our journey to become data-driven.

The use cases

At the beginning of our journey, everything started without a Data Scientist. We had no dedicated person or department to take care of the volume of data that we had collected and wanted to analyze. It was all done in between the tasks of our Operations department and our CEO himself to create reports that were crucial for communicating the status-quo within and outside of the company. When we began our journey, it was our VP of Engineering who assessed potential fits to start working with the data that we have.

1. Monthly investor update

The first use case in becoming data-driven was propelled by the efforts of our VP of Operations, Lydia. She is responsible for keeping an eye on our numbers and informing investors about our current state of business. The problem here was, that once a month, she was required to collect all the data from different sources to create a report that would inform about the state of the business. The goal was to improve this process by having an automated dashboard that is always up to date, with zero maintenance required.

A diagram showing the initial setup, forcing the team to manually put the data in spreadsheets. — The initial setup for the monthly financial overview

At the early stages, there was no data pipeline in place to collect and visualize the data. It was necessary to find and extract the values manually in all sources like Stripe, Salesforce and the ERP system to be able to report the state of the company in a monthly investors update. We had no database or other tools to quickly access the needed information in one place.

Once a month, Lydia would filter and collect the necessary data to create the reports within Google Sheets. This process was very time consuming, and many hours were spent on this every single month. Another challenge was that we were not able to scale with the ERP system accordingly. All of these reasons would ultimately lead to the introduction of a different ERP system, and a reconsideration of the related data stack. A learning for us was, that calculating Churn, Attrition and Upsell with the given Stripe reports were not easy for us as it would have seemed. It is hard to recalculate the numbers on our end based on what the report was showing to us. Also thinking of a data structure that is scalable is something that requires business knowledge and time.

2. Self-Service marketing overview

Our second use case in becoming data-driven was propelled by the efforts of our VP of Marketing, Thomas. In his role he spends money on marketing initiatives, and this needs to be justified accordingly and tracked if those investments pay off. This includes finding numbers that support the decisions of his team. The goal was to improve once again our processes by creating an automated reporting of the most important KPIs and a visual representation to make it easy to spot the correlation between spending and outcome.

A diagram showing the marketing report setup — The current setup for the marketing dashboard

Then began our research for the technical data setup. Does a Data Warehouse or a Data Lake fit our needs? That was one of the questions that we had to resolve. Which tool does fit our needs to visualize the data accordingly? Who is responsible for it? Were additional questions that needed to be answered.

Due to the fact that we already had a technical setup in the AWS ecosystem, we decided to stick with it after getting in touch with the Account Management Team of AWS who helped us by scheduling regular meetings to assess the needs and find possible solutions. We tried out different services with different visualization platforms on top of it to find the best solution for us to make data communication possible.

After a thorough investigation, we decided to go with Amazon Redshift as the Data Warehouse of choice. Why did we decide against a Data Lake you might ask now?

Due to Redshift’s ability to not only handle structured data but also unstructured, it gave us the flexibility that we searched for. Additionally, the data can be processed faster because of the Schema-on-Write approach that comes with a Data Warehouse. (Schema-on-Write is defined as creating a schema for data before writing it into the database.) The raw data is easier accessible and stored in flat hierarchies which can be accessed and understood also by non-engineering-focused individuals. In order to ingest data from different sources, we decided to use AWS Lambda to fetch and load data from (Application Programming Interfaces) APIs or by extracting Comma-separated files from the data sources.

Just as important as finding the right data setup was also the research for a Business Intelligence (BI) software. After thorough investigation and research, we switched from Amazon QuickSight to the Open Source solution Metabase. Metabase can be used to visually represent information based on the data stored in a database using Structured Query Language (SQL). With the power of SQL, all kinds of diagrams and metrics can be now realized and collected to tell a story with data and to understand and improve customer funnels.

After a few months of research, we found our data stack. This allowed us to finally collect and store our analytical data to start building dashboards. The first one was an automated dashboard showing key metrics that can be shared with stakeholders. This could be realized by accessing around 400 to 800 rows of data from around four different sources. The second one involved already ten data sources that needed to be accessed, preprocessed, aggregated, and visualized. With an increasing volume of around 15 to 30 million rows of data, we could build a monthly Marketing overview focused on key areas that we had chosen.

Another challenge for us was to make sense of the different kinds of data tables so that a calculation using the data stored in our database returns valid, transparent, and reproducible results at all times compared to the calculations made in some data sources’ analytics page. A learning for us was the time necessary to understand the data so that it can be then communicated and interpreted in the correct way. With our data stack, it is now easier to organize and filter the data based on our needs.

3. Investor real-time update

Our third use case was propelled by the efforts of our CEO of Storyblok, Dominik. In his role, he needs to prepare new data for investors very frequently. In order to be able to act fast on incoming requests, it was our goal to have the current data accessible for each Investor meeting at any point in time.

A diagram showing the current workflow of Storyblok's data pipeline with AWS Lambda, Amazon Redshift and Metabase — The current Storyblok data pipeline

Now, having also Metabase as our BI software of choice, we can build almost real-time dashboards to inform about the status-quo of our company. Metabase helps us to make the data accessible for everyone, without the barrier of needing to know SQL. The flexibility to organize and filter the reports is one of the benefits since we use Metabase. Taking Dominik as an example here. He can now log in to Metabase, click on the piece of information he needs and then filter and aggregate the data in a few mouse clicks. This gives us now the possibility to think further about all the ways we can use our data.

The conclusion

In the last couple of months, we have established a data pipeline with combined efforts from different teams to not only collect the necessary information in order to understand the data, but also to build and connect each and every connector to our Data Warehouse. Becoming data-driven is a team effort. We have successfully passed the starting phases of our journey and can now provide a Self-Service data experience to not only a core group of individuals but to all stakeholders as well. The next phase in our data journey includes applying Machine Learning approaches to forecast business developments and to identify user experience events like ‘AHA’ moments that trigger Cross-sell or Upsell opportunities. This is just the beginning of what we can do with our data.