Are you looking to break into the exciting world of analytics engineering or transition your career to leverage powerful data insights? This journey can seem daunting, with a myriad of tools and technologies to master. However, with the right approach you can build a robust skillset and a compelling portfolio with freely available resources.
This guide will walk you through setting up a modern data transformation pipeline, using publicly available datasets and open-source tools, providing you with an opportunity to develop and demonstrate your skills.
Modern Data Transformations: Your Essential Toolkit
Before we dive into the project, let's briefly look at the core components of the data transformation pipeline we'll be building. The beauty of this pipeline is its accessibility; many components offer free tiers or open-source versions perfect for learning.
-
Data Warehouse: Google BigQuery
A highly scalable, serverless, and cost-effective cloud data warehouse. We'll leverage its public datasets, which are free to query up to a certain limit, making it ideal for learning.
-
Transformation: dbt (data build tool)
dbt enables data analysts and engineers to transform data in their warehouse by simply writing SQL. It brings software engineering best practices like version control, testing, and documentation to the data transformation workflow.
-
Version Control: Git + GitHub
Essential for any collaborative or structured development, Git allows you to track changes in your code, while GitHub provides a platform for hosting your repositories and collaborating with others.
-
CI/CD: GitHub Actions
Implement continuous integration and continuous delivery directly into your GitHub repositories. This will be crucial for automating your dbt project's testing and deployment.
Your Learning Roadmap: Step-by-Step Implementation
Here's a structured approach to building your modern data transformation pipeline:
0. Prerequisites
We assume that you are already familiar with the below topics, if not then best to focus on these areas before moving to the subsequent step.
-
SQL
SQL is fundamental for many job titles, including Analytics Engineers, as it the primary language used to query data stored in databases and also to transform that data. Familiarity with CTEs (common table expressions), JOINs and window functions is expected. If you are not familiar with these areas, you can sign up for online courses (DataCamp, Udemy, etc.) or attend one of Xebia's SQL training courses.
-
Version Control (git)
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For aspiring analytics engineers, understanding the basics of version control (most commonly Git) is fundamental as it enables collaborative development on SQL queries, dbt models, and Python scripts and more, allowing teams to track every modification, revert to previous states and merge contributions seamlessly. In modern data transformation pipelines, version control ensures data quality, reproducibility, auditability, and efficient teamwork. There are several commonly used version control tools including GitHub, GitLab and Azure DevOps.
1. Setting Up Your Environment (BigQuery & dbt Cloud)
-
BigQuery
Create a Google Cloud account (it's free!). Create a new project and enable the BigQuery API. In the BigQuery UI, ensure you can see the
bigquery-public-dataproject in the Explorer pane.In the "IAM and admin" > "Service accounts" section, create a service account and a JSON key for this service account. Note that this key is sensitive so please store it safely!
-
GitHub
Create a GitHub account (it's free!). Create a repository, preferably public so others can see your skills. Note that it is possible to change the visibility of your repository between public and private as often as you like.
-
dbt Cloud
Create a dbt Cloud account (it's free for individuals!). Create a new project selecting "BigQuery" as the warehouse and uploading the service account JSON file created earlier. Connect your GitHub repository to dbt Cloud by following this guide. You should be able to run
dbt debugsuccessfully from the Studio in dbt Cloud.
2. Structuring Your Data Transformations (dbt Best Practices)
-
Model Organisation
Using the
bigquery-public-dataBigQuery project, select tables to use as sources. Here are some interesting options:1.
crypto_bitcoin.transactions: A large, constantly updating table containing Bitcoin transactions. This source gives you the opportunity to create an incremental model.2.
github_repos.commits: Analyse the commit activity in a GitHub repo(s) of your choice.3.
wikipedia.pageviews_2025: Take a look at what Wikipedia articles are trending.Structure your dbt project with
staging,intermediate, andmartslayers according the dbt's recommendations. -
Materialisations
Configure your
dbt_project.yml, carefully considering which materialisation (e.g., table, view, incremental) makes the most sense for each layer of your models. -
Tests
Add both singular and generic tests to ensure your
martslayer contains high quality data. -
Exposures
Explore adding dbt exposures to define downstream consumption of your dbt models, particularly for your BI reports.
3. Version Control & CI/CD with GitHub & GitHub Actions
-
Development Workflow
Make use of version control best practices by creating feature branches and pull requests. Ensure your commits contain related work and a clear commit message.
-
Continuous Integration (CI)
Add a CI job and configure "slim CI". This will automate testing of your changes on every pull request, ensuring data quality and preventing breaking changes.
-
Branch Protection
Add a branch protection rule to your
mainbranch so that pull requests cannot be merged unless the CI job succeeds.
4. Visualising Your Insights (BI Tools)
-
Dashboards
With your data transformed and validated by dbt, the next step is to connect your chosen BI tool (e.g. Tableau Public, Power BI Desktop – many offer free versions or trials) to your dbt-generated
martstables. Building dashboards demonstrates that the data available in yourmartslayer can answer business questions.
Bonus Points
Want to really show off! Here are some additional options you can pursue:
-
dbt Core
As an alternative to dbt Cloud you can use dbt Core, the open-source version of dbt. This guide provides a quickstart. This allows you to demonstrate your knowledge of setting up a python virtual environment and comfort working with a command line interface (CLI). VSCode is a great integrated development environment (IDE) to help you get started.
-
CI/CD
Using GitHub Actions it is possible to create your own CI workflow that compiles and runs your dbt project. In addition, you can create a continuous delivery (CD) workflow that publishes dbt Docs as a GitHub Page.
-
Formatter
Use a SQL formatter on your code to ensure a standardised approach to formatting. SQLFluff and sqlfmt are two widely used options.
-
Dev Container and Docker
Is your background in other areas of engineering, do you have Docker knowledge? One way to demonstrate this is by using dbt Core with a dev container, a complete development environment that includes all the essential packages, tools, dependencies, and configurations needed to develop with dbt.
-
Orchestration
Orchestration refers to the regular running of your dbt project so the underlying dbt models contain up to date data. Generally this is performed using dbt Cloud or a tool like Airflow or Dagster. You can also use GitHub Actions to schedule your dbt project. When designing your orchestration, keep in mind topics like how you will handle retries, will alerts be sent upon failures and what logs are outputted so you can debug what actions the pipeline performed.
-
Add to your CV
Already have a dbt repo you are proud of?! Don't forget to include a link on your CV so recruiters and hiring managers can take a look!
-
Books
For those interested in upskilling on other aspects of Analytics Engineering, there are many books available such as:
1. Designing Data-Intensive Applications: A resource for understanding and building data-intensive applications, covering data modeling, storage systems, and distributed systems.
2. Fundamentals of Analytics Engineering: This book guides you through designing and implementing data pipelines, data modeling, schema design, data quality, governance, observability, collaborative coding practices (Git), and automating workflows with CI/CD pipelines and orchestrators.
3. The Data Warehouse Toolkit: The classic guide to data modeling in data warehouses, focused on the Kimball dimensional modeling methodology.
-
Community
Inspiration for learnings can also come from others. There are a number of communities where can you ask questions, get inspired or see the work of others:
1. dbt Slack: A Slack workspace for all questions related to dbt, frequented by over 20k users.
2. r/dataengineering: For more technical questions, the data engineering subreddit is an active space for questions as well as project showcases.
3. Newsletters: There are a number of newsletters/blogs that are worth signing up for such as Data Products from Chad Sanderson, The Analytics Engineering Roundup from dbt Labs and Benn Stancil's Substack.
Conclusion
Embarking on a career in analytics engineering doesn't require expensive courses or proprietary software. By leveraging powerful open-source tools and publicly available datasets, you can build a modern data transformation pipeline. This hands-on project using BigQuery, dbt and GitHub will not only equip you with practical skills but also provide a tangible portfolio piece to showcase your abilities to prospective employers.
Start building, keep learning, and don't be afraid to experiment. The world of data awaits!
Feel you could benefit from more personalised advice? Our analytics engineer consultants are here to help – just contact us and we'll get back to you soon. Or are you a future analytics engineer interested in learning more? Check out our training courses or have a look at our job openings.
Photo by RDNE Stock project.
Written by
Pádraic Slattery
Pádraic is a technical-minded engineer passionate about helping organizations derive business value from data. With experience in data engineering, Business Intelligence development, and data analysis, he specializes in data ingestion pipelines and DataOps.
Our Ideas
Explore More Blogs
Contact



