Blog

BigQuery Storage Write API: A Hands-On Guide with Python and Protobuf

30 Mar, 2025
Xebia Background Header Wave

Real-time data is key in today’s fast-paced analytics world. The BigQuery Storage Write API lets you stream data directly into BigQuery tables. In this guide, we’ll walk through several hands-on examples using the BigQuery Storage Write API. You’ll see how to handle different data writing needs with different delivery semantics using Python and Protobuf.

Let’s get started!

BigQuery Storage Write API

When you need to write data into BigQuery in near-real time, there are several options – the BigQuery Storage Write API is one of them. The BigQuery Storage Write API is a low-level, cost-efficient interface designed to write data into BigQuery in near-real time. The API unifies streaming ingestion and batch loading through a low-latency and high-throughput interface.

From a developer’s perspective, using the BigQuery Storage Write API is all about sending your data in a stream of protobuf messages to BigQuery. You’ll start by creating a write session and then convert your data into the appropriate Protobuf format to match your table schema. Each write operation is a simple API call that sends a chunk of data. Then, depending on the streaming type, you decide whether to commit the changes or not. We will deep dive into the details in the following sections. Furthermore, there are different options to write data into BigQuery using the Storage Write API. We will go through all options in this guide.

Writing Options

Two sentences about Protobuf

Protocol Buffers, or Protobuf for short, is a method developed and maintained by Google for serializing structured data. Think of it as a more efficient and strictly-typed alternative to formats like JSON or XML. You define your data structure using a special message definition language (.proto files), then use a compiler to generate code for various programming languages such as Python, Java, or C++. This generated code lets you easily encode and decode data in your language into protobuf messages. Compared to formats like XML or JSON, Protobuf produces smaller, faster-to-process messages, and it is ideal for performance-sensitive, scalable applications.

Setup and Prerequisites

The repository with the code examples is available here. You can clone the repository and run the examples locally if you want to follow along. To better developer experience, there is a devcontainer setup for this repository with the just commands, so open the repository in devcontainer, and you will be equipped with all the tools you need.

When you open the repository in devcontainer, run the following command to set up the environment:

  • Copy the conf_example.yaml into conf.yaml and fill in the gcp project id and the bq dataset id (if the dataset does not exist, in the later step we will create it).
just setup
  • Install the project dependencies
just install
  • Login to gcloud
just login

In the following examples, we will simulate the learning management system (LMS) data pipeline with the following ERD diagram and entities out there.

ERD Diagram Each entity has its own table in BigQuery, and the schema of the tables is described under the misc/schemas folder in json files. So, let’s create the dataset and tables in BigQuery.

  • Create the BigQuery dataset and required tables (The script will iterate over the files under misc/schemas folder and create tables in BigQuery dataset for each json file)
just bq_init

You can skip the following 2 steps, as the results of these steps are already in the repository. These steps are here to show how to generate proto messages and python classes from the .proto files.

  • Generate proto messages from the bigquery table schemas (it will create the .proto files under the misc/proto folder for each entity under the schemas/ folder)
just generate_proto
  • Compile the .proto files into python classes (it will place python classes under the src/bigquery_storage_write_api_examples/entities/ folder)
just compile_proto

Usually, the proto files and generated python classes are stored in the repository, and if needed they are regenerated for the updated schemas.

BigQuery Storage Write API – Default Stream

The default stream in the Storage Write API is built for scenarios where data keeps coming in. When you write to this stream, your data becomes available for queries immediately, and the system guarantees at-least-once delivery. If you’re moving from the legacy tabledata.insertall API, you’ll find that this Default Stream works similarly but offers better resilience, fewer scaling issues and cheaper price.

Now, let’s see how it looks from the code perspective. In our example LMS system, we want to write the students data to the students table in BigQuery using the default stream. The source code is here.

First, we need to initialize the stream with the client, table info, proto message descriptor, etc.

Initialization of the stream
def _init_stream(self):
  # Initialize the client
  self.write_client = BigQueryWriteClient()

  # Create the table path
  self.table_path = self.write_client.table_path(self.project_id, self.dataset_id, self.table_id)

  # Create the stream name
  self.stream_name = self.write_client.write_stream_path(
      self.project_id, self.dataset_id, self.table_id, "_default"
  )

  # Create the proto descriptor
  self.proto_descriptor: DescriptorProto = DescriptorProto()
  RawStudents.DESCRIPTOR.CopyToProto(self.proto_descriptor)

  # Create the proto schema
  self.proto_schema = ProtoSchema(proto_descriptor=self.proto_descriptor)

  # Create the proto data (template for the data to be written)
  self.proto_data: AppendRowsRequest.ProtoData = AppendRowsRequest.ProtoData()
  self.proto_data.writer_schema = self.proto_schema

  # Create the request template
  self.request_template = AppendRowsRequest()
  self.request_template.write_stream = self.stream_name
  self.request_template.proto_rows = self.proto_data

  # Create the stream
  self.append_rows_stream: AppendRowsStream = AppendRowsStream(self.write_client, self.request_template)

Then, when the stream is created, we can start writing the data to it. For that, we need to create a request:

Creating the request
def _request(self, students: list[dict]) -> AppendRowsRequest:
  request = AppendRowsRequest()
  proto_data = AppendRowsRequest.ProtoData()

  proto_rows = ProtoRows()

  for student in students:
      raw_student = ParseDict(js_dict=student, message=RawStudents(), ignore_unknown_fields=True)
      proto_rows.serialized_rows.append(raw_student.SerializeToString())

  proto_data.rows = proto_rows
  request.proto_rows = proto_data
  return request

After that, we can send the request to the stream:

Sending the request
def _write_students(self, request: AppendRowsRequest) -> AppendRowsResponse:
    self.logger.debug("Sending a request to BigQuery")
    response_future = self.append_rows_stream.send(request)
    # if this doesn't raise an exception, all rows are considered successful
    result = response_future.result()
    self.logger.debug(f"🎓 Result: {result}")
    return result

If you run the following just command, you will see 1,000 rows of fake student data is generated and written to the students table in BigQuery.

just default_stream_example
Result

![Result](https://xebia.com/wp-content/uploads/2025/03/bigquery-storage-write-api-a-hands-on-guide-with-python-and-protobuf-default_stream_write_result.jpg)

BigQuery Storage Write API – Pending Type Stream

With a pending stream type, any records you write are held in a buffer until you decide to commit the stream. Once you commit, all the buffered data becomes available for querying. BigQuery also offers a committed type stream that makes data available immediately upon write. We’ll cover the details of committed type streams a bit later.

For this example, we will generate and write 1,000 fake classes data to the classes table in BigQuery (source code).

The initialization part is almost the same as in the previous example, except for some small differences, so only the relevant parts are shown below.

def _init_stream(self):
    ... # same as in the previous example
    self.write_stream.type_ = types.WriteStream.Type.PENDING
    ... # same as in the previous example

As in the previous example, we need to create a request to write the data for the stream. However, we also have to specify the offset for each entry in the request. Offsets are integer values indicating the order of the messages within the request. If there’s an issue—like a network glitch or retry — BigQuery uses offsets to know what data has already been received and what’s still missing. In the case of several requests within the same stream, the offsets must be aligned so the order of messages is maintained.

def _request(self, courses: list[dict], offset: int) -> types.AppendRowsRequest:
  ... # same as in the previous example
  request.offset = offset
  ... # same as in the previous example
  return request

def run(self):
  batches = ...

  # The first request must always have an offset of 0.
  offset = 0

  for batch_index, batch in enumerate(batches):
      request = self._request(batch, offset)
      self._write_courses(request=request, batch_index=batch_index, batch_size=len(batch))
      # Offset must equal the number of rows that were previously sent.
      offset += len(batch)
      # The input() is used to pause the execution of the script to allow you to check the data in the table.
      input("Press Enter to continue...")

Run the following just command. It will generate 1,000 fake course data and send them to the BigQuery table. There is a pause in the execution, so you can check the data in the table before the stream is committed.

just pending_type_stream_example
Result

![Result](https://xebia.com/wp-content/uploads/2025/03/bigquery-storage-write-api-a-hands-on-guide-with-python-and-protobuf-pending_type_stream_write_result.png)

BigQuery Storage Write API – Committed Type Stream

The committed type stream makes every record that you write available for consumption straight away. This makes it perfect for streaming workloads where low read latency is crucial. It turns out that the default stream is actually a form of committed stream that operates on an at-least-once basis. The next example shows the exactly-once delivery guarantees of the committed stream.

The code for this example is here and it’s almost the same as the previous example, except the stream type is committed.

def _init_stream(self):
  ... # same as in the previous example
  self.write_stream.type_ = types.WriteStream.Type.COMMITTED
  ... # same as in the previous example

Run the following just command. It will generate 5 fake enrollment data and send them to the BigQuery table. As in the previous example, use pause to check the underlying BigQuery table state after each write API call.

just committed_type_stream_example
Result

![Result](https://xebia.com/wp-content/uploads/2025/03/bigquery-storage-write-api-a-hands-on-guide-with-python-and-protobuf-committed_type_stream_write_result.png)

BigQuery Storage Write API – Buffered Type Stream

Buffered type streams are an advanced option, mainly intended to be used by BigQuery connector developers, like Apache Beam BigQuery I/O Connector. With this stream type, each row is held in a buffer until you flush the stream, at which point the buffered rows are committed individually. Essentially, it gives you row-level commit control. However, for most use cases—especially if you want small batches to be written together—it’s simpler and more effective to use the committed type and send the batch in one request.

The code for this example is here.

The following just command will generate 5 fake class data and send them to the BigQuery table.

just buffered_type_stream_example
Result

![Result](https://xebia.com/wp-content/uploads/2025/03/bigquery-storage-write-api-a-hands-on-guide-with-python-and-protobuf-buffered_type_stream_write_result.png)

Wrapping Up

We’ve explored the BigQuery Storage Write API and its different stream types—default, pending, committed, and buffered. You learned how to use Python and Protobuf to build both real-time and batch data ingestion. Feel free to tweak the code, experiment with different options, and share your results. And remember, BigQuery Storage Write API isn’t the only way to write to BigQuery Table – Google offers other options like Pub/Sub, Dataflow, and more to suit different workloads and architectures. Happy coding!

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts