Leveraging Serverless and Generative AI for Image Captioning on GCP

In today’s age of abundant data, especially visual data, it’s imperative to understand and categorize images efficiently. Whether it’s for enhancing search functionality, assisting differently-abled individuals, or simply summarizing vast galleries, the automation of image captioning is becoming increasingly relevant.

In my recent endeavor, I explored a seamless integration of serverless architecture with the power of generative AI to auto-caption images on Google Cloud Platform (GCP). The underlying objective was to tap into GCP’s scalable and efficient infrastructure, without the overhead of server management, while benefiting from VertexAI’s image captioning abilities. The result? An automated event-driven application that, once set up, requires minimal to no maintenance and provides instant, accurate captions for any image uploaded to a designated storage bucket.

In this blog post, i’ll delve into the specifics of how this system was designed, the underlying code that powers it, and how you can set it up in your own GCP environment using Terraform—a popular infrastructure as code (IAC) tool. Whether you’re a developer looking for a scalable image captioning solution or a tech enthusiast eager to know more about serverless and AI integrations, this post promises a blend of technical depth with practical insights. I’ll first give you the TL;DR and after that let’s dive in!

TL;DR

We’ve built an automated, serverless system on Google Cloud Platform where:

Users upload images to a Google Cloud Storage Bucket.
A Pub/Sub topic detects this upload and notifies our main Python Cloud Function.
This function leverages Vertex AI to generate captions for the images.
The captions, paired with the image filenames, are then stored in Google Cloud Firestore.
End result? Users drop an image in a bucket and swiftly get an AI-crafted caption from Firestore. It’s efficient, scalable, and harnesses the best of cloud and AI! 🚀🖼🔍
You’ll find the source code at https://github.com/binxio/tf-serverless-image-caption-generator

Technological Overview: Unpacking the Building Blocks

Before diving deep into the code and deployment intricacies, let’s take a moment to unpack the key technologies and components that form the backbone of this solution. Understanding these will provide a clearer picture of how everything seamlessly ties together in our serverless, AI-driven image captioning system.

Firestore Database: A flexible, scalable database for mobile, web, and server applications from Firebase and Google Cloud. In our setup, this is where the image captions are stored.
Cloud Storage Bucket: GCP’s unified object storage, allowing worldwide storage and retrieval of any amount of data. This is where the images are uploaded and the Cloud Function’s source code is stored.
Cloud Pub/Sub: A real-time messaging service that allows you to send and receive messages between independent applications. It aids in bucket notifications when a new image is added.
VertexAI’s ImageCaptioningModel: An advanced model from VertexAI (a suite of machine learning tools on GCP), the ImageCaptioningModel uses generative AI techniques to create descriptions for images. In our system, it’s the powerhouse behind generating the captions.
Python: Serving as the bridge between our Cloud Storage Bucket and VertexAI, this Python Cloud Function is triggered whenever a new image is uploaded. It fetches the image, gets a caption using VertexAI, and then stores the result in Firestore.
Terraform: – A tool for building, changing, and versioning infrastructure safely and efficiently. With our setup, Terraform allows us to codify the entire infrastructure, ensuring repeatability and scalability without manual intervention.

Understanding the Terraform Code

Terraform is an instrumental tool in this project. It allows us to define and provision the required infrastructure using a declarative configuration language. Let’s delve into the various components of our serverless application as defined by our Terraform configurations. Each snippet offers a glimpse into how each segment is orchestrated on the Google Cloud Platform.

Variable Declarations and Provider Configuration

These segments determine which GCP project and region the resources should be created in. I need some of these variables later on as well, so I’ve defined them at the beginning of the Terraform file.

variable "project_id" { ... }
variable "region" { ... }
variable "location" { ... }

provider "google" {
  project = var.project_id
  region  = var.region
}

Cloud Storage Bucket for Images

A dedicated bucket designed for storing images. These images, when uploaded, act as the trigger for our event-driven architecture.

resource "google_storage_bucket" "image_bucket" {
  name = "your-image-bucket"
  location = "US"
}

Cloud Function Deployment

This represents the heart of the serverless application. The function springs to action whenever a message finds its way to the Pub/Sub topic. Its source code, packed into a zip, is uploaded to a distinct storage bucket.

resource "google_cloudfunctions_function" "process_image_function" {
  name                  = "process-image"
  description           = "Processes uploaded image data"
  available_memory_mb   = 1024
  source_archive_bucket = google_storage_bucket.cloudfunction_bucket.name
  source_archive_object = google_storage_bucket_object.function_archive.name
  entry_point           = "process_image"
  runtime               = "python311"

  labels = {
    zip-md5 = local.combined_md5
  }

  event_trigger {
    event_type = "google.pubsub.topic.publish"
    resource   = google_pubsub_topic.bucket_notifications.name
  }
}

The labels{} part is a mechanism to redeploy the function only when its changed. The next snippet explains how it was done.

Zip File Creation and Update Technique

Leveraging MD5 checksums, the system smartly repackages and redeploys the Cloud Function only when there’s a change in the source or its dependencies.

locals {
  main_md5       = md5(file("${path.module}/function/main.py"))
  requirements_md5 = md5(file("${path.module}/function/requirements.txt"))
  combined_md5   = md5("${local.main_md5}${local.requirements_md5}")
}

resource "null_resource" "create_zip" {
  triggers = {
    file1_checksum = md5(file("${path.module}/function/main.py"))
    file2_checksum = md5(file("${path.module}/function/requirements.txt"))
  }

  provisioner "local-exec" {
    command = "zip -j ${path.module}/${local.combined_md5}.zip ${path.module}/function/*"
  }
}

Pub/Sub Topic for Bucket Notifications

The Pub/Sub Topic is primed to capture notifications when images are added to the storage bucket. A binding establishes permissions while a notification configuration binds the bucket to the topic.

resource "google_pubsub_topic" "bucket_notifications" {
  name = "bucket-notifications-topic"
}

resource "google_pubsub_topic_iam_binding" "bucket_pubsub_publisher" {
  topic = google_pubsub_topic.bucket_notifications.name
  role  = "roles/pubsub.publisher"
  members = [
    "serviceAccount:service-<service_account_id>@gs-project-accounts.iam.gserviceaccount.com"
  ]
}

resource "google_storage_notification" "bucket_notification" {
  bucket        = google_storage_bucket.image_bucket.name
  payload_format = "JSON_API_V1"
  topic         = google_pubsub_topic.bucket_notifications.name
  depends_on = [google_pubsub_topic_iam_binding.bucket_pubsub_publisher]
}

Firestore Database Configuration

Firestore, a NoSQL database, is used to store image captions. The configuration below provisions the Firestore service and sets up a native database instance.

resource "google_firestore_database" "your-amazing-database" {
  project     = var.project_id
  name        = "your-amazing-database"
  location_id = var.location
  type        = "FIRESTORE_NATIVE"
}

With these segments, the Terraform script facilitates the creation of an automated, efficient, and event-driven architecture on GCP. The orchestrated interplay of these components facilitates the generation of captions for images using the power of AI.

The Python Cloud Function for Image Captioning

With our infrastructure efficiently established through Terraform, we now venture into the very core of our solution: the serverless Python Cloud Function that leverages Vertex AI to caption images stored in Google Cloud Storage. Let’s have a look:

Setting Up

At the outset, we import required libraries, including logging for debugging and tracebacks for detailed error handling. Google Cloud Storage and Firestore modules enable interactions with their respective services. The Vertex AI image captioning model will be the superstar here, granting our function its AI capabilities.

import logging
import traceback
from google.cloud import storage, firestore
from vertexai.vision_models._vision_models import Image, ImageCaptioningModel
import os

Initializing the Captioning Model & Firestore Client

We preload our ImageCaptioningModel and initialize a Firestore client. Doing so at the global scope ensures these are done just once, rather than upon every function invocation, aiding performance.

model = ImageCaptioningModel.from_pretrained("imagetext@001")
db = firestore.Client()

Processing images

This is the method that is triggered when a new image is uploaded to the bucket: – Event Analysis: Initially, the function decodes the incoming event, discerning event type and target object details. – Image Download: If the event signifies a new image OBJECT_FINALIZE, it proceeds to download the image to a temporary local space for processing. – Captioning: The image is then passed to the get_captions function, which, powered by Vertex AI, determines an appropriate caption. – Firestore Storage: Post-captioning, the image’s name and the computed caption are stored as a document in Firestore.

def process_image(data, context):
    """Triggered from a message on a Cloud Pub/Sub topic."""
    logging.info("Function started.")
    logging.info("Received data: " + str(data))

    try:
        attributes = data.get('attributes', {})
        event_type = attributes.get('eventType')
        event_type = attributes['eventType']
        bucket_id = attributes['bucketId']
        object_id = attributes['objectId']

        if event_type != "OBJECT_FINALIZE":
            logging.info("Event type is not OBJECT_FINALIZE. Exiting function.")
            return "No action taken"

        # Initialize a client
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_id)
        blob = bucket.blob(object_id)

        # Download the blob to a local file
        blob.download_to_filename(f'/tmp/{object_id}')
        logging.info(f'{object_id} downloaded to /tmp/{object_id}.')

        captions = get_captions(f'/tmp/{object_id}')

        if captions:
            first_caption = captions[0]
            logging.info(first_caption)

            # Store filename and caption in Firestore
            doc_ref = db.collection('captions').document(object_id)
            doc_ref.set({
                'filename': object_id,
                'caption': first_caption
            })

            logging.info(f"Stored {object_id} and its caption in Firestore.")
    except Exception as e:
        stack_trace = traceback.format_exc()
        logging.error(f"Error during processing: {e}\n{stack_trace}")
    return "Processed image data"

Image Captioning with Vertex AI

The image, loaded from its file, is fed into GCP Vertex AI’s pretrained vision model. The model then returns the top caption (as we’ve set number_of_results to 1).

def get_captions(filename):
    image = Image.load_from_file(filename)
    caption = model.get_captions(
        image=image,
        number_of_results=1,
        language="en",
    )
    return(caption)

Flow of the serverless application

Upload: A user uploads an image to the Cloud Storage Bucket.
Notification: The upload event triggers a Pub/Sub message.
Invocation: This message then activates our Cloud Function.
Captioning: Within the function, the Vertex AI model processes the image, deducing an apt caption.
Storing: The caption, alongside the image filename, finds its place in Firestore.

Images can be uploaded to the bucket in bulk, and our Cloud Function will create captions on the fly, in parallel.

How to deploy

Clone the Repository:

git clone git@github.com:binxio/tf-serverless-image-caption-generator.git
cd tf-serverless-image-caption-generator

Initialise Terraform:

terraform init

Update Variables: You may want to adjust default values for the project_id, region, and location in the Terraform files or override them using -var option with terraform apply.

Review and Apply Changes:

terraform plan
terraform apply

Review the resources that will be created/modified and type yes when prompted.

Once deployed, navigate to the Google Cloud Console to ensure that the resources have been created.

In conclusion

By weaving together Google Cloud’s storage, Pub/Sub, and Firestore with Vertex AI, we’ve showcased a fraction of the potential of the fusion of cloud and AI technology. In today’s rapidly evolving digital landscape, it’s crucial for developers, businesses, and curious individuals alike to embrace the tools at their disposal, continuously learn, and innovate. The synergy between cloud and AI exemplified in this project paves the way for countless applications — limited only by our imaginations.

Here’s to more such integrations, adventures, and the uncharted frontiers of tech we’re yet to explore. Onward and upward! 🚀

Leveraging Serverless and Generative AI for Image Captioning on GCP