AWS Glue and AWS Data Pipeline have a lot in common. The primary goal of both solutions is to move data. Many of their use cases overlap.
However, there are also fundamental differences.
In this entry, we’re comparing both services to help you choose which is better suited to your needs.
What is AWS Data Pipeline?
The AWS Data Pipeline web service enables you to easily automate the movement and transformation of data. It helps to process and move information between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.
With the use of AWS Data Pipeline, you can access your data where it’s stored, transform and process it at scale, and move the results efficiently to other AWS services – like Amazon RDS, Amazon DynamoDB, Amazon S3, or Amazon EMR.
Key Features of AWS Data Pipeline
With the help of AWS Data Pipeline, you can create complex data processing workloads – repeatable, highly available, and fault-tolerant – to feel at ease about managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, ensuring resource availability, or creating a failure notification system.
Specifically, AWS Data Pipeline enables you to rely on several flexibility features – like scheduling, dependency tracking, and error handling – by using pre-defined activities and preconditions or by creating your own. For example, you can configure an AWS Data Pipeline to take actions like run Amazon EMR jobs, execute SQL queries directly against databases, or execute custom applications running on Amazon EC2 or in your own datacenter. Thus, you can set up potent custom pipelines to analyze and process information without dealing with the complexities of reliably scheduling and executing application logic.
Finally, AWS Data Pipeline also enables you to move and process information previously locked up in on-premises data silos.
What is AWS Glue?
AWS Glue is a fully managed ETL service (extract, transform, and load) that makes it easy and cost-effective to categorize data, clean and enrich it, and move it reliably between data stores and data streams.
With AWS Glue, you can transform and move AWS Cloud data into your data store. You can also load data from disparate static or streaming data sources into your data warehouse or data lake for regular reporting and analysis. That way – by storing information in a data warehouse or data lake – data from different parts of your business is integrated, providing a common source of data for making decisions.
Key Features of AWS Glue
AWS Glue is serverless, so you don’t need to set up or manage any infrastructure. The core of AWS Glue are a central metadata repository (called the AWS Glue Data Catalog), an ETL engine automatically generating Scala or Python code, and a flexible scheduler handling dependency resolution, job monitoring, and retries.
AWS Glue is designed to work with semi-structured data. It also introduces a component called a dynamic frame.
A dynamic frame is similar to an Apache Spark data frame – which is a data abstraction used to organize data into rows and columns – except that each record is self-describing so no schema is initially required. Dynamic frames deliver schema flexibility and a set of advanced transformations specifically designed for dynamic frames. Thus, you can convert between dynamic frames and Spark data frames, taking advantage of both AWS Glue and Spark transformations to perform the analyses you need.
With the AWS Glue console, you can discover data, transform it, and make it available for search and querying. The console calls the underlying services to orchestrate the work required to transform your data. You can also use the AWS Glue API operations to interface with AWS Glue services. Edit, debug, and test your Python or Scala Apache Spark ETL code using a familiar development environment.
When do I Use AWS Glue, and When do I Use AWS Data Pipeline?
AWS Glue and AWS Data Pipeline have a lot in common. Both can do similar things:
- moving and transforming data across different components in the AWS Cloud,
- integrating natively with S3, DynamoDB, RDS, or Redshift,
- deploying and managing long-running asynchronous tasks,
- assisting with your organization’s ETL tasks.
However, from a practical perspective, AWS Glue is more of a managed ETL service, while AWS Data Pipeline is more of a managed workflow service. And one of the key differences lies in the technology. Glue is built upon Apache Spart, making its ETL based on Scala or Python.
When Should I Use AWS Glue? (AWS Glue Use Cases)
AWS Glue can be used to organize, cleanse, validate, and format data for storage in a data warehouse or data lake. You can use it to transform and move AWS Cloud information into your data store. You can also load data from disparate static or streaming data sources into your data warehouse or data lake for regular reporting and analysis. By keeping data in a data lake or data warehouse, you bring together data from various parts of your organization and create a common source of data for data-driven decision-making (DDDM).
You can use AWS Glue when you run serverless queries against your Amazon S3 data lake. AWS Glue can catalogue your Amazon Simple Storage Service (Amazon S3) data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum. With crawlers, your metadata stays in sync with the underlying data. Athena and Redshift Spectrum can directly query your Amazon S3 data lake using the AWS Glue Data Catalog. With AWS Glue, you access and analyze data through one unified interface without loading it into multiple data silos.
AWS Glue enables you to create event-driven ETL pipelines. You can run your ETL jobs as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.
Crucially, thanks to AWS Glue, you can understand your data assets. You can store your data with AWS services and maintain an overview of your store information thanks to the AWS Glue Data Catalog. You can view the Data Catalog to swiftly search and discover the datasets you own and maintain the needed metadata in one central repository.
When Should I Use AWS Data Pipeline? (AWS Data Pipeline Use Cases)
- ETL Unstructured Data. Analyze unstructured data like clickstream logs using Hive or Pig on EMR, combine it with structured data from RDS and upload it to Redshift for easy querying.
- ETL Data to Amazon Redshift. Copy RDS or DynamoDB tables to S3, transform data structure, run analytics using SQL queries and load it to Redshift.
- Load AWS Log Data to Amazon Redshift. Load log files such as from the AWS billing logs, or AWS CloudTrail, Amazon CloudFront, and Amazon CloudWatch logs, from Amazon S3 to Redshift.
- Data Loads and Extracts. Copy data from your RDS or Redshift table to S3 and vice-versa.
- Move to Cloud. Easily copy data from your on-premises data store, like a MySQL database, and move it to an AWS data store, like S3 to make it available to a variety of AWS services such as Amazon EMR, Amazon Redshift, and Amazon RDS.
- Amazon DynamoDB Backup and Recovery. Periodically backup your Dynamo DB table to S3 for disaster recovery purposes.
Factors that Drive AWS Data Pipeline vs AWS Glue Decision
AWS Data Pipeline vs AWS Glue: Infrastructure Management
As we’ve mentioned, AWS Glue is serverless – meaning that developers don’t have to manage any infrastructure. In Glue’s Apache Spark environment, scaling, provisioning, and configuration are fully managed.
On the other hand, AWS Data Pipeline isn’t serverless. Your developers can define the pipelines and get more control over the resources underlining them.
Importantly, these differences will determine the character of skills and bandwidth you’d need to invest in your ETL activities on the AWS Cloud.
AWS Data Pipeline vs AWS Glue: Operational Methods
AWS Glue supports Redshift, SQL, Amazon RDS, Amazon S3, and DynamoDB. It also provides built-in transformations. Additionally, it supports the Apache Spart framework.
AWS Data Pipeline enables you to create data transformations through APIs and JSON, while only supporting Redshift, SQL, Dynamo DB, and all the platforms supported by EMR in addition to Shell.
AWS Data Pipeline vs AWS Glue: Compatibility/Compute Engine
AWS Glue runs ETL jobs on its virtual resources in a serverless Apache Spark environment.
AWS Data Pipeline isn’t limited to Apache Spark. It enables you to use other engines like Hive or Pig. Thus, if your ETL jobs don’t require the use of Apache Spark or multiple engines, AWS Data Pipeline might be preferable.
AWS Data Pipeline vs AWS Glue: Pricing
With AWS Glue, you pay an hourly rate, billed by the second, for crawlers and ETL jobs. For the AWS Glue Data Catalog, you pay a monthly fee for storing and accessing the metadata. The first million objects and accesses don’t cost you anything. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second. For AWS Glue DataBrew, the interactive sessions are billed per session and the DataBrew jobs are billed per minute. Usage of the AWS Glue Schema registry is offered at no additional charge.
AWS Data Pipeline is billed based on how often your activities and preconditions are scheduled to run and where they run (AWS or on-premises). High-Frequency activities are ones scheduled to execute more than once a day; for example, an activity scheduled to execute every hour or every 12 hours is High Frequency. Low-Frequency activities are ones scheduled to execute one time a day or less. Inactive pipelines are those in PENDING, INACTIVE, and FINISHED states.
Compliance Requirements and Security Certifications
AWS Data Pipeline is not compliant with security requirements like GDPR. However, this issue is not an automatic deal breaker. What it means is that you need to manage the checklists and all the necessary parameters yourself, and not through the tool.
AWS Glue, on the other side, is GDPR and HIPPA certified. As a result, you can directly create a report with help of the tool.