FOTC
  • Products
    • Google Workspace
    • Google Cloud
  • Services
    • Cloud engineering as a service
    • Cloud Infrastructure Strategy Roadmap
    • Google AI
    • Landing Zone
    • Security audit
    • Technical support
  • About us
  • Startups
  • Resources
    • Case studies
    • Blog
    • Partner programme
  • Careers
Contact
ro pl hu en
  • Privacy policy

Google Cloud Dataflow – a service for streaming and batch data processing

EN » Blog » Google Cloud Dataflow – a service for streaming and batch data processing

Kamil Lasek

22 November 2022
Google Cloud Dataflow – a service for streaming and batch data processing

Table of contents

  • What is Dataflow? The service overview
    • Dataflow features and advantages
    • Subtypes of the service
  • When it’s worth using Dataflow?
  • How much does Dataflow cost?

What is Dataflow? The service overview

Dataflow is a serverless, latency-minimising, cost-flexible, secure stream data and batch data processing cloud solution. The idea behind the service is to eliminate the need for complex operations and maintenance by automating infrastructure provisioning and scaling. The service provides flexibility, thanks to horizontal and vertical scaling, which enables the management of seasonal and spiky workloads without overspending. It allows businesses to look at the whole picture, have full control over processes of data ingest, and enhance the speed and performance of a product or application.

In other words, GCP Cloud Dataflow is a data pipeline service that allows you to implement streaming in Apache pipelines, which you can build using the Apache Beam library.

Dataflow features and advantages

  • Dataflow programming languages – Google Dataflow supports programming languages such as Python, Java, and Go.
  • Vertical and horizontal autoscaling that dynamically adjusts compute capacity to the demand,
  • Smart diagnostics enabling SLO-based data management, job visualisation capabilities, and automatic recommendations for improving performance,
  • Streaming Engine that moves parts of pipeline execution to the service’s backend, decreasing data latency,
  • Dataflow Shuffle moves operations of grouping and joining data from VMs to the Dataflow’s backend for batch pipelines that scale seamlessly,
  • Dataflow SQL enables the development of streaming Dataflow pipelines using SQL language,
  • Flexible Resource Scheduling (FlexRS) reduces batch processing costs thanks to scheduling and combining preemptible VMs with regular instances,
  • Dataflow templates made for sharing pipelines across the organisation,
  • Notebooks integration that enables the user to create pipelines with Vertex AI Notebooks and deploy them using Dataflow runner,
  • Real-time change data capture for synchronising and replicating data with minimum latency,
  • Inline monitoring that lets the user access job metrics to help with troubleshooting batch and streaming pipelines,
  • Customer-managed encryption keys to protect batch or streaming pipelines with CMEK,
  • Dataflow VPC Service Controls, integration with VPS Service Controls for an additional security layer,
  • Private IPs provide greater security for data processing infrastructure and minimise the level of consumption of Google Cloud project quota.

Subtypes of the service

Google Cloud made three subtypes of the service:

  • Dataflow Prime – a service that scales vertically (adding more instances) and horizontally (increasing machine specifications, adding CPU and memory) for streaming data processing workloads. Dataflow Prime allows you to create more efficient pipelines and add insights in real time.
  • Dataflow Go – a service offering native Go programming language support for batch and streaming data processing workloads. It leverages Apache Beam’s multi-language model to exploit the performance of Java I/O connectors with ML transforms.
  • Dataflow ML – dedicated to running PyTorch and scikit-learn models directly within the pipeline. It allows the implementation of ML models with a little code. Dataflow supports the implementation of ML models, including by accessing GPUs or systems to train models, either directly or via frameworks like Tensorflow Extended (TFX). Dataflow ML is an extension for these functionalities.
Join our Google Cloud Platform community

Be up to date with cloud updates and connect with other GCP users

Become a member

When it’s worth using Dataflow?

Dataflow is an excellent offer for applications such as real-time artificial intelligence, data warehousing or streaming analytics. In what situations will Dataflow work? You can use it for research and retail segmentation, clickstream, or point-of-sale analysis. The service will also help detect fraud in financial services, personalise game user experience and support IoT analytics in industries such as manufacturing, healthcare, or logistics.

One of the Dataflow customers is Renault. As early as 2016, the French car brand started a digital transformation as part of its alignment with Industry 4.0, which resulted in the implementation of Google Cloud infrastructure. Since then, the Dataflow service has become the primary tool for most of the platform’s data processing needs. As a result, Renault uses Dataflow to quickly extract and transform data from production sites and other connected key reference databases.

Here you’ll find information about Google Dataflow best practices: Tips and tricks to get your Cloud Dataflow pipelines into production

How much does Dataflow cost?

Like most services within Google Cloud, Dataflow also operates on a pay-as-you-use model. The usage is billed for resources your jobs consume, i.e., the region in which jobs run, type of VM instance, CPU per hour, memory per hour, or the volume of data processed. The rate for pricing is based on the hour, but Dataflow usage is billed in per-second increments per job basis. When Dataflow works with other services, such as Cloud Storage or Pub/Sub, the costs of these services will apply separately.

One of Dataflow’s pricing features is Flexible Resource Scheduling (FlexRS), which combines preemptible VMs in a single Dataflow worker pool, ensuring cheaper processing resources.

Below are examples of Dataflow prices billed for one-hour usage, for the London (europe-wst2) region, with instance configurations:

  • for Batch: 1 vCPU, 3.75 GB memory, 250 GB Persistent Disk if not using Dataflow Shuffle, 25 GB Persistent Disk if using Dataflow Shuffle;
  • for FlexRS: 2 vCPU, 7.50 GB memory, 25 GB Persistent Disk per worker, with a minimum of two workers;
  • for Streaming: 4 vCPU, 15 GB memory, 400 GB Persistent Disk if not using Streaming Engine, 30 GB Persistent Disk of using Streaming Engine.
Dataflow pricing example

See also:

  • What is machine learning, and why does it matter?
  • What is serverless, and how does it work?
  • Cloud SQL – a cloud database. What is it, and why is it worth using?

Table of contents
What is Dataflow? The service overview
When it’s worth using Dataflow?
How much does Dataflow cost?

Kamil Lasek

Music and marketing enthusiast. In the first field, he fulfils himself as a DJ, and in the second as a Chief Marketing Officer. He follows the principle "You can’t improve what you don't measure".

Services
  • Cloud Infrastructure Strategy Roadmap
  • Landing Zone
  • Training
Products
  • Google Workspace
  • Google Cloud
  • Google Workspace for Education
Industry
  • Education
  • Gaming
  • Government
  • Healthcare
  • Retail
  • Small and medium businesses
Knowledge
  • Blog
  • Case Studies
  • NIS2 directive
Company
  • About us
  • Career
  • Contact
  • Partner programme
  • Google Workspace Support
  • Privacy Policy
  • Regulations
Copyright © 2014 – 2024 Fly On The Cloud sp. z o.o. KRS: 0000500884, NIP: 8971797086, REGON: 022370270