4 day course

Partner of the Year

Virtual, Private

Virtual Classroom

A convenient and interactive learning experience, that enables you to attend one of our courses from the comfort of your own home or anywhere you can log on. We offer Virtual Classroom on selected live classroom courses where this will appear as an option under the location drop down if available. These can also be booked as Private Virtual Classrooms for exclusive business sessions.

Private

A private training session for your team. Groups can be of any size, at a location of your choice including our training centres.

Certificate of Attendance

As a Google Cloud Partner, we’ve been selected to deliver this four-day Data Engineering course.

Our expert Cloud trainers will combine presentations and demos with practical, lab-orientated workshops to teach you the steps required to design data processing systems. You’ll learn how to build end-to-end data pipelines, analyse data and carry out machine learning.

The course covers structured, unstructured, and streaming data, so if you’re a budding Data Engineer, you’ll leave with the ability and confidence to be able to apply your new skills to a variety of scenarios and datasets.

Our Data Engineering on Google Cloud course is delivered via Virtual Classroom. We also offer it as a private training session that can be delivered virtually or at a location of your choice in the UK.

This course is an intermediate-level course. If you’re looking to master the basics, you would benefit from our Google Cloud Fundamentals: Big Data & Machine Learning course.

Want to combine the two? Take a look at our Professional Data Engineer track, which will get you on track to obtaining the Google Data Engineer certification.

Course overview

Who should attend:

This course is intended for developers who are responsible for:

Extracting, loading, transforming, cleaning, and validating data
Designing pipelines and architectures for data processing
Integrating analytics and machine learning capabilities into data pipelines
Querying datasets, visualising query results, and creating reports

What you'll learn:

By the end of this course, you will be able to:

Design and build data processing systems on Google Cloud
Process batch and streaming data by implementing autoscaling data pipelines on Dataflow
Derive business insights from extremely large datasets using
BigQuery
Leverage unstructured data using Spark and ML APIs on Dataproc
Enable instant insights from streaming data
Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding

Prerequisites

To benefit from this course, participants should have completed our Google Cloud Fundamentals: Big Data & Machine Learning course or have equivalent experience. You should also have:

Basic proficiency with a common query language such as SQL
Experience with data modelling and ETL (extract, transform, load) activities
Experience with developing applications using a common programming language such as Python
Familiarity with machine learning and / or statistics

Course agenda

Module 1: Introduction to Data Engineering

Analyse data engineering challenges
Introduction to BigQuery
Data lakes and data warehouses
Transactional databases versus data warehouses
Partner effectively with other data teams
Manage data access and governance
Build production-ready pipelines
Review Google Cloud customer case study
Lab: Using BigQuery to do Analysis

Module 2: Building a Data Lake

Introduction to data lakes
Data storage and ETL options on Google Cloud
Building a data lake using Cloud Storage
Securing Cloud Storage
Storing all sorts of data types
Cloud SQL as a relational data lake

Module 3: Building a Data Warehouse

The modern data warehouse
Introduction to BigQuery
Getting started with BigQuery
Loading data
Exploring schemas
Schema design
Nested and repeated fields
Optimising with partitioning and clustering
Lab: Loading Data into BigQuery
Lab: Working with JSON and Array Data in BigQuery

Module 4: Introduction to Building Batch Data Pipelines

EL, ELT, ETL
Quality considerations
How to carry out operations in BigQuery
Shortcomings
ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

The Hadoop ecosystem
Run Hadoop on Dataproc
Cloud Storage instead of HDFS
Optimise Dataproc
Lab: Running Apache Spark jobs on Dataproc

Module 6: Serverless Data Processing with Dataflow

Introduction to Dataflow
Why customers value Dataflow
Dataflow pipelines
Aggregating with GroupByKey and Combine
Side inputs and windows
Dataflow templates
Dataflow SQL
Lab: A Simple Dataflow Pipeline (Python/Java)
Lab: MapReduce in Dataflow (Python/Java)
Lab: Side inputs (Python/Java)

Module 7: Manage Data Pipelines with Cloud Data Fusion & Cloud Composer

Building batch data pipelines visually with Cloud Data Fusion
Components
UI overview
Building a pipeline
Exploring data using Wrangler
Orchestrating work between Google Cloud services with Cloud Composer
Apache Airflow environment
DAGs and operators
Workflow scheduling
Monitoring and logging
Lab: Building and Executing a Pipeline Graph in Data Fusion
Optional Lab: An introduction to Cloud Composer

Module 8: Introduction to Processing Streaming Data

Process Streaming Data
Explain streaming data processing
Describe the challenges with streaming data
Identify the Google Cloud products and tools that can help address streaming data challenges

Module 9: Serverless Messaging with Pub / Sub

Introduction to Pub / Sub
Pub / Sub push versus pull
Publishing with Pub / Sub code
Lab: Publish Streaming Data into Pub / Sub

Module 10: Dataflow Streaming Features

Steaming data challenges
Dataflow windowing
Lab: Streaming Data Pipelines

Module 11: High-Throughput BigQuery & Bigtable Streaming Features

Streaming into BigQuery and visualising results
High-throughput streaming with Cloud Bigtable
Optimising Cloud Bigtable performance
Lab: Streaming Analytics and Dashboards
Lab: Streaming Data Pipelines into Bigtable

Module 12: Advanced BigQuery Functionality & Performance

Analytic window functions
Use With clauses
GIS functions
Performance considerations
Lab: Optimising your BigQuery Queries for Performance
Optional Lab: Partitioned Tables in BigQuery

Module 13: Introduction to Analytics & AI

What is AI?
From ad-hoc data analysis to data-driven decisions
Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

Challenges dealing with unstructured data
ML APIs for enriching data
Lab: Using the Natural Language API to Classify Unstructured Text

Module 15: Big Data Analytics with Notebooks

What’s a notebook?
BigQuery magic and ties to Pandas
Lab: BigQuery in Jupyter Labs on AI Platform

Module 16: Production ML Pipelines

Ways to do ML on Google Cloud
Vertex AI Pipelines
AI Hub
Lab: Running Pipelines on Vertex AI

Module 17: Custom Model Building with SQL in BigQuery ML

BigQuery ML for quick model building
Supported models
Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
Lab option 2: Movie Recommendations in BigQuery ML

Module 18: Custom Model Building with AutoML

Why AutoML?
AutoML Vision
AutoML NLP
AutoML tables