Building Data Engineering Pipelines in Python
Learn how to use PySpark to build data engineering pipelines in Python. Discover how to ingest data from a RESTful API into a data lake, and how to write unit tests for data transformation pipelines. Explore Apache Airflow and learn how to trigger components of an ETL pipeline on a time schedule. Build robust and reusable components with this comprehensive course. ▼
ADVERTISEMENT
Course Feature
Cost:
Free Trial
Provider:
Datacamp
Certificate:
No Information
Language:
English
Course Overview
❗The content presented here is sourced directly from Datacamp platform. For comprehensive course details, including enrollment information, simply click on the 'Go to class' link on our website.
Updated in [June 30th, 2023]
This course provides an introduction to building data engineering pipelines in Python. Students will learn how to use PySpark to process data in a data lake in a structured manner. They will also explore the fundamentals of Apache Airflow, a popular piece of software that enables them to trigger the various components of an ETL pipeline on a time schedule and execute tasks in a specific order.
Students will gain an understanding of what a data platform is, how data gets into it, and how data engineers build its foundations. They will also learn how to ingest data from a RESTful API into the data platform's data lake via a self-written ingestion pipeline built with Singer's taps and targets. Additionally, students will explore various types of testing and learn how to write unit tests for their PySpark data transformation pipeline so that they can create robust and reusable components.
[Applications]
The application of this course can be seen in the development of data engineering pipelines in Python. After completing this course, the learner will be able to explain what a data platform is, how data gets into it, and how data engineers build its foundations. They will also be able to ingest data from a RESTful API into the data platform's data lake via a self-written ingestion pipeline built with Singer's taps and targets. Additionally, the learner will be able to write unit tests for their PySpark data transformation pipeline, creating robust and reusable components. Finally, the learner will be able to use Apache Airflow to trigger the various components of an ETL pipeline on a time schedule and execute tasks in a specific order.
[Career Path]
A career path recommended to learners of this course is that of a Data Engineer. Data Engineers are responsible for designing, building, and maintaining data pipelines that enable the flow of data from source systems to target systems. They are also responsible for ensuring the accuracy and integrity of the data that is being transferred.
Data Engineers must have a strong understanding of data structures, algorithms, and software engineering principles. They must also be familiar with the various tools and technologies used to build data pipelines, such as PySpark, Singer, Apache Airflow, and others.
The development trend for Data Engineers is to become more specialized in the tools and technologies they use. As data pipelines become more complex, Data Engineers must be able to understand the nuances of the tools they use and be able to troubleshoot any issues that arise. Additionally, Data Engineers must be able to work with other teams, such as Data Scientists and Business Analysts, to ensure that the data pipelines they build are meeting the needs of the organization.
[Education Path]
The recommended educational path for learners of this course is a Bachelor's degree in Data Engineering. This degree program will provide students with the knowledge and skills necessary to design, develop, and maintain data engineering pipelines. Students will learn how to use various tools and technologies to create data pipelines, such as PySpark, Singer's taps and targets, Apache Airflow, and more. They will also learn how to test and debug data pipelines, as well as how to optimize them for performance. Additionally, students will gain an understanding of the principles of data engineering, such as data modeling, data warehousing, and data security.
The development trend of data engineering is rapidly evolving, as more and more organizations are relying on data-driven decision-making. As such, data engineering degrees are becoming increasingly popular and in-demand. As the demand for data engineers grows, so too will the need for more advanced degrees, such as Master's and Doctoral degrees in Data Engineering. These degrees will provide students with the skills and knowledge necessary to design and develop complex data engineering pipelines, as well as to understand the principles of data engineering.
Course Syllabus
Ingesting Data
Creating a data transformation pipeline with PySpark
Testing your data pipeline
Managing and orchestrating a workflow
Course Provider
Provider Datacamp's Stats at AZClass
Discussion and Reviews
0.0 (Based on 0 reviews)
Start your review of Building Data Engineering Pipelines in Python