Make a career shift and become a Data Engineer
Data is eating the world. Humans and IoT devices generate zettabytes of data! This data evolution created new opportunities for the software engineers, so far reduced in their data interactions to the punctual processing of megabytes of data. If you want to be a part of this revolution, mastering data engineering skills is necessary.
If you have some coding experience and want to make a career shift to adapt to the new data-driven companies, then you should check Become a Data Engineer course where in 12 modules you will learn data engineering concepts. Every module is composed of theoretical and demo lessons terminated by the practical homework coding exercise. It means that by the end you should be able to write streaming and batch pipelines and also understand better what is going on in the data world.
You can follow the course in one of 3 different modes. If you prefer to work alone, the Basic plan should fit the best. If you prefer to be a part of a learning community and also get a personalize feedback for your homework solutions, then you should try the Community plan. Finally, if you think you'll need some individual help to move on, then you should try the Community+ option.
Your instructor
My name is Bartosz Konieczny and I am a data engineer working with software for 2009. I'm also an Apache Spark enthusiast, AWS, Azure, and GCP certified cloud user, blogger, and speaker. I like to share and you can discover it on my waitingforcode.com blog or conferences like Spark+AI Summit 2019 or Data+AI Summit 2020. In 2021 I was also nominated to Databricks Beacons
Check me on social media: LinkedIn Twitter Github Stack Overflow Facebook
Course Curriculum
- Introduction
- Data pipeline
- Batch processing
- Streaming processing
- Data architectures - Lambda
- Data architectures - Kappa
- Data architectures - Zeta
- Data architectures - SMACK
- Data architectures - data lake, lake house, data mesh
- Data quality
- Data stores
- Data consistency
- Distributed processing
- Data knowledge
- Tools used in the course
- Homework
- Introduction
- Problem statement
- API Gateway - introduction
- API Gateway - technical aspects
- Data governance
- Data migration
- Delivery semantics
- Idempotency
- Fault-tolerance
- Replication
- Partitioning - introduction
- Partitioning - methods
- Partitioning and bucketing - demo
- ACID file formats - introduction
- ACID file formats - Delta Lake demo
- Batch layer
- Homework
- Introduction
- Problem statement
- Data enrichment
- Data anonymization
- Late data
- Deduplication
- Metadata
- Schema
- Schema registry
- Schema registry - demo
- Schema evolution - demo
- Schema management for semi-structured data
- Schema management for semi-structured data - demo
- Serialization
- Monitoring and alerting
- Monitoring and alerting - demo
- Data validation
- Data validation - demo
- Homework
- Introduction
- Problem statement
- Stateful processing - introduction
- Stateful processing - window
- Stateful processing - arbitrary stateful processing
- Stateful processing - state store implementations
- Stateful processing - stateful logic
- Stateful processing - window demo
- Stateful processing - arbitrary stateful processing demo
- Shuffle
- Late data
- Scalability
- Elasticity
- Fault-tolerance
- Idempotency
- Idempotency - demo
- Reprocessing
- Complex Event Processing - theory
- Complex Event Processing - framework
- Complex Event Processing - demo
- Messaging patterns
- Debugging - tips
- Homework
- Introduction
- Problem statement
- Data pipeline steps
- Staging area
- ETL vs ELT
- Patterns
- Orchestration framework
- Alerting
- Idempotency
- Idempotency and small data
- Idempotency and append data
- Idempotency and append data - demo
- Idempotency and immutable data (versioning)
- Idempotency and immutable data (versioning) - demo
- Data reprocessing
- Triggers
- Task examples
- Best practices
- Data lineage
- Data lineage - demo
- Homework
- Introduction
- Problem statement
- SQL
- JOINs
- Execution plans
- Approximate algorithms
- Data warehousing
- Columnar format vs row format
- Encoding
- Data modeling - normalized data
- Data modeling - dimensional models
- Data modeling - data vault
- Data modeling - denormalization
- Data modeling - pipeline integration
- Real-time SQL -Structured Streaming
- Real-time SQL - Kafka SQL
- Data security
- Data security - credentials and permissions demo
- Data security - data encryption demo
- Data security - data versioning demo
- Homework
- Introduction
- Problem statement
- Visualization types
- Data exploration
- Data exploration - Jupyter example
- Data visualization - JavaScript frameworks
- Data visualization - Python frameworks
- Data visualization - Reporting tools
- Reporting tools - intermediate storage
- Data catalog
- Data mart
- Data visualization and batch processing
- Data visualization and streaming processing
- Best practices
- Homework
- Introduction
- Problem statement
- Polyglot persistence
- Asynchronous communication - theory and Scala example
- Asynchronous communication - Python example
- Compression
- Bulk operations
- Data mutation
- Window-based processing
- Time-series
- Time-series - demo
- RESTful web services
- RESTful web services - demo
- Homework
- Introduction
- Problem statement
- Main concepts
- ML workflow
- Compute environment
- ML workflow - Notebook demo
- ML workflow - automation demo (ETL)
- Online learning
- Online learning - demo
- Model quality
- Serving layer
- Rendezvous architecture
- ML engineer
- ML workflow platform
- ML workflow platform - demo
- Homework
- Introduction
- Problem statement
- Cloud computing - why
- Cloud computing - introduction
- Cloud computing - data services typology
- AWS cloud data services
- GCP cloud data services
- Azure cloud data services
- Docker
- Kubernetes
- Software engineering best practices - Scala example
- Software engineering best practices - Python example
- Tests and data processing - demo
- DevOps - introduction
- DevOps - components
- DevOps - data example
- DevOps - Github Actions demo
- Data processing frameworks - going distributed
- Data processing frameworks - going distributed going distributed Hadoop YARN demo
- Data processing frameworks - going distributed Kubernetes demo
- Data processing frameworks - going distributed - tips
- Serverless
- Not covered frameworks and libs
- Homework
Pricing plans
- 12 week course videos
- Homework exercises
- Lifetime access to the course & updates
- 12 week course videos
- Homework exercises
- Lifetime access to the course & updates
- Lifetime access to the user group
- Lifetime access to the Github project
- Individual homework feedback
- 12 class Live Calls in πΊπΈ
- 12 week course videos
- Homework exercises
- Lifetime access to the course & updates
- Lifetime access to the user group
- Lifetime access to the Github project
- Individual homework feedback
- Lifetime access to class Live Calls in πΊπΈ
- 3 1-on-1 calls in π΅π± πΊπΈ π«π·
- access to the recorded Live Calls
Watch 3 samples of Become a Data engineer course
Frequently asked questions
β What I will be capable of by the end of the course?
After 12 weeks you should be able to:
- understand scalability in data distributed system
- design data systems
- know the most important patterns of modern data architectures
- understand globally data concepts and use that to understand existing data frameworks and data stores
- write batch and streaming pipelines with modern data frameworks and data stores
- switch faster to cloud managed services
βWhen does the course start and how long does it take?
The course is intended to start 4 times a year, March, June, September and November. You will need at least 12 weeks to finish it. Joining the course outside these dates won't be possible. If you missed the date, you can subscribe to mailing list and be alerted before the next opening.
βHow long do I have access to this course?
You get a lifelong access to the course, including all updates.
β Can I follow the course at my own pace?
Yes, you can organize your learning plan as you want. The course is divided into 12 Weeks for easier presentation of the content but it doesn't mean that you have to terminate it after 12 weeks.
βHow long every week takes?
Except Weeks 0 and 12, every week has between 70 and 138 minutes of the recorded lessons.
βWhat do I need to follow the course?
Time and motivation. The content tries to cover as many data engineering parts as possible so it will require a strong motivation. And technically, if you want to make the homework exercises, you should be able to write some code (preferably in Python, Scala or Java), and execute Docker images on your computer.
βWhat about the code snippets?
Most of the code snippets are written in Scala, Java and Python. However, they use very basic concepts of these languages, so even if you don't know any of them, you should be able to understand the examples.
βWhat if I'm not satisfied with the course?
If for any reason the course doesn't satisfy you, we will issue a refund. The guarantee is valid for 21 days from the first week publication date.
βWill I get an invoice?
Yes, just let me know on contact@becomedataengineer.com. For the EU users, I cannot include the VAT in my price, so don't have the EU VAT number neither.
βDo you have a group offer?
YES! If you have a team of 3 people or more, contact me at contact@becomedataengineer.com and you will get a 10% discount!
βCan I pay for the course in installments?
It's too complicated logistically, I prefer one-time payment.
βHow can I communicate with you?
The content and all public communication will be in English but if something is unclear I can explain it in French or Polish in private. Just let me know π
βHow can I ask a question?
Please create a new topic on our user group. The idea of open questions and transparency is to share this extra knowledge with all the class.
βHow will I get an access to the course?
After the confirmation of your payment I will send you an e-mail with all details to access the learning platform.
βWhat is the difference between a class and 1-on-1 call?
An 1-on-1 call is the call between you and me whereas a class call is the call with other students.
βHow to join live calls?
Every 3 weeks you will find there the link to the class Live Call.
βWhen the class live will happen?
On Saturday at 5PM UTC. If it's too late for your timezone, and you're not alone in this situation, I will propose you another schedule on Saturday before noon (UTC).
βHow to join an 1-to-1 live call? You have to write me a message that you need one and we'll find a schedule that fits for both of us.
βCan you add new content?
If you think that there is something missing in the schedule, please contact me at contact@becomedataengineer.com and I will try to add this topic. You have lifetime access to the course, so you will see this and other updates.
βWhat do I need to make the homework exercises?
You should be able to run Docker images and use an IDE. I will give all my examples on IntelliJ and PyCharm so having the same tools will facilitate the troubleshooting.
I will give the examples in Scala or Python but you can do the homework exercises in any language you want.
βWhat is the format of the course?
Most of the time you will see me explaining concepts with a blackboard. From time to time I will hide myself and show you me screen to explain data concepts with some code or slides.
βWhat I will code during the course?
You've just joined data engineering team at MyBlogAnalytics and you were asked to implement different data pipelines with Apache Spark, Apache Kafka, Apache Airflow and PostgreSQL, preferably with Python or Scala.
During your first week you will have to implement a data ingestion part, i.e. move the data from your consumers to the system.
Just after that you'll implement an analytical pipeline and make it visible to the end and not technical users.
By the end of your mission, you'll expose your data through an API to other technical departments of your company. You'll also collaborate with the data scientists of your team to create a Machine Learning pipeline.
βDo I have to do all the homework exercises in order?
No. Even though the homework announcements are logically connected, they don't have to be made in order. For example, you can start with a homework of the Week 5 and go back to the Week 1 later.
βWhat is the format of the homework exercises?
There are 3 tasks categories: conceptual work where I will ask you to define some architectures, coding exercises where I will ask you to implement one component of our system, and open questions where I will ask you to do some research before answering.
Every week's homework is composed of 2 parts called "main" and "bonus". To move forward you should at least terminate the "main" part and if you're in trouble, you can always ask on our user group or during a call for some help.
βHow you will review my homeworks?
Most of the time you will have to create a Pull Request on Github and send me the link with your solution. From time to time, I will give you a feedback on the forum but it's limited to the conceptual tasks, just to spread the knowledge. If you don't want to play the game, you can still submit me the homework as a Pull Request from your Github private project.
βHow can you help me if I don't know how to start the homework?
I will answer your message on our user group. For every coding homework I also prepared one video explaining a possible solution. You can use it as an inspiration.
βWill I learn Apache Spark internals?
No, the course is not intended to deep delve into Apache Spark internals but rather to show how to use it in data projects. But I will also do my best to answer any question related to your homework or a concept described in the course.
βWhere I can find the answers to my question?
If I cannot provide a satisfactory answer during our call, I will put it on our user group, as well for 1-on-1 and class calls. All questions are anonymous, so on the user group you will only see the question and my answer.
βHow long I can access the user group?
If the user group is included in your plan (Community or Community+), you have the access to the user group as long as the course will be online.
βWhy the number of Community+ accesses is restricted?
To give the best experience for the students, at every opening only few access of that type are proposed.