Professional Data Engineering
Big Data solutions are incredibly complex. They usually entail the correct application of five to ten or more complicated technologies, all working together. That increase in complexity is the reason why Big Data is such a different animal. There is an order of magnitude increase in the complexity over small data.
This course is only intended for teams and companies facing real big data problems. We focus on practical use cases and real world applications for teams and companies. The course doesn’t just show the technologies; we show how they’re used in data pipelines and how professional Data Engineers use them.
Duration: Online – Two weeks of concerted effort/Eight weeks of less effort
Intended Audience: Technical, Software Engineers, QA, Analysts
Prerequisites: Intermediate-Level Java
You Will Learn
What exists in the Big Data ecosystem so you can use the right tool for the right job.
An understanding of how HDFS works and how to interact with it.
An understanding of how MapReduce works and how each phase works.
An understanding of how Spark works and how each phase works.
- What are Java 8 Lambdas and how they make your Spark code humanly readable.
- The basics of coding a Spark job with Java to build your Big Data foundation.
- The various API methods in Spark and what they do.
- How SQL can be used with a Spark job and when that vastly improves your productivity and code.
- How to create Java code that runs as a function during a Spark SQL command to use existing Java code or do use case specific queries.
- The basics of coding a MapReduce job with Java to build your Big Data foundation.
- What the advanced features of the MapReduce API that only the true experts know.
- How Apache Crunch gives you a very different API from MapReduce and gives you a more Java-centric API.
- How to use Apache Crunch to do the things not humanly possible in MapReduce like joining datasets and performing secondary sorts.
- The simple and advanced SQL-like commands available in Hive.
- How to extend Hive commands with custom non-Java code to do company or use case specific queries.
- How to move data out of and into relational databases like MySQL and Oracle from Hadoop/Spark using Apache Sqoop.
- How to move files and network data from many different computers to Hadoop using Apache Flume.
- What is Hue and how it aids in creating browser-based data products.
- How Apache Oozie makes it possible to create repeatable workflows that enterprises need.
- How all of these technologies come together as a solution for ETL, click stream, and sessionization use cases.
- The steps and iterations to take when creating a Big Data solution.
Thinking in Big Data
Introducing Big Data
What is Hadoop?
Introduction to HDFS
Introduction to MapReduce
Coding with MapReduce
Using Apache Maven
Advanced MapReduce Classes
MapReduce and Avro
Columnar File Formats
Coding With Parquet
Coding With Crunch
Crunch API Pipelines
Augmenting Hive With UDFs and Transforms
Coding With Spark
Using Apache Maven
Built-In Transformations and Actions
Spark SQL API
Spark SQL UDFs
Moving and Accessing Data
Hue and Oozie
Step 0 – Learning
How To Learn
Habits of Successful Students
Habits of Unsuccessful Students
Simple Big Data
Review and Application
Ways Of Visualizing Data
Charting With Dimple and D3.js
Importance of Visualization
Creating the Right Visualizations
The Basics of HBase
Architecting HBase Solutions
Doing Data Science on the NFL Play by Play Dataset
Enter the Query – The Hive Story
Algorithms Alone – Lost in data
Engineering Big Data Solutions
- Apache Hadoop
- Apache Spark
- Apache Hive
- Apache Pig
- Apache HBase
- Apache Impala
- Apache Kafka
- Apache Parquet
- Apache Crunch
- Apache Sqoop
- Apache Flume
- Apache Oozie
- Apache HBase