Data Engineering: Major Technologies To Learn In 2022
Data Engineering is growing so fast that it is flooded with an array of tools and technologies, and it is easy to get lost in learning them.
The effective way to learn is to categorise these technologies and understand one really well from each category. The core idea will be the same with some differences for different tools.
Let’s dive in:
1. Languages
Python, Scala
Python is easy to learn and dynamic. PySpark is the python interface for Apache Spark which is adapted more than the native Scala API. Airflow is written in Python. Most of the popular big data tools have support for Python APIs.
Scala is a typed language and is used in major technology companies. It runs on JVM and supports JAVA libraries. Apache Kafka, Apache Spark, Apache Flink, all of them have been written in Java and Scala.
2. Hadoop (MapReduce and HDFS)
Hadoop is the idea behind a lot of the technologies. It will help in understanding basic concepts like scalability, replication, failure tolerance, partitioning.
HDFS is the storage part of Hadoop, which is a distributed file system.
MapReduce, a batch processing algorithm published in 2004, was subsequently implemented in various open-source data systems, including Hadoop, MongoDB, CouchDB, etc.
Although the importance of MapReduce is declining, it is worth understanding because it provides a clear picture of why and how batch processing is useful.
3. Ingestion
Below tools ingest raw data into the platform.
Apache Kafka, AWS Kinesis, Cloud Pub/Sub
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. It provides durability really well compared with queues like RabbitMQ, Redis Queue.
AWS Kinesis is provided by AWS, while Cloud Pub/Sub is provided by Google Cloud.
4. Fast Storage
This category is to store real-time data with fast access.
Apache Kafka, AWS Kinesis, Cloud Pub/Sub
Most of the tools used in ingestion with event streaming supports Fast Storage as well.
5. Slow storage/direct data lake access
This category stores the data for batch processing.
HDFS, AWS S3, Google Cloud Storage
In this category, all the data is stored for the purpose of data lake, or staging layer, which is used for batch processing.
6. Real-time processing and analytics
Below tools are being used for processing the streaming data.
Apache Flink, Spark Streaming, Amazon Kinesis Analytics, Cloud Dataflow, ksqlDB, Kafka Streams
Apache Flink is a fast-growing solution for real-time data with minimum latency. Spark Streaming is also widely used but it processes data in really small windows.
Amazon Kinesis Analytics and Cloud DataFlow are the stream processing solutions provided by AWS and Google Cloud respectively.
7. Batch processing and analytics
These tools are being used for batch processing.
Apache Spark, AWS Glue, Dataproc, AWS EMR, Presto, Trino
Apache Spark is the most widely used for ETL operations. It has a lot of APIs supported in Python, Scala, Java and R. It is an in-memory compute engine which has a lot of benefits over MapReduce.
AWS Glue and AWS EMR are spark-based managed solutions from AWS.
Dataproc is provided by Google Cloud.
Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes.
8. Orchestration overlay
These tools are being used to orchestrate the data pipelines.
Apache Airflow, Google Cloud Composer, Astronomer, AWS Step Functions
All the pipelines need to be scheduled and dependencies need to be defined. Apache Airflow is python based workflow engine that creates pipelines dynamically.
Google Cloud Composer and Astronomer are Airflow based managed solutions while AWS Step functions are provided by AWS which is not dependent on Airflow.
9. Data Warehouse
Below tools are the ones where we can store the data for analytics and reporting.
AWS Redshift, BigQuery, Snowflake, PostgreSQL
All the curated and transformed data will be stored in Data Warehousing for analytics and reporting purposes.
AWS Redshift is a massively parallel processing tool provided by AWS, while BigQuery, Snowflake are also widely used, majorly for ELT (extract-load-transform) operations.
10. Serving Layer
To serve the data consumers.
NoSQL Databases — Cassandra, DynamoDB
Relational Databases — PostgreSQL, Mysql
This layer is where APIs and data consumers will be served with minimal latency. Cassandra is one of the highly available NoSQL databases which is being used.
DynamoDB is a highly scalable database provided by AWS.
PostgresSQL and Mysql are relational databases. This article will guide you in choosing the databases for your use case.
These categories and the technologies will ensure that you build the tree of knowledge while focusing on the major part, root and stem while going in the branches when required.