REPOSITORY_HEADER // ID: 290

ACCESS_LEVEL: EXPLORER

Data Engineering

CURATED_BY: littlehelperINITIALIZED: ABOUT 2 MONTHS_AGOLAST_UPDATE: ABOUT 2 MONTHS_AGO

awesome big-data

RSS JSON Markdown

This is a mirrored zone from the igorbarinov/awesome-data-engineering repository. Part of the Awesome list collection.

Data Comparison

1_ENTRIES

datacompy
A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.

Data Ingestion

21_ENTRIES

Kafka
Publish-subscribe messaging rethought as a distributed commit log.BottledWater - Change data capture from PostgreSQL into Kafka. Deprecated.kafkat - Simplified command-line administration for Kafka brokers.kafkacat - Generic command line non-JVM Apache Kafka producer and consumer.pg-kafka - A PostgreSQL extension to produce messag…
AWS Kinesis
A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
RabbitMQ
Robust messaging for applications.
dlt
A fast&simple pipeline building library for Python data devs, runs in notebooks, cloud functions, airflow, etc.
FluentD
An open source data collector for unified logging layer.
Embulk
An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
Apache Sqoop
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Heka
Data Acquisition and Processing Made Easy. Deprecated.
Gobblin
Universal data ingestion framework for Hadoop from LinkedIn.
Nakadi
An open source event messaging platform that provides a REST API on top of Kafka-like queues.
Pravega
Provides a new storage abstraction - a stream - for continuous and unbounded data.
Apache Pulsar
An open-source distributed pub-sub messaging system.
AWS Data Wrangler
Utility belt to handle data on AWS.
Airbyte
Open-source data integration for modern data teams.
Artie
Real-time data ingestion tool leveraging change data capture.
Sling
CLI data integration tool specialized in moving data between databases, as well as storage systems.
Meltano
CLI & code-first ELT.Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
Google Sheets ETL
Live import all your Google Sheets to your data warehouse.
CsvPath Framework
A delimited data preboarding framework that fills the gap between MFT and the data lake.
Estuary Flow
No/low-code data pipeline platform that handles both batch and real-time data ingestion.
db2lake
Lightweight Node.js ETL framework for databases → data lakes/warehouses.

File System

12_ENTRIES

HDFS
A distributed file system designed to run on commodity hardware.Snakebite - A pure python HDFS client.
AWS S3
Object storage built to retrieve any amount of data from anywhere.smart_open - Utils for streaming large files (S3, HDFS, gzip, bz2).
Alluxio
A memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
CEPH
A unified, distributed storage system designed for excellent performance, reliability, and scalability.
JuiceFS
A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
OrangeFS
Orange File System is a branch of the Parallel Virtual File System.
SnackFS
A bite-sized, lightweight HDFS compatible file system built over Cassandra.
GlusterFS
Gluster Filesystem.
XtreemFS
Fault-tolerant distributed file system for all storage needs.
SeaweedFS
Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
S3QL
A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
LizardFS
Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

Serialization format

7_ENTRIES

Apache Avro
Apache Avro™ is a data serialization system.
Apache Parquet
A columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.Snappy - A fast compressor/decompressor. Used with Parquet.PigZ - A parallel implementation of gzip for modern multi-processor, multi-core machines.
Apache ORC
The smallest, fastest columnar storage for Hadoop workloads.
Apache Thrift
The Apache Thrift software framework, for scalable cross-language services development.
ProtoBuf
Protocol Buffers - Google's data interchange format.
SequenceFile
A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
Kryo
A fast and efficient object graph serialization framework for Java.

Stream Processing

18_ENTRIES

Apache Beam
A unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
Spark Streaming
Makes it easy to build scalable fault-tolerant streaming applications.
Apache Flink
A streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Apache Storm
A free and open source distributed realtime computation system.
Apache Samza
A distributed stream processing framework.
Apache NiFi
An easy to use, powerful, and reliable system to process and distribute data.
Apache Hudi
An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
CocoIndex
An open source ETL framework to build fresh index for AI.
VoltDB
An ACID-compliant RDBMS which uses a shared nothing architecture.
PipelineDB
The Streaming SQL Database.
Spring Cloud Dataflow
Streaming and tasks execution between Spring Boot apps.
Bonobo
A data-processing toolkit for python 3.5+.
Robinhood's Faust
Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
HStreamDB
The streaming database built for IoT data storage and real-time processing.
Kuiper
An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
Zilla
An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
SwimOS
A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.
Pathway
Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.

Batch Processing

7_ENTRIES

Batch MLH2O - Fast scalable machine learning API for smarter applications.Mahout - An environment for quickly creating scalable performant machine learning applications.Spark MLlib - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
Batch GraphGraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.Giraph - An iterative graph processing system built for high scalability.Spark GraphX - Apache Spark's API for graphs and graph-parallel computation.
Batch SQL[Presto](https://prestodb.github.io/docs…

Hadoop MapReduce
A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
Spark
A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.Spark Packages - A community index of packages for Apache Spark.Deep Spark - Connecting Apache Spark with different data stores. Deprecated.Spark RDD API Examples - Examples by Zhen He.Livy -…
AWS EMR
A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
Data Mechanics
A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.
Tez
An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
Bistro
A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
Substation
A cloud native data pipeline and transformation toolkit written in Go.

Charts and Dashboards

13_ENTRIES

Highcharts
A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
ZingChart
Fast JavaScript charts for any data set.
C3.js
D3-based reusable chart library.
D3.js
A JavaScript library for manipulating documents based on data.D3Plus - D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.
SmoothieCharts
A JavaScript Charting Library for Streaming Data.
PyXley
Python helpers for building dashboards using Flask and React.
Plotly
Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
Apache Superset
A modern, enterprise-ready business intelligence web application.
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
Metabase
The easy, open source way for everyone in your company to ask questions and learn from data.
PyQtGraph
A pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
Seaborn
A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
QueryGPT
Natural language database query interface with automatic chart generation, supporting Chinese and English queries.

Workflow

22_ENTRIES

Luigi
A Python module that helps you build complex pipelines of batch jobs.
CronQ
An application cron-like system. Used w/Luigi. Deprecated.
Cascading
Java based application development platform.
Airflow
A system to programmatically author, schedule, and monitor data pipelines.
Azkaban
A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.
Oozie
A workflow scheduler system to manage Apache Hadoop jobs.
Pinball
DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
Dagster
An open-source Python library for building data applications.
Hamilton
A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
Kedro
A framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
Dataform
An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.
Census
A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
dbt
A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
Kestra
Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
RudderStack
A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
PACE
An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
Prefect
An orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.
Multiwoven
The open-source reverse ETL, data activation platform for modern data teams.
SuprSend
Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.
Kestra
A versatile open source orchestrator and scheduler built on Java, designed to handle a broad range of workflows with a language-agnostic, API-first architecture.
Mage
Open-source data pipeline tool for transforming and integrating data.
SQLMesh
An open-source data transformation framework for managing, testing, and deploying SQL and Python-based data pipelines with version control, environment isolation, and automatic dependency resolution.

Data Lake Management

5_ENTRIES

lakeFS
An open source platform that delivers resilience and manageability to object-storage based data lakes.
Project Nessie
A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.
Ilum
A modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.
Gravitino
An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.
FlightPath Data
FlightPath is a gateway to a data lake's bronze layer, protecting it from invalid external data file feeds as a trusted publisher.

ELK Elastic Logstash Kibana

3_ENTRIES

docker-logstash
A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).
elasticsearch-jdbc
JDBC importer for Elasticsearch.
ZomboDB
PostgreSQL Extension that allows creating an index backed by Elasticsearch.

Docker

11_ENTRIES

Gockerize
Package golang service into minimal Docker containers.
Flocker
Easily manage Docker containers & their data.
Rancher
RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.
Kontena
Application Containers for Masses.
Weave
Weaving Docker containers into applications.
Zodiac
A lightweight tool for easy deployment and rollback of dockerized applications.
cAdvisor
Analyzes resource usage and performance characteristics of running containers.
Micro S3 persistence
Docker microservice for saving/restoring volume data to S3.
Rocker-compose
Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
Nomad
A cluster manager, designed for both long-lived services and short-lived batch processing workloads.
ImageLayers
Visualize Docker images and the layers that compose them.

Realtime

3_ENTRIES

Twitter Realtime
The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.
Eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
Reddit
Real-time data is available including comments, submissions and links posted to reddit.

Data Dumps

3_ENTRIES

GitHub Archive
GitHub's public timeline since 2011, updated every hour.
Common Crawl
Open source repository of web crawl data.
Wikipedia
Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Prometheus

2_ENTRIES

Prometheus.io
An open-source service monitoring system and time series database.
HAProxy Exporter
Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.

Data Profiler

3_ENTRIES

Data Profiler
The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.
YData Profiling
A general-purpose open-source data profiler for high-level analysis of a dataset.
Desbordante
An open-source data profiler specifically focused on discovery and validation of complex patterns in data.

Testing

8_ENTRIES

Grai
A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
DQOps
An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
DataKitchen
Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.
GreatExpectation
Open Source data validation framework to manage data quality. Users can define and document “expectations” rules about how data should look and behave.
RunSQL
Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.
Spark Playground
Write, run, and test PySpark code on Spark Playground's online compiler. Access real-world sample datasets & solve interview questions to enhance your PySpark skills for data engineering roles.
daffy
Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.
Snowflake Emulator
A Snowflake-compatible emulator for local development and testing.

Forums

2_ENTRIES

/r/dataengineering
News, tips, and background on Data Engineering.
/r/etl
Subreddit focused on ETL.

Conferences

1_ENTRIES

Data Council
The first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Podcasts

2_ENTRIES

Data Engineering Podcast
The show about modern data infrastructure.
The Data Stack Show
A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

Books

4_ENTRIES

Snowflake Data Engineering
A practical introduction to data engineering on the Snowflake cloud data platform.
Best Data Science Books
This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.
Architecting an Apache Iceberg Lakehouse
A guide to designing an Apache Iceberg lakehouse from scratch.
Learn AI Data Engineering in a Month of Lunches
A fast, friendly guide to integrating large language models into your data workflows.

Data Engineering

Contents_Index

Data Comparison

Data Ingestion

File System

Serialization format

Stream Processing

Batch Processing

Charts and Dashboards

Workflow

Data Lake Management

ELK Elastic Logstash Kibana

Docker

Realtime

Data Dumps

Prometheus

Data Profiler

Testing

Forums

Conferences

Podcasts

Books

Exploration_Discussion