Hey everyone,
We wanted to share a little bit about how we've been using some cool open-source tools to automate our data-heavy tasks. If you're drowning in data and manual reports, you might find this interesting.
At Bitscorp, we handle a lot of data from various sources, and one of the key parts of our workflow is generating reports for our clients and for our internal use. Manually running queries and compiling reports is not just tedious, but also prone to errors. That's where Apache Airflow comes in.
Our Data Stack for Automation
We've found a pretty sweet spot with the following combination:
- Amazon S3: Our data lands in S3 buckets, partitioned by date. This is pretty standard, but it's a reliable way to store and organize large volumes of data.
- Apache Hive: We use Hive to run SQL-like queries on the data stored in S3. It's great for those big data processing jobs.
- PostgreSQL: This is where we store the results of our Hive queries. It's our go-to for a reliable, easy-to-use database for the final, aggregated data.
- Apache Airflow: This is the conductor of our orchestra. Airflow lets us define our workflows as code (Python, to be specific) and schedules them to run automatically.
A Quick Look at an Airflow DAG
Here’s a simplified example of what an Airflow DAG (Directed Acyclic Graph) looks like for our process. This DAG defines the tasks and their dependencies.
from airflow import DAGfrom airflow.providers.apache.hive.operators.hive import HiveOperatorfrom airflow.providers.postgres.operators.postgres import PostgresOperatorfrom airflow.utils.dates import days_agowith DAG(dag_id='daily_report_generator',schedule_interval='@daily',start_date=days_ago(1),catchup=False,) as dag:# Task to run a Hive query to process data from S3process_data = HiveOperator(task_id='process_data_from_s3',hql="""INSERT OVERWRITE TABLE processed_reportsSELECT ...FROM raw_dataWHERE date = '{{ ds }}';""",)# Task to copy the processed data to PostgreSQLstore_results = PostgresOperator(task_id='store_results_in_postgres',sql="""TRUNCATE TABLE daily_reports;INSERT INTO daily_reportsSELECT * FROM processed_reports;""",)process_data >> store_results
In this example, process_data runs a Hive query to transform the raw data, and once it's done, store_results takes over to copy the data into our PostgreSQL database. Airflow makes sure this runs every day, and it'll even retry if something fails.
Why We Like It
This setup gives us:
- Automation: No more manual query running.
- Reliability: Airflow has great logging and retry mechanisms.
- Scalability: This stack can handle a growing amount of data.
- Flexibility: We can add more complex steps and integrations as needed.
Need Help with Your Data?
If you're looking to build out your own data pipelines, automate your reporting, or just need a hand with data analysis, we'd love to help. We have a lot of experience in this area and can get you up and running quickly.