Automating Reports with Airflow, Hive, and PostgreSQL

Hey everyone,

We wanted to share a little bit about how we've been using some cool open-source tools to automate our data-heavy tasks. If you're drowning in data and manual reports, you might find this interesting.

At Bitscorp, we handle a lot of data from various sources, and one of the key parts of our workflow is generating reports for our clients and for our internal use. Manually running queries and compiling reports is not just tedious, but also prone to errors. That's where Apache Airflow comes in.

Our Data Stack for Automation

We've found a pretty sweet spot with the following combination:

Amazon S3: Our data lands in S3 buckets, partitioned by date. This is pretty standard, but it's a reliable way to store and organize large volumes of data.
Apache Hive: We use Hive to run SQL-like queries on the data stored in S3. It's great for those big data processing jobs.
PostgreSQL: This is where we store the results of our Hive queries. It's our go-to for a reliable, easy-to-use database for the final, aggregated data.
Apache Airflow: This is the conductor of our orchestra. Airflow lets us define our workflows as code (Python, to be specific) and schedules them to run automatically.

A Quick Look at an Airflow DAG

Here’s a simplified example of what an Airflow DAG (Directed Acyclic Graph) looks like for our process. This DAG defines the tasks and their dependencies.

from airflow import DAG
from airflow.providers.apache.hive.operators.hive import HiveOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.utils.dates import days_ago

with DAG(
    dag_id='daily_report_generator',
    schedule_interval='@daily',
    start_date=days_ago(1),
    catchup=False,
) as dag:

    # Task to run a Hive query to process data from S3
    process_data = HiveOperator(
        task_id='process_data_from_s3',
        hql="""
            INSERT OVERWRITE TABLE processed_reports
            SELECT ...
            FROM raw_data
            WHERE date = '{{ ds }}';
        """,
    )

    # Task to copy the processed data to PostgreSQL
    store_results = PostgresOperator(
        task_id='store_results_in_postgres',
        sql="""
            TRUNCATE TABLE daily_reports;
            INSERT INTO daily_reports
            SELECT * FROM processed_reports;
        """,
    )

    process_data >> store_results

In this example, process_data runs a Hive query to transform the raw data, and once it's done, store_results takes over to copy the data into our PostgreSQL database. Airflow makes sure this runs every day, and it'll even retry if something fails.

Why We Like It

This setup gives us:

Automation: No more manual query running.
Reliability: Airflow has great logging and retry mechanisms.
Scalability: This stack can handle a growing amount of data.
Flexibility: We can add more complex steps and integrations as needed.

Need Help with Your Data?

If you're looking to build out your own data pipelines, automate your reporting, or just need a hand with data analysis, we'd love to help. We have a lot of experience in this area and can get you up and running quickly.

Contact us today to discuss your data challenges!