I attempted scheduling an ETL pipeline. Right here one thing occurred that I did not anticipate.

by root June 19, 2026

written by root June 19, 2026 0 comment 3 views

I discussed that schedule administration is the following hurdle I’ll face.

So I assume I am right here and I am strolling in direction of it

However earlier than I clarify what occurred, let me give some background for many who come upon this for the primary time.

I’m a techniques analyst and determined to maneuver into information engineering. As an alternative of simply taking programs and gathering certificates, I made a decision to study by constructing it and writing about it publicly. Each article on this collection paperwork what I really constructed, the choices I made, what failed, and what I realized from them.

The primary article, my 12-month self-study roadmap, laid out a plan for a way I’d method this transition. Second, as a whole newbie, I constructed my first ETL pipeline from scratch utilizing the GitHub API. Within the third, we took the identical pipeline and made it extra production-ready by including SQLite storage, idempotency processing, and Google Drive persistence all inside Google Colab.

This text is the fourth. It should then resume precisely the place it left off.

I anticipated to spend most of my time deciding on and configuring scheduling instruments. What I did not anticipate was that earlier than I might take into consideration schedules, I needed to take care of one thing extra elementary. My pipeline couldn’t run outdoors of Google Colab. And till that modified, no scheduler on the earth might assist me.

This can be a story that truly occurred.

First wall: My pipeline was in Colab

Earlier than organising a schedule, I wished to know what it really prices to run a pipeline mechanically. So, with that query in thoughts, I took a correct have a look at my code for the primary time.

The load part seems like this:

conn = sqlite3.join('/content material/drive/MyDrive/github_repos.db')

That highway, /content material/drive/MyDrive/solely exists inside Google Colab. That is the trail to the mounted Google Drive that Colab offers while you join the drive to your pocket book. That path doesn’t exist outdoors of Colab. When the scheduler tries to run this script, it crashes there.

What’s attention-grabbing is that my code has google.colab Imported product. There aren’t any Colab-specific libraries. There was only one hard-coded path that I used to be typing with out considering. That path was a dependency, not a code.

This was one thing I did not anticipate at first. I assumed the problem was to study the scheduling instrument. As an alternative, the primary lesson was that my setting was a part of my pipeline and I wasn’t conscious of it.

The repair was straightforward. As an alternative of hard-coding the Colab path, we now permit you to configure the database path via setting variables.

import os

DB_PATH = os.environ.get('DB_PATH', 'github_repos.db')
conn = sqlite3.join(DB_PATH)

The script will now use no matter paths are set in your setting. If nothing is ready, revert to native creation. github_repos.db information in the identical folder. One change is that pipelines are not tied to Colab.

Operating outdoors of Colab for the primary time

Earlier than organising the scheduler, I wished to verify the script really works by itself. So I saved it as pipeline.pyhas been created. necessities.txt Use the 2 required libraries.

requests
pandas

and ran it from my terminal:

Printed: Pipeline full. Duplicates dealt with.

And a file referred to as github_repos.db It appeared in my folder. The identical pipelines you used to run in Colab can now be run wherever as plain Python scripts.

It felt like a much bigger deal than I anticipated. Not as a result of the change was difficult, it wasn’t. Nonetheless, I spotted that I used to be considering of the pipeline as a pocket book, so what I really had was a script that occurred to reside inside it.

Choose a scheduling instrument

At this level, you might have a standalone script. Subsequent, I wanted one thing to run on a schedule.

I thought-about a number of choices. APScheduler lets you outline schedules inside your Python code. This works whereas the session is working, however stops as quickly as I shut the terminal. It is probably not scheduling, it is only a loop. Airflow is the trade normal for orchestrating pipelines, nevertheless it requires a server, metadata database, and net interface to run. That is loads of infrastructure for the place I’m now.

GitHub Actions was someplace in between. It is free, runs on GitHub’s servers, schedules are outlined in code, and there is no infrastructure to keep up. The tradeoff is that it’s designed for CI/CD workflows relatively than pipeline orchestration, so it has limitations with regards to advanced dependencies and monitoring. However at my stage within the pipeline, it is a life like alternative.

Let’s be trustworthy: instruments like Airflow exist for a motive. Correct orchestration is required when your pipeline grows, when there are dependencies between duties, and while you want visibility into what’s working and what’s failing. GitHub Actions shouldn’t be. However this can be a good first step, and understanding why there are limitations is a part of understanding what these extra critical instruments are literally fixing.

Establishing GitHub actions

GitHub Actions works via workflow information, that are YAML information that you just place in particular folders inside your repository. The folder construction will appear like this:

github-etl/
├── .github/
│   └── workflows/
│       └── schedule.yml
├── pipeline.py
└── necessities.txt

Right here is the entire workflow file I created:

identify: Run ETL Pipeline

on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:

jobs:
  run-pipeline:
    runs-on: ubuntu-latest

    steps:
      - identify: Checkout code
        makes use of: actions/checkout@v4

      - identify: Arrange Python
        makes use of: actions/setup-python@v5
        with:
          python-version: '3.11'

      - identify: Set up dependencies
        run: pip set up -r necessities.txt

      - identify: Run pipeline
        run: python pipeline.py

Let’s check out what every half does.

cron: '0 9 * * *' That is the precise schedule. Cron is a time-based job scheduling format that has been used on Unix techniques for many years. The 5 values characterize minutes, hours, days, months, and days of the week. So 0 9 * * * That means: 0 minutes of the ninth hour of daily, each month, and daily of the week. That’s, daily at 9 a.m. UTC.
workflow_dispatch Add a guide set off. This implies you can even click on a button in GitHub to run your workflow with out ready for the scheduled time. That is helpful for testing.
runs-on: ubuntu-latest Tells GitHub to begin a brand new Linux machine for every run. Each time a workflow is triggered, GitHub creates a clear setting, installs dependencies, runs scripts, and shuts every little thing down. There is no machine sitting someplace working code. It is non permanent.

The steps are easy. Checkout pulls code from the repository into the runner. Python setup installs the model you specify. Dependency set up is carried out pip set up -r necessities.txt. Then run the script with “Run Pipeline”.

What occurred after I ran it

After you push your workflow information to GitHub, in your repository[アクション]I went to the tab and manually triggered it utilizing the workflow_dispatch button.

I ran. 27 seconds from begin to end. With out me having to do something after clicking the button, the pipeline retrieved the information from the GitHub API, remodeled it, and loaded it into SQLite. All of this was on a GitHub server.

One warning was displayed on the primary run.

Node.js 20 actions are deprecated...

It’s because I used to be utilizing an older model of the checkout and setup-python actions. The repair was being up to date actions/checkout@v3 to actions/checkout@v4 and actions/setup-python@v4 to actions/setup-python@v5. After that, the workflow ran with out difficulty.

What I really realized

After I considered this, I spotted that schedule administration is all about choosing the proper instruments. What I discovered was that setting a schedule pressured me to consider one thing I hadn’t thought fastidiously about earlier than: portability.

A pipeline that solely runs in a single particular setting shouldn’t be actually a pipeline. This can be a script related to the platform. Making it schedulable means making it transportable first, and making it transportable means understanding what you are really counting on.

The hardcoded path was a small factor. Nonetheless, capturing this has modified the way in which I take into consideration writing pipeline code going ahead. Everytime you write a path, credential, or environment-specific worth, it now asks if it exists outdoors the context it is being constructed.

One other factor I realized is that scheduling and orchestration are two various things. GitHub Actions handles scheduling effectively. It would not deal with issues like retrying failed executions with backoffs, alerting you when one thing goes mistaken, visualizing pipeline dependencies, or managing a number of interdependent pipelines. These are the orchestration issues that instruments like Airflow are constructed to resolve.

I am not there but. However now I perceive why these instruments exist in a method I did not perceive earlier than.

what’s subsequent

The pipeline is presently working daily at 9:00 AM (UTC). Information is presently being collected. And I am beginning to notice one thing. As you run your pipelines daily, you begin to care in regards to the information they produce in different methods.

Are all of the data clear? Are there repositories slipping via with lacking fields? Do the virus flags even have any which means, or did I outline them to be “no” for nearly every little thing?

These are information high quality points. And people are the following partitions I am heading in direction of.

That is a part of an ongoing collection chronicling my transition from techniques analyst to information engineer. Thanks for following me. If that is the primary article in a collection, earlier articles are linked beneath.

From Information Analyst to Information Engineer: My 12 Month Self-Examine Roadmap

As a whole newbie, I constructed my first ETL pipeline. Here is how:

I assumed information engineering was nearly writing scripts. I used to be mistaken.

Please join with me at linkedin, YouTubeand Twitter.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

I attempted scheduling an ETL pipeline. Right here one thing occurred that I did not anticipate.

First wall: My pipeline was in Colab

Operating outdoors of Colab for the primary time

Choose a scheduling instrument

Establishing GitHub actions

What occurred after I ran it

What I really realized

what’s subsequent

Bitcoin Q3 backside might trigger ‘complete mistrust’ above $50,000

After buying Cursor, SpaceX’s inventory value fell. How low can it go?

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest