Monday, May 25, 2026
banner
Top Selling Multipurpose WP Theme

A part of my knowledge engineering journey sequence. In Half 1, we shared a 12-month roadmap for transitioning from knowledge analyst to knowledge engineer. That is the place the precise building begins.

Once I revealed my first article documenting my knowledge engineering efforts, one thing sudden occurred. Folks resonated with that. Strangers reached out and mentioned they had been excited to comply with me. That was a pleasant contact.

However it additionally got here with stress.

All of the sudden, that is now not a private aim that you may quietly abandon if the going will get robust. Folks had been watching. Folks had been in the identical boat. And actually, that accountability is a part of the explanation you are studying this proper now.

So I needed to transfer. And like anybody beginning a brand new talent, the very first thing I did was search for sources. There are numerous tutorials on knowledge engineering on the web. YouTube movies, programs, and written guides. I am unable to do it anymore.

Nevertheless, I did not really feel like consuming solely concept. One thing wanted to be constructed. It was an actual factor, with actual knowledge, and it really labored ultimately.

So I closed the tutorial and opened a Google Colab pocket book as a substitute. I discovered the GitHub API documentation and determined to construct my first ETL pipeline from scratch. No hand-holding. It is simply me, Python, and my objectives.

This text totally paperwork that have. Code, messes, small wins, and issues I realized by doing.

To begin with, what’s ETL?

Earlier than I get into what I constructed, let me shortly clarify what ETL really means. It’s because I needed to look it up myself some time again.

ETL stands for Extract, Rework, Load. This is likely one of the most elementary ideas in knowledge engineering.

  • extract It means going someplace to get the information. APIs, databases, web sites, information. We’re extracting uncooked info from the supply.
  • transformation It means cleansing and formatting the information. Take away unhealthy rows, add new columns, and restructure to make it really helpful.
  • load It means storing the cleaned knowledge someplace. Databases, knowledge warehouses, and easy CSV information.

that is it. Executing these three steps in sequence is an information pipeline. All the things else in knowledge engineering, Airflow, Spark, Databricks, is only a extra subtle method of doing the identical three issues at scale.

I am firstly of my roadmap, so I’ve saved it easy. There isn’t a pure Python,orchestration software but. Nevertheless, the type of the issue is similar.

what i constructed

We extracted knowledge from the GitHub API, particularly essentially the most starred Python repositories created up to now 30 days. I then cleaned it up, added new columns, and saved the output as a CSV file.
Easy. Genuine. It is utterly mine.

The scenario is as follows.

Step 1: Extraction

The very first thing I needed to do was determine easy methods to talk with the GitHub API. An API is basically a door that an organization or platform opens to permit builders to programmatically request knowledge with out having to manually copy and paste something.

GitHub has a free public API. No account or paid plan is required for fundamental searches.

Right here is the code I wrote to extract the information:

import requests

url = "https://api.github.com/search/repositories"

params = {
    "q": "language:python created:>2025-04-22",
    "kind": "stars",
    "order": "desc",
    "per_page": 30
}

response = requests.get(url, params=params)
knowledge = response.json()

print(response.status_code)
print(knowledge.keys())

I will be sincere. At first I used to be confused by this block. of requests It was my first time in a library. of params dictionary that incorporates it q The syntax felt overseas. I did not know what it was immediately .json() What was it being completed and why was it vital?

Let’s break it down briefly.

  • requests.get() That is the way you knock on GitHub’s door and ask for one thing. of url is the tackle of what you might be in search of. of
  • params The dictionary is the particular query you might be asking. On this case: “Type by star to see 30 outcomes for Python repositories created after April twenty second.”
  • .json() Convert GitHub responses from uncooked textual content to Python dictionaries you’ll be able to work with.

Once I ran it, I acquired the next:

200 
dict_keys(['total_count', 'incomplete_results', 'items'])

of 200 means success. That is the way you say “your request was profitable” on the web. If you happen to see a 403 or 404, one thing went improper.
The dictionary has three keys. total_count Shows the variety of repositories that match your search. incomplete_results Signifies whether or not GitHub wanted to shorten one thing. and gadgets That is the place the precise knowledge resides.

I then ran the second block to peek inside.

print("Whole matches on GitHub:", knowledge['total_count'])
print("Repos returned:", len(knowledge['items']))

first_repo = knowledge['items'][0]
print("nFirst repo title:", first_repo['name'])
print("Stars:", first_repo['stargazers_count'])
print("Language:", first_repo['language'])
print("URL:", first_repo['html_url'])

output:

Whole matches on GitHub: 9228201
Repos returned: 30

First repo title: expertise
Stars: 139136
Language: Python
URL: https://github.com/anthropics/expertise

The primary consequence was an Anthropic repository with 139,000 stars. Precise knowledge. Reside. It was pulled by the code I wrote.

Extraction is now full.

Step 2: Convert

There are actually 30 repositories within the Python listing, every a nested dictionary with dozens of fields. I did not want most of it. The transformation step takes the uncooked, messy knowledge and shapes it into one thing clear and purposeful.

First, I took solely the fields I wanted and loaded them right into a Pandas dataframe.

import pandas as pd

repos = []

for repo in knowledge['items']:
    repos.append({
        "title": repo['name'],
        "proprietor": repo['owner']['login'],
        "stars": repo['stargazers_count'],
        "forks": repo['forks_count'],
        "language": repo['language'],
        "description": repo['description'],
        "url": repo['html_url'],
        "created_at": repo['created_at']
    })

df = pd.DataFrame(repos)
df.head()

Seeing the information body seem was an actual “wow” second. We moved from a wall of JSON to a clear, easy-to-read desk with a number of rows of labeled columns.

Subsequent, I carried out the next three transformations:

# Drop rows the place description is lacking
df_clean = df.dropna(subset=['description'])

# Add a viral flag for repos with over 50k stars
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Sure' if x > 50000 else 'No')

# Type by stars descending
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)

print("Earlier than cleansing:", len(df))
print("After cleansing:", len(df_clean))

output:

Earlier than cleansing: 30 
After cleansing: 29

One repository had no description and was eliminated. The virus column appeared clearly. Your knowledge is now sorted and structured.
The transformation is now full.

Step 3: Load

Final step. Get clear knowledge and put it aside someplace. I simplified this and loaded it right into a CSV file.

df_clean.to_csv('github_trending_repos.csv', index=False)

print("Pipeline full. File saved.")
print(f"{len(df_clean)} repos loaded into github_trending_repos.csv")

output:

Pipeline full. File saved.
29 repos loaded into github_trending_repos.csv

I downloaded the file and opened it. A clear spreadsheet with 29 rows and 9 columns. Precise GitHub knowledge fashioned and saved by a pipeline I constructed from scratch.

Loading is now full.

What was this actually like?

Beforehand, at any time when I needed to work with knowledge, I appeared for a public dataset that somebody had already cleaned and uploaded. Kaggle, Google dataset search, in all places. I’ve all the time been a shopper of knowledge ready by another person.

This modified one thing for me.

The second I noticed that I might merely level Python to the API I used to be occupied with and extract stay knowledge myself, the chances felt utterly completely different. You aren’t restricted to current datasets. You possibly can construct pipelines to create datasets.

It is a completely different sort of energy. And that is one of many issues that drew me to knowledge engineering within the first place.

what’s subsequent

This pipeline is straightforward by design. I am at first of the roadmap, however I am not going to faux I am utilizing Airflow or Spark simply but. However the fundamentals are actual. Extract, rework, load. it really works. I constructed it. Understood.

The subsequent step is to make it extra strong. Set the schedule to run every day. Save the output to a SQLite database as a substitute of flat CSV. Begin monitoring long-term developments in your repositories.

And eventually, we use Airflow to regulate every thing. However that is a future article.

Thus far, an important factor I’ve confirmed to myself is that a constructing can educate you stuff you’ll by no means get. I spent weeks in Tutorialland and barely moved. It took me an hour to really construct it, and I realized extra about ETL than some other video.

Please cease watching. Begin constructing.

That is half 2 of our ongoing knowledge engineering sequence. We are going to doc each step of your journey, even the components that do not go easily. Make sure you take a look at my extra in-depth ETL tackle my YouTube channel beneath..

Please join with me at linkedin, YouTubeand Twitter.

banner
Top Selling Multipurpose WP Theme

Converter

Top Selling Multipurpose WP Theme

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

banner
Top Selling Multipurpose WP Theme

Leave a Comment

banner
Top Selling Multipurpose WP Theme

Latest

Best selling

22000,00 $
16000,00 $
6500,00 $

Top rated

6500,00 $
22000,00 $
900000,00 $

Products

Knowledge Unleashed
Knowledge Unleashed

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.