Pandas was some of the fashionable instruments when studying information science in 2020. Though new instruments are targeted on enhancing Pandas’ weaknesses in dealing with very giant datasets, I nonetheless use Pandas for a lot of information cleansing, processing, and evaluation duties. Granted, Pandas has a tough time when working with billions of rows, however it’s undoubtedly ample for something smaller.
You possibly can see Pandas getting used not solely in EDA and notebooks, but in addition in manufacturing methods.
This text describes a number of information cleansing and processing operations to display the facility of Pandas.
Let’s begin with a dataset containing stock-keeping models (SKUs) and search API responses for these SKUs.
import pandas as pd
search_results = pd.read_csv("search_results.csv")
search_results.head()
The search result’s a listing of dictionaries, like this:
search_results.loc[0, "search_result"]
"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]
... and 5 entities remaining"
As you’ll be able to see within the output, this isn’t a correct checklist in dictionary format due to the final half (“… and 5 entities remaining”). It is usually saved as a single string.
To make higher use of this, we have to convert it to a listing of appropriate dictionaries. The next line of code splits the string on “…”, removes the final half, and will get the primary break up.
search_results.loc[0, "search_result"].break up("...")[0].strip()
Nonetheless, the output continues to be a single string. You possibly can convert it to a listing utilizing Python’s built-in ast module.
import ast
res = ast.literal_eval(search_results.loc[0, "search_result"].break up("...")[0].strip())
res
[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]
Search outcomes are actually displayed as a correct checklist of dictionaries. This was solely a single line. The identical operation should be utilized to all SKUs (that’s, the complete SKU column).
One choice is to undergo all of the rows throughout the for loop and carry out the identical operation. Nonetheless, this isn’t the most suitable choice. Vectorized operations ought to be most popular at any time when attainable. Vectorized operations principally imply executing code for all rows directly.
I used break up on a single line to take away the final a part of the string, however it did not work for vectorized operations. A extra sturdy choice appears to be utilizing common expressions.
search_results.loc[:, 'search_result'] = search_results['search_result'].str.substitute(r"....*", "", regex=True).str.strip()
This code selects “…” and all the things after it and replaces nothing. In different phrases, the half “…and the remaining 5 entities” is eliminated.
Now all of the rows within the search consequence column shall be displayed as a correct checklist of dictionaries.
search_results.loc[10, "search_result"]
"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]"
These are nonetheless saved as strings, however you’ll be able to simply convert them to lists utilizing the ast module. This shall be accomplished within the subsequent step.
What I am serious about is the SKU returned within the search outcomes. Create a brand new column by extracting the SKUs within the dictionary. It may be accessed utilizing the “my_id” key within the dictionary.
This operation has three components.
- Convert the search consequence string to a listing utilizing the literal_eval perform.
- Extract the SKU from the my_id key within the dictionary
- Do that with checklist comprehension to get the SKUs from all dictionaries within the checklist.
You possibly can carry out all these operations by making use of a lambda perform to each row like this:
search_results.loc[:, "result_skus"] =
search_results["search_result"].apply(lambda x: [item['my_id'] for merchandise in ast.literal_eval(x)])
search_results.head()

Every row within the result_skus column comprises a listing of 10 SKUs. Suppose it’s essential to put these 10 SKUs in numerous rows. For every row within the sku column, 10 rows are created from the checklist within the result_skus column. Pandas has an easy approach to do that. That’s the explode perform.
information = search_results[["sku", "result_skus"]].explode("result_skus", ignore_index=True)
information.head()

I created a brand new dataframe with sku and result_skus columns. The next diagram exhibits how the explode perform works.

Let’s take into account the other. I’ve a dataframe just like the one above, however I wish to have all the outcomes for SKU in a single row.
You possibly can group the rows by sku utilizing the groupby perform and apply the checklist perform to the result_skus column.
new_data = information.groupby("sku", as_index=False)["result_skus"].apply(checklist)
new_data.head()
It will take you again to the earlier step.

I used the Increase perform to create a dataframe with separate rows for every SKU within the result_skus column. What if I want to separate it into completely different columns as a substitute of rows?
One choice is to use the pd.Sequence perform to the result_skus column and concatenate the ensuing column to the unique dataframe.
new_cols = new_data["result_skus"].apply(pd.Sequence)
new_data = pd.concat([new_data, new_cols], axis=1)
new_data.head()

Columns 0-9 comprise the ten SKUs from the result_skus column. This code utilizing the apply perform shouldn’t be a vectorized operation.
There’s an alternative choice that’s vectorized and far sooner.
new_cols = pd.DataFrame(new_data["result_skus"].tolist())
new_data = pd.concat([new_data, new_cols], axis=1)
This code supplies the identical dataframe as above, however is far sooner.
We have illustrated typical information cleansing and processing duties that information scientists and analysts might encounter on the job. I have been on this area for over 5 years now, and until you are working with very giant datasets (e.g. billions of rows), Pandas has all the time been ample for what you want.
Appropriate for such giant datasets, the device’s syntax is much like Pandas. For instance, PySpark is sort of a mixture of Pandas and SQL. Polars is similar to Pandas when it comes to syntax. Due to this fact, studying and working towards Pandas stays an especially worthwhile talent for these working in information science and AI.
Thanks for studying.

