8.2. Utility Libraries for Pandas#
8.2.1. Speed up Pandas’ apply()
#
Don’t use .apply()
in Pandas blindly!
.apply()
is used to apply operations on all the elements in a dataframe (row-wise or column-wise).
It’s the most obvious choice, but there is a better option:
Instead, use the 𝐒𝐰𝐢𝐟𝐭𝐞𝐫
package.
𝐒𝐰𝐢𝐟𝐭𝐞𝐫
tries to pick up the best way to implement the .apply()
function by either:
Vectorizing your function
Parallelizing using Dask
Using
.apply()
from Pandas if the dataset is small.
That gives your function a huge boost.
In the example below, you only need to add df.swifter.apply() to make use of Swifter
’s capabilities.
!pip install swifter
import swifter
import pandas as pd
df = pd.DataFrame(...)
def my_function(input_value):
...
return output_value
df["Column"] = df["Column"].swifter.apply(lambda x: my_function(x))
8.2.2. Reduce DataFrame Memory with dtype_diet
#
By default, Pandas DataFrames don’t use the smallest data types for its columns.
This results in unnecessary memory usage.
Changing data types can drastically reduce the memory usage of your DataFrame.
Using dtype_diet
, you can automatically change the data types to the smallest (and most memory-efficient) one.
!pip install dtype-diet
from dtype_diet import optimize_dtypes, report_on_dataframe
import pandas as pd
df = pd.read_csv("")
# Get Recommendations
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
8.2.3. Validate Pandas DataFrames with pandera
#
Do you want to validate your Pandas DataFrames?
Try pandera
.
pandera
is a data validation library for Pandas DataFrames and Series.
It provides a convenient way to define and enforce data quality constraints.
You can even define complex constraints or use the in-built constraints.
!pip install pandera
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"name": pa.Column(pa.String),
"age": pa.Column(pa.Int, checks=[
pa.Check(lambda x: x > 0, element_wise=True),
pa.Check(lambda x: x < 100, element_wise=True)
]),
})
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 40, 200],
}
df = pd.DataFrame(data)
schema.validate(df)
8.2.4. Boost Pandas’ Performance With One Line With modin
#
If you already have a large codebase based on Pandas, think again.
You can also use modin
as a drop-in replacement for Pandas, with a 3X-5X speed-up.
Just install modin
and replace the import statement.
It’s maybe not as fast as polars, but you will save hours of development time and gain some performance boost.
!pip install "modin[all]"
import modin.pandas as pd
df = pd.read_csv("")
8.2.5. Chat with your Dataframe with PandasAI
#
You can chat with your Pandas dataframe with a few lines of code.
With PandasAI
, you can use LLMs to analyze your data, generate visuals, and create a report with your words.
Currently, PandasAI
supports popular LLMs from providers like OpenAI, Anthropic, Google, Amazon, or Ollama for local LLMs.
!pip install pandasai
from pandasai import SmartDataframe
from pandasai.llm import OpenAI
from pandasai.helpers.openai_info import get_openai_callback
llm = OpenAI()
df = SmartDataframe("data.csv", config={"llm": llm, "conversational": False})
with get_openai_callback() as cb:
response = df.chat("Calculate the sum of the gdp of north american countries")