8.2. Utility Libraries for Pandas#

8.2.1. Speed up Pandas’ apply()#

Don’t use .apply() in Pandas blindly!

.apply() is used to apply operations on all the elements in a dataframe (row-wise or column-wise).

It’s the most obvious choice, but there is a better option:

Instead, use the 𝐒𝐰𝐢𝐟𝐭𝐞𝐫 package.

𝐒𝐰𝐢𝐟𝐭𝐞𝐫 tries to pick up the best way to implement the .apply() function by either:

  • Vectorizing your function

  • Parallelizing using Dask

  • Using .apply() from Pandas if the dataset is small.

That gives your function a huge boost.

In the example below, you only need to add df.swifter.apply() to make use of Swifter’s capabilities.

!pip install swifter
import swifter
import pandas as pd

df = pd.DataFrame(...)

def my_function(input_value):
	...
    return output_value
    
df["Column"] = df["Column"].swifter.apply(lambda x: my_function(x))

8.2.2. Reduce DataFrame Memory with dtype_diet#

By default, Pandas DataFrames don’t use the smallest data types for its columns.

This results in unnecessary memory usage.

Changing data types can drastically reduce the memory usage of your DataFrame.

Using dtype_diet, you can automatically change the data types to the smallest (and most memory-efficient) one.

!pip install dtype-diet
from dtype_diet import optimize_dtypes, report_on_dataframe
import pandas as pd

df = pd.read_csv("")
# Get Recommendations
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')

8.2.3. Validate Pandas DataFrames with pandera#

Do you want to validate your Pandas DataFrames?

Try pandera.

pandera is a data validation library for Pandas DataFrames and Series.

It provides a convenient way to define and enforce data quality constraints.

You can even define complex constraints or use the in-built constraints.

!pip install pandera
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({
    "name": pa.Column(pa.String),
    "age": pa.Column(pa.Int, checks=[
        pa.Check(lambda x: x > 0, element_wise=True),
        pa.Check(lambda x: x < 100, element_wise=True)
    ]),
})

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 40, 200],
}
df = pd.DataFrame(data)


schema.validate(df)

8.2.4. Boost Pandas’ Performance With One Line With modin#

If you already have a large codebase based on Pandas, think again.

You can also use modin as a drop-in replacement for Pandas, with a 3X-5X speed-up.

Just install modin and replace the import statement.

It’s maybe not as fast as polars, but you will save hours of development time and gain some performance boost.

!pip install "modin[all]"
import modin.pandas as pd

df = pd.read_csv("")

8.2.5. Chat with your Dataframe with PandasAI#

You can chat with your Pandas dataframe with a few lines of code.

With PandasAI, you can use LLMs to analyze your data, generate visuals, and create a report with your words.

Currently, PandasAI supports popular LLMs from providers like OpenAI, Anthropic, Google, Amazon, or Ollama for local LLMs.

!pip install pandasai
from pandasai import SmartDataframe
from pandasai.llm import OpenAI
from pandasai.helpers.openai_info import get_openai_callback

llm = OpenAI()

df = SmartDataframe("data.csv", config={"llm": llm, "conversational": False})

with get_openai_callback() as cb:
    response = df.chat("Calculate the sum of the gdp of north american countries")