Utility Libraries for Pandas

8.2. Utility Libraries for Pandas#

8.2.1. Speed up Pandas’ `apply()`#

Don’t use .apply() in Pandas blindly!

.apply() is used to apply operations on all the elements in a dataframe (row-wise or column-wise).

It’s the most obvious choice, but there is a better option:

Instead, use the 𝐒𝐰𝐢𝐟𝐭𝐞𝐫 package.

𝐒𝐰𝐢𝐟𝐭𝐞𝐫 tries to pick up the best way to implement the .apply() function by either:

Vectorizing your function
Parallelizing using Dask
Using .apply() from Pandas if the dataset is small.

That gives your function a huge boost.

In the example below, you only need to add df.swifter.apply() to make use of Swifter’s capabilities.

!pip install swifter

import swifter
import pandas as pd

df = pd.DataFrame(...)

def my_function(input_value):
	...
    return output_value
    
df["Column"] = df["Column"].swifter.apply(lambda x: my_function(x))

8.2.2. Reduce DataFrame Memory with `dtype_diet`#

By default, Pandas DataFrames don’t use the smallest data types for its columns.

This results in unnecessary memory usage.

Changing data types can drastically reduce the memory usage of your DataFrame.

Using dtype_diet, you can automatically change the data types to the smallest (and most memory-efficient) one.

!pip install dtype-diet

from dtype_diet import optimize_dtypes, report_on_dataframe
import pandas as pd

df = pd.read_csv("")
# Get Recommendations
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')

8.2.3. Validate Pandas DataFrames with `pandera`#

Do you want to validate your Pandas DataFrames?

Try pandera.

pandera is a data validation library for Pandas DataFrames and Series.

It provides a convenient way to define and enforce data quality constraints.

You can even define complex constraints or use the in-built constraints.

!pip install pandera

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({
    "name": pa.Column(pa.String),
    "age": pa.Column(pa.Int, checks=[
        pa.Check(lambda x: x > 0, element_wise=True),
        pa.Check(lambda x: x < 100, element_wise=True)
    ]),
})

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 40, 200],
}
df = pd.DataFrame(data)


schema.validate(df)

8.2.4. Boost Pandas’ Performance With One Line With `modin`#

If you already have a large codebase based on Pandas, think again.

You can also use modin as a drop-in replacement for Pandas, with a 3X-5X speed-up.

Just install modin and replace the import statement.

It’s maybe not as fast as polars, but you will save hours of development time and gain some performance boost.

!pip install "modin[all]"

import modin.pandas as pd

df = pd.read_csv("")

8.2.5. Chat with your Dataframe with `PandasAI`#

You can chat with your Pandas dataframe with a few lines of code.

With PandasAI, you can use LLMs to analyze your data, generate visuals, and create a report with your words.

Currently, PandasAI supports popular LLMs from providers like OpenAI, Anthropic, Google, Amazon, or Ollama for local LLMs.

!pip install pandasai

from pandasai import SmartDataframe
from pandasai.llm import OpenAI
from pandasai.helpers.openai_info import get_openai_callback

llm = OpenAI()

df = SmartDataframe("data.csv", config={"llm": llm, "conversational": False})

with get_openai_callback() as cb:
    response = df.chat("Calculate the sum of the gdp of north american countries")

Utility Libraries for Pandas

Contents

8.2. Utility Libraries for Pandas#

8.2.1. Speed up Pandas’ apply()#

8.2.2. Reduce DataFrame Memory with dtype_diet#

8.2.3. Validate Pandas DataFrames with pandera#

8.2.4. Boost Pandas’ Performance With One Line With modin#

8.2.5. Chat with your Dataframe with PandasAI#

8.2.1. Speed up Pandas’ `apply()`#

8.2.2. Reduce DataFrame Memory with `dtype_diet`#

8.2.3. Validate Pandas DataFrames with `pandera`#

8.2.4. Boost Pandas’ Performance With One Line With `modin`#

8.2.5. Chat with your Dataframe with `PandasAI`#