8.1. Pandas Tips and Tricks#

8.1.1. Filter Pandas in a readable format#

Do you need a more readable way to filter Dataframes in Pandas?

Try df.query().

You can specify the condition using a string.

This can be sometimes more convenient and readable than boolean indexing.

And it’s fast, due to the optimized Cython-based code used under the hood.

import pandas as pd

df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})

filtered_df = df.query("A > 1 & B < 6")

8.1.2. Get frequency of time series with pd.infer_freq#

When working with time series, it is useful to know the frequency of the data.

But especially in larger datasets, it might be difficult to tell the frequency immediately.

To get the frequency of a time series in Pandas, use pd.infer_freq().

It infers the most likely frequency given the input index.

Below you can see how to infer the most likely frequency for a DatetimeIndex.

import pandas as pd
time_index = pd.date_range(start='1/1/2020 20:00:00', end='10/1/2020 00:00:00')

pd.infer_freq(time_index)
'D'

8.1.3. Change the Plotting Backend#

By default, Pandas uses matplotlib as its plotting backend.

You can change it to, let’s say Plotly, with one line of code.

See below how to do that with only one line.

import pandas as pd

pd.options.plotting.backend = 'plotly'

df = pd.DataFrame(dict(a=[5,7,9,3], b=[1,6,4,10]))
fig = df.plot()
fig.show()

8.1.4. Style your DataFrames#

Did you know you can style your DataFrames in Pandas?

You just have to define a condition to apply colors in a function.

And use DataFrame.style.applymap() to apply the condition.

import pandas as pd

sales_data = sales_data = {
    'Product Name': ['Product A', 'Product B', 'Product C', 'Product D'],
    'Revenue': [10000, 5000, 15000, 1000],
}

sales_df = pd.DataFrame(sales_data)

# Apply styles to the dataframe
def coloring(val):
    color = 'red' if val <= 10000 else 'green'
    return 'background-color: %s' % color

sales_df.style.applymap(coloring, subset=['Revenue'])

8.1.5. Set Precision of Displayed Floats#

In Pandas, you can control the precision of the displayed values.

Just use the .set_option() function.

import pandas as pd

pd.set_option('display.precision', 2)

data = {'Value': [1.2343129, 5.8956701, 6.224289]}
df = pd.DataFrame(data)

8.1.6. Faster I/O with Parquet#

Whenever you work with bigger datasets, please avoid using CSV format (or similar).

CSV files are text files, which are human-readable, and therefore a popular option to store data.

For small datasets, this is not a big issue.

But, what if your data has millions of rows?

It can get really slow to do read/write operations on them.

On the other side, binary files exist too.

They consist of 0s and 1s and are not meant to be human-readable but to be used by programs that know how to interpret them.

Because of that, binary files are more compact and consume less space.

Parquet is one popular binary file format, which is more memory-efficient than CSVs.

import pandas as pd

# Shape: (100000000, 5)
df = pd.DataFrame(...)

# Time: 1m 58s
df.to_csv("data.csv")

# Time: 8s
df.to_parquet("data.parquet")