5.2. EDA#

5.2.1. Analyze and visualize data interactively with D-Tale#

Do you still perform your EDA manually?

And do the same repetitive steps?

Let D-Tale help you.

D-Tale is a powerful Python library that allows you to easily inspect and analyze your data interactively.

You can view your data in a web-based interface with various visualizations.

D-Tale supports a wide variety of data types and formats, including CSV, Excel, JSON, SQL, and more.

Bonus: You can export the corresponding source code for your steps.

!pip install dtale
import dtale
import pandas as pd

df = pd.DataFrame([dict(a=1,b=2,c=3)])

d = dtale.show(df)

5.2.2. Use Dark Mode in Matplotlib#

For all Dark Mode fans:

You can use Matplotlib in Dark Mode too.

Just set the background appropriately.

import matplotlib.pyplot as plt

plt.style.use('dark_background')

fig, ax = plt.subplots()

plt.plot(range(1,5), range(1,5))

5.2.3. No-Code EDA with PandasGUI#

PandasGUI provides a PyQT application to analyze and interactively plot your Pandas DataFrames.

Without writing a lot of code.

It offers various functionalities like:

  • Filtering

  • Summary Statistics

  • Different Visualizations like Word Clouds, Bar Charts, etc.

from pandasgui import show
from pandasgui.datasets import pokemon
show(pokemon)

5.2.4. Analyze Missing Values with missingno#

If you want to analyze missing values in your data

Use missingno’s nullity correlation.

It lets us understand how the missing value of one column is related to missing values in other columns.

The heatmap works great for picking out data completeness relationships between variable pairs.

!pip install missingno
import missingno
import pandas as pd

df = pd.read_csv("your_data.csv")

missingno.heatmap(df)

5.2.5. Powerful Correlation with phik#

Do you look for a powerful correlation method?

Try Phik!

Phik (or ϕk) works consistently beween categorical, ordinal and interval variables while capturing non-linear dependencies.

It reverts to Pearson’s correlation only for bivariate normal distribution of the input.

See below how you can use it in Python.

!pip install phik
import pandas as pd
import matplotlib.pyplot as plt

import phik
from phik.report import plot_correlation_matrix
df = pd.read_csv( phik.resources.fixture('fake_insurance_data.csv.gz') )
corr_matrix = df.phik_matrix()
plot_correlation_matrix(corr_matrix.values, 
                        x_labels=corr_matrix.columns, 
                        y_labels=corr_matrix.index, 
                        vmin=0, 
                        vmax=1,  
                        color_map="Blues",
                        title="Correlation Matrix", 
                        fontsize_factor=1.5, 
                        figsize=(10, 8))
plt.tight_layout()

5.2.6. Display X-Axis of Time Series Plots Correctly with autofmt_xdate()#

Plotting Time Series data in Matplotlib makes your x-axis ugly.

It results in overlapping and unreadable labels.

To solve this problem, use fig.autofmt_xdate().

This will automatically format and adjust the x-axis labels.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import numpy as np

dates = np.array([
    datetime(2023, 1, 1),
    datetime(2023, 2, 1),
    datetime(2023, 3, 1),
    datetime(2023, 4, 1),
    datetime(2023, 5, 1),
    datetime(2023, 6, 1),
    datetime(2023, 7, 1)
])
values = np.array([10, 20, 15, 25, 30, 28, 35])

# Create a figure and an axes object
fig, ax = plt.subplots()

ax.plot(dates, values)

ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))

fig.autofmt_xdate()

plt.show()

5.2.7. Beautiful Map Plots with Plotly#

With Plotly, you can create beautiful geo maps with a few lines of code.

Plotly supports map plots like:

  • Filled areas on maps

  • Bubble Maps

  • Hexbin maps

  • Lines on maps

import plotly.express as px
df = px.data.gapminder().query("year == 2007")
fig = px.scatter_geo(df, locations="iso_alpha",
                     size="pop")
fig.show()

5.2.8. Mosaic Plots with Matplotlib#

You can create a mosaic of subplots in Matplotlib.

plt.subplot_mosaic() allows you to arrange multiple subplots in a grid-like fashion, specifying their positions and sizes using a string.

A powerful function to control your subplots.

import matplotlib.pyplot as plt

# Define the layout of subplots
layout = '''
    AAE
    C.E
    '''

fig = plt.figure(constrained_layout=True)
axd = fig.subplot_mosaic(layout)

for key, ax in axd.items():
    ax.plot([1, 2, 3, 4], [1, 4, 9, 16])
    ax.set_title(f"Plot {key}")