5.8. Preprocessing#

5.8.1. Clean your text data with clean-text#

Content on the Web and in Social Media is never clean.

clean-text does the Preprocessing for you.

You can specify, if and how you want to clean your texts.

!pip install clean-text[gpl]
from cleantext import clean

text = '''
       If you want to talk, send me an email: testmail@outlook.com, 
       call me +71112392 or visit my website: https://testurl.com. 
       Calling me is not free, It'\\u2018s\\u2019 costing 0.40$ per 
       minute.
       '''

clean(text,
    fix_unicode=True,              # fix various unicode errors
    to_ascii=True,                 # transliterate to closest ASCII representation
    lower=True,                    # lowercase text
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=True,               # replace all numbers with a special token
    no_digits=True,                # replace all digits with a special token
    no_currency_symbols=True,      # replace all currency symbols with a special token
    no_punct=True,                 # remove punctuations
    lang="en"                      # set to 'de' for German special handling
)

5.8.2. Detect and Fix your Data Quality Issues#

Do you want to detect data quality issues?

Try pandas_dq.

pandas_dq is a relatively new library, focussing on detecting data quality issues and fixing them automatically like:

  • Zero-Variance Columns

  • Rare Categories

  • Highly correlated Features

  • Skewed Distributions

!pip install pandas_dq -q
import pandas as pd
import numpy as np
from pandas_dq import dq_report, Fix_DQ
from sklearn.datasets import load_iris
data = load_iris()
data = pd.DataFrame(data= np.c_[data['data'], data['target']],
                     columns= data['feature_names'] + ['target'])
dq_report(data, verbose=1)
fdq = Fix_DQ()
data_transformed = fdq.fit_transform(data)

5.8.3. Convert Natural Language Numbers into its Numerical Representation#

If you want to convert natural language numbers into numerical values, try numerizer.

numerizer is a Python library for converting numbers in texts to their corresponding numerical values.

numerizer supports a wide range of numeric formats, including whole numbers, decimals, percentages and currencies.

Note: Since version 0.2, numerizer is available as a SpaCy extension.

!pip install numerizer
from numerizer import numerize

text_1 = "Twenty five dollars"
text_2 = "Two hundred and fourty three thousand four hundred and twenty one"
text_3 = "platform nine and three quarters"


num_1 = numerize(text_1)
num_2 = numerize(text_2)
num_3 = numerize(text_3)

print(num_1) # Output: 25 dollars
print(num_2) # Output: 243421
print(num_3) # Output: platform 9.75

5.8.4. Make your Numbers and dates human-friendly#

Looking to make your numbers human-friendly?

Try humanize.

humanize formats your numbers and dates in a way that is intuitive to understand.

It provides various functionalities like:

  • Convert large integers of file sizes

  • Convert floats to fractions

  • Convert dates into a human-understandable format

  • Make big integers more readable

!pip install humanize
import humanize

# Convert bytes to human readable format
humanize.naturalsize(1024000) # Output: 1.0 MB

# Convert a number to its word equivalent
humanize.intword(123500000) # Output: 123.5 million

# Convert a float to its fractional equivalent
humanize.fractional(0.9) # Output: 9/10

# Convert seconds to a readable format
import datetime as dt
humanize.naturaldelta(dt.timedelta(seconds = 1200)) # Output: 20 minutes

5.8.5. Cleaner Pipeline definition in Scikit-Learn#

When you build Pipelines in scikit-learn,

use make_pipeline instead of the Pipeline class.

The Pipeline class can be really long for more complex pipelines.

make_pipeline makes your pipeline definition short and elegant.

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(KNNImputer(), RobustScaler())
cat_pipeline = make_pipeline(SimpleImputer("most_frequent"))

5.8.6. Select Columns for your Pipeline easily#

If you want a convenient way to select columns for your Scikit-learn pipelines

Use make_column_selector.

You can even provide complex regex patterns to select the columns you want.

Afterward, you can use the result in your Pipelines easily.

from sklearn.compose import make_column_selector

# Will only select columns with 'Feature' in its name
columns_with_feature = make_column_selector(pattern='Feature')

# Will only select numeric columns
num_columns = make_column_selector(dtype_include="category")

5.8.7. Rare Label Encoding with feature-engine#

How to tackle Rare Labels in your dataset?

Rare labels can cause issues during model training, as they may not have sufficient representation for the model to learn meaningful patterns.

For this problem, use RareLabelEncoder from feature_engine.

It will convert all rare labels (based on a threshold) to the label “Rare”.

!pip install feature_engine
import pandas as pd
from feature_engine.encoding import RareLabelEncoder

data = ['red', 'blue', 'red', 'green', 'yellow', 'yellow', 'red', "black", "violet", "green", "green"]

df = pd.DataFrame({'color': data})

rare_encoder = RareLabelEncoder(tol=0.1, n_categories=5, variables=['color'])

df_encoded = rare_encoder.fit_transform(df)

df