The Ultimate Guide to Machine Learning: Feature Engineering — Part -2

Simranjeet Singh
21 min readFeb 28, 2023

Introduction

Welcome to the second part of “The Ultimate Guide to Machine Learning”. In the first part, we discussed Exploratory Data Analysis (EDA), which is a crucial step in the machine learning pipeline. In this part, we will delve into Feature Engineering, another important aspect of the machine learning process.

Feature Engineering is the process of transforming raw data into meaningful features that can be used by machine learning algorithms to make accurate predictions. It involves selecting, extracting, and transforming features to enhance the performance of the model. Good feature engineering can make a huge difference in the accuracy of the model, while bad feature engineering can lead to poor performance.

👉 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram 👇
📷 YouTube — https://bit.ly/38gLfTo
📃 Instagram — https://bit.ly/3VbKHWh

👉 Do Donate 💰 or Give me Tip 💵 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip 💰 — https://bit.ly/3oTHiz3

Fig.1 — Feature Engineering

In this guide, we will cover a range of techniques that are commonly used in Feature Engineering. We will start with feature selection and extraction, which involves identifying the most important features in the data. Then, we will move on to encoding categorical variables, which is an essential step when working with non-numerical data. We will also cover scaling and normalization, creation of new features, handling imbalanced data, handling skewness and kurtosis, handling rare categories, handling time-series data, feature transformation, one-hot encoding, count and frequency encoding, binning, grouping, and text preprocessing.

By the end of this guide, you will have a comprehensive understanding of Feature Engineering techniques and how they can be used to enhance the performance of your machine learning models. Let’s get started!

Table of Contents

  1. Feature selection and extraction
  2. Encoding categorical variables
  3. Scaling and Normalization
  4. Creation of new features
  5. Handling imbalanced data
  6. Handling skewness and kurtosis
  7. Handling rare categories
  8. Handling time-series data
  9. Text preprocessing

Feature selection and extraction

Feature selection and extraction is an essential part of machine learning that involves selecting the most relevant features from the dataset to improve the model’s accuracy and efficiency. Here, we will discuss some popular methods for feature selection and extraction, along with Python code snippets.

1. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that reduces the number of features in the dataset by finding a new set of features that captures the most variance in the data. The new features, called principal components, are orthogonal to each other and can be used to reconstruct the original dataset.

Let’s see how to perform PCA on a dataset using scikit-learn:

from sklearn.decomposition import PCA

# create a PCA object
pca = PCA(n_components=2)

# fit and transform the data
X_pca = pca.fit_transform(X)

# calculate the explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)

Here, we create a PCA object and specify the number of components we want to extract. We then fit and transform the data to obtain the new set of features. Finally, we calculate the explained variance ratio to determine how much variance in the data is captured by each principal component.

2. Linear Discriminant Analysis (LDA): LDA is a supervised learning technique that is used for feature extraction in classification problems. It works by finding a new set of features that maximizes the separation between the classes in the data.

Let’s see how to perform LDA on a dataset using scikit-learn:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# create an LDA object
lda = LinearDiscriminantAnalysis(n_components=1)

# fit and transform the data
X_lda = lda.fit_transform(X, y)

Here, we create an LDA object and specify the number of components we want to extract. We then fit and transform the data to obtain the new set of features.

3. Correlation Analysis: Correlation analysis is used to identify the correlation between the features in the dataset. Features that are highly correlated with each other can be removed from the dataset as they provide redundant information.

Let’s see how to perform correlation analysis on a dataset using pandas:

import pandas as pd

# calculate the correlation matrix
corr_matrix = df.corr()

# select highly correlated features
high_corr = corr_matrix[abs(corr_matrix) > 0.8]

# drop highly correlated features
df = df.drop(high_corr.columns, axis=1)

Here, we calculate the correlation matrix using pandas and select the highly correlated features. We then drop the highly correlated features from the dataset using the drop method.

https://www.researchgate.net/figure/Overview-of-feature-selection-methods-for-machine-learning-algorithms_fig1_344212522
Fig.2 — Feature Selection Measures

4. Recursive Feature Elimination (RFE): RFE is a method for selecting features by recursively considering smaller and smaller subsets of features. At each iteration, the model is trained on the remaining features and the importance of each feature is ranked. The least important feature is then eliminated, and the process is repeated until the desired number of features is obtained.

Here is an example of using RFE for feature selection:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

data = load_boston()
X, y = data.data, data.target

model = LinearRegression()
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)

selected_features = data.feature_names[rfe.support_]
print(selected_features)

5. Tree-based Methods: Decision trees and random forests are popular tree-based methods used for this purpose. In these methods, a tree structure is created based on the features that are most important for predicting the target variable. The importance of each feature is calculated by the reduction in impurity that results from splitting the data based on that feature.

In decision trees, the feature with the highest information gain is selected as the root node, and the data is split based on that feature. This process is repeated recursively until a stopping criterion is met, such as a maximum tree depth or a minimum number of samples per leaf.

In random forests, multiple decision trees are built using random subsets of the features and the data. The importance of each feature is calculated as the average reduction in impurity across all trees. This helps to reduce the variance of the model and improve its generalizability.

from sklearn.ensemble import RandomForestRegressor

# Load the data
X, y = load_data()

# Create a random forest regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Print feature importances
for feature, importance in zip(X.columns, importances):
print(feature, importance)

Tree-based methods can also be used for feature extraction. In this case, we can extract new features based on the decision boundaries of the tree. For example, we can use the leaf node of a decision tree as a new binary feature that indicates whether a data point falls within that region of the feature space.

6. Wrapper Methods: These are a type of feature selection method where a model is trained and evaluated on different subsets of features. The performance of the model is measured for each subset of features, and the best subset is selected based on the model’s performance.

Here’s an example of how to implement a wrapper method using Recursive Feature Elimination (RFE) with a support vector machine (SVM) classifier in scikit-learn:

from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.datasets import load_iris

# load the iris dataset
data = load_iris()
X = data.data
y = data.target

# create an SVM classifier
svm = SVC(kernel='linear')

# create a feature selector using RFE with SVM
selector = RFE(svm, n_features_to_select=2)

# fit the selector to the data
selector.fit(X, y)

# print the selected features
print(selector.support_)
print(selector.ranking_)

In this example, we first load the iris dataset and split it into features (X) and TARGET (y). Then we create an SVM classifier with a linear kernel. We then create a feature selector using RFE with SVM and fit it to the data. Finally, we print the selected features using the support_ and ranking_ attributes of the selector.

Forward Selection: Forward Selection is a wrapper method that involves iteratively adding one feature at a time to the model until the performance of the model stops improving. Here’s how it works in Python:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

# Load the dataset
X, y = load_dataset()

# Initialize the feature selector
selector = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='forward')

# Fit the feature selector
selector.fit(X, y)

# Print the selected features
print(selector.support_)

In the code above, we first load the dataset and then initialize the SequentialFeatureSelector object with a linear regression model and a parameter n_features_to_select that specifies the number of features we want to select. We then fit the selector on the dataset and print the selected features.

Backward Elimination: Backward Elimination is a wrapper method that involves iteratively removing one feature at a time from the model until the performance of the model stops improving. Here’s how it works in Python:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

# Load the dataset
X, y = load_dataset()

# Initialize the feature selector
selector = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5, direction='backward')

# Fit the feature selector
selector.fit(X, y)

# Print the selected features
print(selector.support_)

In the code above, we initialize the SequentialFeatureSelector object with a linear regression model and a parameter direction=’backward’ to perform backward elimination. We then fit the selector on the dataset and print the selected features.

Exhaustive Search: Exhaustive Search is a filter method that involves evaluating all possible subsets of features and selecting the best subset based on a scoring criterion. Here’s how it works in Python:

from itertools import combinations
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Load the dataset
X, y = load_dataset()

# Initialize variables
best_score = -float('inf')
best_features = None

# Loop over all possible subsets of features
for k in range(1, len(X.columns) + 1):
for subset in combinations(X.columns, k):
# Train a linear regression model
X_subset = X[list(subset)]
model = LinearRegression().fit(X_subset, y)
# Compute the R2 score
score = r2_score(y, model.predict(X_subset))
# Update the best subset of features
if score > best_score:
best_score = score
best_features = subset

# Print the best subset of features
print(best_features)

In the code above, we first load the dataset and then loop over all possible subsets of features using the itertools.combinations function. For each subset, we train a linear regression model and compute the R2 score. We then update the best subset of features based on the highest R2 score and print the selected features.

7. Embedded Methods: These involve selecting features as part of the model training process. Examples include Lasso regression and Ridge regression, which add a penalty term to the loss function to encourage sparse feature selection.

Lasso regression: Lasso regression also adds a penalty term to the loss function, but it uses the absolute value of the model coefficients instead of the square. This leads to a more aggressive feature selection process, as some coefficients can be set exactly to zero. Lasso regression is particularly useful when dealing with high-dimensional data, as it can effectively reduce the number of features used in the model.

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler

data = load_boston()
X = data.data
y = data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Fit the Lasso model
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Get the coefficients
coefficients = lasso.coef_

Ridge regression: Ridge regression adds a penalty term to the loss function, which encourages the model to select a smaller set of features that are more important for predicting the target variable. The penalty term is proportional to the square of the magnitude of the model coefficients, so it tends to shrink the coefficients towards zero without setting them exactly to zero.

from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler

data = load_boston()
X = data.data
y = data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Fit the Ridge model
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)

# Get the coefficients
coefficients = ridge.coef_

In both cases, the regularization parameter alpha controls the strength of the penalty term. A higher value of alpha will result in a more sparse feature selection.

Encoding Categorical Variables

Encoding categorical variables is a crucial step in feature engineering that involves converting categorical variables into a numerical form that machine learning algorithms can understand. Here are some common techniques used for encoding categorical variables:

1. One-Hot Encoding:

One-hot encoding is a technique that converts categorical variables into a set of binary features, where each feature corresponds to a unique category in the original variable. In this technique, a new binary column is created for each category, and the value is set to 1 if the category is present and 0 if not.

Here’s an example using the panda's library:

import pandas as pd

# create a sample dataframe
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'yellow', 'blue']
})

# apply one-hot encoding
one_hot_encoded = pd.get_dummies(df['color'])
print(one_hot_encoded)

2. Label Encoding:

Label encoding is a technique that assigns a unique numerical value to each category in the original variable. In this technique, each category is assigned a numerical label, where the labels are assigned based on the order of the categories in the variable.

Here’s an example using the scikit-learn library:

from sklearn.preprocessing import LabelEncoder

# create a sample dataframe
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'yellow', 'blue']
})

# apply label encoding
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])
print(df)
Fig.3 — Encoding Data

3. Ordinal Encoding:

Ordinal encoding is a technique that assigns a numerical value to each category in the original variable based on their order or rank. In this technique, the categories are ordered based on a specific criterion, and the categories are assigned numerical values based on their position in the order.

Here’s an example using the category_encoders library:

import category_encoders as ce

# create a sample dataframe
df = pd.DataFrame({
'size': ['S', 'M', 'L', 'XL', 'M', 'S']
})

# apply ordinal encoding
ordinal_encoder = ce.OrdinalEncoder(cols=['size'], order=['S', 'M', 'L', 'XL'])
df = ordinal_encoder.fit_transform(df)
print(df)

Scaling and Normalization

Scaling and Normalization are important steps in feature engineering to ensure that the features are on a similar scale and have similar ranges. This can help improve the performance of some machine learning algorithms and make the optimization process faster. Here are some common techniques used for scaling and normalization:

1. Standardization: Standardization scales the features so that they have zero mean and unit variance. This is done by subtracting the mean from each value and then dividing it by the standard deviation. The resulting values will have a mean of zero and a standard deviation of one.

Here is an example of standardization using scikit-learn:

from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

2. Min-Max Scaling: Min-Max scaling scales the features to a fixed range, usually between 0 and 1. This is done by subtracting the minimum value from each value and then dividing by the range.

Here is an example of Min-Max scaling using scikit-learn:

from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)
https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed
Fig.4 — Standardization and Normalization

3. Robust Scaling: Robust scaling is similar to standardization, but it uses the median and interquartile range instead of the mean and standard deviation. This makes it more robust to outliers in the data.

Here is an example of Robust scaling using scikit-learn:

from sklearn.preprocessing import RobustScaler

# Create a RobustScaler object
scaler = RobustScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

4. Normalization: Normalization scales each observation to have a unit norm, which means that the sum of squares of each feature value is 1. This is useful for some algorithms that require a similar scale for all samples.

Here is an example of normalization using scikit-learn:

from sklearn.preprocessing import Normalizer

# Create a Normalizer object
scaler = Normalizer()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

Creating New Features

Creating new features is an important step in feature engineering that involves creating new variables or columns from existing data. This can help to capture complex relationships between the features and improve the accuracy of the models.

Here are some techniques for creating new features:

1. Interaction features: Interaction features are created by multiplying two or more existing features together. This can help to capture the joint effects of the features and uncover new patterns in the data. For example, if we have two features, “age” and “income”, we can create a new interaction feature called “age_income” by multiplying these two features together.

Here is an example of creating an interaction feature using Pandas in Python:

import pandas as pd

# create a sample data frame
data = pd.DataFrame({'age': [25, 30, 35],
'income': [50000, 60000, 70000]})

# create a new interaction feature
data['age_income'] = data['age'] * data['income']

# display the updated data frame
print(data)

2. Polynomial features: Polynomial features are created by raising existing features to a higher power. This can help to capture non-linear relationships between the features and improve the accuracy of the models. For example, if we have a feature “age”, we can create a new polynomial feature called “age_squared” by squaring this feature.

Here is an example of creating polynomial features using Scikit-learn in Python:

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# create a sample data set
X = np.array([[1, 2],
[3, 4]])

# create polynomial features up to degree 2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# display the updated feature matrix
print(X_poly)

3. Binning: Binning involves grouping continuous values into discrete categories. This can help to capture non-linear relationships and reduce the impact of outliers in the data. For example, if we have a feature “age”, we can create a new binned feature called “age_group” by grouping the ages into different categories such as “0–18”, “18–25”, “25–35”, “35–50”, and “50+”.

Here is an example of creating binned features using Pandas in Python:

import pandas as pd

# create a sample data frame
data = pd.DataFrame({'age': [20, 25, 30, 35, 40, 45, 50, 55]})

# create bins for different age groups
bins = [0, 18, 25, 35, 50, float('inf')]
labels = ['0-18', '18-25', '25-35', '35-50', '50+']
data['age_group'] = pd.cut(data['age'], bins=bins, labels=labels)

# display the updated data frame
print(data)

Handling Imbalanced Data

Dealing with imbalanced data is an important aspect of machine learning. Imbalanced data is a situation where the distribution of the target variable is not uniform, and one class is underrepresented compared to the other. This can lead to a bias in the model toward the majority class, and the model may perform poorly on the minority class. Some of the techniques to handle imbalanced data are:

1. Upsampling: Upsampling involves creating more samples for the minority class by resampling the existing samples with replacement. This can be done using the resample function from the sklearn.utils module.

from sklearn.utils import resample

# Upsample minority class
X_upsampled, y_upsampled = resample(X_minority, y_minority, replace=True, n_samples=len(X_majority), random_state=42)

2. Downsampling: Downsampling involves removing some samples from the majority class to balance the distribution. This can be done using the resample function from the sklearn.utils module.

from sklearn.utils import resample

# Downsample majority class
X_downsampled, y_downsampled = resample(X_majority, y_majority, replace=False, n_samples=len(X_minority), random_state=42)
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
Fig.4 — UnderSampling and OverSampling

3. Synthetic Minority Over-sampling Technique (SMOTE): SMOTE involves creating synthetic samples for the minority class based on the existing samples. This can be done using the SMOTE function from the imblearn.over_sampling module.

from imblearn.over_sampling import SMOTE

# Use SMOTE to upsample minority class
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X, y)

4. Class Weighting: Class weighting involves assigning a weight to each class in the model to account for the imbalance. This can be done using the class_weight the parameter in the model.

from sklearn.linear_model import LogisticRegression

# Use class weighting to handle imbalance
clf = LogisticRegression(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)

5. Anomaly Detection: Anomaly detection involves identifying the outliers in the data and removing them. This can be done using the IsolationForest function from the sklearn.ensemble module. Anomaly detection identifies rare events or observations in a dataset that deviate significantly from the expected or normal behavior. In the case of imbalanced data, where the number of observations in one class is much lower than the other, anomaly detection can be used to identify and label the rare observations in the minority class as anomalies. This can help balance the dataset and improve the performance of machine learning models.

One common approach for anomaly detection in imbalanced data is to use unsupervised learning techniques such as clustering, where the minority class observations are clustered into distinct groups based on their similarities. The observations in the minority class that do not belong to any of these clusters can be labeled as anomalies.

Another approach is to use supervised learning techniques such as one-class classification, where a model is trained on the majority class data to learn the normal behavior of the data. The minority class observations that deviate significantly from the learned normal behavior are then labeled as anomalies.

from sklearn.ensemble import IsolationForest

# Use anomaly detection to handle imbalance
clf = IsolationForest(random_state=42)
clf.fit(X_train)
X_train = X_train[clf.predict(X_train) == 1]
y_train = y_train[clf.predict(X_train) == 1]

6. Cost-Sensitive Learning: Cost-sensitive learning involves assigning a different cost to each type of error in the model to account for the imbalance. This can be done using the sample_weight the parameter in the model.

from sklearn.tree import DecisionTreeClassifier

# Use cost-sensitive learning to handle imbalance
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train, sample_weight=class_weights)

Skewness and Kurtosis Handling

Skewness and kurtosis are statistical measures that can help in understanding the distribution of data. Skewness measures the degree of asymmetry in the data, while kurtosis measures the degree of peakedness or flatness of the distribution.

https://corporatefinanceinstitute.com/resources/data-science/skewness/
Fig.5 — Skewness

Skewed data can negatively affect the performance of machine learning models. Therefore, it is important to handle skewness in the data. Here are some techniques to handle skewness in the data:

  1. Log transformation: Logarithmic transformation can be used to reduce the skewness of data. It can be applied to both positively and negatively skewed data.
  2. Square root transformation: The square root transformation can be used to reduce the skewness of data. It can be applied to positively skewed data.
  3. Box-Cox transformation: The Box-Cox transformation is a more general transformation method that can handle both positively and negatively skewed data. It uses a parameter lambda to determine the type of transformation to be applied to the data.

Here’s some Python code to demonstrate these transformations:

import numpy as np
import pandas as pd
from scipy import stats

# Generate some skewed data
data = np.random.gamma(1, 10, 1000)

# Calculate skewness and kurtosis
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)

print("Skewness:", skewness)
print("Kurtosis:", kurtosis)

# Log transformation
log_data = np.log(data)
log_skewness = stats.skew(log_data)
log_kurtosis = stats.kurtosis(log_data)

print("Log Skewness:", log_skewness)
print("Log Kurtosis:", log_kurtosis)

# Square root transformation
sqrt_data = np.sqrt(data)
sqrt_skewness = stats.skew(sqrt_data)
sqrt_kurtosis = stats.kurtosis(sqrt_data)

print("Sqrt Skewness:", sqrt_skewness)
print("Sqrt Kurtosis:", sqrt_kurtosis)

# Box-Cox transformation
box_cox_data, _ = stats.boxcox(data)
box_cox_skewness = stats.skew(box_cox_data)
box_cox_kurtosis = stats.kurtosis(box_cox_data)

print("Box-Cox Skewness:", box_cox_skewness)
print("Box-Cox Kurtosis:", box_cox_kurtosis)

Handling kurtosis can be done by applying a transformation similar to that used for handling skewness. Some techniques for handling kurtosis include:

  1. Log transformation: Logarithmic transformation can also be used to handle kurtosis in the data.
  2. Square transformation: The square transformation can also be used to handle kurtosis in the data.
  3. Box-Cox transformation: The Box-Cox transformation can also be used to handle kurtosis in the data.
https://www.scribbr.com/statistics/kurtosis/
Fig.6 — Kurtosis

Here’s some Python code to demonstrate these transformations:

import numpy as np
import pandas as pd
from scipy import stats

# Generate some data with high kurtosis
data = np.random.normal(0, 5, 1000)**3

# Calculate skewness and kurtosis
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)

print("Skewness:", skewness)
print("Kurtosis:", kurtosis)

# Log transformation
log_data = np.log(data)
log_skewness = stats.skew(log_data)
log_kurtosis = stats.kurtosis(log_data

Handling Rare Categories

Handling rare categories refers to the process of dealing with categories in categorical variables that occur infrequently in the data. Rare categories can cause problems in machine learning models, as they may not have enough representation in the data to be accurately modeled. Some techniques for handling rare categories are:

  1. Grouping the rare categories: This involves grouping rare categories into a single category or a few categories. This reduces the number of categories in the variable and increases the representation of the rare categories.
  2. Replacing the rare categories with a more common category: This involves replacing the rare categories with the most common category in the variable. This can be effective if the rare categories are not important for the analysis.
  3. One-hot encoding with a flag: This involves creating a new category for rare categories and flagging them as rare. This allows the model to treat rare categories differently from other categories.

Here’s an example of how to handle rare categories using the Titanic dataset:

import pandas as pd
import numpy as np

# load Titanic dataset
titanic = pd.read_csv('titanic.csv')

# view value counts of the 'Embarked' column
print(titanic['Embarked'].value_counts())

# group rare categories into a single category
titanic['Embarked'] = np.where(titanic['Embarked'].isin(['C', 'Q']), titanic['Embarked'], 'R')

# view value counts of the 'Embarked' column after grouping
print(titanic['Embarked'].value_counts())

# replace rare categories with the most common category
titanic['Embarked'] = np.where(titanic['Embarked'].isin(['C', 'Q']), titanic['Embarked'], 'S')

# view value counts of the 'Embarked' column after replacement
print(titanic['Embarked'].value_counts())

# create a new category for rare categories and flag them as rare
titanic['Embarked_R'] = np.where(titanic['Embarked'].isin(['C', 'Q']), 0, 1)

Handling Time Series Data

Handling time-series data involves several techniques such as data preprocessing, feature extraction, and modeling. Let’s take a look at some of the techniques and how they can be implemented using Python.

Fig.7 — Time Series Data

1. Data Preprocessing: Time-series data often contain missing values, outliers, and noisy data that can affect the performance of the model. Therefore, it is essential to preprocess the data before training the model. Some common techniques for data preprocessing include, Imputation, Handling Outliers and Scaling.

2. Feature Extraction: Feature extraction involves extracting relevant information from the time-series data that can be used for modeling. Some popular feature extraction techniques include Rolling Statistics, Fourier Transform, and Wavelet Transform.

3. Modeling: Once the data has been preprocessed and features extracted, it can be used for modeling. Some popular models for time-series data include ARIMA: Autoregressive Integrated Moving Average (ARIMA), LSTM: Long Short-Term Memory (LSTM), and Prophet.

Let’s take a look at an example of how some of these techniques can be implemented in Python:

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.arima_model import ARIMA
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Load time-series data
data = pd.read_csv('time_series_data.csv')

# Preprocess data
data.fillna(method='ffill', inplace=True)
data = data[(data['date'] > '2020-01-01') & (data['date'] < '2021-12-31')]
data.set_index('date', inplace=True)
scaler = StandardScaler()
data = scaler.fit_transform(data)

# Extract features
rolling_mean = data.rolling(window=7).mean()
fft = np.fft.fft(data)
wavelet = pywt.dwt(data, 'db1')

# Train ARIMA model
model = ARIMA(data, order=(1, 1, 1))
model_fit = model.fit(disp=0)
predictions = model_fit.predict(start='2022-01-01', end='2022-12-31')

# Train LSTM model
X_train, y_train = [], []
for i in range(7, len(data)):
X_train.append(data[i-7:i, 0])
y_train.append(data[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

# Define the LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(units=1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Fit the model to the training data
model.fit(X_train, y_train, epochs=100, batch_size=32)

Text Pre-Processing

Text preprocessing is a crucial step in feature engineering when dealing with text data. The goal is to convert the raw text into a numerical representation that can be used for machine learning models. Here are some common text preprocessing techniques in Python:

  1. Tokenization: This involves breaking up a sentence or document into individual words or phrases. The NLTK library provides various tokenizers, such as word tokenizer and sentence tokenizer.
from nltk.tokenize import word_tokenize, sent_tokenize

text = "This is a sample sentence. It contains some words."
words = word_tokenize(text)
sentences = sent_tokenize(text)

print(words)
# Output: ['This', 'is', 'a', 'sample', 'sentence', '.', 'It', 'contains', 'some', 'words', '.']

print(sentences)
# Output: ['This is a sample sentence.', 'It contains some words.']

2. Stop word removal: Stop words are commonly occurring words that do not add any meaning to the text, such as “a”, “the”, “and”, etc. Removing stop words can improve the efficiency of text processing and reduce the size of the data. The NLTK library provides a list of stop words for various languages.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.casefold() not in stop_words]

print(filtered_words)
# Output: ['sample', 'sentence', '.', 'contains', 'words', '.']
Fig.8 — Text Processing

3. Stemming and Lemmatization: Stemming and lemmatization are techniques used to reduce words to their base or root form. For example, “running”, “runner”, and “runs” can be reduced to the root word “run”. The NLTK library provides various stemmers and lemmatizers.

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in filtered_words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

print(stemmed_words)
# Output: ['sampl', 'sentenc', '.', 'contain', 'word', '.']

print(lemmatized_words)
# Output: ['sample', 'sentence', '.', 'contains', 'word', '.']

4. Text normalization: Text normalization involves converting text to a standardized form, such as converting all text to lowercase, removing punctuation, and replacing abbreviations and contractions with their full forms.

import re

def normalize_text(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\b(can\'t|won\'t|shouldn\'t)\b', 'not', text)
text = re.sub(r'\b(i\'m|you\'re|he\'s|she\'s|it\'s|we\'re|they\'re)\b', 'be', text)
return text

text = "I can't believe it's not butter!"
normalized_text = normalize_text(text)

print(normalized_text)
# Output: 'i not believe be not butter'

Conclusion

In conclusion, feature engineering is a critical step in the machine learning process that involves transforming raw data into a format that can be effectively used by machine learning algorithms. In this blog post, we have covered various techniques for feature engineering, including feature selection and extraction, encoding categorical variables, scaling and normalization, creation of new features, handling imbalanced data, handling skewness and kurtosis, handling rare categories, handling time-series data, feature transformation, and text preprocessing.

Here are the key takeaways from this post:

  1. Feature selection and extraction can be done using statistical methods such as PCA, LDA, and correlation analysis, as well as machine learning methods such as tree-based methods, wrapper methods, and embedded methods.
  2. Encoding categorical variables can be done using techniques such as one-hot encoding, label encoding, and count encoding.
  3. Scaling and normalization can be done using techniques such as min-max scaling, standard scaling, and robust scaling.
  4. Text preprocessing involves techniques such as tokenization, stopword removal, stemming, and lemmatization.

If you like the article and would like to support me make sure to:

👏 Clap for the story (100 Claps) and follow me 👉🏻Simranjeet Singh

📑 View more content on my Medium Profile

🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter | Telegram

🚀 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.

🎓 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.

📅 Consultation or Career Guidance

📅 1:1 Mentorship — About Python, Data Science, and Machine Learning

Book your Appointment

--

--

Simranjeet Singh

Data Scientist | Blogger | YouTuber | MLOPS | Machine Learning and Deep Learning | NLP | Azure/AWS