Data Preprocessing using ScikitLearn👨‍💻

Ashish Trada

6 min readOct 7, 2021

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset information

For this tutorial, we are using the ‘New York City Airbnb Open Data’. For downloading data set click here. Here is the information about the dataset in the Notebook.

This dataset contains numeric as well as categorical data. Dataset also has different scaled columns and contains missing values. So this is the perfect dataset for preprocessing.

Encoding

Encoding means converting information into another format.

Encoding is needed whenever we have categorical values. Encoding will assign one unique number to particular entities. Most of the time categorial values are in label form ( ex. spam, ham, yes, no, fake, true ) So the computer will not consider them as features because the computer works with numbers. so we have to assign numerical quantity to each quantity and that process is called ‘Encoding’.

There are two types of encoders we will discuss here.

(1) Label Encoder

In Machine Learning, We will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.

As you can see ‘hotel’ column has five categories (1) Bronx (2) Brooklyn (3)Manhattan (4)Queens (5)Staten Island. After Using Label Encoder we convert it into 0,1,2,3 and 4 form.

classes_ attribute is helping us to identify numerical categories for particular label categories. ( e.g= 0 index: Bronx )

(2) One Hot Encoder

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.

We can also confirm by checking any row number value in both frames( original and new transformed_data).

it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world.

Normalization

In real-world data is not available on the same scale. Data columns will always have different scales. So to make all the columns in one scale we can use normalization methods. Normalization will convert the whole dataset into one scale. With the Help of Normalization, we can increase the computational speed and we can also detect outliers in the dataset by normalizing the data.

MinMaxScaler is one type of Normalizer will normalize the data by using minimum and maximum value of particular feature column.

This technique is to re-scale features with a distribution value between 0 and 1. For every feature, the minimum value of that feature gets transformed into 0, and the maximum value gets transformed into 1. The general equation is shown below:

Standardization

Standardization is also one type of normalizer that will ensure that transformed data have mean = 0 & standard deviation = 1.

This is very useful for optimization algorithms like gradient descent. Data will have a value near zero so it will help to increase computational speed. Rescaling is also used for distance-based algorithms like KNN.

StandardScaler is also known as Z-Score which is used for outlier detection.

Imputing Missing Values

Handling missing values is an important task that every data scientist must have to do. We can handle missing values in two ways.

(1) Remove the data (whole row) which have missing values.

(2) Add the values by using some strategies or using Imputer.

We can remove the missing values when the ratio of the number of missing values and a total number of values is low. So in this particular situation, we can remove missing values using dropna() in pandas.

If the ratio is high so we have to Impute the values.

Thankfully Scikit Learn gives the SimpleImputer Class Which will help us to fill values in missing values. It replaces the NaN values with a specified placeholder.

Reviews_per_month column has 10052 NaN value

Imputer

As you can see we use strategy=‘mean’ which means all the missing values will be filled with the mean of that particular column.

we can use median, constant, and most frequent as a strategy

Discreatization

Discretization is the process of putting values into buckets so that there are a limited number of possible states. Basically, in simple terms, it will convert continuous numerical features into categorical columns.

When we have a lot of possibilities in the data range and it is difficult to classify the data at that time we group the continuous variable into one group. This converting feature methodology is called Discretization.

There are 3 types of Discretization available in Sci-kit learn.

(1) Quantile Discretization Transform

(2) Uniform Discretization Transform

(3) KMeans Discretization Transform