Data Pre-processing tasks using python with Data reduction techniques

Ashish Trada
4 min readOct 28, 2021

Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Principal Component Analysis(PCA)

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. For a lot of machine learning applications it helps to be able to visualize your data. Visualizing 2 or 3 dimensional data is not that challenging. You can use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

Details in datasets are wildly increasing. This might generate problem because of a lot of data main important feature in dataset may buried in useless data. So, this is the reason that Data pre-processing become crucial to any dataset. In this we are going to discuss different data reduction method that reduce unnecessary data from our dataset and make it more efficient for our model to run these datasets.

The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library

Load Dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

add noise

The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.

Principal Component Analysis (PCA)

Import Library
DataFrame after using standard scaler

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.

PCA for 2D projection

Concatenating Data Frame along axis = 1. resultant_Df is the final DataFrame before plotting the data.

Concatenating target column into dataframe

Now, lets visualize the data frame :

2D representation of dataframe

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Variance Thresold

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called analysis of variance (ANOVA). When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’. Each feature has its test score.
Finally, all the test scores are compared, and the features with top scores will be selected.

Recursive Feature Elimination

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. RFE requires a specified number of features to keep, however, it is often not known in advance how many features are valid.

RFE using Random Forest Classifier

Here, only original columns remains true and all other extra added noise column were shown false.

In this blog, we have seen how to use different feature selection methods on the same data and evaluated their performances.

--

--