How to create a custom data transformer using sklearn?

2022-05-21 14:52:34 By : Ms. Diana Teng

The sklearn is a Python-based machine learning package that provides a collection of various data transformations for modifying the data as per need. Many simple data cleaning processes, such as deleting columns etc, are often done manually on the data, so we need to use custom code. The sklearn package provides a mechanism to standardise these unique data transformations so that they can be used like any other transformation, either directly on the data or as part of the modelling pipeline. In this article, you will learn how to create and apply custom data transformations for sklearn. Following are the articles to be covered.

Let’s start with the understanding of the custom data transformer. 

The sklearn which is a Python-based machine learning package directly provides many various data preparation strategies, such as scaling numerical input variables and modifying variable probability distributions. The process of modifying raw data to make it fit for machine learning algorithms is known as data preparation.

When assessing model performance using data sampling approaches such as k-fold cross-validation, these transformations will allow fitting and applying the transformations to a dataset without leaking data.

While the data preparation techniques provided by sklearn are comprehensive, it may be necessary to perform additional data preparation processes. These additional processes are often conducted manually before modelling and need the creation of bespoke code. The danger is that these data preparation stages will be carried out inconsistently.

The approach is to use the FunctionTransformer class to construct a custom data transform in sklearn. This class lets the user define a function that will be invoked to change the data. Defining the function and making any valid alteration, such as modifying the values or eliminating data columns (not removing rows). The class may then be used in sklearn just like any other data transform, for example, to directly convert data or in a modelling pipeline.

Before moving on to creating a Custom Transformer, here are a couple of things worth being familiar with:

Are you looking for a complete repository of Python libraries used in data science, check out here.

We simply need to fulfil a few fundamental parameters to develop a Custom Transformer:

Create a basic custom transformer

A custom class is built, in which fit and transform functions are defined. Both of these functions are necessary for the pipeline to function smoothly. The pipeline will perform all the operations which are mentioned in the class. And the output data frame would look like the below image.

Let’s build a custom transformer and apply it to the data frame and predict some values. Starting with importing necessary libraries for operations.

Reading, preprocessing and data analysis:

This article uses a data set related to the insurance sector in which the cost of insurance will be predicted based on different features and observations.

This plot is representing the distribution of insurance charges with respect to the body mass index (BMI) of customers and categorised by their age.

Splitting the data for train and validation as per the standard ratio of 70:30.

Create a custom transformer and use the transformer for transforming the train data for learners.

The pipeline is employed since it would be difficult to apply all of these stages sequentially with individual code blocks. Because pipelines keep sequencing in a single block of code, the pipeline itself becomes an estimator, capable of completing all operations in a single statement. 

There are some additional things added in the class if compared to the above basic transformer. In this transformer, the user can mention the names of the features on which the operations needed to be performed. This process is known as passing arguments.

The linear regression model is built using the custom transformer. The transformer is converting the values to logs for the learner to decrease the bias toward larger values. This kind of bias is common in linear regression models. 

The above representation is a regression plot between the observed insurance charges and the predicted insurance charges. It could be observed that the regression line is explaining the relationship adequately. 

So, till now we are able to build the custom transformer and utilise it to predict the values. But what if we want to customise the existing transformer offered by sklearn. Let’s customise the ordinal encoder and implement it on the data used above.

from sklearn.preprocessing import OrdinalEncoder

By using the super() method any predefined transformer could be customised according to the need.

Custom Transformers provide a high degree of freedom and control for data preprocessing. We found them particularly useful in this article for encapsulating a phase in the Data Processing process, making the code much more understandable. More of these custom transformers might be constructed based on the requirements using the scikit learn library; give it a shot; it will be worthwhile.

Conference, in-person (Bangalore) MachineCon 2022 24th Jun

Conference, Virtual Deep Learning DevCon 2022 30th Jul

Conference, in-person (Bangalore) Cypher 2022 21-23rd Sep

Stay Connected with a larger ecosystem of data science and ML Professionals

Discover special offers, top stories, upcoming events, and more.

The high accuracy of classification model could be misleading.

The tech accelerator’s early investments include Dropbox, Coinbase, Airbnb, and Reddit.

LTI and Mindtree both play in Analytics services businesses, just like most other large IT/ITes service providers. But, what would the analytics services business of the merged entity look like?

GitHub’s math rendering capability uses MathJax; an open-source, JavaScript-based display engine.

Meta recently organised messaging event called ‘Conversations.’

The studio will leverage Wipro’s deep reservoir of IPs, patents, and innovation DNA.

BMRCL plans to introduce the technology at its automatic fare collection gates.

In the next few months, DealShare looks to grow its data science team by 15-20 members.

The idea was if I give you a sequence of amino acids, can you predict what will be the structure or the shape that it will take in the 3D space?

GeoIQ’s AI-based location tool will help Lenskart with its aggressive store rollout strategy.

Stay up to date with our latest news, receive exclusive deals, and more.

© Analytics India Magazine Pvt Ltd 2022