- B
- l
- u
- e
- P
- i

In the part 1 of the series we looked at the various methods of normalising the data including min-max and box-cox transformations. In this part we look at the following

- Value Mapping
- Discretization
- Equal Width Discretization
- Equal Frequency Discretization
- Aggregation
- Value Mapping

Sometimes in the data set we may have variables that are textual in character but they may signify an order. For example a data set may have a column having three distinct values Low, Medium and High. These can be numerically mapped to 0, 1 and 2. However extreme care must be exercised when choosing the values as they must reflect the degree of change in mathematical terms. Who is to say that the right values are not 0, 5 and 6 for instance.

Another very frequent example of value mapping arises when we need to map categorical values into separate columns. This is required often in any deep learning data preparation. This is termed as one hot encoding signifying only one column of the data set representing the boolean is hot.

Consider a dataset as below and it’s one hot encoded form below:

Category | Article | Quantity |
---|---|---|

Electronics | Mobile Phone | 100 |

Electronics | Tablet | 100 |

Electronics | Laptop | 60 |

Furniture | Table | 25 |

Furniture | Chair | 100 |

Electronics | Furniture | Article | Quantity |
---|---|---|---|

1 | 0 | Mobile Phone | 100 |

1 | 0 | Tablet | 100 |

1 | 0 | Laptop | 60 |

0 | 1 | Table | 25 |

0 | 1 | Chair | 50 |

As can be observed this makes the data set quite sparse if there are many values in the category columns.

Discretization (also referred to as binning) is the process of converting a continuous variable (or a nominal variable into their discrete counterparts. Intuitively it may appear that discretization would lead to loss in information however in certain circumstances the process is quite valuable. For example a risk profile of a customer instead of being represented as any value within 0 to 100 may be categorised into Very Low, Low, Medium, High, Very High. Specifically if there is suspicion about the accuracy of the continuous variable discretization may be a desirable normalisation step.

The mathematical value from discretization arises as the frequency of values in original dataset would be very infrequent thereby leading to poor modelling and correlation. Another discretization of a different nature could be applied to export data for instance. The export data may have millions of company each exporting a handful of materials. It maybe value in grouping the companies into industries and a summarised view at the industry level may lend to a much better analysis.

While discretization may appear to be simply a process of grouping together like values in a dataset there are certain decisions that require consideration. How many intervals to choose is one such. Here two different approaches are commonly used that will be explained with the dataset below.

Math | Physics | Chemistry | English | Biology | Economics | History | Civics | |
---|---|---|---|---|---|---|---|---|

John | 55 | 45 | 56 | 87 | 21 | 52 | 89 | 65 |

Suresh | 75 | 55 | 0 | 64 | 90 | 61 | 58 | 2 |

Ramesh | 25 | 54 | 89 | 76 | 95 | 87 | 56 | 74 |

Jessica | 78 | 55 | 86 | 63 | 54 | 89 | 75 | 45 |

Jennifer | 58 | 96 | 78 | 46 | 96 | 77 | 83 | 53 |

The algorithm first finds the min and max values and the splits the range into equal distances based on the interval.

So let's say we want 5 intervals and the range of marks vary between 0 to 100. In this case we would have the different bins as 0 - 20, 21-40, 41-60, 61-80, 81 -100. After equal width discretization the table would look as below:

Math | Physics | Chemistry | English | Biology | Economics | History | Civics | |
---|---|---|---|---|---|---|---|---|

0-20 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

21-40 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

41-60 | 4 | 4 | 1 | 1 | 1 | 1 | 2 | 1 |

61-80 | 0 | 0 | 1 | 3 | 0 | 2 | 1 | 2 |

81-100 | 0 | 1 | 2 | 1 | 3 | 2 | 2 | 0 |

Equal Frequency Discretization

The algorithm find the minimum and maximum values there-after divides the range into the given number of intervals, in such a way that every interval contains the equal number of sorted values

As we have five intervals and five observations each observation would get 1 value only. So if list the bins it should suffice for each language as each bin would have a frequency of 1.

The results are generated by using the classInt package in R.

The code is as below for Maths.

dataset <-c(55,75,25,78,58)

library(classInt)

classIntervals(dataset, 5)

Maths - [18.375,40) [40,56.5) [56.5,66.5) [66.5,76.5) [76.5,84.625]

Sometimes the variable that you are trying to visualise may not be part of the original dataset but maybe a derived variable based as a function of one or more variables in the original dataset.

As example we may have a dataset that has runs scored and balls faced by each batsman in a cricket match. What we may be interested in however could be the metric called strike rate which is defined simply as

strike rate is then an aggregated variable.

That sums up the most common data normalisation techniques.