Target encoding

The common approaches to encode categorical data are one-hot encoding and label encoding. Recently I encountered another method called “target encoding” which is more efficient. This technique is invented by Daniele Micci-Barreca (A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems).

The idea is using the relationship between the categorical feature x and the target y in order to have a more meaningful numerical representation of x. If x has N unique values xi then this relationship is defined as a function of the count of xi (how many times that we observe x equals to xi) and the mean y corresponding to each xi.

I did an experiment with the Kaggle Gun Violence dataset to figure out the effectiveness of target encoding method. The code of this experiment is available at https://github.com/tienduccao/blog_demo/blob/master/target_encoding/Demo.ipynb.

The target y in this dataset is n_killed which is the number of people killed from a gun violence indicent. The categorical feature x is state where the indicent happened. If we apply the LabelEncoder on x, the encoded values would look like this The following figure shows the encoded values of x using TargetEncoder It is clear that the TargetEncoder’s encoded values reflect a trend of the target n_killed.

To verify if TargetEncoder could help to improve model’s predictive power, I ran a RandomForestClassifier (5-folds stratified k-fold) on a set of categorical features (“state”, “city_or_county”, “gun_stolen”, “gun_type”, “participant_age_group”, “participant_gender”, “participant_relationship”, “participant_status”, “participant_type”) using

The first approach achieves an average log-loss score of 0.02103 while the second one gets 0.01292.

Comments

comments powered by Disqus