Thursday, May 26, 2016

How to encode a categorical variable in sklearn?

In most of the well established machine learning systems categorical variables are handled naturally. For example in R you would use factors, in weka you would use nominal variables. This is not the care in scikit learn. The decision trees implemented in sklearn uses only numerical features and these features are interpreted always as continuous numeric variables.
Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exists in your data.
One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more sublte example maigh happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however some sublte inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.
Finally the answer to your question lies in coding the categorical feature into multiple binary features. For example you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features andfeature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.

No comments:

Post a Comment