对多个分类特征(列)进行特征哈希处理

10 人关注

我想把 "流派 "的特征分成6列,把 "出版商 "的特征分别分成6列。我想要的东西如下。

      Genre      Publisher  0    1    2    3    4    5      0    1    2    3    4    5 
0     Platform  Nintendo  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0
1       Racing      Noir -1.0  0.0  0.0  0.0  0.0 -1.0   -1.0  0.0  0.0  0.0  0.0 -1.0
2       Sports     Laura -2.0  2.0  0.0 -2.0  0.0  0.0   -2.0  2.0  0.0 -2.0  0.0  0.0
3  Roleplaying      John -2.0  2.0  2.0  0.0  1.0  0.0   -2.0  2.0  2.0  0.0  1.0  0.0
4       Puzzle      John  0.0  1.0  1.0 -2.0  1.0 -1.0    0.0  1.0  1.0 -2.0  1.0 -1.0
5     Platform      Noir  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0

下面的代码做了我想做的事

import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
          axis=1)

这对上述两个特征有效,但如果我有40个分类特征,那么这种方法就很乏味了。有什么其他的方法可以做吗?

4 个评论
第一个问题非常好--我又加了一些标签,这样它就会被暴露给有正确技能的人。
40个分类特征是指40列分类数据?
Noor
是40列分类数据。
Noor
谢谢你,帕特里克-阿特纳。
python
pandas
dataframe
scikit-learn
feature-extraction
Noor
Noor
发布于 2019-01-19
2 个回答
Jan K
Jan K
发布于 2020-12-05
已采纳
0 人赞同

散列 (更新)

假设新的类别可能会出现在一些功能中,散列是一种方式。只有2个说明。

  • Be aware of the possibility of collision and adjust the number of features accordingly
  • In your case, you want to hash each feature separately
  • One Hot Vector

    如果每个特征的类别数量是固定的,而且不会太大,就使用一个热编码。

    I would recommend using either of the two:

  • sklearn.preprocessing.OneHotEncoder
  • pandas.get_dummies
  • import pandas as pd
    from sklearn.compose import ColumnTransformer
    from sklearn.feature_extraction import FeatureHasher
    from sklearn.preprocessing import OneHotEncoder
    df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
                       'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
    # Approach 0 (Hashing per feature)
    n_orig_features = df.shape[1]
    hash_vector_size = 6
    ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size, 
                            input_type='string'), i) for i in range(n_orig_features)])
    res_0 = ct.fit_transform(df)  # res_0.shape[1] = n_orig_features * hash_vector_size
    # Approach 1 (OHV)
    res_1 = pd.get_dummies(df)
    # Approach 2 (OHV)
    res_2 = OneHotEncoder(sparse=False).fit_transform(df)
    

    res_0 :

    array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1., -1.,  0., -1.],
           [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  2., -1.,  0.,  0.,  0.],
           [ 0., -1.,  0.,  0.,  0.,  0., -2.,  2.,  2., -1.,  0., -1.],
           [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  2.,  1., -1.,  0., -1.]])
    

    res_1 :

       feature_1_A  feature_1_G  feature_1_T  feature_2_cat  feature_2_dog  feature_2_elephant  feature_2_zebra
    0            1            0            0              1              0                   0                0
    1            0            1            0              0              1                   0                0
    2            0            0            1              0              0                   1                0
    3            1            0            0              0              0                   0                1
    

    res_2 :

    array([[1., 0., 0., 1., 0., 0., 0.],
           [0., 1., 0., 0., 1., 0., 0.],
           [0., 0., 1., 0., 0., 1., 0.],
           [1., 0., 0., 0., 0., 0., 1.]])