对多个分类特征(列)进行特征哈希处理

10 人关注

我想把 "流派 "的特征分成6列，把 "出版商 "的特征分别分成6列。我想要的东西如下。

      Genre      Publisher  0    1    2    3    4    5      0    1    2    3    4    5 
0     Platform  Nintendo  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0
1       Racing      Noir -1.0  0.0  0.0  0.0  0.0 -1.0   -1.0  0.0  0.0  0.0  0.0 -1.0
2       Sports     Laura -2.0  2.0  0.0 -2.0  0.0  0.0   -2.0  2.0  0.0 -2.0  0.0  0.0
3  Roleplaying      John -2.0  2.0  2.0  0.0  1.0  0.0   -2.0  2.0  2.0  0.0  1.0  0.0
4       Puzzle      John  0.0  1.0  1.0 -2.0  1.0 -1.0    0.0  1.0  1.0 -2.0  1.0 -1.0
5     Platform      Noir  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0
下面的代码做了我想做的事
import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
          axis=1)
这对上述两个特征有效，但如果我有40个分类特征，那么这种方法就很乏味了。有什么其他的方法可以做吗？


           
            
             Patrick Artner
            
            ：


           
            
             第一个问题非常好--我又加了一些标签，这样它就会被暴露给有正确技能的人。


           
            
             Bharath M Shetty
            
            ：


           
            
             40个分类特征是指40列分类数据？


           
            
             是40列分类数据。


           
            
             谢谢你，帕特里克-阿特纳。


         
          python


         
          pandas


         
          dataframe


         
          scikit-learn


         
          feature-extraction


          
           已采纳


          
           
            
             散列 (更新)
            
           
           
            假设新的类别可能会出现在一些功能中，散列是一种方式。只有2个说明。
           
           
            Be aware of the possibility of collision and adjust the number of features accordingly
           
           
            In your case, you want to hash each feature separately
           
           
            
             One Hot Vector
            
           
           
            如果每个特征的类别数量是固定的，而且不会太大，就使用一个热编码。
           
           
            I would recommend using either of the two:
           
           
            
             sklearn.preprocessing.OneHotEncoder
            
           
           
            
             pandas.get_dummies
            
           
           import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
                   'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size, 
                        input_type='string'), i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df)  # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0 : 
array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1., -1.,  0., -1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  2., -1.,  0.,  0.,  0.],
       [ 0., -1.,  0.,  0.,  0.,  0., -2.,  2.,  2., -1.,  0., -1.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  2.,  1., -1.,  0., -1.]])
res_1 : 
   feature_1_A  feature_1_G  feature_1_T  feature_2_cat  feature_2_dog  feature_2_elephant  feature_2_zebra
0            1            0            0              1              0                   0                0
1            0            1            0              0              1                   0                0
2            0            0            1              0              0                   1                0
3            1            0            0              0              0                   0                1
res_2 : 
array([[1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1.]])