Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:

         "feature1"   "feature2"
"Doc1"    .44          .22
"Doc2"    .11          .6
"Doc3"    .22          .2

B are my target values for the data, which are just numbers 1-100 associated with each document:

"Doc1"    50
"Doc2"    11
"Doc3"    99

Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.

X = your independent variables

coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)

The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

You can do that by creating a data frame:

cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
                regression.coef_ is now returned as a dataframe so to do this  cdf = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(regression.coef_))], axis = 1)
– tim.newport
                Nov 4, 2021 at 2:58
                @ytu try coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_[0, )})
– plumbus_bouquet
                Apr 5, 2018 at 4:27

I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.

Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.

This is true as long as regression.coef_ returns coefficinet values in the same order. Thanks. – jeffrey Nov 16, 2014 at 0:55 The ExtraTreesClassifier is actually very interesting, but it seems there is no way to retrieve the actual features which it picked after the model has been fit? – jeffrey Nov 16, 2014 at 1:17 @jeffrey Yes, but I always select feature by clf.feature_importances_ to retrieve the importance ranking of features. Well intuitively it is just like the coefficients of the Linear Model, isn't it? – Jake0x32 Nov 16, 2014 at 1:41 Well, if you use a feature selection method like a CountVectorizer(), it has a method get_feature_names(). Then you can map get_feature_names() to .coef_ (i think they are in order, I'm not sure). However, you cannot do this with the tree. – jeffrey Nov 16, 2014 at 1:56

Coefficients and features in zip

print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))

Coefficients and features in DataFrame

pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})

This is the easiest and most intuitive way:

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)

or the same but transposing index and columns

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T

Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:

pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T

Try putting them in a series with the data columns names as index:

coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.