Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

But that raises:

ValueError: Length of values does not match length of index

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

I believe that sklearn supports DataFrames and Series as args to train_test_split so it should work by passing a sub-section of your df, besides what is returned are the indices so you can use these to index back into your df using iloc, see docs: scikit-learn.org/stable/modules/generated/… – EdChum Nov 21, 2016 at 20:56

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)
df['y_hats'] = y_hats2

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
                That doesn't really solve my issue of merging only those data that were in test to begin with. If you merge back predictions for every row, how do you know which were in the original test matrices? For all I know, I could run the lines you added, but have no idea whether the model already saw some of the rows in X (therefore kind of invalidating the whole purpose of train-test).
– blacksite
                Nov 21, 2016 at 21:07
                @flyingmeatball hi I am trying to do the exact same thing but when you have the y_hats stored as a variable it becomes a numpy array rather then a dataframe that needs to get converted to pandas to do the merge. At that point, the merge on indices can not be done. I am not sure what am I missing?
– bernando_vialli
                May 4, 2018 at 15:56
                y_test['preds'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1]
– asmgx
                Jun 20, 2019 at 23:05
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats  = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

Welcome to Stack Overflow. Thank you for contributing an answer. I think your answer could be further improved using this article. Any chance you can add more context to this? – James Mchugh Dec 8, 2019 at 3:11

You can probably make a new dataframe and add to it the test data along with the predicted values:

data['y_hats'] = y_hats
data.to_csv('data1.csv')
                data['y_hats'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1]
– asmgx
                Jun 20, 2019 at 23:10
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)

This worked well for me. It maintains the indexing positions.

pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class  = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)

Here is a solution that worked for me:

It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').

You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.

I hope this helps!

FoldNr = 0
for train_index, test_index in skf.split(X, y):
    FoldNr = FoldNr + 1
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    # [...] your model
    # performance is measured on test set
    y_true, y_pred = y_test, clf.predict(X_test)
    # Save predicted values for each test set
    a = pd.DataFrame(y_true).reset_index()
    b = pd.Series(y_pred, name = 'y_pred')
    globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
    globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']

First you need to convert y_val or y_test data into the DataFrame.

compare_df = pd.DataFrame(y_val)

then just create a new column with predicted data.

compare_df['predicted_res'] = y_pred_val

After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.

test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.