python - How to pre-process the data to calculate Root Mean Squared Logarithmic Error?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm trying to calculate the Root Mean Squared Logarithmic Error for which I have found few options, one is to use the sklearn metric: mean_squared_log_error and take its square root

np.sqrt(mean_squared_log_error( target, predicted_y ))
But I get the following error:
Mean Squared Logarithmic Error cannot be used when targets contain negative values
I have also tried a solution from a Kaggle post:
import math
#A function to calculate Root Mean Squared Logarithmic Error (RMSLE)
def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(terms_to_sum) * (1.0/len(y))) ** 0.5
Same issue, this time I get a domain error.
In the same post they comment the following regarding the negative log issue:
You're right. You have to transform y_pred and y_test to make sure they don't carry negative values.
In my case, when predicting weather temperature (originally in Celsius degrees), the solution was to convert them to Kelvin degrees before calculating the RMSLE:
rmsle(data.temp_pred + 273.15, data.temp_real + 273.15)
Is there any standard form of use this metric that allows to work with negative values?
Normalize both the arrays to range 0 to 1
If you're using scikit you can use sklearn.preprocessing.minmax_scale:
minmax_scale(arr, feature_range=(0,1))
Before you do this save the max and min value of arr. You could get back the actual value.
normalized = (value - arr.min()) / (arr.max() - arr.min()) # Illustration
                I'm not sure if this is what you want. From the docs, minmax_scale independently scales each feature, but you want both temp_pred and temp_real to be scaled by the same amount.
– Kyle
                Sep 8, 2019 at 8:25
                if temp_pred = [-2, 2, 3] and temp_real = [-3, 2, 3] then ideally -3 would scale to 0 and both arrays would scale based on -3 being the lowest value (what I meant by the same). But since each feature is scaled independently, then temp_pred will scale based on -2 being the lowest value and temp_real will scale based on -3 being the lowest value. I have not used minmax_scale extensively, but based on the documentation this is what I think it will do.
– Kyle
                Sep 9, 2019 at 17:11
                To emphasize: RMSLE is appropriate if you're trying to predict the correct order of magnitude of a variable (e.g. "is it in the hunderds or in thousands"). When your variable is temperature (degrees Kelvin), as in the example, this is rarely the case - when you're working with temperatures, you usually care about the absolute difference and not just the order of magnitude. A good example where RMSLE is appropriate is the "Boston housing prices" dataset from Kaggle.
– Itamar Mushkin
                Sep 8, 2019 at 7:54
I had a similar problem, one of the predictions was negative, although all of the training target values were positive. I narrowed this down to outliers and solved it by using the RobustScaler from sklearn. Which not only scales the data but also deals with outliers
Scale features using statistics that are robust to outliers.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.