scipy curve_fit对大X值不正确

Question 1

为了确定一段时间内的趋势，我使用 scipy curve_fit 与来自 time.time() 的X值，例如 1663847528.7147126 （16亿）。做线性插值有时会产生错误的结果，而提供近似的初始 p0 值也没有帮助。我发现X的大小是这个错误的一个关键因素，我想知道为什么？

这里是一个简单的片段，显示了工作和不工作的X偏移。

import scipy.optimize
def fit_func(x, a, b):
    return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
#   (array([-1.e+08,  1.e+00]), array([[ 0., -0.],
#          [-0.,  0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
#   OptimizeWarning: Covariance of the parameters could not be estimated
#   warnings.warn('Covariance of the parameters could not be estimated',
#   (array([-4.53788811e+08,  4.53788812e-01]), array([[inf, inf],
#          [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
#   (array([-7.60846335e+10,  7.60846334e+01]), array([[-1.97051972e+19,  1.97051970e+10],
#          [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09,  1.e+00]), array([[inf, inf],
#       [inf, inf]]))
作为一个附带问题，也许有一个更有效的方法来进行线性拟合？但有时我想找到二阶多项式拟合。
在Windows 10下用Python 3.9.6和SciPy 1.7.1测试。

Question 2


          
           Root cause
          
          
           你正面临两个问题。
          
          
           Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
          
          
           Float Arithmetic Error. When switching from
           
            1e8
           
           to
           
            1e9
           
           you just hit the magnitude when such a kind of error become predominant.
          
          
           第二个是非常重要的认识。比方说，你被限制在8位有效数字的表示。那么
           
            1 000 000 000
           
           和
           
            1 000 000 001
           
           是相同的数字，因为它们都被限制在这个写作
           
            1.0000000e9
           
           ，我们不能准确地表示
           
            1.0000000_e9
           
           ，它需要多一个数字（
           
            _
           
           ）。这就是为什么你的第二个例子失败了。
          
          
           此外，你正在使用非线性最小平方算法来解决线性最小平方问题，这也与你的问题有一定关系。
          
          
           你有三个解决方案。
          
          
           Normalize;
          
          
           Normalize and change the methodology/algorithm;
          
          
           Increase the machine precision.
          
          
           我选择第一种，因为它更通用，第二种是由
           
            @blunova
           
           提出的，完全有意义，后者可能是一个固有的限制。
          
          
           Normalization
          
          
           为了缓解这两个问题，一个常见的解决方案是规范化。在你的案例中，一个简单的标准化就足够了。
          
          import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
    return a + b * x
xm = np.mean(x)         # 1000000002.0
xs = np.std(x)          # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2.        , 1.41421356]),
# array([[0., 0.],
#        [0., 0.]]))
# Back transformation:
a = result[0][1]/xs                    # 1.0
b = result[0][0] - xm*result[0][1]/xs  # -1000000000.0
或者使用sklearn接口也有同样的结果。
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", LinearRegression())
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_          # array([1.e+09])
pipe.named_steps["scaler"].scale_         # array([1.41421356])
pipe.named_steps["regressor"].coef_       # array([1.41421356])
pipe.named_steps["regressor"].intercept_  # 2.0
Back transformation
事实上，当归一化时，拟合结果是以归一化变量的形式表示的。为了得到所需的拟合参数，你只需要做一些数学运算，将回归的参数转换成原始变量的尺度。
只需写下并解决转化问题。
 y = x'*a' + b'
x' = (x - m)/s
 y = x*a + b
Which gives you the following solution:
a = a'/s
b = b' - m/s*a'
Precision addendum
Numpy默认的浮点数精度是float64，如你所料，有大约15位有效数字。
x.dtype                            # dtype('float64')
np.finfo(np.float64).precision     # 15
But scipy.curve_fit relies on scipy.least_square它利用了一个平方度量来驱动优化。
在没有深入研究细节的情况下，我怀疑这就是问题发生的地方，当处理所有接近1e9的数值时，你达到了浮动算术错误成为主导的阈值。
所以你遇到的这个1e9的阈值与你的变量x上的数字的区别无关（float64有足够的精度，使其几乎完全不同），而是在解题时对其进行的使用。
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
你也可以检查一下，在它的签名中，公差大约是80年的宽度。
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
    method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
    loss='linear', f_scale=1.0, diff_step=None, tr_solver=None, 
    tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
    args=(), kwargs={})
这可能会让你调整算法，在达到收敛之前增加额外的步骤（如果有的话），但这不会取代或击败规范化的作用。
Methods comparison
有意思的是scipy.stats.linregress方法是规模公差，它是通过设计处理.该方法使用变量归一化和纯线性代数及数值稳定性技巧（见TINY变量）来解决LS问题，即使在有问题的条件下。
这当然与scipy.optimize.curve_fit的方法形成对比，后者是一种NLLS解算器作为一种优化的梯度下降算法来实现（见Levenberg-Marquardt算法).
如果你坚持使用线性最小平方问题（就参数而言是线性的，而不是变量，所以二阶多项式是LLS），那么LLS可能是一个更简单的选择，因为它为你处理归一化。

Question 3


          
           
            
             
              
               
                
                 
                  如果你只需要计算线性拟合，我相信
                  
                   curve_fit
                  
                  是没有必要的，我也会直接使用SciPy的
                  
                   linregress
                  
                  函数代替。
                 
                 >>> from scipy import stats
>>> y = list(range(5))
>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)