GAM（广义相加模型）概要及R程序实现_gam 用cross validation评估_textboy的博客

相关文章推荐

玩足球的稀饭 · 这支球队正在被全民抵制，德国球迷也那么“仇富 ...· 6 月前 ·

无邪的楼房 · 厦门公交车纵火真凶厌世报复社会记者探访嫌犯 ...· 7 月前 ·

聪明的作业本 · 马克思主义学院关于2024年湖南大学“师德标 ...· 7 月前 ·

逆袭的板栗 · 北京语言大学发展规划与学科建设办公室 ...· 9 月前 ·

坏坏的蚂蚁 · 【大时代】长沙：千年古城引领幸福时尚_央广网· 1 年前 ·

国内关于GAM方面的资料不是一般的少，基本上都要往国外找。我光顾了没100都有50个网站，翻查了不少论文及资料，研究整理出下文，欢迎一同讨论。

GAM 广义相加模型Generalized additive model：

概念	回归模型中部分或全部的自变量采用平滑函数，降低线性设定带来的模型风险，对模型的假定不严，如不需要假定自变量线性相关于因变量（线性或非线性都可以）。解决logistic回归当解释变量个数较多时容易引起维度灾难（Curse of dimensionality）。光滑函数如应用到连续型解释变量。 * http://plantecology.syr.edu/fridley/bio793/gam.html
Equation	g is a link function, y independent, f _i ( x _i ) 为光滑函数（未知），代替经典线性回归中的 x _i ，对样本要求少，适用性广。（unspecified nonparametric function replaces a single coefficient）
估计方法	最小二乘法、likelyhood
检验	残差Pseduo系数(PCf)估计，PCf = 1 - RD / ND (RD残差偏差，ND 无效偏差)
分类	可加/非参数（Additive/Nonparametric）：参数（Parametric）：半参数/部分线性（Semiparametric/Partial Linear）：薄板样条（Thin-plate spline）： , allow for interactions between two predictor
前提	如x1和x2并非独立而存在交互作用，则应设为Thin-plate spline: f(x1, x2) 模型中不必每一项都是非线性的，如都非线性会出现计算量大、过拟合等问题，通过查看xi与y的是否存在线性关系来判断是否使用平滑函数。 Should follow statistical and operational considerations.
光滑函数	见“样条函数”
缺点	样条函数不定参使之不能直接用于预估新的数据（Lack of parametric functional form makes it difficult to score the new data directly）
Q&A	How to define smooth.terms in R.mgcv.GAM? competing philosophies: from "Try everything and go with the one that produces the best fit" (as measured by something like AIC) to "Write the one model that best reflects your understanding of the data-generating process and use it."

广义交叉验证法（GCV，generalized cross-validation）

基本原理是当式Ax=b的测量值 b 中的任意一项i b被移除时，所选择的正则参数应能预测到移除项所导致的变化。

马洛斯的Cp、Cp—准则（Mallows' Cp）

用来帮助在多个候选回归模型之间进行选择的一个统计量。Cp＝(SSEp)/(2)-(n-2p)。

注：仅当使用相同的预测变量时，使用Mallows Cp 比较回归模型才有效。

结合Scorecard

S0 = Intercept (only forBernoulli Likelihood objective function)

c1,c2, ..., cp = Scorecardcharacteristics

S1,S2,...,Sq = Score weightsassociated with the bins of a characteristics

X1,X2,...,Xq= Dummy indicatorvariables for the bins of a characteristics

关键是Score Weight的设定。

Y 的分布	联系函数名称	f(Y)
正态分布（normal ）	Identity	Y
二项分布（binomial ）	Logit	Logit （Y ）
Poisson 分布	Log	Log （Y ）
γ 分布（gamma）	inverse	1/ （Y ^-1 ）
负二项分布（negative binomial ）	Log	Log （Y ）

样条函数（spline function）

概念：早期工程师制图时，把富有弹性的细长木条（所谓样条）用压铁固定在样点上，在其他地方让它自由弯曲，然后沿木条画下曲线。成为样条曲线。

分段光滑、并且在各段交接处也有一定光滑性的函数，具有较好的数值稳定性和收敛性。

可多次样条，最常用是二次和三次样条。

（1）三次样条插值（Cubic smoothingspline）

定义:函数S(x)∈C2[a,b] ，且在每个小区间[ xj,xj+1 ]上是三次多项式，其中a =x0<x1<...< xn= b 是给定节点，则称S(x)是节点x0,x1,...xn上的三次样条函数。

. To the left of the sequence of knots, anatural cubic spline is a line.

. Between knots, a natural cubic spline isa third degree polynomial curve. Hence the cubic in the name.

. At the knots, the curve must becontinuous. At the knots, the derivative also must be continuous (no corner).At the knots, the second derivative must be continuous.

（2）cyclic spline

Live on a "circle", e.g. theytake values in the interval [0,1), and 0=1. like cyclic cubic regressionspline, cyclic p-spline.

Concept	Separate cubic polynomials are fit at each section, and then joined at the knots to create a continuous curve. effective degrees of freedom, or edf. In typical OLS regression the model degrees of freedom is equivalent to the number of predictors/terms in the model. s(Girth,Height) #Girth 和 Height 不独立，存在相互影响 gam(Overall ~ Income + Edu + Health, data = d) # 此时与glm一样 smooth terms: 其实就是应用了光滑函数的自变量e.g. s(agecont), te(Month,Age) l http://www.rdocumentation.org/packages/mgcv/functions/gam
gam syntax	gam(y~s(x,k = , bs =)) / gam(y~te(x,k = , bs =)) Choose.k : sets up the dimensionality of the smoothing matrix for each term. Penalized regression smoothers. Using a substantially increased k to see if there is pattern in the residuals that could potentially be explained by increasing k. Default任意数字（normally 10 degree of freedom）。 bs : See smooth.terms for the full list. tp – DEFAULT, thin plate regression spline, cr – penalized cubic regression spline三次样条, cs – shrinkage version of cr, cc – cyclic cubic regression spline, ps – P-spline, cp – cyclic p-spline, ad – adaptive smoothing, fs – factor smooth interaction. s : smooth s(covariate, edf); te : tensor product smooth gam(formula,family=gaussian(),data=list(),weights=NULL,subset=NULL, na.action,offset=NULL,method="GCV.Cp", optimizer=c("outer","newton"),control=list(),scale=0, select=FALSE,knots=NULL,sp=NULL,min.sp=NULL,H=NULL,gamma=1, fit=TRUE,paraPen=NULL,G=NULL,in.out,...) offset : Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula. control : A list of fit control parameters to replace defaults returned by gam.control. method : smoothing parameter estimation method. e.g. "GCV.Cp", "GACV.Cp", "REML", "P-REML", "ML", "P-ML" (ML = maximum likelihood, REML = 约束性最大似然法 restricted maximum likelihood) fit : If this argument is TRUE then gam sets up the model and fits it, but if it is FALSE then the model is set up and an object G containing what would be required to fit is returned is returned. Gamma : multiplier to inflate the degrees of freedom in the GCV/UBRE/AIC score. Select : TRUE means adding an extra penalty to each term so that it can be penalized to zero. s(x1, by=x2) e.g. Loc = America, Doy = as.numeric(format(Date,format = "%j")), s(Doy,by = Loc)
test	gam.check(b) # k' = k - 1 summary(gammodel) (1) GCV, with lower being better. (2) R-sq.(adj) near to 1 is better. AIC(mod_1d, mod_2d) (3) with lower being better. anova(b) # Wald like tests anova(mod_1d, mod_2d, test = "Chisq") #取lower resid.deviance anova(b,b1,test="F") (4) select the significant one
plot	plot(mod_gam2, pages=1, residuals=T, shade=T, col='#FF8000') vis.gam(mod_gam2, type = "response", plot.type = "contour") vis.gam(mod_gam2, type = "response", plot.type = "persp", border=NA, phi=30, theta=30) * If the graph looks noise, then the smooth function may be not suitable. * http://stats.stackexchange.com/questions/14746/what-does-the-dashed-bounds-mean-when-plotting-a-contour-plot-with-r-gam
Q&A	Err: - not meaningful for factors in: Ops.factor(xx, shift[i]) A: smoothing a factor, which isn't supported (`smooth' means that f(x_1) must be close to f(x_2), e.g. if a factor has levels "brick", "sky" and "purple", how far is it from "brick" to "purple"?) Err: A term has fewer unique covariate combinations than specified maximum degrees of freedom / basis dimension is larger than number of unique covariates A: for smoothing function, one independent variables portfolio cannot match to different response variable values. Q: how to choose a proper smoothing spline (bs='?') A: 1) use the default; 2) use a tensor product of "cr" smooths for bivariate smoothing, ie. te=(x,bs=”cr”)
Summary	Formula: LN_Brutto ~ s(agecont, by = Sex) + factor(Sex) + te(Month, Age) + s(Month, by = Sex) Parametric coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 4.32057 0.01071 403.34 <2e-16 * factor(Sex)m 0.27708 0.01376 20.14 <2e-16 * --- Signif. codes: 0 '*' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(agecont):Sexf 8.1611 8.7526 20.170 < 2e-16 s(agecont):Sexm 6.6695 7.5523 32.689 < 2e-16 * te(Month,Age) 10.3651 12.7201 6.784 2.19e-12 * s(Month):Sexf 0.9701 0.9701 0.641 0.430 s(Month):Sexm 1.3750 1.6855 0.193 0.787 --- Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Rank: 60/62 R-sq.(adj) = 0.781 Deviance explained = 78.7% GCV = 0.048221 Scale est. = 0.046918 n = 1093

国内关于GAM方面的资料不是一般的少，基本上都要往国外找。我光顾了没100都有50个网站，翻查了不少论文及资料，研究整理出下文，欢迎一同讨论。GAM 广义相加模型Generalized additive model：概念回归模型中部分或全部的自变量采用平滑函数，降低线性设定带来的模型风险，对模型的假定不严，如不需要假定自变量线性相关于因变量（线性或非第七章：http://www-bcf.usc.edu/~gareth/ISL/ https://en.wikipedia.org/wiki/Spline_(mathematics) http://web.as.uky.edu/statistics/users/pbreheny/621/F10/notes/11-4.pdf http://learning.cis.upenn.edu/c

拓端tecdat：R语言广义相加模型（ GAM ）在电力负荷预测中的应用我已经准备了一个文件，其中包含四个用电时间序列来进行分析。数据操作将由程序包完成。将提及的智能电表数据读到。使用 GAM 回归模型。将工作日的字符转换为整数，并使用包中的函数重新编码工作日：1.星期一，…，7星期日。将信息存储在日期变量中，以简化工作。让我们看一下用电量的一些数据并对其进行分析。在绘制的时间序列中可以看到两个主要的季节性：每日和每周。我们在一天中有48个测量值，在一周中有7天，因此这将是我们用来对因变量–

之前介绍过线性模型和广义线性模型，线性模型的意思就是响应变量和解释变量之间服从线性关系，广义线性模型就是指如果能通过一些变换，让原本不服从线性关系的响应变量解释变量，转换成线性关系，那么他们之间就是具有广义线性关系。除了线性模型和广义线性模型，在回归模型中，我们还介绍了多项式回归、核回归、LOESS、回归样条等等，这些模型可以针对非线性的关系进行拟合，但是，我们有没有想过，这些模型从公式的角度...

人们对于电力的需求与依赖随着生活水平的提高而不断加深，用电负荷预测工作开始变得越来越重要，如果可以发现用电负荷的规律性，我们就可以合理安排用电负荷。我们使用某商业物业两个星期的电耗数据进行分析。 GAM 模型当因变量和自变量不呈线性关系时,可用广义相加模型（ GAM ）。 GAM 模型的优点，在于其解决响应变量与预测因子间的高度非线性和非单调关系方面的突出能力，是一种基于数据的模型（...

昨天看到有同学对于视觉参数，也就是展现RGB真彩色影像的时候参数的min，max的问题！说白了这个就是波段digital number （DN）值的最大最小值的确定，因为，每个影像所定义的波段的最大最小值是不同的，我将会用Landsat SR影像和LandsatTOA影像对比来说明这个问题。 Landsat SR影像 Landsat SR影像的展示： //Landsat影像的除云 var cloudMaskL457 = .. WARNING: 连接逻辑库“SASHELP”中指定的一个或多个逻辑库不存在。这些逻辑库已从连 WARNING: 接中删除。 ERROR: 当前日期 Tuesday, February 15, 2000 在 SAS 系统的创建日期 Tuesday, ERROR: December 10, 2019 之前。请确保您的系统当前日期正确， ERROR: 并且正在运行相应版本的 SAS 系统。 ERROR: 从 SASHELP 初始化 SETINIT 信息失败. NOTE: 无法初始化选项子系统。 ERROR: (SASXKINI): PHASE 3 KERNEL INITIALIZATION FAILED. ERROR: 无法初始化 SAS 内核。贴主，我的显示是这个应该怎么办电脑显示器的最佳高度 irreho: “（现代）电脑显示器中心水平线与人端坐时眼睛等高。”存在问题，有医生建议电脑显示器中心应略低于眼睛，视线向下约10°~20°都是比较合适的。不过这是一个15年的帖子，7年过去了也不必太较真，请注意查阅最新的资料。 GAM（广义相加模型）概要及R程序实现 Aixsong: 分类变量用as.factor()吧