傲视众生的凉面 · 微服务应用性能分析实战13 ...· 4 月前 · |
细心的冲锋衣 · Java Swing布局管理器(详解版)· 1 年前 · |
潇洒的饼干 · MySQL递归查询上下级菜单 - 个人文章 ...· 1 年前 · |
博学的企鹅 · Socket浅析通信原理和java编程(一) ...· 1 年前 · |
有腹肌的啄木鸟 · ID选择,你做对了吗? - ...· 1 年前 · |
Please note that the source “Glottolog” represents the Glottolog v4.6 trees (Hammarström et al., 2022 ) downloaded and preprocessed using the glottoTrees package (Round, 2021 ), with added branch length using the following methods (see Round, 2022 for details): “original” (all branch length are equal), “exponential” (branch lengths are exponentially distributed: 1/2 k for the kth deepest branch), and “ultrametric” (rescaling the terminal branches so that all tips are equally distant from the root); please note that these branch lengths do not affect the tree topology but just the lengths of the branches. Jäger ( 2018 ) provides a global phylogeny, but also individual phylogenies for each language family. Grollemund et al. ( 2015 ) provides one summary and 100 individual posterior trees for Bantu (a subgroup of Atlantic-Congo). Gray et al. ( 2009 ) provides one summary and 1,000 individual posterior trees for Austronesian. Chang et al. ( 2015 ) provides one summary and 1,000 individual posterior trees for Indo-European, but all languages have a word for “blue,” making these trees unusable for the phylogenetic methods. Bouckaert et al. ( 2018 ) provides one summary and 1,000 individual posterior trees for Pama-Nyungan. Zhang et al. ( 2019 ) provides one Maximum Clade Credibility (MCC) tree for Sino-Tibetan. Hruschka et al. ( 2015 ) provides one summary and 100 individual posterior trees for Turkic. Honkola et al. ( 2013 ) and Jäger ( 2018 ) provide one tree, and one summary and 1,000 individual posterior trees, respectively, for Uralic, but all languages have a word for “blue,” making these trees unusable for the phylogenetic methods. Both Jäger ( 2018 ) and Bouckaert et al. ( 2022 ) provide world-wide “global” phylogenies constructed using very different methods and data.
Most of the methods used here build incrementally on those used by Josserand et al. ( 2021 ), with the exception of the phylogenetic methods. First, there is the now “standard” mixed-effects/hierarchical logistic regressions approach, where one regresses the binary dependent variable blue (i.e., does the language have a specific word for “blue”?) on various (combinations of) predictors (such as the mean UV-B incidence), with controlling for “Galton's problem” and language contact by having language family and macroarea as random effects (Jaeger et al., 2011 ; Ladd et al., 2015 ; Josserand et al., 2021 ). These regressions were preferentially performed in a Bayesian framework (using brms in R ; Bürkner, 2018 ), but also using a frequentist approach (using glmer ; Bates et al., 2015 ) in some cases. In both frameworks, model comparison (which of two models should be considered “better”?), model simplification (starting from a “full” model containing a set of potential predictors, removing the predictors that do no contribute “significantly,” and retaining only those that do), and variable selection (does an individual predictor “significantly” help predicting the dependent variable?) were performed. In the frequentist framework, the p -values reported by glmer() for individual predictors (based on the Wald Z -test) and the p -values reported by anova() (based on the likelihood ratio test) and Δ AIC (difference in Akaike Information Crierion scores) for model comparison were used throughout (the α-level was 0.05 and the threshold for Δ AIC was 3). In the Bayesian framework, model comparison was based on BFs (Bayes factors), WAIC (the Widely Applicable Information Criterion or the Watanabe-Akaike Information Criterion), LOO (Leave-One-Out cross-validation), and KFOLD ( k -fold cross-validation, with k = 10) as implemented by bayes_factor() in brms and by loo_compare() in loo (Vehtari et al., 2022 ). For BFs, the cutoff was , while for the others, the cutoff was 4.0 points. Please note that there might be differences between BFs, on the one hand, and WAIC/LOO/KFOLD, on the other, due to the default use of improper priors (see, for example, here ) and to intrinsic differences in what these indices capture (McElreath, 2020 ), such that the decisions here were based on a combination of these indices. For model simplification and variable selection, the posterior distribution of the predictor of interest vis-à-vis 0.0 (judged jointly from the posterior plot and the 95% Highest Density Interval) and formal hypothesis tests against 0 [either directional, when a direction is a priori hypothesized, or punctual; please note that this is the posterior probability that the variable is in the given relationship with 0 or the posterior probability that the variable is not 0, respectively as given by hypothesis() in brms ], supplemented by model comparison (as described above), were used. To control for “Galton's problem,” the family as a random effect (most models) was included, but also a model where the “global” language phylogeny of Jäger ( 2018 ) and the associated phylogenetic variance-covariance matrix were as a grouping term (using brms 's gr() syntax, and ape 's vcv.phylo() function; Paradis and Schliep, 2019 ) was run. Likewise, to control for contact, macroarea as a random effect (most models) was included, but also a model where a 2D Gaussian process (one per macroarea, using brms 's grouping gr() function by longitude, latitude, and macroarea), as suggested in McElreath ( 2020 ) and Naranjo and Becker ( 2022 ), was run. Moreover, one extra model was fitted, where both the “global” language phylogeny of Jäger ( 2018 ) and the 2D Gaussian process, as described above, were included.
Second, mediation analyses were conducted, which can quantify the direct and the indirect (or mediated) effects of a treatment ( T ) on an outcome ( O ) possibly mediated by a mediator ( M ). Thus, there is a direct effect (with strength a , represented as ) and an indirect effect “flowing” through M ( , with two components of strengths b and c , respectively), with the total effect (i.e., the overall influence of T on O , ) of strength a + b × c . These are estimated here by fitting the two mixed-effects regressions (with family and macroarea as random effects) to the data jointly (using R 's notation):
These were fitted in a Bayesian framework (using brms ), estimating, for each individual component ( T → O , T → M , and M → O ), its strength ( a , b , and c , respectively), as well as their 95% HDIs, and their “significance” was judged based on the inclusion of 0 in the 95% HDI; for the effects (total, direct, and indirect), their strength ( a + b × c , a , and b × c , respectively) was estimated, as well as their 95% HDIs, and their “significance” was judged based on the inclusion of 0 in the 95% HDI and the posterior probability of the hypothesis p (estimate = 0) (using hypothesis() in brms ). These mediation models were also fitted using piecewise Structural Equation Models in a frequentist framework (using package piecewiseSEM in R ; Lefcheck, 2016 ), which allows not only the estimation of the total, direct, and indirect effects (with bootstrapping 95% CIs and p -values) and of a , b , and c (with standard errors and p -values) but also to test the existence of the direct effect using d-separation (within Judea Pearl's causality framework; Lefcheck, 2016 ; Pearl and Mackenzie, 2018 ). Please note that only those mediation models that make sense theoretically and where the three components ( T → O , T → M , and M → O ) were individually “significant” (as regressions), or when they were of particular a priori theoretical importance were actually estimated.
Third, path analysis (Wright, 1934 ) models were fitted using lavaan (Rosseel, 2012 ), which model those relationships that are theoretically important (see Josserand et al., 2021 for details) to the primary hypothesis. While this method allows the simultaneous modeling of multiple influences (paths) between several variables (which the mediation approach does not), it cannot (at the moment) control for the effects of family and macroarea (as the mediation models do); moreover, this was fitted in a frequentist framework. To address some of these limitations, path analysis was also conducted in a piecewise Structural Equation Models framework where the individual regressions composing the model are fitted simultaneously either in a frequentist (using piecewiseSEM ) or Bayesian (using brms ) approach, which allow the inclusion of family and macroareas as random effects and the use of generalized linear models (in particular, of logistic regression; N.B ., piecewiseSEM does currently have some limitations that might affect the use of dichotomous variables). However, it can be argued that some of the predictors are, in fact, indirect measurements of the unmeasured latent variables that presumably play the causal role, in particular UV-B incidence (captured by its mean and standard deviation), “ cultural complexity” (partly captured by subsistence and population size), and possibly climate (captured by various climate PCs and humidity). Therefore, Structural Equation Models with latent variables were also implemented using lavaan with the aforementioned limitations, with the partial exception of also conducting a multi-group analysis using macroarea as a grouping factor, which allows the estimation of separate parameters for each macroarea.
Fourth, various techniques were employed to check which of the many potential predictors of blue do, in effect, predict it. For all these techniques, the full dataset was split randomly into a training (80% of datapoints) and testing (remaining 20%) datasets, 100 times (this allows the testing of how well the model generalizes to new “unseen” data). Then, Bayesian multiple logistic regression with manual model simplification (as implemented by brms ), conditional inference trees (as implemented by ctree() in package partykit ; Hothorn and Zeileis, 2015 ), random forests (as implemented by randomForest() in package randomForest ; Liaw and Wiener, 2002 ), conditional random forests (as implemented by cforest() in package partykit ), and Support Vector Machines [SVMs, as implemented by fit(…,model=~svm~) in the rminer package; Cortez, 2020 ] were fitted.
Finally, several phylogenetic analyses were performed, as follows. The phylogenetic signal of blue was estimated using three methods: the Fritz and Purvis ( 2010 )' D , as implemented by phylo.d() in package caper (Orme et al., 2018 ), which provides a numeric estimate D of the phylogenetic signal and also two p -values associated with the hypotheses ( D = 0 that the character is “clumped,” evolving on the phylogeny under a Brownian motion model and D = 1 that the character is random relative to the phylogeny, respectively). The remaining two methods are based on performing the logistic phylogenetic regression of blue with no predictors, as implemented by binaryPGLMM() in package ape [Paradis and Schliep, 2019 ; which gives the “phylogenetic signal measured as the scalar magnitude of the phylogenetic variance-covariance matrix s2 * V” (denoted here as s2 ) and the p -value of the “likelihood ratio test of the hypothesis H0 that s2 = 0”], and by phyloglm() in package phylolm [Ho and Ane, 2014 ; using Ives and Garland, 2009 's method, which uses “alpha to estimate the level of phylogenetic correlation” (denoted here as α); this might come with a warning if α is too close to its limits, in which case, this probably means that the phylogenetic signal is, in fact, negligible]. Then, ancestral state reconstruction for blue was performed (estimating the probability that a proto-language had a dedicated word for “blue”) using two methods: the one implemented by ace() in package ape and based on Pagel ( 1994 ) (both the single-rate, ER, equivalent in this case to the symmetric, SYM, model, and the all-rates, ARD, model were used; both estimate the appropriate transition rate(s), 1 for ER and 2 for ARD, and the probability of a “0,” i.e., the absence of “blue,” at the root; furthermore, the two methods were compared, using the Likelihood Ratio test and AIC, retaining the one best fitting the data), and the one implemented by rerootingMethod() in package phytools and based on Yang et al. ( 1995 ) (which estimates the marginal ancestral state estimates by re-rooting the tree; this works only for symmetric models, in this case ER, and gives the transition rate and the probability of a “0” at the root). Also, the correlated evolution of blue with all its potential predictors was estimated using two methods: one implemented by fitPagel() in package phytools based on Pagel ( 1994 ) (this only works for binary characters, so the continuous predictors were dichotomized using median split, i.e., all values <the median → “0,” all others → “1”), and the threshold model as implemented by threshBayes() also in package phytools and based on Felsenstein ( 2012 ) (this is a Bayesian method which works with both discrete and continuous characters). Finally, phylogenetic regression of blue on all its potential predictors of interest was performed, using three methods: the Phylogenetic Generalized Linear Mixed Model for Binary Data as implemented by binaryPGLMM() in package ape , the Phylogenetic Generalized Linear Model as implemented by phyloglm() in package phylolm (implementing the phylogenetic logistic regression of Ives and Garland, 2009 with both an optimized GEE approximation to the penalized likelihood of the logistic regression and the maximization of the penalized likelihood of the logistic regression methods), and the Bayesian logistic regression controlling for phylogeny as implemented in package brms , using gr(glottocode, cov = A,) where A is the phylogenetic variance-covariance matrix of the language family (in this case, the “flat” logistic regressions which completely disregard the phylogenetic information was also estimated, providing a baseline test of the relationship between blue and the considered predictor while ignoring Galton's problem).
There are 834 unique glottocodes, distributed as shown in Figure 1 . They belong to 155 unique language families (as per Glottolog , Hammarström et al., 2022 ), but the distribution is highly skewed, with most languages belonging to the Austronesian (glottocode aust1307 ; 134 languages), Indo-European (glottocode indo1319 ; 86 languages), Sino-Tibetan ( sino1245 ; 85), Afro-Asiatic ( afro1255 ; 51), and Pama-Nyungan ( pama1250 ; 48), while 10 have only three languages, 8 have just two languages, and 110 only one, reflecting by and large the actual distribution of languages across families. Likewise, the distribution of the languages across the six Glottolog macorareas is uneven: ordered by decreasing number of languages, there are 350 languages in Eurasia , 187 in Papunesia , 88 in South America , 86 in Africa , 74 in Australia , and 49 in North America . Therefore, this extension of the database, from 142 unique datapoints (i.e., unique glottocodes) in 32 unique families to 834 unique datapoints in 155 unique families, resulted in a 5.87 times (or 487.3%) overall increase, both in terms of new language families added (123, of which most contain less than five languages but three are rather large: Nakh-Daghestanian, 32 languages, Timor-Alor-Pantar, 25, and Hmong-Mien, 25) as well as by adding new languages to existing families (mostly with just a few new languages, with the exception of Austronesian, 134 vs. 9; Sino-Tibetan, 8 vs. 9; Pama-Nyungan, 48 vs. 1; Indo-European, 86 vs. 41; Afro-Asiatic, 51 vs. 13; Tai-Kadai, 25 vs. 1; Austroasiatic, 25 vs. 3; and Uralic, 28 vs. 9). All macroareas have now many more languages, with the most dramatic increases for Australia (74 vs. 2 or an 3600.0% increase) and Papunesia (187 vs. 9, 1977.8%), followed by South America (88 vs. 12, 633.3%), North America (49 vs. 9, 444.4%), Eurasia (350 vs. 79, 343.0%), and Africa (86 vs. 31, 177.4%).
The binary variable blue , coding the presence (“yes”) or not (“no”) of a dedicated word for “blue” in a given language, was coded for all the 834 languages, of which 549 (65.8%) do have such a word (i.e., blue is “yes”) and the remaining 285 (34.2%) do not. Visually ( Figure 1 ), their distribution seems to be spatially non-random, with the majority of languages without a word for “blue” seemingly clustered closer to the equator. However, this impression can be misleading due to various confounding factors (Ladd et al., 2015 ), paramount being “Galton's problem” (Mace and Holden, 2005 ) and language contact. The first refers to the fact that related languages (i.e., languages from the same family) are not independent, as they may inherit some of their characteristics from the family's proto-language, while the second refers to the fact that languages in contact may come to share characteristics as well.
When using TOMS as a source of data, information was recovered for all 834 (100%) languages. The mean UV-B incidence (denoted here UV mT ) varies between a minimum of 70.9 mW/m 2 and a maximum of 238.6 mW/m 2 , with a mean of 208.5 mW/m 2 and a median of 221 mW/m 2 , and a standard deviation of 29.0 mW/m 2 and an inter-quartile range (IQR) of 16.1 mW/m 2 . As can be seen in Figure 2 , UV mT is sharply skewed toward high values, reflecting the relatively small number of languages at very high latitudes.
The standard deviation of the UV-B incidence ( UV sT ) varies between a minimum of 1.1 mW/m 2 and a maximum of 71.2 mW/m 2 , with a mean of 16.4 mW/m 2 and a median of 7.6 mW/m 2 , and a standard deviation of 17.5 mW/m 2 and an IQR of 12.9 mW/m 2 . As can be seen in Supplementary Figure 2 , UV sT is sharply skewed toward low values, essentially because most languages in the dataset have a low seasonal variation in UV-B incidence.
When using WorldClim as a source of data, information was recovered for 829 (99.4%) languages. The mean UV-B incidence ( UV mW ) varies between a minimum of 6,780 kJ/m 2 day and a maximum of 22,681 kJ/m 2 day, with a mean of 16,499 kJ/m 2 day and a median of 17,045 kJ/m 2 day, and a standard deviation of 3470.2 kJ/m 2 day, and an IQR of 5,270 kJ/m 2 day. Its standard deviation ( UV sW ) varies between a minimum of 448.5 kJ/m 2 day and a maximum of 8,625 kJ/m 2 day, with a mean of 3,460 kJ/m 2 day and a median of 2,490 kJ/m 2 day, and a standard deviation of 2156.6 kJ/m 2 day and an IQR of 3981.4 kJ/m 2 day. Please see Supplementary Figures 3 , 4 .
There is a negative correlation between the mean and sd of the UV-B incidence (in TOMS : Pearson's r = −0.96, p = 0; Spearman's ρ = −0.75, p = 1.26·10 −149 ; and in WorldClim : r = −0.55, p = 7.13·10 −67 , ρ = −0.43, p = 1.26·10 −38 ) as expected due to the relationship between latitude and seasonality. As hinted by Pearson's r and Spearman's ρ and shown in Supplementary Figure 5 , this relationship is clearer for TOMS , and, for both databases, it is non-linear.
As expected, the two databases are positively correlated with each other (for mean UV-B incidence: r = 0.78, p = 5.00·10 −170 , ρ = 0.64, p = 4.78·10 −96 ; for sd UV-B incidence: r = 0.86, p = 1.64·10 −249 , ρ = 0.75, p = 2.49·10 −148 ), but the relationship is far from perfect and is non-linear (see Supplementary Figure 6 ), suggesting that the two databases do not capture the same information about UV-B light incidence.
Elevation was available for all 834 (100%) languages and was heavily skewed toward low altitudes (see Supplementary Figure 7 ), ranging between −6 and 5,161 m, with a mean of 652.1 m, a median of 336 m, a standard deviation of 823.4 m, and an IQR of 746.8 m.
Climate data from WorldClim were available for all but three languages. The first Principal Component, PC 1 , of the climate variables, explains 53.9% of the variance and its high values reflect low seasonality, wet and hot climates (see Supplementary Figure 8 ). PC 2 explains 23.3% of the variance (see Supplementary Figure 9 ), while PC 3 explains only 7.1% of the variance (see Supplementary Figure 10 ), and they are harder to interpret.
For specific humidity (measured in grams of vapor per kilogram of air), data were available for all 834 (100%) languages. The mean of yearly medians (shortened as median humidity or hum m ) ranges from 0.0014 to 0.02, with a mean of 0.012, median of 0.013, standard deviation of 0.005, and an IQR of 0.01 (see Supplementary Figure 11 ). The mean of yearly IQRs (shortened as median variation or hum v ) ranges from 0.00035 to 0.012, with a mean of 0.0044, median of 0.0041, standard deviation of 0.003, and an IQR of 0.005 (see Supplementary Figure 12 ).
The distances to the nearest large bodies of water are measured in kilometers (km) as the crow flies, and are subdivided in the distance to the nearest lake ( dist2lake , or d2l ), to the nearest river ( dist2river or d2r ), to the nearest sea or ocean ( dist2ocean or d2o ), and the minimum between the three (i.e., the distance to the nearest large body of water irrespective of its type, denoted dist2water or d2w ). These data were available for all 834 (100%) languages. The dist2lake is heavily skewed to the left with a few extreme outliers, and ranges between 0.5 and 2,770 km, with a mean of 40.4 km, a median of 21.4 km, a standard deviation of 115.0 km, and an IQR of 37.2 km (see Supplementary Figure 13 ). The dist2river is also heavily skewed to the left with a few extreme outliers, and ranges between 0.9 and 3,527 km, with a mean of 81.1 km, a median of 39.7 km, a standard deviation of 176.9 km, and an IQR of 68.5 km (see Supplementary Figure 14 ). The dist2ocean is less skewed and ranges between 0.6 and 2,194 km, with a mean of 317.7 km, a median of 146.5 km, a standard deviation of 354.8 km, and an IQR of 549.4 km (see Supplementary Figure 15 ). The dist2water is skewed to the left and ranges between 0.5 and 238.6 km, with a mean of 19.8 km, a median of 13.1 km, a standard deviation of 24.6 km, and an IQR of 17.3 km (see Supplementary Figure 16 ).
For population size collected from both sources, data were missing only for 63 languages (covering thus 771 or 92.4% of the languages) in the Ethnologue and 78 (covering 756 or 90.6% of the languages) in the Wikipedia. The data primarily derived from the Ethnologue (measured in tens of thousands of speakers to reduce the order of magnitude of the numbers displayed) are heavily skewed toward small languages, and ranges between 0 (for 45 languages, including 41 reported as recently extinct) and 84,091, with a mean of 465.6, a median of 1.0, a standard deviation of 3,480, and an IQR of 29.1 (see Supplementary Figure 17 ). The data primarily derived from Wikidata / Wikipedia (also measured in tens of thousands of speakers), is also heavily skewed toward small languages, and ranges between 0 (for 46 languages, including 42 reported as recently extinct) and 92,000, with a mean of 663.6, a median of 1.0, a standard deviation of 4,257.1, and an IQR of 30.9 (see Supplementary Figure 18 ). There is a strong positive and 1:1 linear relationship between the two sources (Pearson's r = 0.98 and Spearman's ρ = 0.98, both with p < 2.2·10 −16 ; see Supplementary Figure 19 ), with a few languages where the two estimates differ, in most cases due to the year of the estimate (very important for extremely endangered languages) or on the different type of categories of people considered (native speakers only, including L2 speakers as well, or even ethnicity).
Subsistence data were available only for 712 (85.4%) languages, and many more (553, 77.7%) practice subsistence modes centered around food production (“agriculture”) than those (159, 22.3%) whose subsistence mode is based on hunting, fishing, gathering, and/or foraging (“hunter-gatherers”). As expected, the latter tend to be found in marginal lands, being present mainly (in this dataset) in Australia, South America, and northern Eurasia (see Supplementary Figure 20 ).
For 15 variables, their value at the putative origin of the language families were also available. However, these values come with several caveats: first, the putative geographic origins are in most cases very controversial and come with probably very large errors; second, the value of the variables are present-day values, which might differ from their values at the time the proto-languages were spoken (ranging from hundreds to thousands of years, and usually now known with certitude). For each of these variables, the origin of family-level values versus the language-level values was plotted, their Pearson and Spearman correlations were computed, and their VIF (variance inflation factor) when used (as fixed effects) to predict blue in a mixed-effects logistic model with family and macroarea as random effects were estimated (see Table 2 ).
Variable Pearson's r Spearman's ρ | |||
---|---|---|---|
Longitude | r = 0.88, p = 6.7·10 −276 | ρ = 0.86, p = 3.9·10 −248 | 1.7 |
Latitude | r = 0.86, p = 7.8·10 −246 | ρ = 0.78, p = 3.7·10 −170 | 2.4 |
UV-B (mean; TOMS) | r = 0.83, p = 1.8·10 −214 | ρ = 0.67, p = 3.0·10 −111 | 2.4 |
UV-B (sd; TOMS) | r = 0.87, p = 4.5·10 −258 | ρ = 0.71, p = 2.3·10 −129 | 2.7 |
UV-B (mean; WorldClim) | r = 0.69, p = 1.1·10 −116 | ρ = 0.57, p = 4.1·10 −73 | 1.5 |
UV-B (sd; WorldClim) | r = 0.78, p = 1.4·10 −172 | ρ = 0.68, p = 7.8·10 −115 | 1.8 |
Climate PC1 | r = 0.68, p = 1.2·10 −114 | ρ = 0.58, p = 2.4·10 −76 | 1.5 |
Climate PC2 | r = 0.56, p = 7.5·10 −70 | ρ = 0.55, p = 1.2·10 −66 | 1.2 |
Climate PC3 | r = 0.36, p = 1.7·10 −26 | ρ = 0.40, p = 5.5·10 −33 | 1.1 |
Humidity (median) | r = 0.75, p = 1.7·10 −154 | ρ = 0.72, p = 3.9·10 −133 | 1.7 |
Humidity (IQR) | r = 0.53, p = 5.0·10 −62 | ρ = 0.40, p = 1.3·10 −32 | 1.2 |
Dist. to lakes | r = 0.13, p = 0.00017 | ρ = 0.15, p = 8.1·10 −6 | 1.0 |
Dist. to rivers | r = 0.03, p = 0.373 | ρ = 0.02, p = 0.547 | 1.0 |
Dist. to oceans/seas | r = 0.65, p = 8.2·10 −101 | ρ = 0.61, p = 9.9·10 −86 | 1.3 |
Dist. to water | r = 0.30, p = 2.9·10 −19 | ρ = 0.27, p = 5.9·10 −15 | 1.1 |
Shown are: their Pearson's (2nd column) and Spearman's (3rd column) columns, as well as the variance inflation factor (VIF, 4th column) of a mixed-effects logistic regression of blue having the language-level and the family-origin-level values as fixed effects, and family and macroarea as random effects.
It can be seen, first, that the geographical locations of the present-day languages and of the putative origin of language families are very highly correlated, which is to be expected. However, there are a few families which show a very large spread among their daughter languages ( Table 3 ), of particular interest here being those with a large spread in latitude, as latitude is the main driver of UV-B incidence as well as having a strong influence on climate.
Family Glottocode sd (longitude) sd (latitude) No. of lgs | ||||
---|---|---|---|---|
Athabaskan-Eyak-Tlingit | atha1245 | 19.1 | 15.0 | 4 |
Indo-European | indo1319 | 58.4 | 15.0 | 86 |
Atlantic-Congo | atla1278 | 17.5 | 14.7 | 25 |
Tupian | tupi1275 | 5.1 | 11.7 | 7 |
Arawakan | araw1281 | 6.3 | 10.2 | 8 |
Afro-Asiatic | afro1255 | 12.2 | 9.6 | 51 |
Chukotko-Kamchatkan | chuk1271 | 9.3 | 8.4 | 2 |
Nuclear-Macro-Je | nucl1710 | 2.1 | 8.3 | 5 |
Turkic | turk1311 | 25.2 | 8.0 | 12 |
Tungusic | tung1282 | 11.0 | 7.7 | 5 |
Eskimo-Aleut | eski1264 | 45.9 | 6.5 | 6 |
Pama-Nyungan | pama1250 | 11.7 | 6.5 | 48 |
Austronesian | aust1307 | 24.7 | 6.4 | 134 |
Uralic | ural1272 | 21.4 | 6.1 | 28 |
Austroasiatic | aust1305 | 5.0 | 5.3 | 25 |
Uto-Aztecan | utoa1244 | 7.1 | 4.6 | 3 |
Tai-Kadai | taik1256 | 3.9 | 4.2 | 25 |
Dravidian | drav1251 | 2.6 | 3.8 | 5 |
Sino-Tibetan | sino1245 | 6.9 | 3.3 | 85 |
Nilotic | nilo1247 | 1.6 | 2.9 | 4 |
Mongolic-Khitan | mong1349 | 35.6 | 2.7 | 3 |
Pano-Tacanan | pano1259 | 3.0 | 2.4 | 8 |
Yukaghir | yuka1259 | 3.0 | 2.3 | 2 |
The standard deviation of longitude (3rd column), the number of languages with data in the family (5th column), the family name (1st column), and glottocode (2nd column).
Given these, it is no surprise that most variables show high correlations between the language-level and family-origin-level values (except for the distances to lakes and to rivers, the latter being the only non-significant one, given their high dependence on small-scale details of geography and climate), but it is also interesting to note that the highest VIF is ≈2.7, which is well below the usual cutoff of 5, and suggests that the family-origin-level values do not carry the same information as the language-level values.
A priori , it is extremely important to control for Galton's problem, and for language contact (Ladd et al., 2015 ) so, it was also checked if, on these data, including language family and macroarea as random effects in a mixed-effects regression model is statistically justified or not. For this, the null model , m 0 (i.e., in which blue is regressed only on the intercept, without any predictors), with both family and macroarea as random effects [in R's notation, m 0 = blue ~1+(1| family ) + (1| macroarea )] and the null models that miss one of these random effects [ m 0 − f = blue ~1+(1| macroarea ) and m 0 − m = blue ~1+(1| family )] were compared (in both the frequentist and Bayesian frameworks). It was found that m 0 has an Intraclass Coefficient Coefficient, ICC (which can be interpreted as the proportion of the variation explained by the grouping of observations as given by the random effects, ranging from 0%, when the random structure doe not explain anything, to 100%, when the random structure is enough by itself to explain the data), of 27.3%. Removing family significantly drops the fit ( m 0 vs m 0 − f : LR model comparison's p = 2.57·10 −6 , ΔAIC = 20.1, BF = 16059.0, ΔLOO = 14.5, ΔWAIC = 15.1, ΔKFOLD = 12.8), as does removing macroarea ( p = 2.45·10 −5 , ΔAIC = 15.8, BF = 1108.7, ΔLOO = 4.8, ΔWAIC = 3.0, ΔKFOLD = 10.0). Thus, both random effects will be systematically included in the following models.
Both frequentist and Bayesian logistic mixed-effects regressions of blue on each of the following predictors of potential interest individually were performed: UV-B incidence (mean and sd, separately from TOMS and WorldClim), latitude, subsistence strategy, elevation, climate (PC1, PC2, and PC3), humidity (median and IQR), distance to large bodies of water (separately for distance to lakes, rivers, seas/oceans, and any type of large body of water), and population size (separately from the Ethnologue and Wikipedia/Wikidata). For the numeric predictors (all except subsistence ), the process began with the quadratic model [ blue ~1+ x + x 2 +(1| family ) + (1| macroarea )], while for discrete predictors (only subsistence ), it started with the linear model. Then, they were (automatically) simplified by dropping first the quadratic effect (if it exits), then the linear effect, and retaining the simplest model (which could well be the null model) that explains the data equally well as the most complex model. With this, it was found that the predictors which seem to have an individual effect on blue with various degrees of confidence are (see Table 4 ): UV-B as measured at the location of the languages (clearly negative for the mean, either quadratic [TOMS] or linear [WorldClim], and clearly positive for sd, probably linear [TOMS] and [WorldClim]), UV-B at the origins of the language families (suggestive linear, negative for the mean [TOMS], and positive for sd [TOMS] and [WorldClim]), latitude at the location of the languages (clearly linear positive), latitude at the origins of the language families (linear positive), climate PC1 at the origins of the language families (possibly negative linear), humidity median and at the origins of the language families (possibly negative linear), and distance to lakes (probably negative linear). The clearest signals are thus for UV-B incidence (negative for their mean and positive for their standard deviation) and latitude (at the language and family origins, positive). It is interesting to note that, in general, the frequentist and Bayesian estimates are in very good numeric agreement, but that the Bayesian approach tends to be more conservative.
Predictor Approach Formula | ||
---|---|---|
UV-B (mean; TOMS) | Frequentist | −0.20 ± 0.08 x 2 −0.93 ± 0.24 x |
Bayesian | −0.52[−0.83, −0.21] x | |
UV-B (sd; TOMS) | Frequentist | 0.57 ± 0.14 x |
Bayesian | 0.58[0.30, 0.89] x | |
UV-B (mean; WorldClim) | Frequentist | −0.36 ± 0.14 x |
Bayesian | {−0.36[−0.64, −0.09] x } | |
UV-B (sd; WorldClim) | Frequentist | 0.32 ± 0.13 x 2 +0.07 ± 0.16 x |
Bayesian | {0.28[0.02, 0.57] x } | |
UV-B (mean fam.; TOMS) | Frequentist | −0.31 ± 0.15 x |
Bayesian | {−0.31[−0.63, 0.00] x } | |
UV-B (sd fam.; TOMS) | Frequentist | 0.31 ± 0.14 x |
UV-B (sd fam.; WorldClim) | Frequentist | 0.33 ± 0.15 x |
Bayesian | {0.32[0.03, 0.65] x } | |
Latitude | Frequentist | 3.07 ± 1.00 x |
Bayesian | 2.68[0.84, 4.77] x | |
Latitude (fam.) | Frequentist | 2.93 ± 1.08 x |
Bayesian | 2.41[0.29, 4.52 x ] | |
Elevation | Frequentist | −0.06 ± 0.02 x 2 +0.62 ± 0.23 x |
PC1 (fam.) | Frequentist | −0.37 ± 0.13 x |
Bayesian | {−0.37[−0.64, −0.09] x } | |
Humidity (median) | Frequentist | −67.32 ± 25.12 x |
Humidity (median fam.) | Frequentist | −87.21 ± 31.69 x |
Dist. lakes | Frequentist | −0.20 ± 0.09 x |
Bayesian | {−0.20[−0.38, −0.02] x } |
It shows the predictor, the type of regression (frequentist, i.e., using glmer , or Bayesian, using brms ), and the (essential) regression formula. For this last column, it uses the following conventions: for frequentist regressions, it gives the point estimate ± standard error, while for the Bayesian regressions, it gives the point estimate (i.e., the mean posterior) [95% HDI]. For both, it gives either the quadratic or the linear formula, as appropriate. If the Bayesian linear model is not formally better than the null (but marginally so) and the 95% HDI does not contain 0, it still gives the linear model but enclosed within { and }; please note that the global intercept α nor its variation by the random effect structure are shown, as these are not relevant for establishing the direction and relative strength of the predictor's effect. So, as an example, the first row is interpreted as a quadratic model (frequentist) with coefficient −0.20 ± 0.08 for the quadratic term and −0.93 ± 0.24 for the linear term, while the last row is a Bayesian linear regression that is not formally better than the null model, but where the slope, −0.20, is very probably negative, as 0∉[−0.38, −0.02].
Comparing the two UV-B incidence databases, TOMS and WorldClim, in terms of their capacity to predict blue when using UV-B mean in a mixed-effects logistic regression with family and macroarea as random effects, we suggests that TOMS is a better predictor [ glmer : Δ AIC = 6.4, Δ BIC = 6.4; brms : BF = 23.0, Δ LOO = 2.4(2.6), Δ WAIC = 2.3(2.6), Δ KFOLD = 3.5(3.4)]. Likewise, fitting a mixed-effects logistic regression of blue as above, but with UV-B mean and sd from both databases as fixed effects simultaneously found high VIFs for UV-B mean (8.6) and sd (12.2) from TOMS, and low VIFs for the mean (2.2) and sd (2.9) from WorldClim, suggesting that the two databases contain highly overlapping information. Taken together with the substantive difference between the two databases in terms of what they actually mean in terms of UV-B incidence, it was decided to only use the TOMS data in the reminder of the article.
Several mediation models having blue as outcome were fitted (see Supplementary Figures 21 – 39 ), and it was found that, first, the significant positive total effect of latitude on blue (Bayesian: TE = 3.6[1.4, 5.9], piecewiseSEM: TE = 0.03[0.01, 0.05], p = 0.0001; please note that the effects are standardized but not the regression coefficients) is composed of a non-significant negative direct effect (Bayesian: DE = −5.7[−12.5, 1.1]), piecewiseSEM: no estimate) and a significant positive indirect effect (Bayesian: IE = 9.3[2.4, 16.1], piecewiseSEM: IE = 0.03[0.01, 0.05], p = 0.0001), the latter mediated through UV-B mean and composed of a significant negative effect of latitude on UV-B mean (Bayesian: β T → M = −7.1[−7.2, −6.9], piecewiseSEM: β T → M = −7.1 ± 0.1, p = 0) and a significant negative effect of UV-B mean on blue (Bayesian: β M → O = −1.3[−2.3, −0.4], piecewiseSEM: β M → O = −0.5 ± 0.1, p = 0.0007; see Supplementary Figure 21 ). When using UV-B sd instead, the results are similar, but suggest that UV-B mean is a better mediator of the relationship: the significant positive total effect of latitude on blue (Bayesian: TE = 3.1[0.1, 5.2], piecewiseSEM: TE = 0.01[0.00, 0.02], p = 0.0006) is composed of a significant negative direct effect (Bayesian: DE = −8.9[−15.6, −2.4], piecewiseSEM: DE = −0.02[−0.03, −0.00], p = 0.009) and a significant positive indirect effect (Bayesian: IE = 12.0[5.8, 18.6], piecewiseSEM: IE = 0.03[0.01, 0.04], p = 0.0003), the latter mediated through UV-B sd and composed of a significant positive effect of latitude on UV-B sd (Bayesian: β T → M = 6.6[6.4, 6.8], piecewiseSEM: β T → M = 6.6 ± 0.1, p = 0) and a significant positive effect of UV-B sd on blue (Bayesian: β M → O = 1.8[0.9, 2.8], piecewiseSEM: β M → O = 1.5 ± 0.4, p = 0.0003, see Supplementary Figure 22 ). Climate PC1, population size , and distance to lakes do not mediate the relationship between latitude and blue , but subsistence might (piecewiseSEM: IE = −0.01[−0.01, −0.00], p = 0, β T → M = −0.6 ± 0.1, p = 0, β M → O = 1.1 ± 0.3, p = 0, Supplementary Figures 23 – 25 , 34 ). Subsistence seems to mediate some of the relationship between UV-B ( mean and sd ) and blue ( Supplementary Figures 28 , 29 ), but population size does not ( Supplementary Figures 30 , 31 ). Focusing on distance to lakes suggests that its negative effect on blue is, in fact, mediated by latitude and UV-B incidence ( Supplementary Figures 33 – 39 ).
I fitted various path analyses models that reflect to various degrees our causal beliefs connecting blue, UV-B, latitude , and other predictors using three different techniques, each with its own advantages and disadvantages: “classical” variance-based SEM (as implemented by lavaan ; Rosseel, 2012 ), frequentist piecewise SEM ( piecewiseSEM ; Lefcheck, 2016 ), and Bayesian piecewise SEM (using brms ; Bürkner, 2018 ).
With lavaan , two types of path models were fitted, as in Josserand et al. ( 2021 ). The “full” models include all the potentially relevant variables ( latitude, UV-B incidence, distance to lakes, climate PC1, subsistence, population size , and blue ) and most paths are directional (except for UV-B ↔ climate PC1 , and UV-B ↔ distance to lakes , which are modeled as correlations). Three such models were fitted: one including UV-B mean ( Supplementary Figure 40 ), one including UV-B sd ( Supplementary Figure 41 ), and one including both UV-B mean and UV-B sd (modeled as correlated; Supplementary Figure 42 ). All these models fit the data rather well and are equivalent in terms of fitting [for all: , p = 0.74>0.05, CFI = 1.0, TLI = 1.02, NNFI = 1.02, and RMSEA = 0.00], suggesting that using the mean or the sd of UV-B incidence are equivalent from this pint of view. Using only the mean results in a negative but non-significant path UV-B → blue , using only the sd results in a significant positive path, and when including both, only the positive path from sd remains significant; the positive path subsistence → blue is significant in all models. The “relaxed” models kept the direction of the effect only in those cases for which there are strong a priori reasons ( Figure 3 and Supplementary Figures 43 , 44 ). This results in very good fits to the data and now the three models have slightly different fits as well [mean: , p = 0.96>0.05, CFI = 1.00, TLI = 1.01, NNFI = 1.01, RMSEA = 0.00; sd: , p = 0.99>0.05, CFI = 1.00, TLI = 1.01, NNFI = 1.01, RMSEA = 0.00; both: , p = 0.99>0.05, CFI = 1.00, TLI = 1.01, NNFI = 1.01, RMSEA = 0.00]. As above, including individually the mean and sd of UV-B incidence results in significant paths to blue of similar strengths (negative and positive, respectively), but including both makes their paths to blue non-significant. Likewise, subsistence → blue is significant and positive in all models.
In contrast with lavaan , piecewiseSEM allows the inclusion of family and macroarea as random effects, and I fitted two path models corresponding to the “full” models above separately for mean and sd UV-B incidence (see Supplementary Figures 45 , 46 ). Including the mean results in a much smaller AIC than not including it (Δ AIC = −263.9), the model fits the data very well [ , p = 0, and Fisher's C (14) = 167.5, p = 0], but the negative effect of mean(UV-B) is not significant (standardized β = −0.40, p = 0.065). Likewise, the standard deviation results in a Δ AIC = −221.5, the model fits the data very well [ , p = 0, and Fisher's C (14) = 69.0, p = 0], and the positive effect of sd(UV-B) is highly significant (standardized β = 0.73, p = 0.0003). In both models, subsistence (AGR) has a significant positive effect on blue .
While more flexible than lavaan , piecewiseSEM still has certain restrictions that may affect the results, prompting me to also implement piecewise SEM using brms to fit the two models described above (see Supplementary Figures 47 , 48 ). The model including the mean is overwhelmingly better than the one without it [ BF = 1.2· 42 , Δ LOO = 121.2(24.0), Δ WAIC = 123.7(23.6), and Δ KFOLD = 120.0(25.6)] and finds a clear negative effect of mean(UV-B) on blue [β = −1.10[−1.99, −0.23], posterior p (β < 0) = 0.98]. Likewise, the model including the standard deviation is overwhelmingly better than the one without it [ BF = 2.5· 16 , Δ LOO = 70.9(37.9), Δ WAIC = 71.7(37.8), and Δ KFOLD = 66.3(40.2)] and finds a clear positive effect of sd(UV-B) on blue (β = 1.93[1.09, 2.79], posterior p (β > 0) = 1.00). Both models find a significant positive effect of subsistence (AGR) on blue and there may be hints of a negative effect of distance to lakes and a positive effect of population size .
However, it is arguably incorrect to include simultaneously both the mean and standard deviation of UV-B incidence as they are highly correlated and causally linked, being two connected aspects of the same unmeasured construct capturing the UV-B incidence received by a geographic location in a year. Likewise, subsistence and population size are, arguably, proxies for an unmeasured “ cultural complexity” that might affect blue . Therefore, I also implemented a series of Structural Equation Models that explicitly model the latent variables UV-B incidence , measured by mean(UV-B) and sd(UV-B), and cultural complexity , measured by subsistence and population size (climate is captured by climPC1 ). However, currently only lavaan allows latent constructs and, given the complexities of fitting such models, I start with the main hypothesis and I subsequently added other factors to the model. First, the model implementing the main hypothesis (see Supplementary Figure 49 ) that blue is influenced by the latent UV-B incidence which is affected by latitude fits the data [ , p = 0.17 > 0.05, CFI = 1.00, TLI = 0.99, NNFI = 0.99, RMSEA = 0.032] and finds a significant negative effect of UV-B incidence on blue (standardized β = −0.83, p = 0.008); this latent loads approximately equally but with opposed signs on the mean (loading fixed to 1.0) and sd (loading −1.003, p = 0). Adding the climate ( climPC1 ) improves the fit [ , p = 0.75 > 0.05, CFI = 1.00, TLI = 1.00, NNFI = 1.00, RMSEA = 0.00] and does not alter the relationship among blue, UV-B incidence , and latitude . Further adding the latent cultural complexity (see Supplementary Figure 50 ) makes the model to not formally fit the data anymore [ , p = 0.025 ≤ 0.05] but the fit indices are still very good ( CFI = 0.99, TLI = 0.99, NNFI = 0.99, RMSEA = 0.04); the relationship among blue, UV-B incidence , and latitude remains the same (with a slightly weaker β UVB → blue = −0.64, p = 0.002), and there is now a significant positive relationship between cultural complexity (mainly loading positively on subsistence but also on population size ) and blue (β culture → blue = 0.46, p = 0.0). However, adding the distance to lakes makes the model not fit the data and degrades its fit indices as well [ , p = 0 ≤ 0.05, CFI = 0.94, TLI = 0.88, NNFI = 0.88, RMSEA = 0.12] suggesting that we should not put too much weight on it, but it introduces a significant negative effect of this variable on blue and does not alter the previous relationships of interest.
Finally, while lavaan does not currently handle random effects, I attempted to control for the effect of macroarea by modeling it as a grouping factor in the first model that embodies the main hypothesis, estimating the models' parameters for each macroarea ( N.B ., this is fundamentally different from a random effects approach and cannot be applied to the language family due to the large number of families and generally very low number of languages per family). This fits the data well enough [ , p = 0.065 > 0.05, CFI = 0.97, TLI = 0.90, NNFI = 0.90, RMSEA = 0.08] and finds the following estimates of β UVB → blue (with 95% CIs and p -values) per macroareas: Africa (−2.15[−4.24, −0.06], p = 0.044), Eurasia (−3.24[−5.82, −0.65], p = 0.014), Australia (1.55[−2.04, 5.13], p = 0.40), Papunesia (−3.39[−6.51, −0.28], p = 0.033), North America (0.40[−2.54, 3.35], p = 0.79), and South America (−1.74[−3.99, 0.51], p = 0.13).
A Bayesian mixed effects logistic regression with family and macroarea as random effects, using all potential predictors as fixed effects fits well the full dataset (76.6% accuracy, 77.5% sensitivity, 74.1% specificity, 88.7% precision, and 77.5% recall). When randomly splitting the dataset into 80% training/20% testing subsets 100 times, using all the potential predictors, a good fit on the testing subsets was obtained (70.1 ± 2.9% accuracy, 74.7 ± 3.2% sensitivity, 59.4 ± 6.6% specificity, 81.7 ± 3.9% precision, and 74.7 ± 3.2% recall). Manual simplification retains the following three predictors [estimate, 95% HDI and p (ROPE)]: UV-B sd (β = 0.81[0.49, 1.11], p ( ROPE ) = 0.00), subsistence (β = 0.87[0.2, 1.52], p ( ROPE ) = 0.0003), and distance to oceans (family) (β = 0.24[0.02, 0.47], p ( ROPE ) = 0.29); this model still fits the full data well (76.0% accuracy, 76.7% sensitivity, 74.0% specificity, 89.2% precision, and 76.7% recall).
A conditional inference tree using all the potential predictors fits the full dataset well (72.9% accuracy, 72.0% sensitivity, 79.2% specificity, 96.2% precision, and 74.1% recall) and seems to make a distinction among the African, Eurasian, North American, and Papunesian languages, on the one hand, and the South American and Australian languages, on the other; for the former split, UV-B (sd) has a positive effect on blue , while for the second, the longitude of the family has a positive effect (see Supplementary Figure 51 ). When randomly splitting the dataset into 80% training/20% testing subsets 100 times, using all the potential predictors, good fits on the testing subsets were obtained (70.8 ± 3.5% accuracy, 74.3 ± 4.0% sensitivity, 62.2 ± 8.4% specificity, 85.0 ± 5.9% precision, and 74.3 ± 4.0% recall).
Both random forests and conditional random forests fit the dataset well (72.3 ± 0.5% accuracy, 75.5 ± 0.4% sensitivity, 64.9 ± 0.8% specificity, 83.3 ± 0.5% precision, and 75.5 ± 0.4% recall; and 81.4 ± 0.3% accuracy, 81.2 ± 0.3% sensitivity, 81.9 ± 0.6% specificity, 93.3 ± 0.2% precision, and 81.2 ± 0.3% recall, respectively). Various measures of variable importance suggest the following top five predictors: UV-B (mean), distance to oceans (family), UV-B (sd), latitude , and macroarea (accuracy-based predictor importance from random forests); UV-B (mean), UV-B (sd), latitude, population size , and elevation (Gini-index-based predictor importance from random forests); and macroarea, latitude (family), climate PC1 (family), UV-B (mean) , and UV-B (sd) (unconditional predictor importance from conditional random forests).
An SVM using all potential predictors fits well the full dataset (74.7% accuracy, 74.3% sensitivity, 76.0% specificity, 91.7% precision, and 74.3% recall), and the top five predictors by importance are as follows: distance to lakes (family), macroarea, elevation (family), subsistence , and climate PC1 (family). When randomly splitting the dataset into 80% training/20% testing subsets 100 times, using all the potential predictors, very good fits on the testing subsets were obtained (71.5 ± 2.9% accuracy, 71.8 ± 3.4% sensitivity, 70.7 ± 7.4% specificity, 89.9 ± 3.0% precision, and 71.8 ± 3.4% recall).
Furthermore, the Bayesian logistic regression of blue on the full set of potential predictors with manual simplification in brms were fitted, using (a) a 2D Gaussian process to model the spatial relationships between the languages (McElreath, 2020 ; Naranjo and Becker, 2022 ), which should better capture the continuous dependency of the probability and/or intensity of language contact on geographical space within macroareas (as opposed to the categorical use of macroareas as a random effect) while still including family as a random effect; (b) the “global” language phylogeny in Jäger ( 2018 ) to model the detailed “vertical” historical relationships between languages (as opposed to the categorical approach of using language family as a random effect) while still including macroarea as a random effect; and (c) combining both (a) and (b) in a single model where a 2D Gaussian process models the within-macroarea continuous language contact and the “global” phylogeny to model the detailed “vertical” historical relationships between languages. However, given that these models are very computationally expensive, they were not generalized to each individual predictor nor to the mediation models. Their findings clearly support the a priori hypothesis: after manual simplification, the model retained for (a) includes a negative effect of UV-B mean (β = −0.64[−0.98, −0.29], p (β = 0) = 0.03, p (β < 0) = 1.00), of subsistence (agriculture: β = 1.18[0.51, 1.86], p (β = 0) = 0.03, p (β > 0) = 1.00), and negative of distance to lakes (β = −0.20[−0.40, 0.02], p (β = 0) = 0.83, p (β < 0) = 0.97). The models retained in (b) and (c) both include only the negative effect of UV-B mean (β = −0.83[−1.90, 0.06], p (β = 0) = 0.36, p (β < 0) = 0.99, and β = −0.73[−1.39, −0.07], p (β = 0) = 0.30, p (β < 0) = 0.99, respectively).
A total of 4,259 trees with branch length for 13 language families ( Supplementary Figures 52 – 83 ) and two “global” trees (not shown due to their size) were collected (see Table 1 for summaries). The number of languages with data in a family varies between 10 (Turkic; Hruschka et al., 2015 ) and 129 (Austronesian; Round, 2021 ), with 641 and 703 languages in the two “global” trees (Jäger, 2018 ; Bouckaert et al., 2022 , respectively). The percent of languages with a dedicated word for “blue” varies between ≈19% (Pama-Nyungan; Bouckaert et al., 2018 ; Round, 2021 ) and 100% (Uralic; Honkola et al., 2013 ; Jäger, 2018 ), with ≈66% for the two “global” trees. The corresponding Shannon entropy varies between an uninformative 0.00 (when “blue” is at 100%) to 0.99 (e.g., Atlantic-Congo; Jäger, 2018 ); for the two “global” trees, it is a very high 0.92.
The phylogenetic signal of blue independently in each of the available trees was estimated and it was found, in summary, that there seems to be a significant phylogenetic signal at least in Austronesian, Indo-European, possibly Hmong-Mien, and the two “global” trees, but the results are rather patchy and seem to depend on the particular tree and method used (see Supplementary Table 2 for details). This probably reflects the need for large trees, as the signal for the two very large “global” trees is quite strong and consistence across methods.
Given this, it is not surprising that the ancestral state reconstruction of blue seems to depend on the particular tree and method used, but the following families seem to have had a specific word for “blue” in their proto-languages: Afro-Asiatic, Austroasiatic, Austronesian, Indo-European, Nakh-Daghestanian, Sino-Tibetan, Tai-Kadai, Turkic, and Uralic, while proto-Pama-Nyungan seems not to have had it. The “global” tree of Jäger ( 2018 ) seems to have had a specific word for “blue” at its root, but the other “global” tree (Bouckaert et al., 2022 ) is uninformative. See Supplementary Table 3 for details.
The correlated evolution between blue and each of its potential predictors was estimated separately. Focusing on UV-B incidence, there seems to be some evidence for correlated evolution between blue and UV-B mean and between blue and UV-B sd in a few families and trees (Austroasiatic and Hmong-Mien, and Austronesian, Hmong-Mien and Pama-Nyungan, respectively), as well as in both “global” trees (see Supplementary Tables 4 , 5 ). For the other predictors, there is some evidence for correlated evolution with blue in some families as well as in one or both “global” trees, but is inconsistent—please see the full analysis report for details.
The phylogenetic logistic regression of blue on each of the potential predictors separately was performed using two non-Bayesian and one Bayesian approach, and for each, the corresponding non-phylogenetic logistic regression was also fitted as a baseline comparison which ignores “Galton's problem” (Mace and Holden, 2005 ). The results are presented in the Supplementary Figures 85 – 92 (see Supplementary Figure 84 for the full caption and interpretation key) and summarized in Supplementary Table 6 . Several predictors show suggestive signals of coherent association with blue across multiple families and the two “global” phylogenies, including UV-B mean (negative), UV-B sd (positive), longitude (positive), population size (positive), climate PC3 (negative), and distance to lakes (negative), while climate PC1 varying between families and subsistence (agriculture) seems to have a positive effect blurred by being constant in so many families.
Putting all these results together and focusing on the a priori main hypothesis of a negative effect of UV-B incidence on the existence of a dedicated word for “blue,” it was found that: first, there is significant correlated evolution for Austroasiatic and Hmong-Mien using the Bayesian approach, and using both methods. On the other hand, the logistic phylogenetic regression finds a significant negative effect only in a few cases (14 or 5.6% trees belonging to Indo-European, Uralic, and Sino-Tibetan and the two “global” phylogenies), but the estimated β's are negative in the majority of cases (≈65% when including the posterior trees, which give a very strong influence to the few families with such trees, and ≈75% when excluding them, which gives a much more balanced view); importantly, there is a significant strong negative effect for all methods in the two “global” phylogenies.
Second, there is also a signal of correlated evolution between UV-B sd and blue , and there is a significant positive effect for 11 (4.4%) cases and a positive β for ≈60 and ≈65% of cases, respectively; there is a strong positive signal in both “global” phylogenies.
Third, plotting the relationship between blue and UV-B incidence in each family separately (see Supplementary Figures 85 – 92 and the full analysis report) suggests that, first, only two families (Atlantic-Congo and Tai-Kadai) do not show any effect of UV-B incidence on blue , nine (sub)families show a signal consistent with the hypothesis of an effect of UV-B on blue , but three families show an effect in the opposite direction to that predicted(for UV-B mean : Uralic, for UV-B sd : Timor-Alor-Pantar, and for both: Turkic). However, it is clear that for Trukic and Uralic, this is driven by one outlier each in the north, while for Timor-Alor-Pantar, there is very little variation in UV-B incidence (The case of Indo-European is interesting as the MCMC summary and posterior trees seem to show an opposite effect to the expected one, while the Glottolog trees show an effect in the expected direction, with the Jäger ( 2018 ) tree showing essentially a null effect). Importantly, both “global” phylogenies show a clear and significant effect in the expected direction for both the mean and standard deviation of UV-B incidence.
Josserand et al. ( 2021 ), based on the original hypothesis by Brown and Lindsey ( 2004 ), only tested a linear negative effect of mean UV-B incidence on the probability of having a specific word for “blue.” However, the actual shape of the relationship might not be strictly linear and its particular shape might give hints as to the details of the causal mechanisms involved.
To help better understand these relations, the z -scored values of UV-B incidence (mean and sd) back-map to their raw values as follows. For UV-B mean : 0.0 → 208.5 mW / m 2 , −4.75 → 70.89 mW / m 2 , and 1.04 → 238.6 mW / m 2 ; in general, . For UV-B sd : 0 → 16.4 mW / m 2 , −0.87 → 1.13 mW / m 2 , and 3.13 → 71.2 mW / m 2 ; in general, . Supplementary Figures 93 – 95 show the relationship between UV-B incidence (mean and sd) and the presence of a specific word for “blue” globally and per macroarea.
Polynomial logistic regression up to degree 3 in the fixed effect were conducted (both Bayesian and non-Bayesian) while controlling for family and macroarea as random effects, and the results are extremely similar. For UV-B mean , the linear model finds a clear negative relationship, where going from the minimum mean UV-B incidence of 70.9 mW / m 2 to the maximum of 238.6 mW / m 2 is associated with a drop in the probability of “blue” from about 94% (with a 95%CI of [77%, 99%]) to about 46% (with a 95%CI of [30%, 63%]). However, the model with both linear and quadratic effects fits the data marginally better [vs. linear: , p = 0.026, Δ AIC = 2.9, Δ BIC = −1.8], which suggests that the relationship might not be linear (or even monotonic) at low mean UV-B incidences (the confidence interval is very wide), and instead the probability of “blue” might plateau (or reach a maximum) at about 140 mW / m 2 of about 84% [67, 93%] and falls off to 35% [20, 55%] for the maximum mean UV-B incidence, and also (but see the very wide 95%CI!) toward 62% [15, 94%] for the minimum mean UV-B incidence. However, the “dip” at lower mean UV-B incidences (higher latitudes) could be an artifact of hunter–gatherer populations whose languages tend to lack a word for “blue.” Therefore, the same polynomial regression but also including all the interactions with subsistence was also fitted. With these, manual model simplification suggests that the best model (Δ AIC = 123.6, Δ BIC = 124.4) actually comprises the linear effect of UV-B mean and the independent contribution of subsistence (i.e., with no interaction between the two). For UV-B sd , the linear and the quadratic models fit equally well [ , p = 0.36, Δ AIC = −1.2, Δ BIC = −5.9], so we will use the linear model when going from the minimum sd UV-B incidence of 1.1 mW / m 2 to the maximum of 71.2 mW / m 2 is associated with an increase in the probability of “blue” from about 47% [31, 63%] to about 90% [73, 96%]. Adding the independent contribution of subsistence results in an even better fit (Δ AIC = 129.7, Δ BIC = 125.7). Figure 4 shows the predictions of these models: it can be seen that including subsistence removes the need for a quadratic effect in UV-B mean and highlights the overall lower probability of a specific word for “blue” in hunter–gatherer languages but no detectable interaction (within the limits of the dataset) between subsistence and UV-B incidence.
This extension of the database resulted in a massive increase in the language families covered, and in the languages within families and macroareas. A slight majority of the languages in the extended database do have a specific word for “blue” and are spread across a wide range of UV-B incidences. The other potentially relevant variables were also extended to all of or to a sizable proportion of the data. This resulted in a much better coverage of small families and isolates, and of Australia and Papunesia, offering a much more representative sample of present-day linguistic diversity and increased statistical power relative to the original study (Josserand et al., 2021 ). Moreover, by increasing the available data for several large families, it made possible the application of various phylogenetic methods as well as the addition of piecewise path analysis and of Structural Equation Models with latent variables.
Overall, the large set of diverse methods used overwhelmingly supports the a priori hypothesis of a negative effect of mean UV-B incidence on the probability that a language has a specific word for “blue.” First, this negative effect is found in the individual logistic regression of “blue” on the mean UV-B incidence. Second, it also appears in the mediation analysis, where mean UV-B incidence fully mediates the overall effect of latitude on “blue,” and in the various path and Structural Equation models. Third, the phylogenetic methods provide some evidence of correlated evolution and of a negative phylogenetic effect in several large families and in two “global” language phylogenies. Moreover, it emerged that the annual variation in UV-B incidence is strongly negatively correlated with mean UV-B incidence (as expected due to astronomic considerations) and, in most models, both tend to explain very similar variation, resulting in one “removing” the other from the model when included simultaneously (most often, the variation is retained and the mean is “dropped”). With this in mind, variation in UV-B has a clear positive effect on “blue” in the individual logistic regression, in the mediation, path and Structural Equation analyses, and in the suggestive signal in the phylogenetic analyses. Moreover, modeling the mean and variation in UV-B incidence as indicators of the latent UV-B incidence recovers the expected effect of this latent variable on “blue.” Thus, while most techniques do find a “significant” overall negative effect of mean UV-B on “blue,” there is none among the remaining techniques that supports an overall positive effect, and even among the techniques that suggest no effect, this seems to be due to overlapping variance with other predictors. While “global” language phylogenies have serious issues and it is unclear to what degree they reflect the “vertical” historical connections between languages (especially beyond the level of established language families), the fact that two such “global” phylogenies, constructed using widely different methods and datasets, find overwhelming support for the negative effect of mean UV-B on “blue” after controlling for “Galton's problem” at this global scale is more than encouraging. The fact that a phylogenetic effect was detected for certain families suggests that the effect is indeed diachronic and may play out at the time-scale of within-family divergence (i.e., thousands to hundreds of years). It is important, however, to point out that the measurements used here for UV-B incidence (but also for climate, humidity and distances to bodies of water) are present-day measurements that, on the one hand, may deviate quite strongly from their values during the periods of interest (presumably more so for some regions than for others) and, on the other, represent a snapshot of a variable timeseries (again, in a region- and time-dependent manner). In particular, the UV-B incidence used may not accurately reflect historical values due to the human-induced ozone layer depletion and its depletion and its slow recovery following the “Montreal Protocol” from 1978 (see https://en.wikipedia.org/wiki/Montreal_Protocol ), very probably with strong variation across geographic regions (e.g., Australia), but it is unclear how we can extrapolate its values back to the pre-industrial period globally and with the required spatio-temporal resolution (Lindfors et al., 2007 ; den Outer et al., 2010 ; Čížková et al., 2018 ).
Climate and humidity seem to have a much less clear and consistent effect in this larger dataset. The previously found negative effect of distance to lakes on blue , which Josserand et al. ( 2021 ) were careful not to over-interpret, is much weaker but still arguably discernible at least as a trend in this extended dataset, especially when using mediation, path analyses, latent variable SEM, and even phylogenetic regression, but the mediation analyses conducted specifically with this variable in mind seem to suggest that this might be due to it being related to latitude and UV-B incidence (this relationship probably reflects the vagaries of the current disposition of landmasses on Earth, on the one hand, and the causal links among latitude, climate, and the density of lakes and UV-B incidence, on the other).
However, there might be an overall weak effect of subsistence (practicing agriculture increases the probability of “blue”) and possibly of population size. Nevertheless, given that both are far-from-perfect proxies for the unmeasured (and arguably extremely hard to measure) cultural complexity and capture different aspects thereof (Josserand et al., 2021 ), the fact that there is a “switch” in their contribution to “blue” between Josserand et al. ( 2021 ) and this study should not be taken too literally and, coupled with the results of Structural Equation Modeling including a latent “cultural complexity”, gives extra support to the positive influence of cultural complexity on “blue.” Interestingly, subsistence seems to be required to properly explain the shape of the relationship between UV-B incidence and “blue,” as it helps account for the few northern populations who do not have a specific term for “blue.” It turns out that these apparent exceptions do, in fact, support the a priori hypothesis which states that high UV-B incidence generates a negative pressure against a specific term for “blue,” but, in contrast, low UV-B incidence does not induce any specific bias for or against “blue” and, instead, allows other factors to “play freely,” as it were. And indeed, this is what it was found: the relationship between UV-B incidence and “blue” is negative linear overall if one accounts for hunter–gatherer populations living with low UV-B incidence but do not have a dedicated word for “blue.” This is highly similar to other cases reported in the literature, in particular concerning the positive effect of a small or absent alveolar ridge prominence on click consonants (Moisik and Dediu, 2017 ) and the negative effect of an edge-to-edge bite on labiodentals (Blasi et al., 2019 ), where the bias is effectively asymmetric. Moreover, the finding that the relationship is very probably linear should help guide the search for the detailed causal mechanisms involved, suggesting an additive effect of UV-B incidence on the perception of blue as well as an additive effect on language across time. Nevertheless, as highlighted in Josserand et al. ( 2021 ), we must keep in mind that this is very likely a multi-factorial complex causal process involving multiple temporal and organizational scales, ranging from the intra-individual physiological lens brunescence and the associated perceptual and cognitive mechanisms of compensating and adapting to it to the large-scale presumably cross-generational and inter-individual language change in structured communities reflecting the decreased perception of “blue” among its most affected (older) members. While many of these components are still in need of thorough study and require inter-disciplinary and methodologically diverse approaches, the conversation has already started (see, for example, Josserand et al., 2021 , the recent technical comment to it in Hardy et al., 2023 and our response in Josserand et al., 2023 , touching on these aspects).
In conclusion, enlarging the database using primary and secondary sources of data vastly increased its representativity of the world's linguistic diversity and allowed the application of phylogenetic methods to investigate the diachronic component of the negative influence of UV-B incidence on the existence of a specific word for “blue.” It can only be highlighted that such extensions are an essential component of science and that, in this case, it supports and refines the previous findings of Josserand et al. ( 2021 ) and the original proposal (Lindsey and Brown, 2002 ) of a negative effect of UV-B incidence on the probability that a language has a specific word for “blue.”
The data and code needed to reproduce the results reported here, as well as the full analysis report can be found at: https://github.com/ddediu/colors-UV-update .
DD designed the research, checked, corrected, collected data, performed the analyses, and wrote the study.
Thanks to Karl Seifen, Tessa Vermeir, Rigele Na, and Lea Mouton for their expert opinions concerning specific languages, to Jayden Macklin-Cordes for help with data on the Australian languages, and to Rémi Anselme for painstakingly checking and collecting information on subsistence strategies. Thanks to Simon Greenhill for suggesting to use CLICS, and for confirming that brms would be useful and that using global phylogenies might solve the power issues of within-family phylogenetic analyses. Special thanks to Mathilde Josserand who collected, cleaned, merged most of the new data, and provided feedback on the article, and who, despite my opinion to the contrary, explicitly considered that her contributions do not warrant her being a co-author on this article. Finally, thanks to two reviewers whose comments greatly improved the article, and, in particular, to the reviewer who not only suggested to add data from Lexibank but also provided the code needed to obtain the blue/green colexification.
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling editor AB-B declared a past co-authorship with the author.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1143283/full#supplementary-material
细心的冲锋衣 · Java Swing布局管理器(详解版) 1 年前 |