ipv4/6冲突引发的血案:CentOS7使用curl和wget反应都很慢原因及方案

这个问题折腾了一天。起因是想通过实验室的CentOS服务器下载GEO数据库数据集,发现一直下载不下来:

> gse <- getGEO('GSE10')
Error in open.connection(x, "rb") : 
  Timeout was reached: Resolving timed out after 10000 milliseconds
> gse <- getGEO('GSE10')
Error in open.connection(x, "rb") : 
  Timeout was reached: Resolving timed out after 10000 milliseconds
> gse <- getGEO('GSE10')
Error in open.connection(x, "rb") : 
  Timeout was reached: Resolving timed out after 10000 milliseconds
> gse <- getGEO('GSE10')
Error in open.connection(x, "rb") : 
  Timeout was reached: Resolving timed out after 10000 milliseconds

一开始我以为是下载速度不行,网上搜索下载GEO数据需要调整下载方式,照着在代码中添加

options( 'download.file.method.GEOquery' = 'libcurl' )

问题依旧。

没办法我开始自己调试,根据调用函数一直追查下去。

Error in open.connection(x, "rb"): Timeout was reached: Resolving timed out after 10000 milliseconds
Traceback:
1. getGEO("GSE10")
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL, 
 .     parseCharacteristics = parseCharacteristics)
3. getDirListing(sprintf(gdsurl, stub, GEO))
4. xml2::read_html(url)
5. read_html.default(url)
6. suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE, 
 .     options = options))
7. withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
8. read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
9. read_xml.character(x, encoding = encoding, ..., as_html = TRUE, 
 .     options = options)
10. read_xml.connection(con, encoding = encoding, ..., as_html = as_html, 
  .     base_url = x, options = options)
11. open(x, "rb")
12. open.connection(x, "rb")

发现底层原因是使用了一个read_xml函数,我又去查看xml2包的github仓库,有一个issue提到类似问题,作者直接说底层调用依赖的是curl包,所以这种问题不能找他。然后我把目标转向了curl,肯定是什么原因让curl没有正常工作。

经过调试,上面的报错本质上与下面相同:

> con <- curl::curl("https://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE10/matrix/") 
> readLines(con)
Error in readLines(con) : 
  Timeout was reached: Resolving timed out after 10000 milliseconds

有人说是libcurl的版本不对,然后我想办法把curl版本升级到最新版本,问题依旧。

问题是解析超时,那么如果增加解析时间的话,问题应该能解决。于是我使用终端调用该命令,发现10几秒后,结果能够成功返回。气人的是,我谷歌了半天没有发现怎么正确设置timeout时间限制的办法,文档也没有说明,只能在Github提交issue问作者了:

作者很快地反馈了解决办法。

h <- curl::new_handle(timeout = 60)
con <- curl::curl("https://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE10/matrix/", handle = h) 
readLines(con)

上面代码确实能成功执行,但我们要解决的根本问题是GEOquery的下载,GEOquery调用了xml2,xml2又调用了curl,虽然curl给了一种折中办法,但我没法应用上去!

于是只能通过其他关键信息使用谷歌搜索方案,最终在StackExchange找到了:是系统DNS的问题!!!

具体可以看下QA: Slow Responses with curl and wget on CentOS 7

终极方案是在/etc/resolv.conf文件最后加入下面一行:

options single-request-reopen

这个问题的根本原因是ipv4与ipv6冲突导致的,见https://aarvik.dk/disable-ipv6/,如果使用相同的端口/套接字,会导致域名解析非常的慢。 我试了一下百度,在问题解决之前,我curl百度要6s左右,在解决之后,我curl百度一般在1s内就会返回结果。

问题解决后,执行getGEO的时间在5秒左右,少于报错的10s,自然就没有问题了:

> gse <- GEOquery::getGEO('GSE10')
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
Found 1 file(s)
GSE10_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE10/matrix/GSE10_series_matrix.txt.gz'
Content type 'application/x-gzip' length 364007 bytes (355 KB)
==================================================
downloaded 355 KB
Parsed with column specification:
cols(
  TAG = col_character(),
  GSM571 = col_double(),
  GSM572 = col_double(),
  GSM573 = col_double(),
  GSM574 = col_double()