HttpClient之URI_httpclient uri

相关文章推荐

玩篮球的西装 · HttpClientJsonExtensio ...· 3 周前 ·

温文尔雅的生姜 · 基于Pytorch框架的LSTM算法(二)— ...· 7 月前 ·

打酱油的莴苣 · SQL语句汇总（三）——聚合函数、分组、子查 ...· 1 年前 ·

乐观的芒果 · uefi ...· 1 年前 ·

老实的鸡蛋面 · 利用CNN-LSTM学习有效子图进行动态网络 ...· 1 年前 ·

贪玩的山楂 · 用户对问题“select2不能正确触发更改事 ...· 1 年前 ·

标准的 URL 格式：

协议模式： // 主机名 : 可选的端口 / 资源路径？可选查询 # 可选的片段

即： s heme://authority:port/path?query#fragment

完整的 URL 格式：

协议模式： // 用户名 : 密码 @ 主机名 : 可选的端口 / 资源路径？可选查询 # 可选的片段

即： Scheme://userid:password@authority:port/path?query#fragment

注：用户信息一般使用，主要作用是在 URL 中添加用户信息的写法，这样可以省去访问 SVN 时要求输入用户登录信息

URL 各组成部分在 Header 消息头的位置如下图

注： fragment( 信息片段 ) ：用于指定网络资源中的片段。

详细资料可参考：

(1)Things YouShould Know About Fragment URLs

http://blog.httpwatch.com/2011/03/01/6-things-you-should-know-about-fragment-urls/

(2) 从 QQ 密码修改的小问题回顾下 URL Fragment

http://www.cnblogs.com/syf/archive/2013/04/02/2995903.html

二、 URI 和 URL 区别

URI ，是 uniformresource identifier ，统一资源标识符，用来唯一的标识一个资源。 URI 有绝对和相对之分，绝对的 URI 指以 scheme （后面跟着冒号）开头的 URI 。例如 http://www.cnn.com/articles/articles.html 就是绝对的 URI ；相对的 URI 不是以 scheme （后面跟着冒号）开始的 URI ，例如 articles/articles.html 就是相对的 URI 。

URL 是 uniformresource locator ，统一资源定位器，它是一种具体的 URI ，即 URL 可以用来标识一个资源。 URL 是 URI 的一个子集，指明了如何 locate 这个资源。也即是说 URL 是一种具体的 URI ，它不仅唯一标识资源，而且还提供了定位该资源的信息。需要注意 URL 必须提供足够的信息来定位，是绝对的，而通常说的 relative URL ，则是针对 absolute URL 来说的，本质上还是绝对的。

注： URI 的研究出现就是为了弥补 URL 的一些缺点，例如当资源改变时 URL 也需要变化的问题等

详细资料可参考：

http://docs.oracle.com/javase/1.5.0/docs/api/java/net/URI.html

http://en.wikipedia.org/wiki/Uniform_Resource_Identifier

三、 HttpClient 中的 URI

HttpClient 支持的 URL 格式为 ( 默认忽略不处理 fragment) ：

Scheme://[userid:password@]authority:port/path?query

一些知识的整理

URI-reference = [absoluteURI | relativeURI ] [ "#" fragment ]

absoluteURI = scheme ":"( hier_part | opaque_part )

relativeURI = ( net_path |abs_path | rel_path ) [ "?" query ]

hier_part = ( net_path |abs_path ) [ "?" query ]

net_path ="//" authority [ abs_path ]

abs_path ="/" path_segments

rel_path = [path ] [ ";" params ] [ "?" query ]

authority =server | reg_name

host =hostname | IPv4address | IPv6reference

一些名词解释：

reg_name指服务器注册的域名，如www.baidu.com

hier_part是指需要有分隔符对不同等级的组件进行分割

opaque_part指不需要有分隔符对不同组件进行分割

例如：完整的URL地址

http://userName:password@www.testpage.com/otengyue/article/search?ie=utf8&oe=utf8

URI-reference: http://userName:password@www.testpage.com/otengyue/article/search?ie=utf8&oe=utf8

Path: /otengyue/article/search

Host: www.testpage.com

Authority: userName:password@www.testpage.com

abs_path： /otengyue/article/search

net_path: //userName:password@www.testpage.com/otengyue/article/search

更多资料可查阅

URL 规范 (RFC1738) ： http://tools.ietf.org/html/rfc1738

介绍 RFC1738( 中文 ) 博文： http://blog.csdn.net/msgsnd/article/details/2172306

HttpClient 中有关 URL 处理的类图如下：

其中 URI 类是对 URL 元数据的封装。 URI 中元数据的存储默认采用给定的编码类型编码后存储，若要改变需要实例化时设置不编码实例。

(1)URI 中元数据的默认编码方式为 "UTF-8" ，若要更改默认的 charset 类型可以采用下面两种方法

<1> 永久更改。调用 URI 的静态函数 setDefaultProtocolCharset(Stringcharset) 。 ( 不推荐 )

如下代码 ( 放在所有代码前面 )

	try {
	     URI.setDefaultProtocolCharset("gbk");
	} catch (DefaultCharsetChanged cc) {
	     // CASE 1: the exception could be ignored, when it is set by user
	     if (cc.getReasonCode() == DefaultCharsetChanged.PROTOCOL_CHARSET) {
	     // CASE 2: let user know the default protocol charset changed
	     } else {
	     // CASE 2: let user know the default document charset changed
<2>临时更改。在构造函数中charset作为参数，只能更改当前实例化类的编码类型。 
(2)URI元数据的设置/获取 
getRaw*/getEscaped*  获得原始的URI对应元数据 
get*                 获得对应的已解码得对应元数据 
setRaw*/setEscaped*  设置URI对应元数据值(参数未经过编码直接设置) 
set*                 设置URI对应元数据值(参数会经过编码后再设置) 
(3)URI实例化 
	boolean escaped=false;
	//使用绝对路径和选择是否编码为false实例化，注不设置则默认采用是否编码为true实例化，如下面集中实例化
	URI uri=new URI("http://www.baidu.com/search/s?ie=utf8&oe=utf8&wd=HttpClient&tn=98010089_dg&ch=1",escaped);
	//使用绝对路径
	URI uri1=new URI("http://www.baidu.com/search/s?ie=utf8&oe=utf8&wd=HttpClient&tn=98010089_dg&ch=1");
	//实例化时设置URI编码类型
	URI URI2=new URI("http://www.baidu.com","gbk");
	//使用相对路径,注不能忘记以"/"开头
	URI uri3=new URI("/ms?ie=utf8&oe=utf8&wd=HttpClient&tn=98010089_dg&ch=1");
	//使用baseURI和RelativeURI
	//uri4.getURI()：<a target=_blank href="http://www.baidu.com/ms?ie=utf8&oe=utf8&wd=HttpClient&tn=98010089_dg&ch=1">http://www.baidu.com/ms?ie=utf8&oe=utf8&wd=HttpClient&tn=98010089_dg&ch=1</a>
	URI uri4=new URI(uri1,uri3); 
(4)具体协议的使用 
      针对具体的协议(如http)，则使用对应的URI的子类。例如针对Http协议的HttpUrl类封装了一系列的方法简化uri的操作 (HttpUrl是采用自动编码方式存储元数据)。另外HttpUrl无法调用无参构造函数(类型为protected)。下图为HttpUrl的构造函数图 
四、编码问题 
HttpClient处理字符编码大致包括URL、Header、和请求/响应体三个部分。 
URL编码标准遵循RFC1738标准，标准为US-ASCII编码，但其不支持双字节，在HttpClient的URI中采用UTF-8为默认的编码格式。Header的Content-Type字段可能会包含字符编码信息。例如字符的设置：Content-Type: text/html; charset=UTF-8。 
GET的请求参数在QueryString中，是URI的一部分。因此，对于GET请求，关注于请求参数的中文编码。 
POST的请求参数在Body中，因此，对于POST请求，关注Body的编码问题。 
解决方案： 
(1)GET请求参数编码 
<1>设置URI编码类型 
查看URI类代码，不设置编码则默认为UTF-8 
     * The charset of the protocol used by this URI instance.
    protected String protocolCharset = null;
     * The default charset of the protocol.  RFC 2277, 2396
    protected static String defaultProtocolCharset = "UTF-8";

需要设置编码时，在构造函数中设置编码类型

 public URI(char[] escaped, String charset) 
        throws URIException, NullPointerException {
        protocolCharset = charset;
        parseUriReference(new String(escaped), true);
<2>设置QueryString编码类型 
一般中文编码问题只存在请求参数QueryString中，因此只需对请求参数设置编码即可。 
queryString =EncodingUtil.formUrlEncode(params, "UTF-8"); //NameValuePair[] params 
queryString = URIUtil.decode (params, "UTF-8"); //String params 
(2)POST请求参数编码 
主要是在Post请求头的Header中添加Content-Type: text/html; charset=UTF-8, 
在HttpMethodParams类中设置编码的源码如下(如果不设置则默认为ISO-8859-1) 
     * Sets the default charset to be used for writing content body,
     * when no charset explicitly specified.
     * @param charset The charset
    public void setContentCharset(String charset) {
        setParameter(HTTP_CONTENT_CHARSET, charset);
     * Returns the default charset to be used for writing content body, 
     * when no charset explicitly specified.
     * @return The charset
    public String getContentCharset() {
        String charset = (String) getParameter(HTTP_CONTENT_CHARSET);
        if (charset == null) {
            LOG.warn("Default content charset not configured, using ISO-8859-1");
            charset = "ISO-8859-1";
        return charset;
添加方法如下： 
<1>在POST请求中的Header中设置Content-Type 
PostMethod method = new PostMethod(); 
method.addRequestHeader("Content-Type","text/html;charset=UTF-8"); 
<2>设置HttpClientParam的ContentCharset 
HttpClient httpClient = new HttpClient(); 
HttpClientParam params =httpClient.getParams(); 
params.setContentCharset("UTF-8"); 
<3>设置HttpMethodParams的ContentCharse 
PostMethod method = new PostMethod(); 
HttpMethodParams params = method.getParams(); 
params.setContentCharset("UTF-8"); 
(3) 请求/响应体中body编码 
在HttpMethodBase类源码中获得响应Body的源码如下 
     * Returns the response body of the HTTP method, if any, as a {@link String}. 
     * If response body is not available or cannot be read, returns <tt>null</tt>
     * The string conversion on the data is done using the character encoding specified
     * in <tt>Content-Type</tt> header. Buffers the response and this method can be 
     * called several times yielding the same result each time.
     * Note: This will cause the entire response body to be buffered in memory. A
     * malicious server may easily exhaust all the VM memory. It is strongly
     * recommended, to use getResponseAsStream if the content length of the response
     * is unknown or resonably large.
     * @return The response body or <code>null</code>.
     * @throws IOException If an I/O (transport) problem occurs while obtaining the 
     * response body.
    public String getResponseBodyAsString() throws IOException {
        byte[] rawdata = null;
        if (responseAvailable()) {
            rawdata = getResponseBody();
        if (rawdata != null) {
            return EncodingUtil.getString(rawdata, getResponseCharSet());
        } else {
            return null;
     * Returns the character encoding of the request from the <tt>Content-Type</tt> header.
     * @return String The character set.
    public String getRequestCharSet() {
        return getContentCharSet(getRequestHeader("Content-Type"));
可见，getResponseCharSet方法Content-Type Header获取响应数据的charset。这要求Servlet必须正确设置response的Content-Type Header 。 
通过查看源码可以看到在获得Response后，HttpClient识别编码的顺序是：http头信息的charset，如果头信息中没有charset，则查找HttpClientParams的contentCharset，如果没有指定编码，则是ISO-8859-1。其源码可在HttpMethodBase中查看 
     * Returns the character encoding of the request from the <tt>Content-Type</tt> header.
     * @return String The character set.
    public String getRequestCharSet() {
        return getContentCharSet(getRequestHeader("Content-Type"));
     * Returns the character set from the <tt>Content-Type</tt> header.
     * @param contentheader The content header.
     * @return String The character set.
    protected String getContentCharSet(Header contentheader) {
        LOG.trace("enter getContentCharSet( Header contentheader )");
        String charset = null;
        if (contentheader != null) {
            HeaderElement values[] = contentheader.getElements();
            // I expect only one header element to be there
            // No more. no less
            if (values.length == 1) {
                NameValuePair param = values[0].getParameterByName("charset");
                if (param != null) {
                    // If I get anything "funny" 
                    // UnsupportedEncondingException will result
                    charset = param.getValue();
        if (charset == null) {
            charset = getParams().getContentCharset();
            if (LOG.isDebugEnabled()) {
                LOG.debug("Default charset used: " + charset);
        return charset;
因此，在解决响应头中没有Content-Type:text/html; charset=utf-8而引起乱码时，可以通过设置发送请求头时设置 
httpClient.getParams().setContentCharset("gbk"); 
                    一、URL简介标准的URL格式：协议模式：//主机名:可选的端口/资源路径？可选查询#可选的片段即：Scheme://authority:port/path?query#fragment完整的URL格式：协议模式：//用户名:密码@主机名:可选的端口/资源路径？可选查询#可选的片段即：Scheme://userid:password@authority:port/path?