Apache Doris-0036-数据导入-04-外部导入-S3

Apache Doris-0036-数据导入-04-外部导入-S3

0、背景

研究一下 Apache Doris 官方文档

(1)本系列文章

格瑞图:Apache Doris-0001~0004-快速开始

格瑞图:Apache Doris-0005~0012-数据模型

格瑞图:Apache Doris-0013~0016-数据划分

格瑞图:Apache Doris-0017~0022-使用指南

格瑞图:Apache Doris-0023~0025-上卷查询

格瑞图:Apache Doris-0026-最佳实践

格瑞图:Apache Doris-0027-动态模式

格瑞图:Apache Doris-0028~0032-索引概述

格瑞图:Apache Doris-0033-数据导入-01-导入总览

格瑞图:Apache Doris-0034-数据导入-02-本地导入

格瑞图:Apache Doris-0035-数据导入-03-外部导入-HDFS

1、外部导入 - S3

(1)对象存储服务

使用 比特纳米的 MinIO 镜像 搭建一个 MinIO 容器。

请参考文章 格瑞图:容器化-部署容器-minio 。文章版本为:2021-12-10T23:03:39Z。

本篇使用的是最新版本:2023-03-24T21:41:23Z。 时间过得真快~

1、创建一个访问密钥

2、使用访问密钥,通过派森代码,创建一个 poc 桶,并将派森代码自身上传到 poc 桶中。

from minio import Minio
from minio.error import S3Error
def main():
    # Create a client with the MinIO server playground, its access key
    # and secret key.
    client = Minio(
        "10.10.1.11:9000",
        secure=False,
        access_key="ka1QIeq416hnCGJW",
        secret_key="pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
    # Make 'poc' bucket if not exist.
    found = client.bucket_exists("poc")
    if not found:
        client.make_bucket("poc")
    else:
        print("Bucket 'poc' already exists")
    # Upload 'file.uploader.py' as object name
    # 'poc.py' to bucket 'poc'.
    client.fput_object(
        "poc", "poc.py", "file.uploader.py",
    print(
        "'file.uploader.py' is successfully uploaded as "
        "object 'poc.py' to bucket 'poc'."
if __name__ == "__main__":
    try:
        main()
    except S3Error as exc:
        print("error occurred.", exc)

3、将测试数据上传到概念验证 poc 桶里

(2)适用场景

版本 0.14 之后支持从支持 S3 协议的在线存储系统导入数据。
数据量级别在几十到百吉比特 GB。

(3)导入语法

跟代理导数类似,要替换一下 WITH BROKER broker_name 为 WITH S3

    WITH S3
        "AWS_ENDPOINT" = "AWS_ENDPOINT",
        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
        "AWS_REGION" = "AWS_REGION"
    )

(4)创建数据表

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE IF NOT EXISTS load_s3_file_test
>     id INT,
>     age TINYINT,
>     name VARCHAR(50)
> unique key(id)
> DISTRIBUTED BY HASH(id) BUCKETS 3
> properties("replication_num" = "1");'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'desc load_s3_file_test;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+-------------+------+-------+---------+---------+
| Field | Type        | Null | Key   | Default | Extra   |
+-------+-------------+------+-------+---------+---------+
| id    | INT         | Yes  | true  | NULL    |         |
| age   | TINYINT     | Yes  | false | NULL    | REPLACE |
| name  | VARCHAR(50) | Yes  | false | NULL    | REPLACE |
+-------+-------------+------+-------+---------+---------+
[root@vm30 ~]#

(5)导入数据

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> LOAD LABEL demo.label_20220411
>     DATA INFILE("s3://poc/test_s3.csv")
>     INTO TABLE `load_s3_file_test`
>     COLUMNS TERMINATED BY ","
>     (id,age,name)
> WITH S3
>     "AWS_ENDPOINT" = "http://bm11:9000",
>     "AWS_ACCESS_KEY" = "ka1QIeq416hnCGJW",
>     "AWS_SECRET_KEY" = "pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
>     "AWS_REGION" = "cn-beijing"
> PROPERTIES
>     "timeout"="3600"
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
         JobId: 12413
         Label: label_20220411
         State: PENDING
      Progress: ETL:0%; LOAD:0%
          Type: BROKER
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-04-11 20:13:55
  EtlStartTime: NULL
 EtlFinishTime: NULL
 LoadStartTime: NULL
LoadFinishTime: NULL
           URL: NULL
    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 0
  ErrorTablets: {}
[root@vm30 ~]#

数据加载作业提交后好久都处于等待状态 Pending。

具体什时候开始执行呢?什么机制导致其开始执行呢?

(说明:通过后续执行成功的时间,可以判断这里的是超时时间,任务无法连接 S3 服务。)

等了八分钟还没有执行。再次查看就报错了:

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
         JobId: 12413
         Label: label_20220411
         State: CANCELLED
      Progress: ETL:N/A; LOAD:N/A
          Type: BROKER
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
      ErrorMsg: type:ETL_RUN_FAIL; msg:errCode = 2, detailMessage = S3 list path exception. path=s3://poc/test_s3.csv, err: errCode = 2, detailMessage = S3 list path failed. path=s3://poc/test_s3.csv,msg=errors while get file status getFileStatus on s3://poc/test_s3.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused)
    CreateTime: 2023-04-11 20:13:55
  EtlStartTime: NULL
 EtlFinishTime: NULL
 LoadStartTime: NULL
LoadFinishTime: 2023-04-11 20:22:19
           URL: NULL
    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 0
  ErrorTablets: {}
[root@vm30 ~]#
ErrorMsg: type:ETL_RUN_FAIL; msg:errCode =2, detailMessage = S3 list path exception. path=s3://poc/test_s3.csv, err: errCode = 2, detailMessage = S3 list path failed. path=s3://poc/test_s3.csv,msg=errors while get file status getFileStatus on s3://poc/test_s3.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused)
S3 list path exception - S3 列出路径异常。
S3 list path failed - S3 列出路径失败。
errors while get file status getFileStatus on - 在 s3://... 上通过 gFS 获取文件状态时发生错误。
Unable to execute HTTP request - 无法执行 HTTP 请求。
Connection refused (Connection refused) - 连接拒绝。

这里要通过安全方式连接 MinIO 服务。但是部署的容器没有证书。

(6)删除重做

使用 CLEAN LABEL label_20220411 FROM demo; 从数据库 demo 中删除标签 label_20220411 之后重新提交:

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CLEAN LABEL label_20220411 FROM demo;'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> LOAD LABEL demo.label_20220411
>     DATA INFILE("s3://poc/test_s3.csv")
>     INTO TABLE `load_s3_file_test`
>     COLUMNS TERMINATED BY ","
>     (id,age,name)
> WITH S3
>     "AWS_ENDPOINT" = "http://bm11:9000",
>     "AWS_ACCESS_KEY" = "ka1QIeq416hnCGJW",
>     "AWS_SECRET_KEY" = "pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
>     "AWS_REGION" = "cn-beijing",
>     "use_path_style" = "true"
> PROPERTIES
>     "timeout"="10"
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#

这次换为 "AWS_ENDPOINT" = " bm11:9000 " 试试。

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
         JobId: 12431
         Label: label_20220411
         State: FINISHED
      Progress: ETL:100%; LOAD:100%
          Type: BROKER
       EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=5
      TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-04-11 20:56:08
  EtlStartTime: 2023-04-11 20:56:10
 EtlFinishTime: 2023-04-11 20:56:10
 LoadStartTime: 2023-04-11 20:56:10
LoadFinishTime: 2023-04-11 20:56:10
           URL: NULL
    JobDetails: {"Unfinished backends":{"6cf156ffcca14295-9e2d523c9700c060":[]},"ScannedRows":5,"TaskNumber":1,"LoadBytes":125,"All backends":{"6cf156ffcca14295-9e2d523c9700c060":[10003]},"FileNumber":1,"FileSize":44}
 TransactionId: 1021
  ErrorTablets: {}
[root@vm30 ~]#

(6)查询数据

终于成功了,查询一下数据验证结果:

[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'select * from load_s3_file_test;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+-------+
| id   | age  | name  |
+------+------+-------+
|    4 |    3 | bush  |