Apache Doris-0036-数据导入-04-外部导入-S3
0、背景
研究一下 Apache Doris 官方文档 。
(1)本系列文章
格瑞图:Apache Doris-0001~0004-快速开始
格瑞图:Apache Doris-0005~0012-数据模型
格瑞图:Apache Doris-0013~0016-数据划分
格瑞图:Apache Doris-0017~0022-使用指南
格瑞图:Apache Doris-0023~0025-上卷查询
格瑞图:Apache Doris-0028~0032-索引概述
格瑞图:Apache Doris-0033-数据导入-01-导入总览
格瑞图:Apache Doris-0034-数据导入-02-本地导入
格瑞图:Apache Doris-0035-数据导入-03-外部导入-HDFS
1、外部导入 - S3
(1)对象存储服务
使用 比特纳米的 MinIO 镜像 搭建一个 MinIO 容器。
请参考文章 格瑞图:容器化-部署容器-minio 。文章版本为:2021-12-10T23:03:39Z。
本篇使用的是最新版本:2023-03-24T21:41:23Z。 时间过得真快~
1、创建一个访问密钥
2、使用访问密钥,通过派森代码,创建一个 poc 桶,并将派森代码自身上传到 poc 桶中。
from minio import Minio
from minio.error import S3Error
def main():
# Create a client with the MinIO server playground, its access key
# and secret key.
client = Minio(
"10.10.1.11:9000",
secure=False,
access_key="ka1QIeq416hnCGJW",
secret_key="pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
# Make 'poc' bucket if not exist.
found = client.bucket_exists("poc")
if not found:
client.make_bucket("poc")
else:
print("Bucket 'poc' already exists")
# Upload 'file.uploader.py' as object name
# 'poc.py' to bucket 'poc'.
client.fput_object(
"poc", "poc.py", "file.uploader.py",
print(
"'file.uploader.py' is successfully uploaded as "
"object 'poc.py' to bucket 'poc'."
if __name__ == "__main__":
try:
main()
except S3Error as exc:
print("error occurred.", exc)
3、将测试数据上传到概念验证 poc 桶里
(2)适用场景
版本 0.14 之后支持从支持 S3 协议的在线存储系统导入数据。
数据量级别在几十到百吉比特 GB。
(3)导入语法
跟代理导数类似,要替换一下 WITH BROKER broker_name 为 WITH S3
WITH S3
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION"
)
(4)创建数据表
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CREATE TABLE IF NOT EXISTS load_s3_file_test
> id INT,
> age TINYINT,
> name VARCHAR(50)
> unique key(id)
> DISTRIBUTED BY HASH(id) BUCKETS 3
> properties("replication_num" = "1");'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'desc load_s3_file_test;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+-------------+------+-------+---------+---------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-------+---------+---------+
| id | INT | Yes | true | NULL | |
| age | TINYINT | Yes | false | NULL | REPLACE |
| name | VARCHAR(50) | Yes | false | NULL | REPLACE |
+-------+-------------+------+-------+---------+---------+
[root@vm30 ~]#
(5)导入数据
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> LOAD LABEL demo.label_20220411
> DATA INFILE("s3://poc/test_s3.csv")
> INTO TABLE `load_s3_file_test`
> COLUMNS TERMINATED BY ","
> (id,age,name)
> WITH S3
> "AWS_ENDPOINT" = "http://bm11:9000",
> "AWS_ACCESS_KEY" = "ka1QIeq416hnCGJW",
> "AWS_SECRET_KEY" = "pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
> "AWS_REGION" = "cn-beijing"
> PROPERTIES
> "timeout"="3600"
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
JobId: 12413
Label: label_20220411
State: PENDING
Progress: ETL:0%; LOAD:0%
Type: BROKER
EtlInfo: NULL
TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
ErrorMsg: NULL
CreateTime: 2023-04-11 20:13:55
EtlStartTime: NULL
EtlFinishTime: NULL
LoadStartTime: NULL
LoadFinishTime: NULL
URL: NULL
JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
TransactionId: 0
ErrorTablets: {}
[root@vm30 ~]#
数据加载作业提交后好久都处于等待状态 Pending。
具体什时候开始执行呢?什么机制导致其开始执行呢?
(说明:通过后续执行成功的时间,可以判断这里的是超时时间,任务无法连接 S3 服务。)
等了八分钟还没有执行。再次查看就报错了:
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
JobId: 12413
Label: label_20220411
State: CANCELLED
Progress: ETL:N/A; LOAD:N/A
Type: BROKER
EtlInfo: NULL
TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
ErrorMsg: type:ETL_RUN_FAIL; msg:errCode = 2, detailMessage = S3 list path exception. path=s3://poc/test_s3.csv, err: errCode = 2, detailMessage = S3 list path failed. path=s3://poc/test_s3.csv,msg=errors while get file status getFileStatus on s3://poc/test_s3.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused)
CreateTime: 2023-04-11 20:13:55
EtlStartTime: NULL
EtlFinishTime: NULL
LoadStartTime: NULL
LoadFinishTime: 2023-04-11 20:22:19
URL: NULL
JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
TransactionId: 0
ErrorTablets: {}
[root@vm30 ~]#
ErrorMsg: type:ETL_RUN_FAIL; msg:errCode =2, detailMessage = S3 list path exception. path=s3://poc/test_s3.csv, err: errCode = 2, detailMessage = S3 list path failed. path=s3://poc/test_s3.csv,msg=errors while get file status getFileStatus on s3://poc/test_s3.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to bm11:443 [bm11/10.10.1.11] failed: Connection refused (Connection refused)
S3 list path exception - S3 列出路径异常。
S3 list path failed - S3 列出路径失败。
errors while get file status getFileStatus on - 在 s3://... 上通过 gFS 获取文件状态时发生错误。
Unable to execute HTTP request - 无法执行 HTTP 请求。
Connection refused (Connection refused) - 连接拒绝。
这里要通过安全方式连接 MinIO 服务。但是部署的容器没有证书。
(6)删除重做
使用 CLEAN LABEL label_20220411 FROM demo; 从数据库 demo 中删除标签 label_20220411 之后重新提交:
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> CLEAN LABEL label_20220411 FROM demo;'
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> LOAD LABEL demo.label_20220411
> DATA INFILE("s3://poc/test_s3.csv")
> INTO TABLE `load_s3_file_test`
> COLUMNS TERMINATED BY ","
> (id,age,name)
> WITH S3
> "AWS_ENDPOINT" = "http://bm11:9000",
> "AWS_ACCESS_KEY" = "ka1QIeq416hnCGJW",
> "AWS_SECRET_KEY" = "pSNyFwhmI4gvVgM3HtQAIB68ozk2RAJL",
> "AWS_REGION" = "cn-beijing",
> "use_path_style" = "true"
> PROPERTIES
> "timeout"="10"
mysql: [Warning] Using a password on the command line interface can be insecure.
[root@vm30 ~]#
这次换为 "AWS_ENDPOINT" = " http:// bm11:9000 " 试试。
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e '
> SHOW LOAD WHERE LABEL = "label_20220411"\G;'
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
JobId: 12431
Label: label_20220411
State: FINISHED
Progress: ETL:100%; LOAD:100%
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=5
TaskInfo: cluster:N/A; timeout(s):10; max_filter_ratio:0.0
ErrorMsg: NULL
CreateTime: 2023-04-11 20:56:08
EtlStartTime: 2023-04-11 20:56:10
EtlFinishTime: 2023-04-11 20:56:10
LoadStartTime: 2023-04-11 20:56:10
LoadFinishTime: 2023-04-11 20:56:10
URL: NULL
JobDetails: {"Unfinished backends":{"6cf156ffcca14295-9e2d523c9700c060":[]},"ScannedRows":5,"TaskNumber":1,"LoadBytes":125,"All backends":{"6cf156ffcca14295-9e2d523c9700c060":[10003]},"FileNumber":1,"FileSize":44}
TransactionId: 1021
ErrorTablets: {}
[root@vm30 ~]#
(6)查询数据
终于成功了,查询一下数据验证结果:
[root@vm30 ~]# mysql -uroot -p123456 -P9030 -hvm30 demo -e 'select * from load_s3_file_test;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------+------+-------+
| id | age | name |
+------+------+-------+
| 4 | 3 | bush |