Hive 3.1.1
Spark 3.1.1
你的堆栈溢出问题提出,我引用一下。
"我需要以编程方式为Hive表和视图生成DDL语句。我试着用Spark和Beeline来完成这项任务。Beeline每条语句大约需要5-10秒,而Spark在几毫秒内就完成了同样的事情。我打算使用Spark,因为它比Beeline快。使用spark从蜂巢中获取DDL语句的一个缺点是,它把CHAR,VARCHAR字符当作String来处理,并且不保留CHAR,VARCHAR数据类型所包含的长度信息。同时,Beeline保留了CHAR、VARCHAR数据类型的数据类型和长度信息。我正在使用Spark 2.4.1和Beeline 2.1.1。下面给出了创建表的示例命令及其显示创建表的输出。"
在测试数据库的Hive中创建一个简单的表
hive> use test;
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
现在让我们去看一下火花-外壳
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
你可以看到Spark正确地显示了列
现在让我们通过beeline在hive中创建同样的表
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
现在在spark-shell中再次检查
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')