如何更新或删除数据_云原生大数据计算服务 MaxCompute-阿里云帮助中心

相关文章推荐

鼻子大的汤圆 · Visual Studio 2017 ...· 3 月前 ·

不要命的帽子 · merge into用法详解· 1 月前 ·

威武的茴香 · SQL中merge into用法_sql ...· 1 月前 ·

挂过科的葡萄 · 使用MERGE ...· 1 月前 ·

打酱油的帽子 · 10分钟看懂深度学习的基本数据类型——张量- ...· 1 年前 ·

私奔的山楂 · DataGridView ...· 1 年前 ·

性感的小蝌蚪 · 教你用Wii手柄+蓝牙玩转Android模拟 ...· 2 年前 ·

大鼻子的书包 · 江西财经大学经济学院· 2 年前 ·

MaxCompute的 delete 、 update 功能具备与传统数据库用法类似的删除或更新表中指定行的能力。

实际使用


    delete

、


    update

功能时，系统会针对每一次删除或更新操作自动生成用户不可见的Delta文件，用于记录删除或更新的数据信息。具体实现原理如下：


     delete

：Delta文件中使用


     txnid(bigint)

和


     rowid(bigint)

字段标识Transactional表的Base文件（表在系统中的存储形式）中的记录在哪次删除操作中被删除。

例如，表t1的Base文件为f1，且内容为 a, b, c, a, b ，当执行 delete from t1 where c1='a'; 命令后，系统会生成一个单独的 f1.delta 文件。假设 txnid 是 t0 ，则 f1.delta 的内容是 ((0, t0), (3, t0)) ，标识行0和行3，在txnt0中被删除了。如果再执行一次 delete 操作，系统会又生成一个 f2.delta 文件，该文件仍然是根据Base文件f1编号，读取文件时，基于Base文件f1和当前所有Delta文件的共同表示结果，读取没有被删除的数据。


     update

：


     update

操作会转换为


     delete


     insert into

的实现逻辑。


      delete

、


      update

功能具备的优势如下：

写数据量下降

此前，MaxCompute通过 insert into 或 insert overwrite 操作方式删除或更新表数据，更多信息，请参见插入或覆写数据（INSERT INTO | INSERT OVERWRITE）。当用户需要更新表或分区中的少量数据时，如果通过 insert 操作实现，需要先读取表的全量数据，然后通过 select 操作更新数据，最后通过 insert 操作将全量数据写回表中，效率较低。使用 delete 、 update 功能后，系统无需写回全部数据，写数据量会显著下降。

对于按量计费场景，


     delete

、


     update

和


     insert overwrite

任务的写数据部分不收费，但是


     delete

、


     update

任务需要按分区过滤读取需要变更的数据，用于标注删除的记录或写回更新的记录，而读取数据部分依然遵照SQL作业按量计费模型收费，所以


     delete

、


     update

任务相比


     insert overwrite

任务，费用并不能因为写数据量减少而减少。

对于包年包月场景，


     delete

、


     update

减少了写数据资源消耗，与


     insert overwrite

相比，相同资源可以运行更多的任务。

重要多次


      delete

、


      update

操作会使Transactional表的底层存储增大，会提高存储和后续查询费用，且影响后续查询效率，建议定期合并（Compact）后台数据。更多合并操作信息，请参见合并Transactional表文件。

当作业并发运行且操作的目标表相同时，可能会出现作业冲突问题，更多信息，请参见 ACID语义。

delete 、 update 功能适用于随机、低频删除或更新表或分区中的少量数据。例如，按照T+1周期性地批量对表或分区中5%以下的行删除或更新数据。

delete 、 update 功能不适用于高频更新、删除数据或实时写入目标表场景。


       delete

、


       update

功能及对应Transactional表的使用限制如下：

仅支持Transactional表。更多创建Transactional表信息，请参见表操作。

在创建表时，不支持将聚簇表、外部表设置为Transactional表。

不支持MaxCompute内部表、外部表、聚簇表与Transactional表互转。

不支持其他系统的作业（例如MaxCompute Spark、PAI、Graph）访问Transactional表。

不支持


     clone table

、


     merge partition

操作。

不支持通过备份与恢复功能备份数据，因此在对Transactional表的重要数据执行


     update

、


     delete

或


     insert overwrite

操作前需要手动通过


     select


     insert

操作将数据备份至其他表中。

通过


      delete

、


      update

操作删除或更新表或分区内的数据时，注意事项如下：

如果需要对表中较少数据进行删除或更新操作，且操作和后续读数据的频率也不频繁，建议使用


     delete

、


     update

操作，并且在多次执行删除或更新操作之后，请合并表的Base文件和Delta文件，降低表的实际存储。更多信息，请参见合并Transactional表文件。

如果删除或更新行数较多（超过5%）并且操作不频繁，但后续对该表的读操作比较频繁，建议使用


     insert overwrite

或


     insert into

操作。更多信息，请参见插入或覆写数据（INSERT INTO | INSERT OVERWRITE）。

例如，某业务场景为每次删除或更新10%的数据，一天更新10次。建议根据实际情况评估 delete 、 update 操作产生的费用及后续对读性能的消耗是否小于每次使用 insert overwrite 或 insert into 操作产生的费用及后续对读性能的消耗，比较两种方式在具体场景中的效率，选择更优方案。

删除数据会生成Delta文件，所以删除数据不一定能降低存储，如果您希望通过


     delete

操作删除数据来降低存储，请合并表的Base文件和Delta文件，降低表的实际存储。更多信息，请参见合并Transactional表文件。

MaxCompute会按照批处理方式执行


     delete

、


     update

作业，每一条语句都会使用资源并产生费用，建议您使用批量方式删除或更新数据。例如您通过Python脚本生成并提交了大量行级别更新作业，且每条语句只操作一行或者少量行数据，则每条语句都会产生与SQL扫描输入数据量对应的费用，并使用相应的计算资源，多条语句累加时将明显增加费用成本，降低系统效率。命令示例如下。

--推荐方案。
update table1 set col1= (select value1 from table2 where table1.id = table2.id and table1.region = table2.region);
--不推荐方案。
update table1 set col1=1 where id='2021063001'and region='beijing';                  
update table1 set col1=2 where id='2021063002'and region='beijing';

示例1：创建非分区表acid_delete，并导入数据，执行


     delete

操作删除满足指定条件的行数据。命令示例如下：

--创建Transactional表acid_delete。
create table if not exists acid_delete(id bigint) tblproperties ("transactional"="true"); 
--插入数据。
insert overwrite table acid_delete values(1),(2),(3),(2); 
--查看插入结果。
select * from acid_delete; 
+------------+
| id         |
+------------+
| 1          |
| 2          |
| 3          |
| 2          |
+------------+
--删除id为2的行，如果在MaxCompute客户端（odpscmd）执行，需要输入yes|no确认。
delete from acid_delete where id = 2; 
--查看结果表中数据只有1、3。
select * from acid_delete; 
+------------+
| id         |
+------------+
| 1          |
| 3          |
+------------+

示例2：创建分区表acid_delete_pt，并导入数据，执行


     delete

操作删除满足指定条件的行。命令示例如下：

--创建Transactional表acid_delete_pt。 
create table if not exists acid_delete_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
--添加分区。
alter table acid_delete_pt add if not exists partition (ds= '2019');
alter table acid_delete_pt add if not exists partition (ds= '2018');
--插入数据。
insert overwrite table acid_delete_pt partition (ds='2019') values(1),(2),(3);
insert overwrite table acid_delete_pt partition (ds='2018') values(1),(2),(3);
--查看插入结果。
select * from acid_delete_pt;
+------------+------------+
| id         | ds         |
+------------+------------+
| 1          | 2018       |
| 2          | 2018       |
| 3          | 2018       |
| 1          | 2019       |
| 2          | 2019       |
| 3          | 2019       |
+------------+------------+
--删除分区为2019且id为2的数据，如果在MaxCompute客户端（odpscmd）执行，需要输入yes|no确认。
delete from acid_delete_pt where ds='2019' and id = 2;
--查看结果表中已删除分区为2019且id为2的数据。
select * from acid_delete_pt;
+------------+------------+
| id         | ds         |
+------------+------------+
| 1          | 2018       |
| 2          | 2018       |
| 3          | 2018       |
| 1          | 2019       |
| 3          | 2019       |
+------------+------------+

示例3：创建目标表acid_delete_t和关联表acid_delete_s，通过关联操作删除满足指定条件的行。命令示例如下：

--创建目标Transactional表acid_delete_t和关联表acid_delete_s。
create table if not exists acid_delete_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
create table if not exists acid_delete_s(id int,value1 int,value2 int);
--插入数据。
insert overwrite table acid_delete_t values(2,20,21),(3,30,31),(4,40,41);
insert overwrite table acid_delete_s values(1,100,101),(2,200,201),(3,300,301);
--删除acid_delete_t表中id与acid_delete_s表中id不匹配的行。如果在MaxCompute客户端（odpscmd）执行，需要输入yes|no确认。
delete from acid_delete_t where not exists (select * from acid_delete_s where acid_delete_t.id=acid_delete_s.id);
--查看结果表中只有id为2、3的数据。
select * from acid_delete_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 2          | 20         | 21         |
| 3          | 30         | 31         |
+------------+------------+------------+

命令格式

--方式1
update <table_name> set <col1_name> = <value1> [, <col2_name> = <value2> ...] [WHERE <where_condition>];
--方式2
update <table_name> set (<col1_name> [, <col2_name> ...]) = (<value1> [, <value2> ...])[WHERE <where_condition>];
--方式3
UPDATE <table_name>
       SET <col1_name> = <value1> [ , <col2_name> = <value2> , ... ]
        [ FROM <additional_tables> ]
        [ WHERE <where_condition> ]

table_name ：必填。待执行


     update

操作的Transactional表名称。

col1_name 、 col2_name ：至少更新一个。待修改行对应的列名称。

value1 、 value2 ：至少更新一个列值。修改后的新值。

where_condition ：可选。WHERE子句，用于筛选满足条件的数据。更多WHERE子句信息，请参见 WHERE子句（where_condition）。如果不带WHERE子句，会更新表中的所有数据。

additional_tables ：可选，from子句。


      update

支持from子句，使用from子句时，


      update

的使用会更加方便，不使用from子句与使用from子句的对比示例如下。不使用from子句

update target set v =
  (select min(v) from src group by k where target.k = src.key)
  where target.k in (select k from src);

使用from子句

update target set v = b.v
  from (select k, min(v) v from src group by k) b
  where target.k = b.k;

从上述示例代码可见：

当用源表的多行数据更新目标表的一行数据的时，由于不知道用哪条源表的数据去更新，所以遇到这种情况需要用户写聚合操作来保证数据源的unique性，可以看出不使用from子句时，代码不够简洁，用from子句的写比较简洁易懂。

关联更新的时候，如果只更新交集数据，不使用from子句时需要写额外的where条件，相对于from语法而言也不太简洁。

示例1：创建非分区表acid_update，并导入数据，执行


     update

操作更新满足指定条件的行对应的列数据。命令示例如下：

--创建Transactional表acid_update。  
create table if not exists acid_update(id bigint) tblproperties ("transactional"="true");
--插入数据。
insert overwrite table acid_update values(1),(2),(3),(2);
--查看插入结果。
select * from acid_update; 
+------------+
| id         |
+------------+
| 1          |
| 2          |
| 3          |
| 2          |
+------------+
--将所有id为2的行，id值更新为4。
update acid_update set id = 4 where id = 2; 
--查看更新结果，2被更新为4。
select * from acid_update; 
+------------+
| id         |
+------------+
| 1          |
| 3          |
| 4          |
| 4          |
+------------+

示例2：创建分区表acid_update，并导入数据，执行


     update

操作更新满足指定条件的行对应的列数据。命令示例如下：

--创建Transactional表acid_update_pt。 
create table if not exists acid_update_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
--添加分区。
alter table acid_update_pt add if not exists partition (ds= '2019');
--插入数据。
insert overwrite table acid_update_pt partition (ds='2019') values(1),(2),(3);
--查看插入结果
select * from acid_update_pt where ds = '2019';
+------------+------------+
| id         | ds         |
+------------+------------+
| 1          | 2019       |
| 2          | 2019       |
| 3          | 2019       |
+------------+------------+
--更新指定行的一列数据，将分区为2019的所有id=2的行，id值更新为4。
update acid_update_pt set id = 4 where ds = '2019' and id = 2; 
--查看更新结果，2被更新为4。
select * from acid_update_pt where ds = '2019';
+------------+------------+
| id         | ds         |
+------------+------------+
| 4          | 2019       |
| 1          | 2019       |
| 3          | 2019       |
+------------+------------+

示例3：创建目标表acid_update_t和关联表acid_update_s，实现同时更新多列值。命令示例如下：

--创建待更新目标Transactional表acid_update_t和关联表acid_update_s。
create table if not exists acid_update_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
create table if not exists acid_update_s(id int,value1 int,value2 int);
--插入数据。
insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
--方式一：用常量更新。
update acid_update_t set (value1, value2) = (60,61);
--查询方式一目标表结果数据。
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 2          | 60         | 61         |
| 3          | 60         | 61         |
| 4          | 60         | 61         |
+------------+------------+------------+
--方式二：关联更新，规则为acid_update_t表左关联acid_update_s表。
update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id);
--查询方式二目标表结果数据。
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 2          | 200        | 201        |
| 3          | 300        | 301        |
| 4          | NULL       | NULL       |
+------------+------------+------------+
--方式三：关联更新，规则为增加过滤条件，只更新交集。
update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
--查询方式三目标表结果数据。
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 4          | 40         | 41         |
| 2          | 200        | 201        |
| 3          | 300        | 301        |
+------------+------------+------------+
--方式四：用汇总结果关联更新。
update acid_update_t set (id, value1, value2) = (select id, max(value1),max(value2) from acid_update_s where acid_update_t.id = acid_update_s.id group by acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
--查询方式四目标表结果数据。
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 4          | 40         | 41         |
| 2          | 200        | 201        |
| 3          | 300        | 301        |
+------------+------------+------------+

示例4：涉及两个表的简单关联查询，示例命令如下：

--创建更新目标表acid_update_t和关联表acid_update_s
create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
--插入数据
insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
--查询数据
select * from acid_update_t;
--返回：
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 2          | 20         | 21         |
| 3          | 30         | 31         |
| 4          | 40         | 41         |
+------------+------------+------------+
select * from acid_update_s;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 1          | 100        | 101        |
| 2          | 200        | 201        |
| 3          | 300        | 301        |
+------------+------------+------------+
--关联更新，目标表增加过滤条件，只取交集
update acid_update_t set value1 = b.value1, value2 = b.value2
from acid_update_s b where acid_update_t.id = b.id;
--查看更新结果，20被更新为200，21被更新为201，30被更新为300，31被更新为301
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 4          | 40         | 41         |
| 2          | 200        | 201        |
| 3          | 300        | 301        |
+------------+------------+------------+
                           示例5：涉及多个表的复杂关联查询，示例命令如下：--创建更新目标表acid_update_t和关联表acid_update_s
create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
create table if not exists acid_update_m(id bigint,value1 bigint,value2 bigint);
--插入数据
insert overwrite table acid_update_t
values(2,20,21),(3,30,31),(4,40,41),(5,50,51);
insert overwrite table acid_update_s
values (1,100,101),(2,200,201),(3,300,301),(4,400,401),(5,500,501);
insert overwrite table acid_update_m
values(3,30,101),(4,400,201),(5,300,301);
--查询数据
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 2          | 20         | 21         |
| 3          | 30         | 31         |
| 4          | 40         | 41         |
| 5          | 50         | 51         |
+------------+------------+------------+
select * from acid_update_s;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 1          | 100        | 101        |
| 2          | 200        | 201        |
| 3          | 300        | 301        |
| 4          | 400        | 401        |
| 5          | 500        | 501        |
+------------+------------+------------+
select * from acid_update_m;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 3          | 30         | 101        |
| 4          | 400        | 201        |
| 5          | 300        | 301        |
+------------+------------+------------+
--关联更新，并且在where中同时对原表和目标表进行过滤
update acid_update_t
set value1 = acid_update_s.value1, value2 = acid_update_s.value2
from acid_update_s
where acid_update_t.id = acid_update_s.id
and acid_update_s.id > 2
and acid_update_t.value1 not in
(select value1 from acid_update_m where id = acid_update_t.id)
and acid_update_s.value1 not in
(select value1 from acid_update_m where id = acid_update_s.id);
--查看更新结果，acid_update_t表只有id为5的数据符合条件，
对应value1被更新尾500，valu2被更新尾501
select * from acid_update_t;
+------------+------------+------------+
| id         | value1     | value2     |
+------------+------------+------------+
| 5          | 500        | 501        |
| 2          | 20         | 21         |
| 3          | 30         | 31         |
| 4          | 40         | 41         |
+------------+------------+------------+
                  合并Transactional表文件
                  Transactional表底层物理存储为不支持直接读取的Base文件和Delta文件。对Transactional表执行update或delete操作，不会修改Base文件，只会追加Delta文件，所以会出现更新或删除次数越多，表实际占用存储越大的情况，多次累积的Delta文件会产生较高的存储和后续查询费用。
                  
对同一表或分区，执行多次update或delete操作，会生成较多Delta文件。系统读数据时，需要加载这些Delta文件来确定哪些行被更新或删除，较多的Delta文件会影响数据读取效率。此时您可以将Base文件和Delta合并，减少存储以便提升数据读取效率。
                     
命令格式alter table <table_name> [partition (<partition_key> = '<partition_value>' [, ...])] compact {minor|major};
                           table_name：必填。待合并文件的Transactional表名称。
                           
partition_key：可选。当Transactional表为分区表时，指定分区列名。
                           
partition_value：可选。当Transactional表为分区表时，指定分区列名对应的列值。
                           
major|minor：至少选择其中一个。二者的区别是：
                                 
minor：只将Base文件及其下所有的Delta文件合并，消除Delta文件。
                                 
major：不仅将Base文件及其下所有的Delta文件合并，消除Delta文件，还会把表对应的Base文件中的小文件进行合并。当Base文件较小（小于32 MB）或有Delta文件的情况下，等价于重新对表执行insert overwrite操作，但当Base文件足够大（大于等于32 MB ），且不存在Delta文件的情况下，不会重写。
                           
示例1：基于Transactional表acid_delete，合并表文件。命令示例如下：alter table acid_delete compact minor;
返回结果如下：Summary:
Nothing found to merge, set odps.merge.cross.paths=true if cross path merge is permitted.
                           示例2：基于Transactional表acid_update_pt，合并表文件。命令示例如下：alter table acid_update_pt partition (ds = '2019') compact major;
返回结果如下：Summary:
table name: acid_update_pt /ds=2019  instance count: 2  run time: 6
  before merge, file count:        8  file size: 2613  file physical size: 7839
   after merge, file count:        2  file size: 679  file physical size: 2037
                           问题现象：执行update操作时，报错ODPS-0010000:System internal error - fuxi job failed, caused by: Data Set should contain
                                 exactly one row。
                           
问题原因：待更新的行数据与子查询结果中的数据无法一一对应，系统无法判断对哪一行数据进行更新。命令示例如下：update store set (s_county, s_manager) = (select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk) where s_store_sk in (select s_store_sk from store_delta);
通过子查询select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk与store_delta关联，并用store_delta的数据更新store。假设store的s_store_sk中有[1, 2, 3]三行数据，如果store_delta的s_store_sk有[1, 1]两行数据，数据无法一一对应，执行报错。
                           
解决措施：确保待更新的行数据与子查询结果中的数据一一对应。
                           问题现象：在DataWorks DataStudio中使用compact命令时，报错ODPS-0130161:[1,39] Parse exception - invalid token 'minor', expect one of 'StringLiteral','DoubleQuoteStringLiteral'。
                           
问题原因：DataWorks独享资源组中的MaxCompute客户端版本不支持compact命令。
                           
解决措施：请通过DataWorks交流群联系技术支持团队升级独享资源组中的MaxCompute客户端版本。