DataWorks支持哪些MongoDB数据同步能力_大数据开发治理平台 DataWorks(DataWorks)-阿里云帮助中心

支持的版本

仅支持4.x、5.x版本的MongoDB。

使用限制

数据集成支持使用MongoDB数据库对应账号进行连接，如果您使用的是云数据库MongoDB版，默认会有一个root账号。出于安全策略的考虑，在添加使用MongoDB数据源时，请避免使用root作为访问账号。
如果MongoDB为分片集群，则在配置数据源时，需要配置 mongos 地址，避免配置 mongod/shard 节点地址。否则同步任务在抽取MongoDB中数据时，可能会导致只查询到指定 shard 的数据，而非预期的全集。关于 mongos 、 mongod ，详情请参考 mongos 、 mongod 。
在并发大于1的情况下，同步任务配置的集合中所有 _id 字段类型必须一致（例如， _id 字段都为string类型或者ObjectId类型），否则会出现部分数据无法同步的问题。
```
"useSplitVector" : false
```

类型	离线读（MongoDB Reader）	说明
ObjectId	支持	对象ID类型。
Double	支持	64位浮点数类型。
32-bit integer	支持	32位整数。
64-bit integer	支持	64位整数。
Decimal128	支持	Decimal128类型。
String	支持	字符串类型。
Boolean	支持	布尔类型。
Timestamp	支持	时间戳类型。
Date	支持	日期类型。

类型

离线读（MongoDB Reader）

说明

Document

支持

嵌入文档类型。

如果没有配置type属性，则直接将Document转JSON序列化处理。
如果配置了type属性为 document ，则属于嵌套类型，MongoDB Reader会按path读取Document属性。详细示例请参见下文的 数据类型示例2：递归解析处理多层嵌套的Document 。

Array

支持

数组类型。

如果type配置为 array.json 、 arrays ，直接JSON序列化处理。
如果type配置为 array 、 document.array ，则拼接为字符串，分隔符（column中的splitter）默认为英文逗号。

类型

离线读（MongoDB Reader）

说明

Combine

支持

数据集成自定义类型。

如果type配置为 combine ，MongoDB Reader会移除已配置的Column对应Key后，将整个Document其他所有信息进行JSON序列化输出，详细示例请参见下文 数据类型示例1：Combine类型使用示例 。

转换后的类型分类	MongoDB数据类型
LONG	INT、LONG、document.INT和document.LONG
DOUBLE	DOUBLE和document.DOUBLE
STRING	STRING、ARRAY、document.STRING、document.ARRAY和COMBINE
DATE	DATE和document.DATE
BOOLEAN	BOOL和document.BOOL
BYTES	BYTES和document.BYTES

类型分类	MongoDB数据类型
整数类	INT和LONG
浮点类	DOUBLE
字符串类	STRING和ARRAY
日期时间类	DATE
布尔型	BOOL
二进制类	BYTES

"column": [
"name": "a",
"type": "string",
"name": "b",
"type": "string",
"name": "doc",
"type": "combine",
]

odps_column1	odps_column2	odps_column3
a	b	{x_1,x_2}
a	b	{x_2,x_3,x_4}
a	b	{x_5}

{
    "name": "name1",
            "c": "this is value"
}

{"name":"_id","type":"string"}
{"name":"name","type":"string"}
{"name":"a.b.c","type":"document"}

{
    "type":"job",
    "version":"2.0",//版本号。
    "steps":[
            "category": "reader",
            "name": "Reader",
            "parameter": {
                "datasource": "datasourceName", //数据源名称。
                "collectionName": "tag_data", //集合名称。
                "query": "", // 数据查询过滤。
                "column": [
                        "name": "unique_id", //字段名称。
                        "type": "string" //字段类型。
                        "name": "sid",
                        "type": "string"
                        "name": "user_id",
                        "type": "string"
                        "name": "auction_id",
                        "type": "string"
                        "name": "content_type",
                        "type": "string"
                        "name": "pool_type",
                        "type": "string"
                        "name": "frontcat_id",
                        "type": "array",
                        "splitter": ""
                        "name": "categoryid",
                        "type": "array",
                        "splitter": ""
                        "name": "gmt_create",
                        "type": "string"
                        "name": "taglist",
                        "type": "array",
                        "splitter": " "
                        "name": "property",
                        "type": "string"
                        "name": "scorea",
                        "type": "int"
                        "name": "scoreb",
                        "type": "int"
                        "name": "scorec",
                        "type": "int"
                        "name": "a.b",
                        "type": "document.int"
                        "name": "a.b.c",
                        "type": "document.array",
                        "splitter": " "
            "stepType": "mongodb"
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
    "setting":{
        "common": { 
            "column": { 
                "timeZone": "GMT+0" //时区
        "errorLimit":{
            "record":"0"//错误记录数。
        "speed":{
            "throttle":true,//当throttle值为false时，mbps参数不生效，表示不限流；当throttle值为true时,表示限流。
            "concurrent":1 //作业并发数。
            "mbps":"12"//限流，此处1mbps = 1MB/s。
    "order":{
        "hops":[
                "from":"Reader",
                "to":"Writer"
}

参数	描述
datasource	数据源名称，脚本模式支持添加数据源，此配置项填写的内容必须要与添加的数据源名称保持一致。
collectionName	MonogoDB的集合名。
hint	MongoDB支持 hint 参数，使查询优化器使用特定索引来完成查询，在某些情况下，可以提高查询性能。详情请参见 hint参数。示例如下： `{ "collectionName":"test_collection", "hint":"{age:1}" }`
column	MongoDB的文档列名，配置为数组形式表示MongoDB的多个列。 name ： column 的名字。 type 支持的类型包括： string ：表示字符串。 long ：表示整型数。 double ：表示浮点数。 date ：表示日期。 bool ：表示布尔值。 bytes ：表示二进制序列。 arrays ：以JSON字符串格式读出，例如["a","b","c"]。 array ：以分隔符splitter分隔的方式读出，例如 `a,b,c` ，推荐使用 arrays 格式。 combine ：使用MongoDB Reader插件读出数据时，支持合并MongoDB document中的多个字段为一个JSON串。 splitter ：因为MongoDB支持数组类型，但数据集成框架本身不支持数组类型，所以MongoDB读出来的数组类型，需要通过该分隔符合并成字符串。
batchSize	批量获取的记录数，该参数为选填参数。默认值为 `1000` 条。
cursorTimeoutInMs	游标超时时间，该参数为选填参数。默认值为 `1000 * 60 * 10 = 600000` 。如果 cursorTimeoutInMs 配置为负值，则表示游标永不超时。
query	您可以通过该配置型来限制返回MongoDB数据范围，仅支持以下时间格式，不支持直接使用时间戳类型的格式。
splitFactor	如果存在比较严重的数据倾斜，可以考虑增加splitFactor，实现更小粒度的切分，无需增加并发数。

{
    "type": "job",
    "version": "2.0",//版本号。
    "steps": [
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
            "stepType": "mongodb",//插件名。
            "parameter": {
                "datasource": "",//数据源名。
                "column": [
                        "name": "_id",//列名。
                        "type": "ObjectId"//数据类型。如果replacekey为_id，则此处的type必须配置为ObjectID。如果配置为string，会无法进行替换。
                        "name": "age",
                        "type": "int"
                        "name": "id",
                        "type": "long"
                        "name": "wealth",
                        "type": "double"
                        "name": "hobby",
                        "type": "array",
                        "splitter": " "
                        "name": "valid",
                        "type": "boolean"
                        "name": "date_of_join",
                        "format": "yyyy-MM-dd HH:mm:ss",
                        "type": "date"
                "writeMode": {//写入模式。
                    "isReplace": "true",
                    "replaceKey": "_id"
                "collectionName": "datax_test"//连接名称。
            "name": "Writer",
            "category": "writer"
    "setting": {
        "errorLimit": {//错误记录数。
            "record": "0"
        "speed": {
            "throttle": true,//当throttle值为false时，mbps参数不生效，表示不限流；当throttle值为true时,表示限流。
            "concurrent": 1,//作业并发数。
            "mbps": "1"//限流的速度，此处1mbps = 1MB/s。
       "jvmOption": "-Xms1024m -Xmx1024m"
    "order": {
        "hops": [
                "from": "Reader",
                "to": "Writer"
}

参数	描述	是否必选	默认值
datasource	数据源名称，脚本模式支持添加数据源，该配置项填写的内容必须与添加的数据源名称保持一致。	是	无
collectionName	MongoDB的集合名。	是	无
column	MongoDB的文档列名，配置为数组形式表示MongoDB的多个列。 name：Column的名字。 type：Column的类型。 int：表示32位整型数。 string：表示字符串。 array： `splitter` 必须配置，用于分隔源端字符串，如：源端数据为 `a,b,c` ， `splitter` 配置英文逗号 `,` 则会将数据切分为数组 `["a","b","c"]` 写入MongoDB中。 `{"type":"array","name":"col_split_array","splitter":",","itemtype":"string"}`	是	无
writeMode	指定了传输数据时是否覆盖的信息，包括 isReplace 和 replaceKey ： isReplace ：当设置为true时，表示针对相同的replaceKey做覆盖操作。当设置为false时，表示不覆盖。 replaceKey ：replaceKey指定了每行记录的业务主键，用来做覆盖时使用（不支持replaceKey为多个键，通常指Mongo中的主键）。	否	无
preSql	表示数据同步写出MongoDB前的前置操作，例如清理历史数据等。如果 preSql 为空，表示没有配置前置操作。配置 preSql 时，需要确保preSql符合JSON语法要求。	否	无

- ```
query=(BasicDBObject) com.mongodb.util.JSON.parse(json);        
col.deleteMany(query);
```

MongoDB数据源

支持的版本

使用限制

支持的字段类型

MongoDB Reader支持的MongoDB数据类型

数据集成特殊数据类型：combine

MongoDB Reader数据类型转换

MongoDB Writer数据类型转换

数据类型示例1：Combine类型使用示例

数据类型示例2：递归解析处理多层嵌套的Document

数据同步任务开发

创建数据源

单表离线同步任务配置指导

单表实时同步任务配置指导

整库级别同步任务配置指导

最佳实践

常见问题

附录：MongoDB脚本Demo与参数说明

附录：离线任务脚本配置方式

MongoDB Reader脚本Demo

MongoDB Reader脚本参数

MongoDB Writer脚本Demo

MongoDB Writer脚本参数