InputSplit

第一个是InputSplit, 它把数据划分成若干块提供给mapper

默认情况下是根据数据文件的block, 来划分, 一个block对应一个mapper, 优先在block所在的机器上启动mapper

如果要重构这个 InputSplit 函数的话, 要去 InputFormat 里重构 getSplits 函数

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/InputFormat.html

在streaming中:

-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default

这两个参数指定姚世勇inputformat class

Partition

partition用于把结果分配给不同的reducer, 一般继承自 " org.apache.hadoop.mapreduce.Partitioner " 这个类

Grouping

这个概念比较难理解, 意思是在数据给reducer前再进行一次分组, 一组数据给到同一个reducer执行一次, 他们的key用的是分组中第一个数据的key

https://stackoverflow.com/questions/14728480/what-is-the-use-of-grouping-comparator-in-hadoop-map-reduce

最佳答案中 a-1和a-2因为grouping的关系合并成了 a-1为key的一组数据给reducer处理

那么在streaming中Partition和Grouping该怎么处理呢?

在streaming中可以用命令行参数指定Partition的类:

-partitioner JavaClassName

Optional

Class that determines which reduce a key is sent to

也可以用另一种参数结合sort命令来指定:

-D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ -D mapred.text.key.partitioner.options=-k1,2 \

这里指定了分割符, 并且分割出来前4个field是key, 并用第一和第二个field来做partition

-D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr'

linux中的sort命令:

sort -k1 -k2n -k3nr #表示优先根据第一列排序, 再根据第二列排序且第二列是数字,再根据第三列排序它是数字而且要逆序来排

grouping在streaming的模式中没有相应实现, 但是可以利用partition来代替.

Parameter	Optional/Required	Description
-input directoryname or filename	Required	Input location for mapper
-output directoryname	Required	Output location for reducer
-mapper executable or JavaClassName	Required	Mapper executable
-reducer executable or JavaClassName	Required	Reducer executable
-file filename	Optional	Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName	Optional	Combiner executable for map output
-cmdenv name=value	Optional	Pass environment variable to streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks	Optional	Specify the number of reducers
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

MapReduce中有三个步骤用于划分大数据集, 给mapper和reducer提供数据InputSplit第一个是InputSplit, 它把数据划分成若干块提供给mapper默认情况下是根据数据文件的block, 来划分, 一个block对应一个mapper, 优先在block所在的机器上启动mapper如果要重构这个 InputSplit 函数的话, 要去 InputFormat 里重构 g...

几周前，当我最初听到，以致后来初次接触Hadoop与MapReduce这两个东西，我便稍显兴奋，觉得它们很是神秘，而神秘的东西常能勾起我的兴趣，在看过介绍它们的文章或论文之后，觉得Hadoop是一项富有趣味和挑战性的技术，且它还牵扯到了一个我更加感兴趣的话题：海量数据处理。由此，最近凡是空闲时，便在看“Hadoop”，“MapReduce”“海量数据处理”这方面的论文。但在看论文的过程中，总觉得那些论文都是浅尝辄止，常常看的很不过瘾，总是一个东西刚要讲到紧要处，它便结束了，让我好生“愤懑”。尽管我对这个Hadoop与MapReduce知之甚浅，但我还是想记录自己的学习过程，说不定，关于这个东西

Mongodb是针对大数据量环境下诞生的用于保存大数据量的非关系型数据库，针对大量的数据，如何进行统计操作至关重要，那么如何从Mongodb中统计一些数据呢？在Mongodb中，给我们提供了三种用于数据聚合的方式：（1）简单的用户聚合函数；（2）使用aggregate进行统计；（3）使用mapReduce进行统计；今天我们首先来讲讲mapReduce是如何统计，在后续的文章中，将另起文章进行相关说明。 MapReduce是啥呢？以我的理解，其实就是对集合中的各个满足条件的文档进行预处理，整理出想要的数据然后进行统计得到最终的统计结果。其中map函数用于对集合中的各个满足条件的文档

MapReduce是一种分布式计算模型，由Google提出，主要用于搜索领域，解决海量数据的计算问题.对于业界的大数据存储及分布式处理系统来说Hadoop2提出的新MapReudce就是YARN:Aframeworkforjobschedulingandclusterresourcemanagement.百度百科:MapReduce是一种编程模型，用于大规模数据集（大于1TB）的并行运算。概念"Map（映射）"和"Reduce（归约）"，和他们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系

mr和mapreduce 重点 (Top highlight)When do I need to use MapReduce? How can I translate my jobs to Map, Combiner, and Reducer? 什么时候需要使用MapReduce？如何将我的工作翻译为Map，Combiner和Reducer？ MapReduce is a programmin...

Hadoop用于对key的排序和分桶的设置选项比较多和复杂，目前在公司内主要以KeyFieldBasePartitioner和KeyFieldBaseComparator被hadoop用户广泛使用。基本概念Partition：分桶过程，用户输出的key经过partition分发到不同的reduce里，因而partitioner就是分桶器，一般用平台默认的hash分桶也可以自己指定。 Key：是需要

有一个格式化的数据文件，用\t分割列，第2列为产品名称。现在需求把数据文件根据产品名切分为多个文件，使用MapReduce程序要如何实现？原始文件： [root@localhost opt]# cat aprData 1 a1 a111 2 a2 a211 3 a1 a112 4 a1 a... 一、Partitioner简介 Partitioner的作用是对Mapper产生的中间结果进行分片，以便将同一个分组的数据交给同一个Reducer处理，它直接影响Reducer阶段的复杂均衡。 Partitioner只提供了一个方法： getPartition(Text key,Text valu Hadoop Streaming 是 Hadoop 提供的一个 MapReduce 编程工具，它允许用户使用任何可执行文件、脚本语言或其他编程语言来实现 Mapper 和 Reducer，从而充分利用 Hadoop 并行计算框架的优势和能力，来处理大数据。一个简单的示例，以 shell 脚本为例： hadoop jar hadoop-stream... import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.slf4j.Logger; import org... 1 hadoop streaming Hadoop streaming是和hadoop一起发布的实用程序。它允许用户创建和执行使用任何程序或者脚本编写的map或者reduce的mapreducejobs。譬如， $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/h

通过初期的几个开发员培训班，我发现有不少学员容易“偏爱”缺省的MapReduce行为，而忽略如何在代码里根据自己应用的需要来定制不同于系统缺省的行为。这篇文章结合Secondary Sort来介绍“Shuffle & Sort”里涉及到的三个重要操作。缺省情况下，MapReduce Framework的Shuffle & Sort过程将所有和某一个键相关联的值“组合”(group)在一起，传送

partition是分区，默认根据key的hash值分区，确定各个key分到哪个reducer中去，计算方法一般是HashValue%Num(reducer)，如果只有一个分区，则全都分配到一个区。 sort是在分区内根据key进行排序。 group是分组，是在partition里面再分组，相同的key分到一个组中去，实现方法是compare(o1,o2)，相同为一个group。有些问题需...

GroupingComparator是mapreduce当中reduce端的一个功能组件，主要的作用是决定哪些数据作为一组，调用一次reduce的逻辑，默认是每个不同的key，作为多个不同的组，每个组调用一次reduce逻辑，我们可以自定义GroupingComparator实现不同的key作为同一个组，调用一次reduce逻辑 1、分组排序步骤：（1）自定义类继承WritableCompa...