Question

当spark读取一个非常大的本地文件时，读入内存后分区会自动分布到多个节点上吗？

本地文件指的是在某一个节点的本地文件系统上，不是HDFS上面。如若不然是在执行action的时候再拷贝相应分区到多个worker节点进行并行计算吗？希望能说一下对应哪块源码。我目前还没找到这一块。

5 个回答

羊咩 build awesome software system · Accepted Answer

一·是在执行action的时候再拷贝相应分区到多个worker节点进行并行计算吗？

不是，这种读取local file system而不是hdfs的情况，需要同一个文件存在所有的worker node上面，在读取的时候每个worker node的task会去读取本文件的一部分。打个比方，比如你有一个file，有一个spark集群(node1是master,node2,node3两个是slaves)，那么这个file需要在node2,node3上面都存在，这两个节点的task会各读一半，不然会出错。（这里其实还有一个点注意，你的spark app所运行的节点也需要有这个file，因为需要用到file进行Partition划分）。

二·具体对应哪一段源码。

1.由读取文件的方法SparkContext.textFile(path)跟踪源码知道它利用了TextInputFormat生成了一个HadoopRDD.

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString)
def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()
    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)

2.再来分析HadoopRDD，对于你的疑问来说最重要的是getPartitions方法，也就是如何划分你输入的文件成为Partitions：

override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    if (inputFormat.isInstanceOf[Configurable]) {
      inputFormat.asInstanceOf[Configurable].setConf(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    array

其中 val inputSplits = inputFormat.getSplits(jobConf, minPartitions), 是将你的输入文件划分为多个Split，一个Split对应一个Partition，因为是本地文件系统，通过"file://"前缀可以获取文件系统，这个源码我就不帖了，这里minPartitions是2（如果你没有指定的话），也就是将file划分为2部分，每个Split都有SplitLocationInfo描述该Split在哪个node上如何存储，比如FileSplit包含了（Hosts，start, len, path），就是在哪个host上面的哪个path，从哪个起点start读取len这么多数据就是这个Split的内容了。对于本地文件，他的Host直接指定的是 localhost ，path就是你传入的文件路径，start和len根据2份进行简单的计算即可，我就不赘述。有了这个信息我们可以构造每个Split的PreferLocation:

override def getPreferredLocations(split: Partition): Seq[String] = {
    val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
    val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {
      case Some(c) =>
        try {
          val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)
          val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]
          Some(HadoopRDD.convertSplitLocationInfo(infos))
        } catch {
          case e: Exception =>
            logDebug("Failed to use InputSplitWithLocations.", e)
      case None => None
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))

从这段代码可以看出来，对于localhost的host，是没有PreferredLocation的，这个会把对应于该partition的task追加到no_prefs的任务队列中，进行相应data locality的任务调度。

3.任务调度

val taskIdToLocations = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.resultOfJob.get
          partitionsToCompute.map { id =>
            val p = job.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap