TCP 172.18.1.194:63937 172.18.7.101:8032 ESTABLISHED 17412
TCP 172.18.1.194:63940 172.18.7.102:9000 ESTABLISHED 17412
TCP 172.18.1.194:63952 172.18.7.121:50010 ESTABLISHED 17412
TCP 192.168.132.1:63923 0.0.0.0:0 LISTENING 17412
TCP [::]:4040 [::]:0 LISTENING 17412
Detailed Version
Before explain in detail, there are only these three related conf variables:
spark.driver.host
spark.driver.port
spark.driver.bindAddress
There are NO variables like spark.driver.hostname or spark.local.ip. But there IS a environment variable called SPARK_LOCAL_IP
and before explain the variables, first we have to understand the application submition process
Main Roles of computers:
development machine
master node (YARN / Spark Master)
worker node
There is an ApplicationMaster for each application, which takes care of resource request from cluster and status monitor of jobs(stages)
The ApplicationMaster is in the cluster, always.
Place of spark Driver
development machine: client mode
within the cluster: cluster mode, same place as the ApplicationMaster
Let's say we are talking about client mode
The spark application can be submitted from a development machine, which act as a client machine of the application, as well as a client machine of the cluster.
The spark application can alternatively submitted from a node within the cluster (master node or worker node or just a specific machine with no resource manager role)
The client machine might not be placed within the same subnet as the cluster and this is one case that these variables try to deal with. Think about your internet connection, it is often not possible that your laptop can be accessed from anywhere around the globe just as google.com.
At the beginning of the application submission process, the spark-submit on the client side would upload necessary files to the spark master or yarn, and negotiate about resource requests. In this step the client connect to the cluster, and the cluster address is the destination address that the client tries to connect.
Then the ApplicationMaster starts on the allocated resource.
The resource allocated for ApplicationMaster is by default random, and cannot control by these variables. It is controlled by the scheduler of the cluster, if you're curious about this.
Then the ApplicationMaster tries to connect BACK to the spark Driver. This is the place that these conf variables take effects.
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.