java - Spark Cannot assign requested address: Service Driver failed after 16 retries

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a cluster of 3 workers Spark. (worker-1, worker-2, worker-3) that runs with Spark 2.0.2.
The Spark Master is started on worker-1.
I submit my application with the following script :
#!/bin/bash
sparkMaster=spark://worker-1:6066
mainClass=my.package.Main
jar=/path/to/my/jar-with-dependencies.jar
driverPort=7079
blockPort=7082
deployMode=cluster
$SPARK_HOME/bin/spark-submit \
  --conf "spark.driver.port=${driverPort}"\
  --conf "spark.blockManager.port=${blockPort}"\
  --class $mainClass \
  --master $sparkMaster \
  --deploy-mode $deployMode \
When my driver is started on the worker-1 (Worker + Master), everything is ok, and my application is correctly executed using all workers
But when my driver start on another worker (worker-2 or worker-3), he fails with error : 
Launch Command: "/usr/java/jdk1.8.0_181-amd64/jre/bin/java" "-cp" "/root/spark-2.0.2-bin-hadoop2.7/conf/:/root/spark-2.0.2-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=my.package.Main" "-Dspark.driver.port=7083" "-Dspark.blockManager.port=7082" "-Dspark.master=spark://worker-1:7077" "-Dspark.jars=file:/path/to/my/jar-with-dependencies.jar" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@worker-2:7078" "/data/spark/work/driver-20181001132624-0001/jar-with-dependencies.jar" "my.package.Main"
========================================
org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) | Service 'Driver' could not bind on port 0. Attempting port 1.
org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) | Service 'Driver' could not bind on port 0. Attempting port 1.
org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) | Service 'Driver' could not bind on port 0. Attempting port 1.
org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) | Service 'Driver' could not bind on port 0. Attempting port 1.
Exception in thread "main" java.net.BindException: Cannot assign requested address: Service 'Driver' failed after 16 retries! Consider explicitly setting the appropriate port for the service 'Driver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:433)
        at sun.nio.ch.Net.bind(Net.java:425)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
        at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
        at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
        at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
        at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
        at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
My 3 workers are configured as follow : 
SPARK_LOCAL_IP=worker-[X]
SPARK_LOCAL_DIRS=/data/spark/tmp
SPARK_WORKER_PORT=7078
SPARK_WORKER_DIR=/data/spark/work
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=86400 -Dspark.worker.cleanup.interval=1800"
After multiple attempts to solve this problem, I tried to force the start of the driver on the master machine by adding to my submit the following option : 
  --conf "spark.driver.host=worker-1"
But the driver still start on a random worker, so it does not solve my problem.
Edit : 
When I submit with the spark.driver.host option, the option does not appear in the Launch Command log (but the spark.driver.port appear, so I don't understand why my option is not taken this time).
Edit 2 : 
I have done some deeper tests : 
I now have only one worker running on worker-2, still submitting from worker-1 where my master is running.
When I submit my application, I can see on my worker logs : 
2018-10-04 11:27:39,794 | dispatcher-event-loop-6 | INFO | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) | Asked to launch driver driver-20181004112739-0003
2018-10-04 11:27:39,833 | DriverRunner for driver-20181004112739-0003 | INFO | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) | Copying user jar file:/path/to/myjar-with-depencies.jar to /data/spark/work/driver-20181004112739-0003/myjar-with-depencies.jar
2018-10-04 11:27:39,833 | DriverRunner for driver-20181004112739-0003 | INFO | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) | Copying /path/to/myjar-with-depencies.jar to /data/spark/work/driver-20181004112739-0003/myjar-with-depencies.jar
2018-10-04 11:27:40,243 | DriverRunner for driver-20181004112739-0003 | INFO | org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) | Launch Command: "/usr/java/jdk1.8.0_181-amd64/jre/bin/java" "-cp" "/root/spark-2.0.0-bin-hadoop2.7/conf/:/root/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.supervise=false" "-Dspark.history.fs.cleaner.interval=12h" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://worker-1:7077" "-Dspark.history.fs.cleaner.maxAge=1d" "-Dspark.app.name=my.package.Main" "-Dspark.jars=file:/path/to/myjar-with-depencies.jar" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@worker-2:7078" "/data/spark/work/driver-20181004112739-0003/myjar-with-depencies.jar" "my.package.Main"
2018-10-04 11:27:42,692 | dispatcher-event-loop-8 | WARN | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) | Driver driver-20181004112739-0003 exited with failure
And I still have the same error in my driver logs. 
I then tried to run manually the command that is launched by the DriverRunner :
"/usr/java/jdk1.8.0_181-amd64/jre/bin/java" "-cp" "/root/spark-2.0.0-bin-hadoop2.7/conf/:/root/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.supervise=false" "-Dspark.history.fs.cleaner.interval=12h" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://worker-1:7077" "-Dspark.history.fs.cleaner.maxAge=1d" "-Dspark.app.name=my.package.Main" "-Dspark.jars=file:/path/to/myjar-with-depencies.jar" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@worker-2:7078" "/data/spark/work/driver-20181004112739-0003/myjar-with-depencies.jar" "my.package.Main"
And when I do that, the application start correctly (surprisingly). 
What is the difference between my manual start, and the one from the Driver-Runner that can cause my binding error ?
Note : 
I have made no modification on the Driver-Runner command line to work
I manually launched my command line in root, and my spark runs in root too.
Had the same behavior on Spark 2.0.0 and Spark 2.0.2
So, I answer my own question as I found the reason of this weird behavior.
It does happend when I run the spark-submit from a machine where there is a spark-env.sh file. And more precisely, when the SPARK_LOCAL_IP is set in this machine.
To avoid this problem, I created a 4th machine, running only a Spark Master and with no spark-env.sh file and from which I run my spark submit.
                You mean to say in master machine no need  to configure " spark-env.sh " and " spark-defaults.conf  "  file.  Only need to configure slave file with number of slaves hostname or IP.  Right ?
– Bimal Kumar Dalei
                Jan 11, 2021 at 7:26
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.