Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I am trying to create a Spark cluster using the bitnami spark image, and also connect it to Minio storage created by the bitnami Minio image.
The following, provided in bits by Bitnami, is my docker-compose file:
version: '2'
networks:
spark-network:
driver: bridge
services:
minio:
image: bitnami/minio:latest
ports:
- '9000:9000'
- '9001:9001'
environment:
- MINIO_ROOT_USER=<INSERT>
- MINIO_ROOT_PASSWORD=<INSERT>
networks:
- spark-network
spark:
image: bitnami/spark:3.3.2
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
- '7077:7077'
networks:
- spark-network
depends_on:
- minio
spark-worker:
image: bitnami/spark:3.3.2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=3G
- SPARK_WORKER_CORES=2
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
networks:
- spark-network
depends_on:
- spark
- minio
When hitting docker compose, everything seems fine. The worker(s) seem connected to the master, and Minio is accessible from localhost (+ I can upload data). I have tried changing localhost to the IP provided in the master docker container.
The following is my Spark Session config:
spark = SparkSession.builder \
.master(f'spark://localhost:7077') \
.appName("docker_spark_minio_storage") \
.config('spark.jars', '/opt/bitnami/spark/jars/hadoop-aws-3.3.2') \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
.config("spark.hadoop.fs.s3a.access.key", <INSERT>) \
.config("spark.hadoop.fs.s3a.secret.key", <INSERT>) \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.getOrCreate()
Here is my PySpark code, run from my terminal, i.e. not uploading the script to any container(s):
df = (spark.read
.format('csv')
.option('inferSchema', 'true')
.option('header', 'true')
.option('delimiter', delimiter)
.load('s3a://<CONTAINER</<FILENAME.csv>))
Problem:
Spark cannot find the hadoop-aws jar needed to read Minio: "java.io.FileNotFoundException: Jar /opt/bitnami/spark/jars/hadoop-aws-3.3.2 not found". It is however fully visible in the provided path when checking in the docker container(s). Why?
Changing config to .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2') rather than pointing to the jar-file does not give this particular problem, but freezes on read (at stage 0/0). When pointing to an erroneous file path in Minio, it says that the file does not exist, so at least it seems to get that there is a file when writing the correct path. I am however not sure that it's actually ran on the spark cluster, since my terminal seems to spit out some local paths. I've tried adding .config('spark.submit.deployMode', 'client'), .config("spark.driver.bindAddress", "") and .config('spark.driver.host', ""), but no change there.
Information:
When running spark locally, i.e. changing to .master('local[*]'), everything works fine, I am able to read from Minio using the exact same configuration. But I want to run it on the docker container(s). I can see that there are different OpenJDK versions on my local machine and in the docker container (19.0.2 locally and 1.8.0_362 in spark container(s)). Might this be the reason? I don't see why, so please explain if this is the case!
I am using the same PySpark versions, i.e. 3.3.2.
The container(s) seem to be able to run Spark code, e.g. when I create a dataframe with mock data in code, rather than reading from Minio.
Thank you!
I have been googling for days, asked ChatGPT, etc. I can't find anyone having had a similar problem. I am expecting the spark container(s) to read my data from Minio and simply show it so I can get somewhere.
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.