相关文章推荐
细心的佛珠  ·  Mysql-基础-15~17章 - 掘金·  1 年前    · 
大力的西瓜  ·  postgresql ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am trying to create a Spark cluster using the bitnami spark image, and also connect it to Minio storage created by the bitnami Minio image.

The following, provided in bits by Bitnami, is my docker-compose file:

version: '2'
networks:
  spark-network:
    driver: bridge
services:
  minio:
    image: bitnami/minio:latest
    ports:
      - '9000:9000'
      - '9001:9001'
    environment:
      - MINIO_ROOT_USER=<INSERT>
      - MINIO_ROOT_PASSWORD=<INSERT>
    networks:
      - spark-network
  spark:
    image: bitnami/spark:3.3.2
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '8080:8080'
      - '7077:7077'
    networks:
      - spark-network
    depends_on:
      - minio
  spark-worker:
    image: bitnami/spark:3.3.2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=3G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - spark-network
    depends_on:
      - spark
      - minio

When hitting docker compose, everything seems fine. The worker(s) seem connected to the master, and Minio is accessible from localhost (+ I can upload data). I have tried changing localhost to the IP provided in the master docker container.

The following is my Spark Session config:

    spark = SparkSession.builder \
        .master(f'spark://localhost:7077') \
        .appName("docker_spark_minio_storage") \
        .config('spark.jars', '/opt/bitnami/spark/jars/hadoop-aws-3.3.2') \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
        .config("spark.hadoop.fs.s3a.access.key", <INSERT>) \
        .config("spark.hadoop.fs.s3a.secret.key", <INSERT>) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .getOrCreate()

Here is my PySpark code, run from my terminal, i.e. not uploading the script to any container(s):

df = (spark.read
  .format('csv')
  .option('inferSchema', 'true')
  .option('header', 'true')
  .option('delimiter', delimiter)
  .load('s3a://<CONTAINER</<FILENAME.csv>))

Problem:

  • Spark cannot find the hadoop-aws jar needed to read Minio: "java.io.FileNotFoundException: Jar /opt/bitnami/spark/jars/hadoop-aws-3.3.2 not found". It is however fully visible in the provided path when checking in the docker container(s). Why?
  • Changing config to .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.2') rather than pointing to the jar-file does not give this particular problem, but freezes on read (at stage 0/0). When pointing to an erroneous file path in Minio, it says that the file does not exist, so at least it seems to get that there is a file when writing the correct path. I am however not sure that it's actually ran on the spark cluster, since my terminal seems to spit out some local paths. I've tried adding .config('spark.submit.deployMode', 'client'), .config("spark.driver.bindAddress", "") and .config('spark.driver.host', ""), but no change there.
  • Information:

  • When running spark locally, i.e. changing to .master('local[*]'), everything works fine, I am able to read from Minio using the exact same configuration. But I want to run it on the docker container(s). I can see that there are different OpenJDK versions on my local machine and in the docker container (19.0.2 locally and 1.8.0_362 in spark container(s)). Might this be the reason? I don't see why, so please explain if this is the case!
  • I am using the same PySpark versions, i.e. 3.3.2.
  • The container(s) seem to be able to run Spark code, e.g. when I create a dataframe with mock data in code, rather than reading from Minio.
  • Thank you!

    I have been googling for days, asked ChatGPT, etc. I can't find anyone having had a similar problem. I am expecting the spark container(s) to read my data from Minio and simply show it so I can get somewhere.

    EDIT: I have now made sure that I run the same OpenJDK version on my local machine and my docker container running spark. The issue is still the same, i.e. when reading minio using local spark master it works magnificently, but when running master in Bitnami Spark docker container, it freezes – Paddy Mar 20 at 13:55 make sure you've disabled https on the s3a connector, and consider that "localhost" isn't likely to be the right answer in a container. use the real host. – stevel Mar 23 at 15:31

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.