Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am trying to load data using spark into the minio storage -

Below is the spark program -

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.appName("MinioTest").getOrCreate()
sc = spark.sparkContext
spark.conf.set("spark.hadoop.fs.s3a.endpoint", "https://minioendpoint.com/")
spark.conf.set("spark.hadoop.fs.s3a.access.key", "username")
spark.conf..set("spark.hadoop.fs.s3a.secret.key", "password" )
spark.conf..set("spark.hadoop.fs.s3a.path.style.access", True)
spark.conf..set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
df = spark.read.csv('s3a://bucketname/spark-operator-on-k8s/data/input/input.txt',header=True)
df.write.format('csv').options(delimiter='|').mode('overwrite').save('s3a://bucketname/spark-operator-on-k8s/data/output/')

Spark Submit Command -

/usr/middleware/spark-3.2.0-bin-hadoop3.2/bin/spark-submit --jars /usr/middleware/maven/hadoop-aws-3.2.0.jar,/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar  --driver-class-path /usr/middleware/maven/hadoop-aws-3.2.0.jar,/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar --conf spark.executor.extraClassPath="/usr/middleware/maven/hadoop-aws-3.2.0.jar:/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar" /usr/middleware/miniocerts/minio.py

Error 1 - Although password was provided in the scripts not sure why this error is thrown -

java.nio.file.AccessDeniedException: s3a://bucketname/spark-operator-on-k8s/data/output: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))

Then i

export AWS_ACCESS_KEY_ID=usenrame
export AWS_SECRET_KEY=password
/usr/middleware/spark-3.2.0-bin-hadoop3.2/bin/spark-submit --jars /usr/middleware/maven/hadoop-aws-3.2.0.jar,/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar  --driver-class-path /usr/middleware/maven/hadoop-aws-3.2.0.jar,/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar --conf spark.executor.extraClassPath="/usr/middleware/maven/hadoop-aws-3.2.0.jar:/usr/middleware/maven/aws-java-sdk-bundle-1.11.375.jar" /usr/middleware/miniocerts/minio.py

Now the error is -

java.nio.file.AccessDeniedException: s3a://bucketname/spark-operator-on-k8s/data/input/input.txt: getFileStatus on s3a://bucketname/spark-operator-on-k8s/data/input/input.txt: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: XX9TMS6ANGYZXXKN; S3 Extended Request ID: t4kasdasda=; Proxy: null), S3 Extended Request ID: t4kYUfgfSAnw7ymP:403 Forbidden at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:249)

mc S3 -policy -

The access fothe useris is {
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:DeleteObject"
"Resource": [
"arn:aws:s3:::bucketname",
"arn:aws:s3:::bucketname/*"

Any pointers on what could be the error ?

an AWS request ID implies this response is coming from AWS s3 not minio. Maybe you are setting up the s3a config options too late in the job...this would explain why the password wasn't found – stevel Oct 26, 2021 at 14:25 I changed the spark.conf.set to spark.sparkContext._jsc.hadoopConfiguration().set and added the certs to cacerts to using the keytool command and it worked. – Rafa Oct 28, 2021 at 11:59

I modified your code. It should work now.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.appName("MinioTest").getOrCreate()
sc = spark.sparkContext
spark.conf.set("spark.hadoop.fs.s3a.endpoint", "https://minioendpoint.com/")
spark.conf.set("spark.hadoop.fs.s3a.access.key", "username")
spark.conf.set("spark.hadoop.fs.s3a.secret.key", "password" )
spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
df = spark.read.csv('s3a://bucketname/spark-operator-on-k8s/data/input/input.txt',header=True)
df.write.format('csv').options(delimiter='|').mode('overwrite').save('s3a://bucketname/spark-operator-on-k8s/data/output/')
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.