Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

Ask Question

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0 python version: python3.9 os: MacOS

is there a better way to read all xlsx files from a folder ?

I've met excatly same problem on linux and the reason is just as @pirateofebay said. There are some temporary files which are hidden. I've solved this by using pd.ExcelWriter to open all files related and then use writer.close() to close them one by one. Then we are good, you can use pd.read_excel as you wish. :D – Christy Jul 27, 2022 at 6:55

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())
for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)

https://stackoverflow.com/a/32241271/17411729

link to an answer on how to remove hidden files

Mac = go to folder press cmd + shift + . will show the hidden file, delete it, run it back.

Thank you for pointing out a potentially duplicated question. However, there are two things that you may consider: 1) to mention it only as a comment to the question, rather than an answer 2) If the solution in the SO page that you referred is not exactly the same, you should include the steps that you took too, not only the link – Aryo Pradipta Gema Nov 25, 2021 at 8:34 Thanks for making me aware, I will try to keep that in mind the next time :) would love to move my answer to the comments, unfortunately, I'm not allowed to make comments until I get 50 reps. – TimTiger Dec 7, 2021 at 18:13

In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.