python 3.x - NLTK is called and got error of "punkt" not found on databricks pyspark

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I would like to call NLTK to do some NLP on databricks by pyspark. I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
 import pyspark.sql.functions as F
 from pyspark.sql.types import StringType
 import nltk
 nltk.download('punkt')
 def get_keywords1(col):
     sentences = []
     sentence = nltk.sent_tokenize(col)
 get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
 [nltk_data] Downloading package punkt to /root/nltk_data...
 [nltk_data]   Package punkt is already up-to-date!
When I run the following code:
 t = spark.createDataFrame(
 [(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
  (2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
 ("year", "month", "u_id", "objects"))
 t1 = t.withColumn('keywords', get_keywords_udf('objects'))
 t1.show() # error here !
I got error:
 <span class="ansi-red-fg">&gt;&gt;&gt; import nltk
 PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
 Traceback (most recent call last):
 LookupError: 
 **********************************************************************
 Resource punkt not found.
 Please use the NLTK Downloader to obtain the resource:
 >>> import nltk
 >>> nltk.download('punkt')
 For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them  work for me.
I have tried to updated
 nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
                I had a similar issue, the reason why your code is failing is you have installed the package only on master node but not on the worker nodes. When your code runs parallel on all the machines, the worker nodes error out. You need to find a way to copy the downloaded files from master to worker during cluster creation/setup or if your cluster has internet access download it in the function(not the best way but would unblock you)
– Karan Sharma
                Aug 17, 2020 at 10:29
                @KaranSharma, thanks, but, I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
– user3448011
                Aug 19, 2020 at 15:55
                One other potential issue could be because the nltk library downloads 'punkt' on each node and the nodes might not have internet access, I would check the cluster configuration and make sure the internet is available. You can even run something like a ping command on all nodes just to check the issue.
– Karan Sharma
                Aug 19, 2020 at 17:22
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """ 
#!/bin/bash
pip install nltk""", True)
Check it out
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os 
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.