I have created a custom pipeline in Azure Machine Learning that involves executing a script called data_prep.py . This script requires the en_core_web_sm model from the Spacy library for data cleaning purposes.

I have defined an environment YAML file for my pipeline, where I included the spacy package as a dependency. However, when I try to include the en_core_web_sm model in the YAML file, it fails to create the environment.

I have also attempted to download the en_core_web_sm model manually and store it in a folder, and then referenced the model path in my data_prep.py script. However, this approach is not working either.

I would greatly appreciate any guidance on how to correctly include the en_core_web_sm model in my Azure ML pipeline environment. Is there a recommended method to include Spacy models in Azure ML environments? How can I ensure that the en_core_web_sm model is accessible to my pipeline and can be utilized by data_prep.py ?

Thank you for your assistance.

Hello [@Kavinamoole, Abhishek]/users/na/?userid=310be9bd-66aa-4ab9-8091-dba14acc54e0)

Thanks for reaching out to us, to include the en_core_web_sm model in your Azure ML pipeline environment, you can try the following steps:

Install the en_core_web_sm model in your environment: You can install the en_core_web_sm model in your environment by adding it as a dependency in your environment YAML file. Here is an example of how to add the en_core_web_sm model as a dependency:

dependencies:
  - python=3.8
  - spacy
  - pip:
    - https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl

This will install the en_core_web_sm model from the official Spacy models repository.

Load the en_core_web_sm model in your script: Once the en_core_web_sm model is installed in your environment, you can load it in your script using the spacy.load() function. Here is an example of how to load the en_core_web_sm model in your script:

import spacy
nlp = spacy.load("en_core_web_sm")

This will load the en_core_web_sm model into memory and allow you to use it for data cleaning purposes.

Verify that the model is accessible: To verify that the en_core_web_sm model is accessible in your environment, you can run a simple test script that loads the model and performs some basic text processing tasks. If the script runs without errors, then the en_core_web_sm model is accessible and can be used in your pipeline.

I hope this helps! Please have a try and let me know if you have any other questions.

Regards,

Yutong