Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/amazon-web-services/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 我能';t在EMR PySpark笔记本中安装spacy模型_Python_Amazon Web Services_Pyspark_Amazon Emr_Spacy - Fatal编程技术网

Python 我能';t在EMR PySpark笔记本中安装spacy模型

Python 我能';t在EMR PySpark笔记本中安装spacy模型,python,amazon-web-services,pyspark,amazon-emr,spacy,Python,Amazon Web Services,Pyspark,Amazon Emr,Spacy,我目前有一个AWS EMR,它有一个连接到同一集群的笔记本 我想加载一个spacy模型(en_core\u web\u sm),但首先我需要下载该模型,通常使用python-m spacy download en_core\u web\u sm完成,但我真的找不到如何在PySpark会话中完成 这是我的配置: %%configure -f { "name":"conf0", "kind": "pyspark&q

我目前有一个AWS EMR,它有一个连接到同一集群的笔记本

我想加载一个spacy模型(
en_core\u web\u sm
),但首先我需要下载该模型,通常使用
python-m spacy download en_core\u web\u sm
完成,但我真的找不到如何在PySpark会话中完成

这是我的配置:

%%configure -f
{
    "name":"conf0",
    "kind": "pyspark",
    "conf":{
          "spark.pyspark.python": "python",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    },
    "files":["s3://my-s3/code/utils/NLPtools.py",
            "s3://my-s3/code/utils/Parse_wikidump.py",
            "s3://my-s3/code/utils/S3_access.py",
            "s3://my-s3/code/utils/myeval.py",
            "s3://my-s3/code/utils/rank_metrics.py",
            "s3://my-s3/code/utils/removeoutput.py",
            "s3://my-s3/code/utils/secret_manager.py",
            "s3://my-s3/code/utils/word2vec.py"]
}
我可以运行这样的命令,这有点正常:

sc.install_pypi_package("boto3")
sc.install_pypi_package("pandas")
sc.install_pypi_package("hdfs")
sc.install_pypi_package("NLPtools")
sc.install_pypi_package("numpy")
sc.install_pypi_package("tqdm")
sc.install_pypi_package("wikipedia")
sc.install_pypi_package("filechunkio")
sc.install_pypi_package("thinc")
sc.install_pypi_package("gensim")
sc.install_pypi_package("termcolor")
sc.install_pypi_package("boto")
sc.install_pypi_package("spacy")
sc.install_pypi_package("langdetect")
sc.install_pypi_package("pathos")
但当然,就像我无法成功下载模型一样,在尝试加载模型时,我出现了以下错误:

An error was encountered:
[E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Traceback (most recent call last):
  File "/mnt/tmp/spark-eef27750-07a4-4a8a-82dc-b006827e7f1f/userFiles-ec6ecbe3-558b-42df-bd38-cd33b2340ae0/NLPtools.py", line 13, in <module>
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'textcat'])
  File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/tmp/1596550154785-0/lib/python2.7/site-packages/spacy/util.py", line 175, in load_model
    raise IOError(Errors.E050.format(name=name))
IOError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
遇到错误:
[E050]找不到型号“en_core_web_sm”。它似乎不是快捷链接、Python包或数据目录的有效路径。
回溯(最近一次呼叫最后一次):
文件“/mnt/tmp/spark-eef27750-07a4-4a8a-82dc-b006827e7f1f/userFiles-ec6ecbe3-558b-42df-bd38-cd33b2340ae0/NLPtools.py”,第13行
nlp=spacy.load('en_core_web_sm',disable=['parser','textcat']))
文件“/tmp/1596550154785-0/lib/python2.7/site packages/spacy/_init__.py”,第30行,已加载
返回util.load\u模型(名称,**覆盖)
文件“/tmp/1596550154785-0/lib/python2.7/site packages/spacy/util.py”,第175行,加载模式
raise IOError(Errors.E050.format(name=name))
IOError:[E050]找不到型号“en_core\u web\u sm”。它似乎不是快捷链接、Python包或数据目录的有效路径。
我曾尝试将其直接安装在集群(主/辅)上,但它位于PySpark会话的“外部”,因此不会嵌入。还有像
这样的命令!python-mspacy下载en_core_web_sm
在PySpark笔记本中不起作用


提前谢谢

安装spacy和模型的最佳方法是使用EMR bootsrap脚本。这一个让我工作

我的配置:

Release label:emr-5.32.0 
Hadoop distribution:Amazon 2.10.1
Applications:Spark 2.4.7 
JupyterEnterpriseGateway 2.1.0 
Livy 0.7.0
我的剧本:

#!/bin/bash -xe

#### WARNING #####
## After modifying this script you have to push it on s3

# Non-standard and non-Amazon Machine Image Python modules:
version=1.1

printf "This is the latest script $version"

sudo /usr/bin/pip3.7 install -U \
  boto3 \
  pandas \
  langdetect \
  hdfs \
  tqdm \
  pathos \
  wikipedia \
  filechunkio \
  gensim \
  termcolor \
  awswrangler

# Install spacy. Order matter !
sudo /usr/bin/pip3.7 install -U \
  numpy \
  Cython \
  pip

sudo /usr/local/bin/pip3.7 install -U spacy

sudo python3 -m spacy download en_core_web_sm
需要注意的两个要点:

  • 对所有命令使用sudo
  • 升级pip并在其之后更改路径