Python 3.x 有效的熊猫分块_Python 3.x_Pandas_Nltk

Python 3.x 有效的熊猫分块

python-3.x pandas

Python 3.x 有效的熊猫分块,python-3.x,pandas,nltk,Python 3.x,Pandas,Nltk,我使用的是熊猫框架，它可以包含大量的内容，对于每个文档，我需要得到一些定义的名词短语。虽然一切正常，但速度也有点慢，我相信有更好的方法可以达到同样的效果，但我的python知识还不够好因此，任何关于如何改进的建议都将不胜感激这是我的代码： import pandas import nltk, re from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, wordpunct_tokenize from n

我使用的是熊猫框架，它可以包含大量的内容，对于每个文档，我需要得到一些定义的名词短语。虽然一切正常，但速度也有点慢，我相信有更好的方法可以达到同样的效果，但我的python知识还不够好

因此，任何关于如何改进的建议都将不胜感激

这是我的代码：

import pandas
import nltk, re
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, wordpunct_tokenize
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import untag

def chunckMe(str,rule):

    np=[]
    chunk_parser = RegexpChunkParser(rule, chunk_label='LBL')
    sentences= sent_tokenize(str)

    for sent in sentences:
        d_words=nltk.word_tokenize(sent)
        d_tagged=nltk.pos_tag(d_words)
        chunked_text = chunk_parser.parse(d_tagged)

        tree = chunked_text
        for subtree in tree.subtrees():
            if subtree.label() == 'LBL': np.append(" ".join(untag(subtree)).lower())

    return np;

# main def
def rm_main(data):

    np_all=[]

    # This works but can probably be done much better ...

    for index,row in data.iterrows():

        str=row["txt"]

        chunk_rule = ChunkRule("<JJ.*><NN.*>+|<JJ.*>*<NN.*><CC>*<NN.*>+|<CD><NN.*>", "Simple noun phrase")
        tags = chunckMe(str,[chunk_rule])
        np_all.append(', '.join(set(tags)))

    data['noun_phrases']=np_all

    return data

导入熊猫
导入nltk，重新
从nltk.tokenize导入发送\u tokenize、word\u tokenize、regexp\u tokenize、wordputt\u tokenize
从nltk.chunk导入*
从nltk.chunk.util导入*
从nltk.chunk.regexp导入*
从nltk进口untag
def chunckMe（str，规则）：
np=[]
chunk\u parser=RegexpChunkParser（规则，chunk\u label='LBL'）
句子=已发送\标记化（str）
对于以句子形式发送的邮件：
d_words=nltk.word_标记化（已发送）
d_taged=nltk.pos_标签（d_字）
chunked_text=chunk_parser.parse（d_标记）
树=分块的文本
对于树中的子树。子树（）：
if subtree.label（）=='LBL'：np.append（“.join（untag（subtree））.lower（））
返回np；
#主def
def rm_主（数据）：
np_all=[]
#这很有效，但可能做得更好。。。
对于索引，data.iterrows（）中的行：
str=行[“txt”]
chunk|rule=ChunkRule（“简单名词短语”）
tags=chunckMe（str，[chunk\u规则]）
np_all.append（'，'.join（set（tags）））
数据['noun_短语']=np_全部
返回数据

你知道我如何避免或改进iterrows部分吗？这是可行的，但我强烈感觉有更好的方法可以做到这一点