Python 滥用nltk和x27的后果；s word_标记化（已发送）_Python_Nltk

Python 滥用nltk和x27的后果；s word_标记化（已发送）

python

Python 滥用nltk和x27的后果；s word_标记化（已发送）,python,nltk,Python,Nltk,我想把一段话分成几个字。我手头上有可爱的nltk.tokenize.word_tokenize（已发送），但help（word_tokenize）说，“这个tokenizer设计用于一次处理一个句子。” 有人知道如果你在一个段落中使用它会发生什么，比如最多5句话吗？我自己也在几小段文章中尝试过，它似乎很有效，但这并不是决定性的证据。nltk.tokenize.word\u tokenize（text）只是一个调用类实例的tokenize方法的瘦函数，它显然使用简单的正则表达式来解析句子该类文件

我想把一段话分成几个字。我手头上有可爱的nltk.tokenize.word_tokenize（已发送），但help（word_tokenize）说，“这个tokenizer设计用于一次处理一个句子。”

有人知道如果你在一个段落中使用它会发生什么，比如最多5句话吗？我自己也在几小段文章中尝试过，它似乎很有效，但这并不是决定性的证据。

nltk.tokenize.word\u tokenize（text）

只是一个调用类实例的

tokenize

方法的瘦函数，它显然使用简单的正则表达式来解析句子

该类文件说明：

此标记器假定文本已被分割为句子。任何句点——除了字符串末尾的句点-- 被假定为其所附单词的一部分（例如缩写等），并且没有单独标记

基本方法本身非常简单：

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上，如果周期落在字符串末尾，该方法通常会将其标记为单独的标记：

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

字符串中的任何句点都被标记为单词的一部分，前提是它是一个缩写：

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为是可以接受的，你就应该没事。

试试这种黑客行为：

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

那么最有可能的是，下面的代码也是您计算频率所需的代码=）

啊哈，这种行为是不可接受的，我用词频来做文本分类。多彻底的回答啊，谢谢！这个建议现在已经过时了

nltk.word\u tokenize（）

现在在确定标记之前使用朋克句子标记器拆分句子。很棒的黑客！我要试一试

nltk.word\u tokenize（）

现在可以处理包含多个句子的文本。

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]