在Python中如何计算短语并将其用作标题？_Python_Text_Count

在Python中如何计算短语并将其用作标题？

python text

在Python中如何计算短语并将其用作标题？,python,text,count,Python,Text,Count,我有一个文件，在其中我试图获得短语的计数。在某些文本行中，我需要计算大约100个短语。作为一个简单的例子，我有以下几点： phrases = """hello name john doe """ text1 = 'id=1: hello my name is john doe. hello hello. how are you?' text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe' header

我有一个文件，在其中我试图获得短语的计数。在某些文本行中，我需要计算大约100个短语。作为一个简单的例子，我有以下几点：

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

header = ''
for phrase in phrases.splitlines():
    header = header+'|'+phrase
header = 'id'+header

我希望能够有如下输出：

id|hello|name|john doe
1|3|1|1
2|0|1|1

我把头球放下了。我只是不知道如何计算每个短语的数量并附加输出。

您可以使用.count来计算字符串中的单词

因此，除了下面的评论中提到的不匹配之外，这应该是可行的

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

texts = [text1,text2]

header = ''
for phrase in phrases.splitlines():
    header = header+'|'+phrase
header = 'id'+header
print header

for id,text in enumerate(texts):
    textcount = [id]
    for phrase in header.split('|')[1:]:
        textcount.append(text.lower().count(phrase))
    print "|".join(map(str,textcount))

以上假设您有一个按id顺序排列的文本列表，但如果它们都以“id=n”开头，您可以执行以下操作：

for text in texts:
    id = text[3]  # assumes id is 4th char
    textcount = [id]

可以使用.count对字符串中的单词进行计数

因此，除了下面的评论中提到的不匹配之外，这应该是可行的

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

texts = [text1,text2]

header = ''
for phrase in phrases.splitlines():
    header = header+'|'+phrase
header = 'id'+header
print header

for id,text in enumerate(texts):
    textcount = [id]
    for phrase in header.split('|')[1:]:
        textcount.append(text.lower().count(phrase))
    print "|".join(map(str,textcount))

以上假设您有一个按id顺序排列的文本列表，但如果它们都以“id=n”开头，您可以执行以下操作：

for text in texts:
    id = text[3]  # assumes id is 4th char
    textcount = [id]

创建一个标题列表

In [6]: p=phrases.strip().split('\n')

In [7]: p
Out[7]: ['hello', 'name', 'john doe']

使用正则表达式，使用i.e.\b获取避免部分匹配的发生次数。标记re.I是为了使搜索不区分大小写

In [11]: import re

In [14]: re.findall(r'\b%s\b' % p[0], text1)
Out[14]: ['hello', 'hello', 'hello']

In [15]: re.findall(r'\b%s\b' % p[0], text1, re.I)
Out[15]: ['hello', 'hello', 'hello']

In [16]: re.findall(r'\b%s\b' % p[1], text1, re.I)
Out[16]: ['name']

In [17]: re.findall(r'\b%s\b' % p[2], text1, re.I)
Out[17]: ['john doe']

在其周围放置一个透镜，以获得找到的图案数量

创建标题列表

In [6]: p=phrases.strip().split('\n')

In [7]: p
Out[7]: ['hello', 'name', 'john doe']

使用正则表达式，使用i.e.\b获取避免部分匹配的发生次数。标记re.I是为了使搜索不区分大小写

In [11]: import re

In [14]: re.findall(r'\b%s\b' % p[0], text1)
Out[14]: ['hello', 'hello', 'hello']

In [15]: re.findall(r'\b%s\b' % p[0], text1, re.I)
Out[15]: ['hello', 'hello', 'hello']

In [16]: re.findall(r'\b%s\b' % p[1], text1, re.I)
Out[16]: ['name']

In [17]: re.findall(r'\b%s\b' % p[2], text1, re.I)
Out[17]: ['john doe']

在其周围放置一个透镜，以获得找到的图案数量

虽然它没有回答你的问题@askewchan和@Fredrik已经这样做了，但我想我应该就你的方法的其余部分提供一些建议：

在列表中定义短语可能会更好：

phrases = ['hello', 'name', 'john doe']

这样，您就可以跳过创建标头时的循环：

header = 'id|' + '|'.join (phrases)

你可以在askewchan的回答中省略.split“|”[1:]部分，例如，支持短语中的短语：

虽然它不能回答你的问题@askewchan和@Fredrik已经这样做了，但我想我应该就你的方法的其余部分提供一些建议：

在列表中定义短语可能会更好：

phrases = ['hello', 'name', 'john doe']

这样，您就可以跳过创建标头时的循环：

header = 'id|' + '|'.join (phrases)

例如，在askewchan的回答中，您可以省略.split“|”[1:]部分，而只支持短语中的短语：

给出：

id|hello|name|john|doe
1|3|1|1|1
2|0|1|0|0

给出：

id|hello|name|john|doe
1|3|1|1|1
2|0|1|0|0

太好了。我知道这一定很简单。我试图用空格分隔文本，结果弄得一团糟。非常感谢。在这种情况下，单独使用.count可能是危险的，因为存在意外匹配。比如说Othello，或者搪瓷。@ DSM是的，我想解决办法是使用ReEX，但我不能帮助。这太好了。我知道这一定很简单。我试图用空格分隔文本，结果弄得一团糟。非常感谢。在这种情况下，单独使用.count可能是危险的，因为存在意外匹配。比如说Othello，或者搪瓷。@ DSM是的，我想解决办法是使用ReEX，但是我不能帮助它。非常感谢。这就是我需要的。非常感谢。这就是我需要的。@myname不用担心，希望helped@myname别担心，希望这有帮助