Python 如何确定单词的概率？_Python_Linux_Probability

Python 如何确定单词的概率？

python linux

Python 如何确定单词的概率？,python,linux,probability,Python,Linux,Probability,我有两份文件。Doc1的格式如下： TOPIC: 0 5892.0 site 0.0371690427699 Internet 0.0261371350984 online 0.0229124236253 web 0.0218940936864 say 0.0159538357094 TOPIC: 1 12366.0 web 0.150331554262 site 0.0517548115801 say 0.0451237263464 Internet 0.0153647096879 on

我有两份文件。Doc1的格式如下：

TOPIC:  0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094

TOPIC:  1 12366.0
web 0.150331554262
site 0.0517548115801
say 0.0451237263464
Internet 0.0153647096879
online 0.0135856380398

…以此类推，直到主题99，模式相同

Doc2的格式如下：

0 0.566667 0 0.0333333 0 0 0 0.133333 ..........

等等。。。每个主题的每个值总共有100个值

现在，我必须找到每个单词的加权平均概率，即：

P(w) = alpha.P(w1)+ alpha.P(w2)+...... +alpha.P(wn)

where alpha = value in the nth position corresponding to the nth topic.

也就是说，对于“说”这个词，概率应该是

P(say) = 0*0.0159 + 0.5666*0.045+.......

同样地，对于每个单词，我必须计算概率

For  multiplication, if the word is taken from topic 0, then the 0th value from the doc2 must be considered and so on.

我只使用下面的代码计算单词的出现次数，但从未取过它们的值。所以，我很困惑

 with open(doc2, "r") as f:
    with open(doc3, "w") as f1:

         words = " ".join(line.strip() for line in f)
         d = defaultdict(int)
         for word in words.split():  
              d[word] += 1
              for key, value in d.iteritems() :
                  f1.write(key+ ' ' + str(value) + ' ')
              print '\n'

我的输出应该如下所示：

 say = "prob of this word calculated by above formula"
 site = "
 internet = "

等等

我做错了什么？

假设您忽略了主题行，请使用defaultdict对值进行分组，然后在最后进行计算：

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
    values = map(float, f2.read().split()) 
    for line in f:
        if line.strip() and not line.startswith("TOPIC"):
            name, val = line.split()
            d[name].append(float(val))

for k,v in d.items():
    print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

另一种方法是边走边计算，每次点击一个新的部分，即一行主题，增加一个计数，通过索引从值中获取正确的值：

from collections import defaultdict
d = defaultdict(float)
from itertools import  imap

with open("doc1") as f,open("doc2") as f2:
    # create list of all floats from doc2
    values = imap(float, f2.read().split())
    for line in f:
        # if we have a new TOPIC increase the ind to get corresponding ndex from values
        if line.startswith("TOPIC"):
            ind = next(values)
            continue
        # ignore empty lines
        if line.strip():
            # get word and float and multiply the val by corresponding values value
            name, val = line.split()
            d[name] += float(val) * values[ind]

for k,v in d.items():
    print("Prob for {} is {}".format(k ,v) )

在doc2中使用两个doc1内容和

0 0.566667 0 0.0333333 0

输出以下内容：

Prob for web is 0.085187930859
Prob for say is 0.0255701266375
Prob for online is 0.0076985327511
Prob for site is 0.0293277438137
Prob for Internet is 0.00870667394471

您还可以使用itertools groupby：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v) 
            #  get matching float from values
            f = next(values)
            # iterate over the group 
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
for k,v in d.iteritems():
    print("Prob for {} is {}".format(k,v))

对于python3，所有的

imap

都应该更改为just

map

，这也会在python3中返回一个迭代器。

所以

比如说

0.0159538357094乘以5892.0？我只看到你在问题中乘以doc2文件中相应的元素不用担心，不客气。你让我有点困惑：）没问题，至少你自己做了很好的努力来解决它，这比这里很多人做的都多；）我将在a.m.中查看一下，但基本上我们只需要将.read（）.split（）替换为拆分每一行，然后对每一行进行迭代，为每一行打开一个文件并编写输出。您是否完全按照发布的方式使用代码并使用python2？