Python N-gram到列_Python_Pandas - Fatal编程技术网

Python N-gram到列

python pandas

Python N-gram到列,python,pandas,Python,Pandas,给定以下数据帧： import pandas as pd d=['Hello', 'Helloworld'] f=pd.DataFrame({'strings':d}) f strings 0 Hello 1 Helloworld 我想将每个字符串分成3个字符的块，并使用这些字符作为头创建1或0的矩阵，这取决于给定行是否有3个字符的块像这样： Strings Hel low orl 0 Hello 1 0

给定以下数据帧：

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
    strings
0   Hello
1   Helloworld

我想将每个字符串分成3个字符的块，并使用这些字符作为头创建1或0的矩阵，这取决于给定行是否有3个字符的块

像这样：

    Strings     Hel     low     orl
0   Hello         1       0       0
1   Helloworld    1       1       1

请注意，字符串“Hello”的“low”列有一个0，因为它只为精确的部分匹配分配了1。如果有超过1个匹配项（即，如果字符串为“HelHel”，它仍然只会分配1（尽管知道如何计算它并因此分配2也很好）

最终，我试图通过SKLearn为我们在LSHForest中准备数据。因此，我预期会有许多不同的字符串值

以下是我迄今为止所尝试的：

#Split into chunks of exactly 3
def split(s, chunk_size):
    a = zip(*[s[i::chunk_size] for i in range(chunk_size)])
    return [''.join(t) for t in a]
cols=[split(s,3) for s in f['strings']]
cols

[['Hel'], ['Hel', 'low', 'orl']]

#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(cols))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['orl', 'Hel', 'low']

现在，我需要做的就是在f中为colsunq中的每个元素创建一列，如果“strings”列中的字符串与每个给定列标题的块匹配，则添加1

提前谢谢

注意： 如果首选木瓦：

#Shingle into strings of exactly 3
def shingle(word):
    a = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
    return [''.join(t) for t in a]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
a=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(a))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

木瓦

f.strings.apply(chunkit, k=3)

0              [Hel]
1    [Hel, low, orl]
Name: strings, dtype: object

f.strings.apply(count_chunks, k=3).fillna(0)

def str_shingle(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = i + 1, j + 1

def shingleit(s, k):
    return [_ for _ in str_shingle(s, k)]

def count_shingles(s, k):
    return pd.value_counts(shingleit(s, k))

f.strings.apply(count_shingles, k=3).fillna(0)

def str_瓦（s，k）：
i、 j=0，k
当我试图将你的解决方案应用到我添加到原始问题中的木瓦示例时。你能提供指导吗？如果你愿意，我可以将其作为一个单独的问题发布。@DanceParty2我写这篇文章时实际上已经想到了这一点。这是一个非常简单的修改。我已经更新了我的帖子。哇，这太好了。非常感谢你！还有一件事，我注意到一个大的数据帧需要一段时间。我最终需要一个列表列表才能像这样传递到sci kit learn的LSHForest中：[[1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0]]，[1.0,1.0,1.0,1.0,1.0,1.0,1.0]…我知道我可以用np.array（）.tolist（）来包装它，但有没有办法将它放到数据帧中（如果这能节省时间的话）？我也在努力解决这个问题。我唯一能改进它的方法（我能想到的）就是使用cython。但我不希望有显著的改进。我会更新帖子。
def str_shingle(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = i + 1, j + 1

def shingleit(s, k):
    return [_ for _ in str_shingle(s, k)]

def count_shingles(s, k):
    return pd.value_counts(shingleit(s, k))

f.strings.apply(count_shingles, k=3).fillna(0)