Python 如何通过一次数据帧有效地计算行数

Python 如何通过一次数据帧有效地计算行数,python,pandas,Python,Pandas,我有一个由如下字符串组成的数据帧: ID_0 ID_1 g k a h c i j e d i i h b b d d i a d h 对于每一对字符串,我可以按如下方式计算其中有多少行包含任意一个字符串 import pandas as pd import itertools df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0

我有一个由如下字符串组成的数据帧:

ID_0 ID_1
 g    k
 a    h
 c    i
 j    e
 d    i
 i    h
 b    b
 d    d
 i    a
 d    h
对于每一对字符串,我可以按如下方式计算其中有多少行包含任意一个字符串

import pandas as pd
import itertools

df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])

alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
    print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])
这使得:

a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]
问题是我的数据帧非常大,字母表大小为200,这种方法对每对字母在整个数据帧上进行独立遍历

是否有可能通过某种方式在数据帧上进行单次传递来获得相同的输出


计时

我使用以下工具创建了一些数据:

import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])

#Testing Parfait's answer
def f(row):
    ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
                 (df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
    return(ser)

%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop
我希望能够为n=10**8这样做。这能加快速度吗


考虑一个
DataFrame.apply()
方法:

from io import StringIO
import pandas as pd

data = '''ID_0,ID_1
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
'''    
df = pd.read_csv(StringIO(data))

def f(row):
    ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
                 (df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
    return(ser)

df['CountIDs'] = df.apply(f, axis=1)
print df
#   ID_0 ID_1  CountIDs
# 0    g    k         1
# 1    a    h         4
# 2    c    i         4
# 3    j    e         1
# 4    d    i         6
# 5    i    h         6
# 6    b    b         1
# 7    d    d         3
# 8    i    a         5
# 9    d    h         5
替代解决方案:

# VECTORIZED w/ list comprehension
def f(x, y, z):    
    ser = [len(df[(df['ID_0'] == x[i]) | (df['ID_1'] == x[i])|
                  (df['ID_0'] == y[i]) | (df['ID_1'] == y[i])]) for i in z]
    return(ser)

df['CountIDs'] = f(df['ID_0'], df['ID_1'], df.index)

# USING map()
def f(x, y):
    ser = len(df[(df['ID_0'] == x) | (df['ID_1'] == x)|
                 (df['ID_0'] == y) | (df['ID_1'] == y)])
    return(ser)

df['CountIDs'] = list(map(f, df['ID_0'], df['ID_1']))

# USING zip() w/ list comprehnsion
def f(x, y):
    ser = len(df[(df['ID_0'] == x) | (df['ID_1'] == x)|
                 (df['ID_0'] == y) | (df['ID_1'] == y)])
    return(ser)

df['CountIDs'] = [f(x,y) for x,y in zip(df['ID_0'], df['ID_1'])]

# USING apply() w/ isin()
def f(row):
    ser = len(df[(df['ID_0'].isin([row['ID_0'], row['ID_1']]))|
                 (df['ID_1'].isin([row['ID_0'], row['ID_1']]))])
    return(ser)

df['CountIDs'] = df.apply(f, axis=1)

通过使用一些巧妙的组合学/集合论进行计数,您可以通过行级子迭代:

# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()

# Get the counts.
df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
结果输出:

  ID_0 ID_1  count
0    g    k      1
1    a    h      4
2    c    i      4
3    j    e      1
4    d    i      6
5    i    h      6
6    b    b      1
7    d    d      3
8    i    a      5
9    d    h      5
我将我的方法的输出与一个数据集上的行级迭代方法进行了比较,该数据集有5000行,并且所有计数都匹配

为什么这样做有效?它基本上只依赖于计算两组并集的公式:

给定元素的基数就是字符数。当元素不同时,交集的基数就是任意顺序的元素对的计数。请注意,当这两个元素相同时,公式只会减少到
char\u计数

计时

使用问题中的计时设置,并使用以下功能进行回答:

def root(df):
    char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
    pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
    df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
    return df
我得到了
n=10**4
的以下计时:

%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop

%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop
我得到了
n=10**6
的以下计时:

%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop

我的解决方案似乎近似线性扩展。

您能从顶部发布的特定示例中发布所需的结果吗?我正在得到结果,但想检查它们是否与您的一致。您期望的结果与输入不可比。当
i
在两个cols中出现4次,而
a
在两个cols中出现2次时,行
i a
不应该产生6次吗?@Parfait'a i'给出5次,因为您正在计算的一行中同时包含'a'和'i',因此不应该重复计算。我相信我在问题中发布的结果是正确的,可能是重复的,谢谢。遗憾的是,这种方法似乎非常缓慢。请参阅其他方法。最好的是
isin()
,它比
|
过滤器逻辑稍快。有趣的是,我使用子查询从SQL概念化了这个解决方案。数据来源于哪里?为了加快速度,您需要进行矢量化(我尝试过),但我不知道如何通过所需的迭代,因为每一行都需要通过计数过程。