通过Python计算两个字符串变量之间的公共项_Python_Pandas_Bigdata

通过Python计算两个字符串变量之间的公共项

python pandas

通过Python计算两个字符串变量之间的公共项,python,pandas,bigdata,Python,Pandas,Bigdata,如果有人能帮我从csv文件的两列中计算出匹配的州名，我将不胜感激。例如，从列 > StesteBurnII< 和 StesteLivisIin < /C> >： State_born_in State_lives_in New York Florida Massachusetts Massachusetts Florida Massachusetts Illinois Illinois Iowa Texas New Hampshire Massachusetts

如果有人能帮我从csv文件的两列中计算出匹配的州名，我将不胜感激。例如，从列 > StesteBurnII< <代码>和<代码> StesteLivisIin < /C> >：

State_born_in   State_lives_in
New York    Florida
Massachusetts   Massachusetts
Florida Massachusetts
Illinois    Illinois 
Iowa    Texas
New Hampshire   Massachusetts
California  California

基本上，我想数一数与他们出生在同一州的人数。然后我想知道所有生活在同一州的人的百分比。所以在上面的例子中，我的计数是2，因为有两个人住在他们出生的同一个州（加利福尼亚州和马萨诸塞州），他们住在他们出生的同一个州。如果我想要这个百分比，我就用2除以观测值。我对使用熊猫还比较陌生，但到目前为止我已经尝试过了

df = pd.read_csv("uscitizens.csv","a")
import pandas as pd 
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])

此外，对于一个有超过200万个观测值的数据集，我将如何做到这一点？我将非常感谢任何人的帮助

您是否期待这一点

counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])

您可以先使用过滤后的

数据帧的长度
，然后使用原始数据帧的长度
（它与索引的长度相同，最快的是什么）：
计时：
In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop

In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop

In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop


更快的解决方案：
same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0    False
1     True
2     True
3     True
4    False
5     True
6    False
7    False
8     True
dtype: bool

counts = same.sum()
print (counts)
5

percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556

2M数据帧中的计时：
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)


In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop

In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop

非常感谢。顺便问一下，对于一个包含大约200万个观测值的大型数据集，我怎么能做到这一点呢？我想你可以使用这个解决方案——它可以很好地处理大型数据帧。或者有一些问题？我解决了：），基本上我将df=pd.read\u csv（“uscitizens.csv”，“a”）
重写为，并将open（'uscitizens.csv'，r'）作为f:
，以便它将其作为一个对象处理，而不是存储内存。但是我按照你的方法做了line[2]==line[3]
，其中line[2]是State\u born\u in
，line[3]是State\u lives\u in谢谢你的帮助
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)


In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop

In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop