通过Python计算两个字符串变量之间的公共项

通过Python计算两个字符串变量之间的公共项,python,pandas,bigdata,Python,Pandas,Bigdata,如果有人能帮我从csv文件的两列中计算出匹配的州名,我将不胜感激。例如,从列 > StesteBurnII< 和 StesteLivisIin < /C> >: State_born_in State_lives_in New York Florida Massachusetts Massachusetts Florida Massachusetts Illinois Illinois  Iowa Texas New Hampshire Massachusetts

如果有人能帮我从csv文件的两列中计算出匹配的州名,我将不胜感激。例如,从列 > StesteBurnII< <代码>和<代码> StesteLivisIin < /C> >:

State_born_in   State_lives_in
New York    Florida
Massachusetts   Massachusetts
Florida Massachusetts
Illinois    Illinois 
Iowa    Texas
New Hampshire   Massachusetts
California  California
基本上,我想数一数与他们出生在同一州的人数。然后我想知道所有生活在同一州的人的百分比。所以在上面的例子中,我的计数是2,因为有两个人住在他们出生的同一个州(加利福尼亚州和马萨诸塞州),他们住在他们出生的同一个州。如果我想要这个百分比,我就用2除以观测值。我对使用熊猫还比较陌生,但到目前为止我已经尝试过了

df = pd.read_csv("uscitizens.csv","a")
import pandas as pd 
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])

此外,对于一个有超过200万个观测值的数据集,我将如何做到这一点?我将非常感谢任何人的帮助

您是否期待这一点

counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])
您可以先使用过滤后的
数据帧的
长度
,然后使用原始数据帧的
长度
(它与
索引的长度相同,最快的是什么):

计时

In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop

In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop

In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop

更快的解决方案:

same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0    False
1     True
2     True
3     True
4    False
5     True
6    False
7    False
8     True
dtype: bool

counts = same.sum()
print (counts)
5

percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556
2M数据帧中的计时:

#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)


In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop

In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop

非常感谢。顺便问一下,对于一个包含大约200万个观测值的大型数据集,我怎么能做到这一点呢?我想你可以使用这个解决方案——它可以很好地处理大型数据帧。或者有一些问题?我解决了:),基本上我将
df=pd.read\u csv(“uscitizens.csv”,“a”)
重写为
,并将open('uscitizens.csv',r')作为f:
,以便它将其作为一个对象处理,而不是存储内存。但是我按照你的方法做了
line[2]==line[3]
,其中line[2]是
State\u born\u in
,line[3]是
State\u lives\u in
谢谢你的帮助
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)


In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop

In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop