使用Pandas识别成对的匹配记录,以便进一步分析
我在学期开始和结束时进行多项选择题调查,我想分析学生对问题的回答是否从开始到结束有显著变化 会有学生回答第一个问题而不回答第二个问题,反之亦然,原因很多。我想把这些从分析中删除 请注意,学生们并非都在同一时间(甚至一天)作答。有些学生可能在作业前一天或作业后一天作答,因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配 这些问题通常有“强烈同意或不同意,同意或不同意,或不确定” 我的数据文件如下所示:使用Pandas识别成对的匹配记录,以便进一步分析,pandas,Pandas,我在学期开始和结束时进行多项选择题调查,我想分析学生对问题的回答是否从开始到结束有显著变化 会有学生回答第一个问题而不回答第二个问题,反之亦然,原因很多。我想把这些从分析中删除 请注意,学生们并非都在同一时间(甚至一天)作答。有些学生可能在作业前一天或作业后一天作答,因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配 这些问题通常有“强烈同意或不同意,同意或不同意,或不确定” 我的数据文件如下所示: Email address: text Time: date/time Multiple C
Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
以下是输入的模型:
input = pd.DataFrame({'email':
['joe@sample.com', 'jane@sample.com', 'jack@sample.com',
'joe@sample.com', 'jane@sample.com', 'jack@sample.com', 'jerk@sample.com'],
'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})
这是一个输出模型:
output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
'change score': [+0.6, +1]})
多棒的练习啊,谢谢你的建议。
变化分数的逻辑是“你快乐吗?”乔保持不变,杰克和简从“否”变为“是”,所以(0+1+1)/3。而“你聪明吗?”这三个词都从“否”变为“是”,所以(1+1+1)/3=1。jerk@sample.com不算在内,因为他没有回复开始的调查,只是回复了结束的调查
以下是我的数据文件的前两行:
Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a similar idea. The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s@b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with film production company,Sophomore,2020,,,,,,,,
从初始数据帧开始 首先,我们将您的日期转换为适当的日期时间
df['date'] = pd.to_datetime(df['date'])
然后我们创建两个变量,第一个变量确保每个人有超过2封电子邮件,第二个变量分别为第1个月和第7个月
(假设您可能有重复的entires).loc
允许我们对数据帧使用布尔条件
s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
score_dict = dict(zip(output["question"], output["change score"]))
s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
score_dict
)
现在,我们需要重新调整数据的形状,以便更轻松地运行一些逻辑测试
df3 = (
df2.set_index(["email", "date"])
.stack()
.reset_index()
.rename(columns={0: "answer", "level_2": "question"})
.sort_values(["email", "date"])
)
email date question answer
0 jack@sample.com 2019-01-01 are you happy? no
1 jack@sample.com 2019-01-01 are you smart? no
2 jack@sample.com 2019-07-01 are you happy? yes
3 jack@sample.com 2019-07-01 are you smart? yes
现在,我们需要弄清楚杰克的答案从学期开始到学期结束是否发生了变化,如果是这样,我们将分配一个分数,我们将利用map
并从输出数据框创建一个字典
s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
score_dict = dict(zip(output["question"], output["change score"]))
s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
score_dict
)
从逻辑上讲,我们只想对任何已更改且不在倒数第二个月的值应用分数 因此,Joe的
你快乐吗问题的值为NaN,因为他在第一学期选择了是,第二学期选择了是
您可能希望为评分添加更多的逻辑,以不同的方式查看Y/N,并且您需要从查看第一行开始清理您的数据框-但是沿着这些思路应该可以工作。查看您的数据似乎非常简单,您需要1)筛选出在特定日期为<2
的记录,并对每个记录进行量化回答。对于第二个问题,这通常是市场研究人员的领域——我通常(为他们)做的事情是根据答案提供1-5个度量,并进行相应的分析和加权。添加到@DataNearior said中,如果您可以创建一个虚拟数据框,复制您原始的df和预期的输出,并提供解释,这将很有帮助伟大的建议,因为它迫使我专门面对我需要的内容。我已经更新了问题。@pitosalas我已经添加了一个答案,但现在我想进一步划分答案栏可能是一个好主意,然后你可以测试是否有人从yes
到no
或no
到yes
,然后你可以给出相应的排名。谢谢!你的答案非常有用。我会尽快尝试!