使用Pandas识别成对的匹配记录，以便进一步分析_Pandas

使用Pandas识别成对的匹配记录，以便进一步分析

pandas

使用Pandas识别成对的匹配记录，以便进一步分析,pandas,Pandas,我在学期开始和结束时进行多项选择题调查，我想分析学生对问题的回答是否从开始到结束有显著变化会有学生回答第一个问题而不回答第二个问题，反之亦然，原因很多。我想把这些从分析中删除请注意，学生们并非都在同一时间（甚至一天）作答。有些学生可能在作业前一天或作业后一天作答，因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配这些问题通常有“强烈同意或不同意，同意或不同意，或不确定” 我的数据文件如下所示： Email address: text Time: date/time Multiple C

我在学期开始和结束时进行多项选择题调查，我想分析学生对问题的回答是否从开始到结束有显著变化

会有学生回答第一个问题而不回答第二个问题，反之亦然，原因很多。我想把这些从分析中删除

请注意，学生们并非都在同一时间（甚至一天）作答。有些学生可能在作业前一天或作业后一天作答，因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配

这些问题通常有“强烈同意或不同意，同意或不同意，或不确定”

我的数据文件如下所示：

Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]

我需要过滤掉两次没有回答的学生的记录（在学期开始和结束时）

我需要找到一种方法来量化每个答案的变化程度

我有过很多想法，但它们都是某种形式的强力老式循环和保存

使用熊猫，我想有一种更优雅的方法

以下是输入的模型：

input = pd.DataFrame({'email': 
                   ['joe@sample.com', 'jane@sample.com', 'jack@sample.com', 
                    'joe@sample.com', 'jane@sample.com', 'jack@sample.com', 'jerk@sample.com'],
                  'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
                           'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
                  'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
                  'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})

这是一个输出模型：

output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
                       'change score': [+0.6, +1]})

多棒的练习啊，谢谢你的建议。变化分数的逻辑是“你快乐吗？”乔保持不变，杰克和简从“否”变为“是”，所以（0+1+1）/3。而“你聪明吗？”这三个词都从“否”变为“是”，所以（1+1+1）/3=1。jerk@sample.com不算在内，因为他没有回复开始的调查，只是回复了结束的调查

以下是我的数据文件的前两行：

Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a  similar idea.  The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than  having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s@b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with  film production company,Sophomore,2020,,,,,,,,

从初始数据帧开始

首先，我们将您的日期转换为适当的日期时间

df['date'] = pd.to_datetime(df['date'])

然后我们创建两个变量，第一个变量确保每个人有超过2封电子邮件，第二个变量分别为第1个月和第7个月

（假设您可能有重复的entires）

.loc

允许我们对数据帧使用布尔条件

s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]

score_dict = dict(zip(output["question"], output["change score"]))

s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))

df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
    score_dict
)

现在，我们需要重新调整数据的形状，以便更轻松地运行一些逻辑测试

df3 = (
    df2.set_index(["email", "date"])
    .stack()
    .reset_index()
    .rename(columns={0: "answer", "level_2": "question"})
    .sort_values(["email", "date"])
)

             email       date        question answer  
0  jack@sample.com 2019-01-01  are you happy?     no    
1  jack@sample.com 2019-01-01  are you smart?     no    
2  jack@sample.com 2019-07-01  are you happy?    yes    
3  jack@sample.com 2019-07-01  are you smart?    yes

现在，我们需要弄清楚杰克的答案从学期开始到学期结束是否发生了变化，如果是这样，我们将分配一个分数，我们将利用

map

并从输出数据框创建一个字典

s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]

score_dict = dict(zip(output["question"], output["change score"]))

s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))

df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
    score_dict
)

从逻辑上讲，我们只想对任何已更改且不在倒数第二个月的值应用分数

因此，Joe的

你快乐吗问题的值为NaN，因为他在第一学期选择了是，第二学期选择了是
您可能希望为评分添加更多的逻辑，以不同的方式查看Y/N，并且您需要从查看第一行开始清理您的数据框-但是沿着这些思路应该可以工作。
查看您的数据似乎非常简单，您需要1）筛选出在特定日期为<2
的记录，并对每个记录进行量化回答。对于第二个问题，这通常是市场研究人员的领域——我通常（为他们）做的事情是根据答案提供1-5个度量，并进行相应的分析和加权。添加到@DataNearior said中，如果您可以创建一个虚拟数据框，复制您原始的df和预期的输出，并提供解释，这将很有帮助伟大的建议，因为它迫使我专门面对我需要的内容。我已经更新了问题。@pitosalas我已经添加了一个答案，但现在我想进一步划分答案栏可能是一个好主意，然后你可以测试是否有人从yes
到no
或no
到yes
，然后你可以给出相应的排名。谢谢！你的答案非常有用。我会尽快尝试！