使用Pandas识别成对的匹配记录,以便进一步分析

使用Pandas识别成对的匹配记录,以便进一步分析,pandas,Pandas,我在学期开始和结束时进行多项选择题调查,我想分析学生对问题的回答是否从开始到结束有显著变化 会有学生回答第一个问题而不回答第二个问题,反之亦然,原因很多。我想把这些从分析中删除 请注意,学生们并非都在同一时间(甚至一天)作答。有些学生可能在作业前一天或作业后一天作答,因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配 这些问题通常有“强烈同意或不同意,同意或不同意,或不确定” 我的数据文件如下所示: Email address: text Time: date/time Multiple C

我在学期开始和结束时进行多项选择题调查,我想分析学生对问题的回答是否从开始到结束有显著变化

会有学生回答第一个问题而不回答第二个问题,反之亦然,原因很多。我想把这些从分析中删除

请注意,学生们并非都在同一时间(甚至一天)作答。有些学生可能在作业前一天或作业后一天作答,因此我不能依赖日期/时间。我必须依靠电子邮件地址的匹配

这些问题通常有“强烈同意或不同意,同意或不同意,或不确定”

我的数据文件如下所示:

Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
  • 我需要过滤掉两次没有回答的学生的记录(在学期开始和结束时)
  • 我需要找到一种方法来量化每个答案的变化程度
  • 我有过很多想法,但它们都是某种形式的强力老式循环和保存

    使用熊猫,我想有一种更优雅的方法


    以下是输入的模型:

    input = pd.DataFrame({'email': 
                       ['joe@sample.com', 'jane@sample.com', 'jack@sample.com', 
                        'joe@sample.com', 'jane@sample.com', 'jack@sample.com', 'jerk@sample.com'],
                      'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
                               'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
                      'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
                      'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})
    
    这是一个输出模型:

    output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
                           'change score': [+0.6, +1]})
    
    多棒的练习啊,谢谢你的建议。 变化分数的逻辑是“你快乐吗?”乔保持不变,杰克和简从“否”变为“是”,所以(0+1+1)/3。而“你聪明吗?”这三个词都从“否”变为“是”,所以(1+1+1)/3=1。jerk@sample.com不算在内,因为他没有回复开始的调查,只是回复了结束的调查


    以下是我的数据文件的前两行:

    Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a  similar idea.  The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than  having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
    8/30/2017 18:53:21,s@b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with  film production company,Sophomore,2020,,,,,,,,
    

    从初始数据帧开始

    首先,我们将您的日期转换为适当的日期时间

    df['date'] = pd.to_datetime(df['date'])
    
    然后我们创建两个变量,第一个变量确保每个人有超过2封电子邮件,第二个变量分别为第1个月和第7个月

    (假设您可能有重复的entires)
    .loc
    允许我们对数据帧使用布尔条件

    s = df.groupby('email')['email'].transform('count') >= 2
    months = [1,7] # start & end of semester.
    df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
    
    score_dict = dict(zip(output["question"], output["change score"]))
    
    s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
    
    df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
        score_dict
    )
    

    现在,我们需要重新调整数据的形状,以便更轻松地运行一些逻辑测试

    df3 = (
        df2.set_index(["email", "date"])
        .stack()
        .reset_index()
        .rename(columns={0: "answer", "level_2": "question"})
        .sort_values(["email", "date"])
    )
    
                 email       date        question answer  
    0  jack@sample.com 2019-01-01  are you happy?     no    
    1  jack@sample.com 2019-01-01  are you smart?     no    
    2  jack@sample.com 2019-07-01  are you happy?    yes    
    3  jack@sample.com 2019-07-01  are you smart?    yes    
    
    现在,我们需要弄清楚杰克的答案从学期开始到学期结束是否发生了变化,如果是这样,我们将分配一个分数,我们将利用
    map
    并从输出数据框创建一个字典

    s = df.groupby('email')['email'].transform('count') >= 2
    months = [1,7] # start & end of semester.
    df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
    
    score_dict = dict(zip(output["question"], output["change score"]))
    
    s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
    
    df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
        score_dict
    )
    

    从逻辑上讲,我们只想对任何已更改且不在倒数第二个月的值应用分数

    因此,Joe的
    你快乐吗
    问题的值为NaN,因为他在第一学期选择了是,第二学期选择了是


    您可能希望为评分添加更多的逻辑,以不同的方式查看Y/N,并且您需要从查看第一行开始清理您的数据框-但是沿着这些思路应该可以工作。

    查看您的数据似乎非常简单,您需要1)筛选出在特定日期为
    <2
    的记录,并对每个记录进行量化回答。对于第二个问题,这通常是市场研究人员的领域——我通常(为他们)做的事情是根据答案提供1-5个度量,并进行相应的分析和加权。添加到@DataNearior said中,如果您可以创建一个虚拟数据框,复制您原始的df和预期的输出,并提供解释,这将很有帮助伟大的建议,因为它迫使我专门面对我需要的内容。我已经更新了问题。@pitosalas我已经添加了一个答案,但现在我想进一步划分答案栏可能是一个好主意,然后你可以测试是否有人从
    yes
    no
    no
    yes
    ,然后你可以给出相应的排名。谢谢!你的答案非常有用。我会尽快尝试!