Python 在数据集中查找唯一值_Python_Sql

Python 在数据集中查找唯一值

python sql

Python 在数据集中查找唯一值,python,sql,Python,Sql,我有一个数据集，如下所示： Visitor ID Page Id TimeStamp 1 a x1 2 b x2 3 c x3 2 d x4 以下是数据的规则：一,。将此视为访问者访问网站并进行一些交互的Web数据。VID代表访客的唯一Id。页面Id是他访问的页面的Id，时间戳是访问的时间二

我有一个数据集，如下所示：

Visitor ID    Page Id      TimeStamp
1             a            x1
2             b            x2
3             c            x3 
2             d            x4

以下是数据的规则：

一,。将此视为访问者访问网站并进行一些交互的Web数据。VID代表访客的唯一Id。页面Id是他访问的页面的Id，时间戳是访问的时间

二,。如果页面刷新，则时间戳将更改，因此将在数据集中创建具有相同VID值、页面Id值但不同时间戳值的新行

三,。如果访问者单击其他页面，则时间戳和页面Id都将更改。假设他先在“a”页上，然后转到“b”页，这样他在数据集中会有另一条记录，具有相同的VID，但现在page id=b，并在新的时间戳上加时间戳

问题:

我想找出在访问“a”页之后访问过“b”页的所有独特视频。请注意，我希望它在特定的会议或一天

有人能用sql和Pythonic两种方法来做这件事吗

谢谢

sql方法是：

select distinct(t1.vid) from my_table as t1 
inner join my_table as t2 on t1.vid = t2.vid
where t1.page_id = 'a' and t2.page_id='b' and t1.time < t2.time;

只是为了让您或其他人开始学习Python部分：

如果可以，将数据放入NumPy，例如使用：

其中字段“time”是一些可比较的int/float/str或python实例。实际上，“x1”和“x2”等也会起作用。然后你可以做像这样的事情

records_of_interest = records[records['time'] > 200]

然后我会循环浏览访客ID，看看他们的记录是否符合您的标准：

target_vids = []
vids = np.unique(records['vid'])
for vid in vids:
    # get the indices for the visitor's records
    ii = np.where(records['vid'] == vid)[0]
    # make sure they visited page 'b' at all
    if 'b' not in records[ii]['pid']:
        continue
    # check whether they visited 'a' before 'b' 
    lastvisit_b = np.where(records[ii]['pid'] == 'b')[0].max()
    firstvisit_a = np.where(records[ii]['pid'] == 'a')[0].min()
    if firstvisit_a < lastvisit_b:
        target_vids.append(vid)

目标视频现在包含所需的访客ID

此外，SQL还有Python接口，这可能会将您的问题简化为一种语言……

我认为这并不是我们想要的。这将选择分页为a和b的所有视频。我需要的是在访问a页之后访问b页的视频。

records_of_interest = records[records['time'] > 200]

target_vids = []
vids = np.unique(records['vid'])
for vid in vids:
    # get the indices for the visitor's records
    ii = np.where(records['vid'] == vid)[0]
    # make sure they visited page 'b' at all
    if 'b' not in records[ii]['pid']:
        continue
    # check whether they visited 'a' before 'b' 
    lastvisit_b = np.where(records[ii]['pid'] == 'b')[0].max()
    firstvisit_a = np.where(records[ii]['pid'] == 'a')[0].min()
    if firstvisit_a < lastvisit_b:
        target_vids.append(vid)