间隔连续问题
1. 数据结构:某游戏公司记录的用户每日登录数据
表名:game_user
字段名:id(用户id)、dt(日期)
2. 需求:
① 创建表
② 计算每个用户最大的连续登录天数,可以间隔一天。如:如果一个用户在1,3,5,6登录游戏,则视为连续6天登录。
3. 数据准备:
创建文件game_user.txt
[atguigu@hadoop102 ~]$ vim /opt/module/hive/datas/game_user.txt
1001 2022-05-01 23:21:33
1003 2022-05-02 23:21:33
1002 2022-05-01 23:21:33
1003 2022-05-01 23:21:33
1001 2022-05-03 23:21:33
1003 2022-05-04 23:21:33
1002 2022-05-01 23:21:33
1001 2022-05-05 23:21:33
1001 2022-05-01 23:21:33
1002 2022-05-06 23:21:33
1001 2022-05-06 23:21:33
1001 2022-05-07 23:21:33
4. 答案:
① 创建表
hive(default)>
create table game_user(
id bigint,
dt string
)
row format delimited
fields terminated by '\t';
② 计算每个用户最大的连续登录天数,可以间隔一天
select id, datediff(max_dt, min_dt)
from
(
select id, max(dt) as max_dt, min(lag_dt) as min_dt
from
(select id, dt, lag_dt, datediff(dt, lag_dt) as date_diff, if(datediff(dt, lag_dt) > 2, 1, 0) as flag from(select id, dt, lag(dt, 1, '1970-01-01') over (partition by id order by dt asc) as lag_dtfrom (select id, date_format(dt, 'yyyy-MM-dd') as dtfrom game_user) tb1) tb2
)tb3where flag = 0
group by id
) tb4;