消灭暴力扫描,世界属于渐进
数仓建设过程中大部分表都是增量表,当计算过去一段时间的聚合指标时,常规的实现方式会重复扫描分区,带来大量计算的浪费。本文我们将介绍一些增量计算的方式,避免重复扫描分区,提高计算效率~
现象
当我们要计算过去一段时间的指标(如过去30天的成交量)时,最直接的实现方式就是暴力扫描多个分区的天表数据:
SELECT pk
-- 滑动窗口:开始和结束时间每天变化
,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
-- 累计窗口:开始时间固定,结束时间变化
,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM a_1d
-- 假定startdate > bizdate-nd
WHERE dt > '${bizdate-nd}'
AND dt <= '${bizdate}'
GROUP BY pk;
这样就会浪费大量资源:
反复扫描重复分区,产出耗时长。
当在同一个作业耦合不同周期指标时,若下游需产出时效保证,会重复计算短周期指标。
本文就结合阿里云MaxCompute计算服务,来讨论几种解决方案~
技术优化
递推计算
适用于指标计算存在递推公式:避免暴力扫描,效率大大提升
累计窗口:SUM、MAX等都可递推计算
滑动窗口:SUM可递推计算,但MAX不可以, 因为最大值可以处于临界窗口边界
缺点
当世界没有银弹,当改用增量计算后,依赖需设置为自依赖,补数据时需串行执行,若需求紧急,需再暴力扫描初始化,开发成本高。
JOIN实现
SELECT COALESCE(today.pk, yst.pk) AS pk
,SUM(COALESCE(m_sum_nd, 0) + COALESCE(today.m_1d, 0) - COALESCE(nd.m_1d, 0)) AS m_sum_nd
,SUM(COALESCE(m_sum_td, 0) + COALESCE(today.m_1d, 0)) AS m_sum_td
,GREATEST(m_max_td, today.m_1d) AS m_max_td
FROM (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate}'
) today
FULL OUTER JOIN (
SELECT pk
,m_sum_nd
,m_sum_td
,m_max_td
FROM a_nd
WHERE dt = '${bizdate-1d}'
) yst
ON today.pk = yst.pk
LEFT JOIN (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate-nd}'
) nd
ON COALESCE(today.pk, yst.pk) = nd.pk;
缺点:join算子消耗大,且需串行执行,当维度和周期过多时,效率提升不明显。
UNION实现
优化:可并行,效率更高。
SELECT pk
,SUM(m_sum_nd + m_1d - m_np1d) AS m_sum_nd
FROM (
SELECT pk
,m_sum_nd
,0 AS m_1d
,0 AS m_np1d
FROM a_nd
WHERE dt = '${bizdate-1d}'
UNION ALL
SELECT pk
,0 AS m_sum_nd
,m_1d
,0 AS m_np1d
FROM a_1d
WHERE dt = '${bizdate}'
UNION ALL
SELECT pk
,0 AS m_sum_nd
,0 AS m_1d
,m_1d AS m_np1d
FROM a_1d
WHERE dt = '${bizdate-nd}'
) a
GROUP BY pk;
缺点:字段对齐比较麻烦,可能需要一个copliot。
prompt
You are a SQL expert that knows use progressive method to calculate past periods metircs. For example, when you need calculate past 30 days metric each day, you can use recurrence formula: use yesterday 30 days metric plus today 1 day metric substract 30 days ago 1 day metric Pay attention to use variable like ${bizdate-30}, not use date functions. Use the following format: Question: "Question here" Answer: "SQL Query to run" Here an example: Question: "Suppose we have two tables: CREATE TABLE a_1d ( pk STRING ,m_1d BIGINT ,n_1d BIGINT ,dt STRING ); CREATE TABLE a_nd ( pk STRING ,m_7d BIGINT ,n_7d BIGINT ,m_30d BIGINT ,n_30d BIGINT ,dt STRING ); How to calculate passed 7, 30 days metrics using SUM aggregation progressively by a_1d and a_nd? Answer: SELECT pk ,SUM(m_7d + m_1d - m_8d) AS m_7d ,SUM(n_7d + n_1d - n_8d) AS n_7d ,SUM(m_30d + m_1d - m_31d) AS m_30d ,SUM(n_30d + n_1d - n_31d) AS n_30d FROM ( SELECT pk ,m_nd ,n_nd ,0 AS m_1d ,0 AS n_1d ,0 AS m_8d ,0 AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_nd WHERE dt = '${bizdate-1}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,m_1d ,n_1d ,0 AS m_8d ,0 AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_1d WHERE dt = '${bizdate}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,0 AS m_1d ,0 AS n_1d ,m_1d AS m_8d ,n_1d AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_1d WHERE dt = '${bizdate-7}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,0 AS m_1d ,0 AS n_1d ,0 AS m_8d ,0 AS n_8d ,m_1d AS m_31d ,n_1d AS n_31d FROM a_1d WHERE dt = '${bizdate-30}' ) a GROUP BY pk;" Question: "Suppose we have two tables: {1d table DDL} and {nd table DDL} How to calculate passed {periods} days metrics using SUM aggregation progressively by {1d table name} and {nd table name} ? " Answer: "不可递推指标
处理nd滑动窗口中计算MAX等不可递推指标时,可以用MAP、ARRAY等复杂类型数据结构存放过去时间周期的指标,然后直接对复杂类型中的指标进行聚合计算:
每天增量维护复杂数据结构:移除(n+1)d指标并添加1d的指标。
利用复杂类型函数聚合计算:过滤出相应时间范围的指标聚合。
SELECT pk
,ARRAY_MAX(MAP_VALUES(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'))) AS m_max_nd
FROM (
SELECT COALESCE(today.pk, yst.pk) AS pk
-- 维护过去nd的map
,MAP_CONCAT(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'), MAP('${bizdate}', m_1d)) AS m_map
FROM (
SELECT pk
,m_max_nd
,m_map
FROM a_nd
WHERE dt = '${bizdate-1d}'
) yst
FULL JOIN (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate}'
) today
ON yst.pk = today.pk
) mid;
缺点:复杂类型使用技巧较高,当指标类型很多时,需存放过多指标,可能也会有性能问题。
渐进计算
修改代码的ROI还是太低,有没有更简单的方法?渐进计算带来了一些曙光,只需在暴力扫描的代码上添加一行参数即可:
-- 开启渐进计算
SET odps.progressive.enable=true;
SELECT pk
-- 滑动窗口:开始和结束时间每天变化
,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
-- 累计窗口:开始时间固定,结束时间变化
,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM a_1d
-- 假定startdate > bizdate-nd
WHERE dt > '${bizdate-nd}'
AND dt <= '${bizdate}'
GROUP BY pk;
原理简析
第一次运行时产生可复用的中间表,比如天表,月表等。
再结合当天增量数据和中间表,生成最终结果。
特点
第一次由于需要生成中间表,消耗时间长,所以需要在周期前需提前运行一次。
中间表的合并、拆分、生命周期由平台关联,用户不用关心。
缺点
由于是通用的计算优化,效果可能没有之前的增量效率高。
高级特性使用较少,遇到问题时可能较难解决。
规范建模
利用规范化研发自动生成指标,定义即研发, 计算优化统一由平台进行,不用用户关心。
总结
本文讨论了几种优化暴力扫描的技巧,但我希望这些技巧以后都不用用户操心,可以自动集成在开发平台中,开发人员只需关心业务逻辑即可~