消灭暴力扫描,世界属于渐进

数仓建设过程中大部分表都是增量表,当计算过去一段时间的聚合指标时,常规的实现方式会重复扫描分区,带来大量计算的浪费。本文我们将介绍一些增量计算的方式,避免重复扫描分区,提高计算效率~

现象

当我们要计算过去一段时间的指标(如过去30天的成交量)时,最直接的实现方式就是暴力扫描多个分区的天表数据:

SELECT  pk
      	-- 滑动窗口:开始和结束时间每天变化
        ,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
        ,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
        -- 累计窗口:开始时间固定,结束时间变化
        ,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
        ,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM    a_1d
-- 假定startdate > bizdate-nd
WHERE   dt > '${bizdate-nd}'
AND     dt <= '${bizdate}'
GROUP BY pk;

这样就会浪费大量资源:

  • 反复扫描重复分区,产出耗时长。

  • 当在同一个作业耦合不同周期指标时,若下游需产出时效保证,会重复计算短周期指标。

本文就结合阿里云MaxCompute计算服务,来讨论几种解决方案~

技术优化

递推计算

  • 适用于指标计算存在递推公式:避免暴力扫描,效率大大提升

    • 累计窗口:SUM、MAX等都可递推计算

\sum^n_{i=0} a_i = \sum^{n-1}_{i=0} a_i + a_n
\max^n_{i=0} a_i = \max(\max^{n-1}_{i=0} a_i, a_n)
  • 滑动窗口:SUM可递推计算,但MAX不可以, 因为最大值可以处于临界窗口边界

\sum^n_{i=1} a_i = \sum^{n-1}_{i=0} a_i + a_n - a_0

  • 缺点

    • 当世界没有银弹,当改用增量计算后,依赖需设置为自依赖,补数据时需串行执行,若需求紧急,需再暴力扫描初始化,开发成本高。

JOIN实现

SELECT  COALESCE(today.pk, yst.pk)                                                   AS pk
        ,SUM(COALESCE(m_sum_nd, 0) + COALESCE(today.m_1d, 0) - COALESCE(nd.m_1d, 0)) AS m_sum_nd
        ,SUM(COALESCE(m_sum_td, 0) + COALESCE(today.m_1d, 0))                        AS m_sum_td
        ,GREATEST(m_max_td, today.m_1d)                                              AS m_max_td
FROM    (
            SELECT  pk
                    ,m_1d
            FROM    a_1d
            WHERE   dt = '${bizdate}'
        ) today
FULL OUTER JOIN (
                    SELECT  pk
                            ,m_sum_nd
                            ,m_sum_td
                            ,m_max_td
                    FROM    a_nd
                    WHERE   dt = '${bizdate-1d}'
                ) yst
ON      today.pk = yst.pk
LEFT JOIN   (
                SELECT  pk
                        ,m_1d
                FROM    a_1d
                WHERE   dt = '${bizdate-nd}'
            ) nd
ON      COALESCE(today.pk, yst.pk) = nd.pk;
  • 缺点:join算子消耗大,且需串行执行,当维度和周期过多时,效率提升不明显。

UNION实现

  • 优化:可并行,效率更高。

SELECT  pk
        ,SUM(m_sum_nd + m_1d - m_np1d) AS m_sum_nd
FROM    (
            SELECT  pk
                    ,m_sum_nd
                    ,0 AS m_1d
                    ,0 AS m_np1d
            FROM    a_nd
            WHERE   dt = '${bizdate-1d}'
            UNION ALL
            SELECT  pk
                    ,0 AS m_sum_nd
                    ,m_1d
                    ,0 AS m_np1d
            FROM    a_1d
            WHERE   dt = '${bizdate}'
            UNION ALL
            SELECT  pk
                    ,0    AS m_sum_nd
                    ,0    AS m_1d
                    ,m_1d AS m_np1d
            FROM    a_1d
            WHERE   dt = '${bizdate-nd}'
) a
GROUP BY pk;
  • 缺点:字段对齐比较麻烦,可能需要一个copliot。

prompt You are a SQL expert that knows use progressive method to calculate past periods metircs. For example, when you need calculate past 30 days metric each day, you can use recurrence formula: use yesterday 30 days metric plus today 1 day metric substract 30 days ago 1 day metric Pay attention to use variable like ${bizdate-30}, not use date functions. Use the following format: Question: "Question here" Answer: "SQL Query to run" Here an example: Question: "Suppose we have two tables: CREATE TABLE a_1d ( pk STRING ,m_1d BIGINT ,n_1d BIGINT ,dt STRING ); CREATE TABLE a_nd ( pk STRING ,m_7d BIGINT ,n_7d BIGINT ,m_30d BIGINT ,n_30d BIGINT ,dt STRING ); How to calculate passed 7, 30 days metrics using SUM aggregation progressively by a_1d and a_nd? Answer: SELECT pk ,SUM(m_7d + m_1d - m_8d) AS m_7d ,SUM(n_7d + n_1d - n_8d) AS n_7d ,SUM(m_30d + m_1d - m_31d) AS m_30d ,SUM(n_30d + n_1d - n_31d) AS n_30d FROM ( SELECT pk ,m_nd ,n_nd ,0 AS m_1d ,0 AS n_1d ,0 AS m_8d ,0 AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_nd WHERE dt = '${bizdate-1}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,m_1d ,n_1d ,0 AS m_8d ,0 AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_1d WHERE dt = '${bizdate}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,0 AS m_1d ,0 AS n_1d ,m_1d AS m_8d ,n_1d AS n_8d ,0 AS m_31d ,0 AS n_31d FROM a_1d WHERE dt = '${bizdate-7}' UNION ALL SELECT pk ,0 AS m_nd ,0 AS n_nd ,0 AS m_1d ,0 AS n_1d ,0 AS m_8d ,0 AS n_8d ,m_1d AS m_31d ,n_1d AS n_31d FROM a_1d WHERE dt = '${bizdate-30}' ) a GROUP BY pk;" Question: "Suppose we have two tables: {1d table DDL} and {nd table DDL} How to calculate passed {periods} days metrics using SUM aggregation progressively by {1d table name} and {nd table name} ? " Answer: "

不可递推指标

处理nd滑动窗口中计算MAX等不可递推指标时,可以用MAP、ARRAY等复杂类型数据结构存放过去时间周期的指标,然后直接对复杂类型中的指标进行聚合计算:

  1. 每天增量维护复杂数据结构:移除(n+1)d指标并添加1d的指标。

  2. 利用复杂类型函数聚合计算:过滤出相应时间范围的指标聚合。

SELECT  pk
        ,ARRAY_MAX(MAP_VALUES(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'))) AS m_max_nd
FROM    (
            SELECT  COALESCE(today.pk, yst.pk)                                                             AS pk
                    -- 维护过去nd的map
                    ,MAP_CONCAT(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'), MAP('${bizdate}', m_1d)) AS m_map
            FROM    (
                        SELECT  pk
                                ,m_max_nd
                                ,m_map
                        FROM    a_nd
                        WHERE   dt = '${bizdate-1d}'
            ) yst
            FULL JOIN   (
                            SELECT  pk
                                    ,m_1d
                            FROM    a_1d
                            WHERE   dt = '${bizdate}'
            ) today
            ON      yst.pk = today.pk
) mid;
  • 缺点:复杂类型使用技巧较高,当指标类型很多时,需存放过多指标,可能也会有性能问题。

渐进计算

修改代码的ROI还是太低,有没有更简单的方法?渐进计算带来了一些曙光,只需在暴力扫描的代码上添加一行参数即可:

-- 开启渐进计算
SET odps.progressive.enable=true;

SELECT  pk
      	-- 滑动窗口:开始和结束时间每天变化
        ,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
        ,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
        -- 累计窗口:开始时间固定,结束时间变化
        ,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
        ,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM    a_1d
-- 假定startdate > bizdate-nd
WHERE   dt > '${bizdate-nd}'
AND     dt <= '${bizdate}'
GROUP BY pk;
  • 原理简析

    • 第一次运行时产生可复用的中间表,比如天表,月表等。

    • 再结合当天增量数据和中间表,生成最终结果。

  • 特点

    • 第一次由于需要生成中间表,消耗时间长,所以需要在周期前需提前运行一次。

    • 中间表的合并、拆分、生命周期由平台关联,用户不用关心。

  • 缺点

    • 由于是通用的计算优化,效果可能没有之前的增量效率高。

    • 高级特性使用较少,遇到问题时可能较难解决。

规范建模

  • 利用规范化研发自动生成指标,定义即研发, 计算优化统一由平台进行,不用用户关心。

总结

本文讨论了几种优化暴力扫描的技巧,但我希望这些技巧以后都不用用户操心,可以自动集成在开发平台中,开发人员只需关心业务逻辑即可~


消灭暴力扫描,世界属于渐进
https://syntomic.cn/archives/eliminate-violent-scanning
作者
syntomic
发布于
2023年12月09日
更新于
2024年08月27日
许可协议