"Emulating" MATCH_RECOGNIZE in PostgreSQL

Gespeichert von bluthg am Do., 11.06.2020 - 12:58

Things that happen on a bank holiday...

So, this morning I was made aware of this blog post covering a rather new (completely new to me, TBTH) SQL:2016 feature, MATCH_RECOGNIZE, as something that could be added to PostgreSQL.

The example challenge in the blog post is to find "cases" peaks in COVID-19 data.

I claimed that the main task here would be to add the grammar to PostgreSQL, as the task at hand can be done with normal WINDOW functions already.

However, Vik Fearing stated "You cannot do MATCH_RECOGNIZE with window functions" (spoiler: you can, at least for simple cases like this).

Challenge accepted!

I just couldn't resist ;-)

So, I downloaded the CSV as "covid.csv", created a little database, created a "rawdata" table and filled it:

# create database covid;
# \c covid
# create table raw(dateRep date,day int,month int,year int,cases int,deaths int,countriesAndTerritories text,geoId text ,countryterritoryCode text,popData2018 int,continentExp text);

# \copy raw FROM 'Downloads/covid.csv' WITH (FORMAT CSV, HEADER);

And after some fiddling (the first attempt had 3 CTEs and 37 lines of code), I came up with this SQL (which yields the exact same result):

WITH peaks AS (
SELECT countriesandterritories country
  , daterep AS day
  , cases
  , sign(cases - LAG(cases) OVER w) + sign(LEAD(cases) OVER w - cases) = 0
        AND LEAD(cases) OVER w < cases
    AS is_peak
  FROM raw
  WHERE countriesandterritories = 'Switzerland'
  WINDOW w AS (PARTITION BY countriesandterritories ORDER BY daterep)
  ORDER BY daterep
)
SELECT Country, day, cases,
        row_number() OVER () AS "match#"
FROM peaks
WHERE is_peak;

The key line;

  , sign(cases - LAG(cases) OVER w) + sign(LEAD(cases) OVER w - cases) = 0
        AND LEAD(cases) OVER w < cases
    AS is_peak

probably needs some explanation.

w is defined as WINDOW w AS (PARTITION BY countriesandterritories ORDER BY daterep). So, LEAD(cases) and LAG(cases) represent the "cases" column in the record before ("yesterday") and after ("tomorrow") the one we're currently looking at ("today") for our countriesandterritories entry ('Switzerland').

So, cases - LAG(cases) OVER w is a positive number if today's cases are higher than yesterday's and a negative one when the numbers went down.

Accordingly, LEAD(cases) OVER w - cases will be positive if tomorrow's number is higher and vice versa.

Since we're only looking at trends, we can simplify that by just keeping the sign() result (which is +1, 0 or -1). As it happens, at any peak (high or low), the sum of those two sign() results is 0 (1 + -1 or -1 + 1).

Now, the sum would also be 0 on plateaus (0 + 0), and we are looking for high peaks only! Which is why we also check that after "today", the case numbers are going down, so we add AND LEAD(cases) OVER w < cases as a second condition.

The result

covid=# \e
   country   |    day     | cases | match# 
-------------+------------+-------+--------
 Switzerland | 2020-02-26 |     1 |      1
 Switzerland | 2020-02-28 |     7 |      2
 Switzerland | 2020-03-07 |   122 |      3
 Switzerland | 2020-03-09 |    68 |      4
 Switzerland | 2020-03-14 |   267 |      5
 Switzerland | 2020-03-16 |   841 |      6
 Switzerland | 2020-03-18 |   450 |      7
 Switzerland | 2020-03-22 |  1237 |      8
 Switzerland | 2020-03-24 |  1044 |      9
 Switzerland | 2020-03-28 |  1390 |     10
 Switzerland | 2020-03-31 |  1138 |     11
 Switzerland | 2020-04-03 |  1124 |     12
 Switzerland | 2020-04-08 |   590 |     13
 Switzerland | 2020-04-10 |   785 |     14
 Switzerland | 2020-04-16 |   583 |     15
 Switzerland | 2020-04-18 |   346 |     16
 Switzerland | 2020-04-20 |   336 |     17
 Switzerland | 2020-04-24 |   228 |     18
 Switzerland | 2020-04-26 |   216 |     19
 Switzerland | 2020-05-01 |   179 |     20
 Switzerland | 2020-05-09 |    81 |     21
 Switzerland | 2020-05-11 |    54 |     22
 Switzerland | 2020-05-17 |    58 |     23
 Switzerland | 2020-05-21 |    40 |     24
 Switzerland | 2020-05-24 |    18 |     25
 Switzerland | 2020-05-27 |    15 |     26
 Switzerland | 2020-05-29 |    35 |     27
 Switzerland | 2020-06-06 |    23 |     28
(28 Zeilen)

covid=# 

So, is MATCH_RECOGNIZE void?!?

No, certainly not. It's just a pretty bad example, to be honest. My SQL is 1 LOC more, but I find it much more comprehensible (the MATCH_RECOGNIZE syntax is not exactly nice, is it?).

Many every-day MATCH_RECOGNIZE scenarios can be built with WINDOW functions already. And PostgreSQL's rich choice of functions help keeping the code readable.

German speaking readers may want to have a look at Markus Winand's nice talk here, which shows some much better examples: https://modern-sql.com/de/feature/match_recognize.

--

Neuen Kommentar hinzufügen