An Effective Aggregation Methodology

Michael Kamfonas
Copyright © InfoKarta inc

 The Double-Dipping Query Pattern

The classic form of the fact-dimension join works for queries written against the base fact. In order to redirect queries to pre-joined MQTs, the columns involved in the query must be embedded in the MQT. This becomes impractical with names and long descriptions for several levels per dimension since if embedded in the MQTs they would make their rows extremely long. A more appropriate query pattern is to make a double dip to each dimension. First, using a core set of identifier fields that are carried in the MQTs to filter and group the data, and second to pick-up descriptions and names, by rejoining back to the dimension after the grouping. This pattern can be either explicitly introduced by the query designer or implicitly induced by the rewrite step of the optimizer cued by informational functional dependency constraints. This article explores this “double-dipping” query pattern in the context of realistic cube design, and discuses the pros and cons of designing it explicitly or relying on the optimizer to invoke it implicitly. 

Overview

Q1:

Select P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, C.YR, C.WK, S.SUM(AMT)

From SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID

             join CAL  C on S.DT_ID=C.DT_ID

where P.CAT_ID = 101 and L.RGN_ID = 205 and C.YR = 2005 and C.WK = 40

group by P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, C.YR, C.WK


A typical dimensional query pattern involves a fact table and one join per dimension. Let’s take a Sales fact, and three dimensions: location, product and calendar. The dimensions typically include multiple levels, each level characterized at least by an identifier and a name, often including a long description or special attributes particular to the level. For example, the product dimension may have UPC as the base level rolling into subclass (SCL), to class (CLS) and to category (CAT). Each of the levels has a token ID generated by the system, a name (NM) and a long description (DSC). The location dimension may include stores (STR) at the bottom level, which carry ZIP code, and open date (OPN_DT) as special attributes, with district (DST), region (RGN), division (DIV) etc. as its higher levels. A typical drill-down query would select a cell of the OLAP cube by filtering on each dimension, and summing quantities and amounts after grouping at the filtering level or lower along each dimension. For example, lets view week 40 sales of major category 101 for region 205 broken down by store and product class. Let’s assume that the displayed results must include class descriptions, store names and ZIP code. The query would look something like Q1.

Let’s also assume the existence of an MQT at the weekly level by store by subclass, which of course we want to use. Here are a couple MQT definitions that could be used to successfully redirect our query:

MQT01 as (

Select P.SCL_ID,P.SCL_NM,P.SCL_DSC,P.CLS_ID,P.CLS_NM, P.CLS_DSC,P.CAT_ID...,

       S.STR_ID,

       C.WK_ID, C.YR, C.WK, ...,

       S.SUM(AMT) as AMT

From   SALES S join PROD P on S.UPC_ID=P.UPC_ID join CAL C on S.DT_ID=C.DT_ID

group by ...)

;

MQT02 as (

Select P.SCL_ID,P.SCL_NM,P.SCL_DSC,P.CLS_ID,P.CLS_NM, P.CLS_DSC,P.CAT_ID..., 

       L.STR_ID, STR_NM, STR_DSC, L.ZIP, ..., L.DST_ID, ..., L.RGN_ID

       C.WK_ID, C.YR, C.WK, ...,

       S.SUM(AMT)

From SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID

             join CAL  C on S.DT_ID=C.DT_ID

group by ...)

;

MQT03 as (

Select P.SCL_ID, P.CLS_ID, P.CAT_ID..., 

       C.YR, C.WK_ID, C.PER_ID, ...,

       S.SUM(AMT)

From SALES S join PROD P on S.UPC_ID=P.UPC_ID join CAL  C on S.DT_ID=C.DT_ID

group by ...)

;

MQT04 as (

Select P.SCL_ID, P.CLS_ID, P.CAT_ID..., 

       L.STR_ID, L.DST_ID, L.RGN_ID, L.DIV_ID,

       C.YR, C.WK_ID, C.PER_ID, ...,

       S.SUM(AMT)

From SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID

             join CAL  C on S.DT_ID=C.DT_ID

group by ...)

;


Both MQT01 and MQT02 include each level of product starting from subclass and up, and each level of time starting from week and up. MQT02 also includes the complete location dimension. Also notice that they include IDs, names (NM), descriptions (DSC) and special attributes (e.g. ZIP). The optimizer will gladly match our query Q1 with either of the first two MQTs and rewrite the query like Q1.1 and Q1.2. In both cases the performance is likely to improve substantially because the amount of data in the two MQTs should be at least one, if not two levels of magnitude less than the base table, and two or three joins respectively are avoided.

If we are to include all names and descriptions in the MQT, however, the MQT row becomes too large. It is preferable to design MQTs that are not cluttered by long descriptive columns, but only with level identifiers and critical attributes that can facilitate fast filtering and grouping. MQT03 and MQT04 are examples of this approach with row size that could be hundreds, if not thousands, of bytes shorter than the first two MQTs respectively. These MQTs can only be used by a double-dipping query such as Q1.3 or Q1.4. These queries essentially use the subset of the columns that match the MQT for filtering and grouping, and then join back to a distinct subset of the dimensions to extract names and descriptions at the grouping level. This double-dipping to the dimension can be derived by the optimizer if certain functional dependencies are guaranteed and defined in the schema. Alternatively a double-dipping query pattern can be explicitly coded. The pros and cons will be discussed along with design alternatives and trade-offs.

Q1.1:

Select M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK, SUM(M.AMT)

From MQT01 M join LOC L on M.STR_ID=L.STR_ID

where M.MCAT_ID = 101 and L.RGN_ID = 205 and M.YR = 2005 and M.WK = 40

group by M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK

;

Q1.2:

Select M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK, SUM(M.AMT)

From MQT02 M

where M.MCAT_ID = 101 and M.RGN_ID = 205 and M.YR = 2005 and M.WK = 40

group by M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK

;

Q1.3:

Select P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK, SUM(M.AMT)

From MQT03 M join LOC L on M.STR_ID=L.STR_ID

             join (select distinct CLS_ID, CLS_NM, CLS_DSC from PROD) P

               on M.CLS_ID=P.CLS_ID

where M.MCAT_ID = 101 and L.RGN_ID = 205 and M.YR = 2005 and M.WK = 40

group by M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK

;

Q1.4:

Select M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK, M.SUM(AMT)

From MQT04 M join (select distinct CLS_ID, CLS_NM, CLS_DSC from PROD) P

               on M.CLS_ID=P.CLS_ID

             join LOC P on M.STR_ID=L.ID

where M.MCAT_ID = 101 and M.RGN_ID = 205 and M.YR = 2005 and M.WK = 40

group by P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK

;