Temporal Attributes in the Data Mart

a2cudGVjaA==bW9ja2JpdGVzLmNvbQ==

April 12, 2009

Abstract

Data mart facts almost always have temporal attributes such as date and time. Ways of organizing such attributes to efficiently support DSS queries are presented.

The Dimensions

Instead of storing temporal attributesSuch as: transaction datetime, event start datetime, event end daytime, etc. as facts, it is common to use temporal dimensions–this provides for query efficiency and flexibility in temporal slicing and dicing. The choices are to either have two separate dimensions one for date and one for time, or to combine date and time into one datetime dimension.

The Date Dimension

It is fairly easy to build the date dimension using the facilities of an RDBMS. The following code creates the table on a PostgreSQL database:

create table date_dimension (
    date_key                    INT4        not null,
    date_type                   VARCHAR(16) not null,
    full_date                   DATE        not null,
    day_of_week                 VARCHAR(9)  null,
    day_number_in_epoch         INT4        null,
    month_number_in_epoch       INT4        null,
    day_number_in_week          INT2        null,
    day_number_in_month         INT2        null,
    day_number_in_year          INT2        null,
    last_day_in_week_indicator  flag        null,
    last_day_in_month_indicator flag        null,
    week_ending_date            DATE        null,
    week_number_in_year         INT2        null,
    month_name                  VARCHAR(9)  null,
    month_number_in_year        INT2        null,
    year_month                  CHAR(7)     null,
    quarter                     INT2        null,
    year_quarter                CHAR(6)     null,
    half_year                   INT2        null,
    year                        INT2        null,
    weekday_flag                flag        null,
    constraint PK_DATE_DIMENSION primary key (date_key),
    constraint AK_AK1_DATE_DIM unique (date_type, full_date)
);

and this code creates a function that populates it from the epoch dateUsually 1970/1/1 on Unix systems. to 2100/12/31:

-- populates date_dimension
-- install it and run with: select date_dim();

create or replace function date_dim() returns void as $$

declare
epoch date := 'epoch'::date;
current_date date := epoch;
end_date date := '2100/12/31';

begin

insert into date_dimension (date_key, date_type, full_date)
    values(-1, 'inapplicable', '1900/1/1');

insert into date_dimension (date_key, date_type, full_date)
    values(-2, 'missing', '1900/1/2');

insert into date_dimension (date_key, date_type, full_date)
    values(-3, 'corrupt', '1900/1/3');

while current_date <= end_date loop

insert into date_dimension
(
date_key,
date_type,
full_date,
day_of_week,
day_number_in_epoch,
month_number_in_epoch,
day_number_in_week,
day_number_in_month,
day_number_in_year,
last_day_in_week_indicator,
last_day_in_month_indicator,
week_ending_date,
week_number_in_year,
month_name,
month_number_in_year,
year_month,
quarter,
year_quarter,
half_year,
year,
weekday_flag
)
values
(
current_date - epoch + 1, -- date_key
'date', -- date_type
current_date, -- full_date
rtrim(to_char(current_date, 'Day')), -- day_of_week
current_date - epoch + 1, -- day_number_in_epoch
-- month_number_in_epoch
extract(year from age(current_date, 'epoch')) * 12
    + extract(month from age(current_date, 'epoch')) + 1,
extract(isodow from current_date), -- day_number_in_week
extract(day from current_date), -- day_number_in_month
extract(doy from current_date), -- day_number_in_year
(case when extract(isodow from current_date) = 7
    then 'Y' else 'N' end)::flag, -- last_day_in_week_indicator
-- last_day_in_month_indicator
(case when extract(month from current_date) <>
    extract(month from current_date + '1 day'::interval)
    then 'Y' else 'N' end)::flag,
(current_date + '7 day'::interval -
    (extract(isodow from current_date)
    || ' day')::interval)::date, -- week_ending_date
extract(week from current_date), -- week_number_in_year
rtrim(to_char(current_date, 'Month')), -- month_name
extract(month from current_date), -- month_number_in_year
-- year_month
to_char(current_date, 'YYYY') || '_'
    || to_char(current_date, 'MM'),
extract(quarter from current_date), -- quarter
-- year_quarter
to_char(current_date, 'YYYY') || '_'
    || to_char(current_date, 'Q'),
-- half_year
case when extract(quarter from current_date) <= 2
    then 1 else 2 end,
extract(year from current_date), -- year
(case when rtrim(to_char(current_date, 'Day'))
    in ('Saturday', 'Sunday')
    then 'N' else 'Y' end)::flag -- weekday_flag
);

current_date := current_date + '1 day'::interval;

end loop;

end;
$$ language plpgsql;

The Time Dimension

Similarly, to create the time dimension:

create table time_dimension (
    time_key                INT4        not null,
    time_type               VARCHAR(16) not null,
    full_time               TIME        not null,
    hour_number_in_day      INT2        null,
    minute_number_in_day    INT2        null,
    minute_number_in_hour   INT2        null,
    second_number_in_day    INT4        null,
    second_number_in_hour   INT2        null,
    second_number_in_minute INT2        null,
    half_hour               INT2        null,
    quarter_hour            INT2        null,
    constraint PK_TIME_DIMENSION primary key (time_key),
    constraint AK_AK1_TIME_DIM unique (time_type, full_time)
);

and populate it with:

-- populates time_dimension
-- install it and run with: select time_dim()

create or replace function time_dim() returns void as $$

declare
current_time time := '00:00:00';
current_second_number_in_day int4 := 0;
end_second_number_in_day int4 := 24 * 60 * 60 - 1;

begin

insert into time_dimension (time_key, time_type, full_time)
    values(-1, 'inapplicable', '00:00:00');
insert into time_dimension (time_key, time_type, full_time)
    values(-2, 'missing', '00:00:00');
insert into time_dimension (time_key, time_type, full_time)
    values(-3, 'corrupt', '00:00:00');

-- loops '00:00:00' to '23:59:59'
while current_second_number_in_day
    <= end_second_number_in_day loop

insert into time_dimension
(
time_key,
time_type,
full_time,
hour_number_in_day,
minute_number_in_day,
minute_number_in_hour,
second_number_in_day,
second_number_in_hour,
second_number_in_minute,
half_hour,
quarter_hour
)
values
(
current_second_number_in_day + 1, -- time_key
'time', -- time_type
current_time, -- full_time
extract(hour from current_time), -- hour_number_in_day
-- minute_number_in_day
extract(hour from current_time) * 60
    + extract(minute from current_time),
extract(minute from current_time), -- minute_number_in_hour
current_second_number_in_day, -- second_number_in_day
-- second_number_in_hour
extract(minute from current_time) * 60
    + extract(second from current_time),
extract(second from current_time), -- second_number_in_minute
-- half_hour
floor((extract(minute from current_time) * 60
    + extract(second from current_time)) / 1800) + 1,
-- quarter_hour
floor((extract(minute from current_time) * 60
    + extract(second from current_time)) / 900) + 1
);

current_time := current_time + '1 second'::interval;
current_second_number_in_day :=
    current_second_number_in_day + 1;

end loop;

end;
$$ language plpgsql;

The Datetime Dimension

The combination of the date and time dimensions into one datetime dimension is straightforward, the first three columns are datetime_key, datetime_type, full_datetime and the rest are the combined columns of the two dimensions.

Choosing the Right Dimensions

The choice of temporal dimensions depends on how fine the temporal resolution needs to be to accomodate queries to the data mart.Which of course depends on the nature of the data stored. It is usually preferable to have separate date and time dimensions because the cardinality of a single datetime dimension explodes quickly depending on the time granularity. For example, for 50 years we have:

Dimension	Granularity	Cardinality
datetime	Second	1,576,800,000 (50 × 365 × 24 × 60 × 60)
datetime	Minute	26,280,000 (50 × 365 × 24 × 60)
datetime	Hour	438,000 (50 × 365 × 24)
date	Day	18,250 (50 × 365)
time	Second	86,400 (24 × 60 × 60)
time	Minute	1,440 (24 × 60)

Temporal Ranges

A Case for Using a Datetime Dimension

Although it is often unnecessary to combine date and time into one dimension, there are cases where it may be warranted. For example, facts associated with temporal ranges (e.g., orbit images from a satellite) rather than a point in time (e.g., sales transactions) can benefit from the use of a datetime dimension. Consider the example of calculating monthly counts temporally constrained by a starting datetime of '2005/1/1 15:00' and ending datetime of '2007/12/31 13:00'. The simple implementation using the date and time dimensions

-- don't do this
select
    d.year, d.month_number_in_year, count(1)
from
fact f
join date_dimension d on d.date_key
    between f.begin_date_key and f.end_date_key
join time_dimension t on t.time_key
    between f.begin_time_key and f.end_time_key
where
    d.full_date >= '2005/1/1' and t.full_time >= '15:00'
    and d.full_date <= '2007/12/31' and t.full_time <= '13:00'
group by
    d.year, d.month_number_in_year
order by
    d.year, d.month_number_in_year

is in fact very wrong because:

It incorrectly applies the time constrains to all days instead of the beginning and ending days only.
It over-counts fact records because of the join to the time dimension.

The correct implementation using the date and time dimensions

-- you probably don't want to do this either
-- distinct so we don't over-count fact records
select
    d.year, d.month_number_in_year, count(distinct f.fact_key)
from
fact f
join date_dimension d on d.date_key
    between f.begin_date_key and f.end_date_key
join time_dimension t on
    ( -- inside the period but not the first or last day
        d.date_key > f.begin_date_key
            and d.date_key < f.end_date_key
    )
    or
    ( -- the first, but not the last day
        d.date_key = f.begin_date_key
            and t.time_key >= f.begin_time_key
        and d.date_key < f.end_date_key
    )
    or
    ( -- the last but not the first day
        d.date_key > f.begin_date_key
        and d.date_key = f.end_date_key
            and t.time_key <= f.end_time_key
    )
    or
    ( -- both the first and the last day
        d.date_key = f.begin_date_key
            and t.time_key >= f.begin_time_key
        and d.date_key = f.end_date_key
            and t.time_key <= f.end_time_key
    )
where
    (
        d.full_date > '2005/1/1' and d.full_date < '2007/12/31')
    or
    (
        d.full_date = '2005/1/1' and t.time_key >= '15:00'
    )
    or
    (
        d.full_date = '2007/12/31' and t.time_key <= '13:00'
    )
group by
  d.year, d.month_number_in_year
order by
  d.year, d.month_number_in_year

will be slowOR clauses and range joins are not exactly the fastest SQL operations. if the data mart is sufficiently large.As it should, otherwise there would be no need for a data mart. With a datetime dimension the query is reduced to:

select
    d.year, d.month_number_in_year, count(1)
from
fact f
join datetime_dimension dt
    on dt.datetime_key
        between f.begin_datetime_key and f.end_datetime_key
where
    dt.full_datetime
        between '2005/1/1 15:00' and '2007/12/31 13:00'
group by
    dt.year, d.month_number_in_year
order by
    dt.year, d.month_number_in_year

There is one more improvement to group by queries such as the one above that applies when using either the date or datetime dimensions. Instead of applying the between operator to every day or time unit, we can pre-compute the group by's temporal boundaries and determine fact temporal overlap only for them. Consider the following query:

select
    month_number_in_year,
    (
        select count(distinct fact_key)
        from fact
        where z._min <= fact_end_date_key
            and fact_begin_date_key <= z._max
    )
from
(
    select
        month_number_in_year,
        min(date_key) as _min, max(date_key) as _max
    from date_dimension
    where year = 2008
    group by month_number_in_year
) z

This calculates the monthly counts for facts that have a temporary overlapFor example, they may begin in 2007 and end in 2008 or 2009, or begin in 2008 and end in 2009. into 2008. The key here is that for a given period starting at P_s and ending at P_e a fact starting at F_s and ending at F_e has a temporary overlap with that period if (P_s <= F_e) and (F_s <= P_e). The second part of the query above simply calculates the P_s and P_e date_key values for every month in 2008.

Using a Temporal Range Bridge

Another way of working with temporal ranges is to have a temporal range bridge table that associates temporal units with facts. For the example of orbit images, we can store one entry per image per hour for the entire temporal range of the image. This alone is not sufficient for temporal constraining to the minuteWhich can be implemented via a hybrid approach., but it allows for efficient temporal slicing and dicing using an equijoin between the temporal range bridge and the date dimension which can be very fast as long as the temporal unit for the bridge is not unreasonably small.For example, a temporal unit of one second, or even one minute, is probably a bad idea for facts with an average range of several hours. The query using the bridge is:

-- distinct because the bridge has one entry per fact per hour
select
    d.year, d.month_number_in_year, count(distinct b.fact_key)
from
fact_temporal_range_bridge b
right join date_dimension d on d.date_key = b.date_key
where
    d.full_date between '2005/1/1' and '2007/12/31'
group by
    d.year, d.month_number_in_year
order by
    d.year, d.month_number_in_year

The downside of the temporal range bridge approach is that, depending on the granularity of the temporal unit stored per fact record, its size can explode quickly, especially if some facts represent large temporal ranges. This can have a negative impact on:

The ETL rate of insert.
Query performance, depending on indexing and the nature of the queries.