Temporal Attributes in the Data Mart
a2cudGVjaA==bW9ja2JpdGVzLmNvbQ==Abstract
Data mart facts almost always have temporal attributes such as date and time. Ways of organizing such attributes to efficiently support DSS queries are presented.
The Dimensions
Instead of storing temporal attributesSuch as: transaction datetime, event start datetime, event end daytime, etc. as facts, it is common to use temporal dimensions–this provides for query efficiency and flexibility in temporal slicing and dicing. The choices are to either have two separate dimensions one for date and one for time, or to combine date and time into one datetime dimension.
The Date Dimension
It is fairly easy to build the date dimension using the facilities of an RDBMS. The following code creates the table on a PostgreSQL database:
create table date_dimension (
date_key INT4 not null,
date_type VARCHAR(16) not null,
full_date DATE not null,
day_of_week VARCHAR(9) null,
day_number_in_epoch INT4 null,
month_number_in_epoch INT4 null,
day_number_in_week INT2 null,
day_number_in_month INT2 null,
day_number_in_year INT2 null,
last_day_in_week_indicator flag null,
last_day_in_month_indicator flag null,
week_ending_date DATE null,
week_number_in_year INT2 null,
month_name VARCHAR(9) null,
month_number_in_year INT2 null,
year_month CHAR(7) null,
quarter INT2 null,
year_quarter CHAR(6) null,
half_year INT2 null,
year INT2 null,
weekday_flag flag null,
constraint PK_DATE_DIMENSION primary key (date_key),
constraint AK_AK1_DATE_DIM unique (date_type, full_date)
);
and this code creates a function that populates it from the epoch dateUsually 1970/1/1 on Unix systems. to 2100/12/31:
-- populates date_dimension
-- install it and run with: select date_dim();
create or replace function date_dim() returns void as $$
declare
epoch date := 'epoch'::date;
current_date date := epoch;
end_date date := '2100/12/31';
begin
insert into date_dimension (date_key, date_type, full_date)
values(-1, 'inapplicable', '1900/1/1');
insert into date_dimension (date_key, date_type, full_date)
values(-2, 'missing', '1900/1/2');
insert into date_dimension (date_key, date_type, full_date)
values(-3, 'corrupt', '1900/1/3');
while current_date <= end_date loop
insert into date_dimension
(
date_key,
date_type,
full_date,
day_of_week,
day_number_in_epoch,
month_number_in_epoch,
day_number_in_week,
day_number_in_month,
day_number_in_year,
last_day_in_week_indicator,
last_day_in_month_indicator,
week_ending_date,
week_number_in_year,
month_name,
month_number_in_year,
year_month,
quarter,
year_quarter,
half_year,
year,
weekday_flag
)
values
(
current_date - epoch + 1, -- date_key
'date', -- date_type
current_date, -- full_date
rtrim(to_char(current_date, 'Day')), -- day_of_week
current_date - epoch + 1, -- day_number_in_epoch
-- month_number_in_epoch
extract(year from age(current_date, 'epoch')) * 12
+ extract(month from age(current_date, 'epoch')) + 1,
extract(isodow from current_date), -- day_number_in_week
extract(day from current_date), -- day_number_in_month
extract(doy from current_date), -- day_number_in_year
(case when extract(isodow from current_date) = 7
then 'Y' else 'N' end)::flag, -- last_day_in_week_indicator
-- last_day_in_month_indicator
(case when extract(month from current_date) <>
extract(month from current_date + '1 day'::interval)
then 'Y' else 'N' end)::flag,
(current_date + '7 day'::interval -
(extract(isodow from current_date)
|| ' day')::interval)::date, -- week_ending_date
extract(week from current_date), -- week_number_in_year
rtrim(to_char(current_date, 'Month')), -- month_name
extract(month from current_date), -- month_number_in_year
-- year_month
to_char(current_date, 'YYYY') || '_'
|| to_char(current_date, 'MM'),
extract(quarter from current_date), -- quarter
-- year_quarter
to_char(current_date, 'YYYY') || '_'
|| to_char(current_date, 'Q'),
-- half_year
case when extract(quarter from current_date) <= 2
then 1 else 2 end,
extract(year from current_date), -- year
(case when rtrim(to_char(current_date, 'Day'))
in ('Saturday', 'Sunday')
then 'N' else 'Y' end)::flag -- weekday_flag
);
current_date := current_date + '1 day'::interval;
end loop;
end;
$$ language plpgsql;
The Time Dimension
Similarly, to create the time dimension:
create table time_dimension (
time_key INT4 not null,
time_type VARCHAR(16) not null,
full_time TIME not null,
hour_number_in_day INT2 null,
minute_number_in_day INT2 null,
minute_number_in_hour INT2 null,
second_number_in_day INT4 null,
second_number_in_hour INT2 null,
second_number_in_minute INT2 null,
half_hour INT2 null,
quarter_hour INT2 null,
constraint PK_TIME_DIMENSION primary key (time_key),
constraint AK_AK1_TIME_DIM unique (time_type, full_time)
);
and populate it with:
-- populates time_dimension
-- install it and run with: select time_dim()
create or replace function time_dim() returns void as $$
declare
current_time time := '00:00:00';
current_second_number_in_day int4 := 0;
end_second_number_in_day int4 := 24 * 60 * 60 - 1;
begin
insert into time_dimension (time_key, time_type, full_time)
values(-1, 'inapplicable', '00:00:00');
insert into time_dimension (time_key, time_type, full_time)
values(-2, 'missing', '00:00:00');
insert into time_dimension (time_key, time_type, full_time)
values(-3, 'corrupt', '00:00:00');
-- loops '00:00:00' to '23:59:59'
while current_second_number_in_day
<= end_second_number_in_day loop
insert into time_dimension
(
time_key,
time_type,
full_time,
hour_number_in_day,
minute_number_in_day,
minute_number_in_hour,
second_number_in_day,
second_number_in_hour,
second_number_in_minute,
half_hour,
quarter_hour
)
values
(
current_second_number_in_day + 1, -- time_key
'time', -- time_type
current_time, -- full_time
extract(hour from current_time), -- hour_number_in_day
-- minute_number_in_day
extract(hour from current_time) * 60
+ extract(minute from current_time),
extract(minute from current_time), -- minute_number_in_hour
current_second_number_in_day, -- second_number_in_day
-- second_number_in_hour
extract(minute from current_time) * 60
+ extract(second from current_time),
extract(second from current_time), -- second_number_in_minute
-- half_hour
floor((extract(minute from current_time) * 60
+ extract(second from current_time)) / 1800) + 1,
-- quarter_hour
floor((extract(minute from current_time) * 60
+ extract(second from current_time)) / 900) + 1
);
current_time := current_time + '1 second'::interval;
current_second_number_in_day :=
current_second_number_in_day + 1;
end loop;
end;
$$ language plpgsql;
The Datetime Dimension
The combination of the date and time dimensions into one datetime dimension
is straightforward, the first three columns are
datetime_key, datetime_type, full_datetime
and the rest are the combined columns of the two dimensions.
Choosing the Right Dimensions
The choice of temporal dimensions depends on how fine the temporal resolution needs to be to accomodate queries to the data mart.Which of course depends on the nature of the data stored. It is usually preferable to have separate date and time dimensions because the cardinality of a single datetime dimension explodes quickly depending on the time granularity. For example, for 50 years we have:
| Dimension | Granularity | Cardinality |
|---|---|---|
| datetime | Second | 1,576,800,000 (50 × 365 × 24 × 60 × 60) |
| datetime | Minute | 26,280,000 (50 × 365 × 24 × 60) |
| datetime | Hour | 438,000 (50 × 365 × 24) |
| date | Day | 18,250 (50 × 365) |
| time | Second | 86,400 (24 × 60 × 60) |
| time | Minute | 1,440 (24 × 60) |
Temporal Ranges
A Case for Using a Datetime Dimension
Although it is often unnecessary to combine date and time into one dimension, there are cases where it may be warranted. For example, facts associated with temporal ranges (e.g., orbit images from a satellite) rather than a point in time (e.g., sales transactions) can benefit from the use of a datetime dimension. Consider the example of calculating monthly counts temporally constrained by a starting datetime of '2005/1/1 15:00' and ending datetime of '2007/12/31 13:00'. The simple implementation using the date and time dimensions
-- don't do this
select
d.year, d.month_number_in_year, count(1)
from
fact f
join date_dimension d on d.date_key
between f.begin_date_key and f.end_date_key
join time_dimension t on t.time_key
between f.begin_time_key and f.end_time_key
where
d.full_date >= '2005/1/1' and t.full_time >= '15:00'
and d.full_date <= '2007/12/31' and t.full_time <= '13:00'
group by
d.year, d.month_number_in_year
order by
d.year, d.month_number_in_year
is in fact very wrong because:
- It incorrectly applies the time constrains to all days instead of the beginning and ending days only.
- It over-counts fact records because of the join to the time dimension.
The correct implementation using the date and time dimensions
-- you probably don't want to do this either
-- distinct so we don't over-count fact records
select
d.year, d.month_number_in_year, count(distinct f.fact_key)
from
fact f
join date_dimension d on d.date_key
between f.begin_date_key and f.end_date_key
join time_dimension t on
( -- inside the period but not the first or last day
d.date_key > f.begin_date_key
and d.date_key < f.end_date_key
)
or
( -- the first, but not the last day
d.date_key = f.begin_date_key
and t.time_key >= f.begin_time_key
and d.date_key < f.end_date_key
)
or
( -- the last but not the first day
d.date_key > f.begin_date_key
and d.date_key = f.end_date_key
and t.time_key <= f.end_time_key
)
or
( -- both the first and the last day
d.date_key = f.begin_date_key
and t.time_key >= f.begin_time_key
and d.date_key = f.end_date_key
and t.time_key <= f.end_time_key
)
where
(
d.full_date > '2005/1/1' and d.full_date < '2007/12/31')
or
(
d.full_date = '2005/1/1' and t.time_key >= '15:00'
)
or
(
d.full_date = '2007/12/31' and t.time_key <= '13:00'
)
group by
d.year, d.month_number_in_year
order by
d.year, d.month_number_in_year
will be slowOR clauses and range joins are not exactly the fastest SQL operations. if the data mart is sufficiently large.As it should, otherwise there would be no need for a data mart. With a datetime dimension the query is reduced to:
select
d.year, d.month_number_in_year, count(1)
from
fact f
join datetime_dimension dt
on dt.datetime_key
between f.begin_datetime_key and f.end_datetime_key
where
dt.full_datetime
between '2005/1/1 15:00' and '2007/12/31 13:00'
group by
dt.year, d.month_number_in_year
order by
dt.year, d.month_number_in_year
There is one more improvement to group by queries such as the one above that applies when using either the date or datetime dimensions. Instead of applying the between operator to every day or time unit, we can pre-compute the group by's temporal boundaries and determine fact temporal overlap only for them. Consider the following query:
select
month_number_in_year,
(
select count(distinct fact_key)
from fact
where z._min <= fact_end_date_key
and fact_begin_date_key <= z._max
)
from
(
select
month_number_in_year,
min(date_key) as _min, max(date_key) as _max
from date_dimension
where year = 2008
group by month_number_in_year
) z
This calculates the monthly counts for facts that have a temporary overlapFor example, they may begin in 2007 and end in 2008 or 2009, or begin in 2008 and end in 2009. into 2008. The key here is that for a given period starting at Ps and ending at Pe a fact starting at Fs and ending at Fe has a temporary overlap with that period if (Ps <= Fe) and (Fs <= Pe). The second part of the query above simply calculates the Ps and Pe date_key values for every month in 2008.
Using a Temporal Range Bridge
Another way of working with temporal ranges is to have a temporal range bridge table that associates temporal units with facts. For the example of orbit images, we can store one entry per image per hour for the entire temporal range of the image. This alone is not sufficient for temporal constraining to the minuteWhich can be implemented via a hybrid approach., but it allows for efficient temporal slicing and dicing using an equijoin between the temporal range bridge and the date dimension which can be very fast as long as the temporal unit for the bridge is not unreasonably small.For example, a temporal unit of one second, or even one minute, is probably a bad idea for facts with an average range of several hours. The query using the bridge is:
-- distinct because the bridge has one entry per fact per hour
select
d.year, d.month_number_in_year, count(distinct b.fact_key)
from
fact_temporal_range_bridge b
right join date_dimension d on d.date_key = b.date_key
where
d.full_date between '2005/1/1' and '2007/12/31'
group by
d.year, d.month_number_in_year
order by
d.year, d.month_number_in_year
The downside of the temporal range bridge approach is that, depending on the granularity of the temporal unit stored per fact record, its size can explode quickly, especially if some facts represent large temporal ranges. This can have a negative impact on:
- The ETL rate of insert.
- Query performance, depending on indexing and the nature of the queries.