Temporal Attributes in the Data Mart
a2cudGVjaA==bW9ja2JpdGVzLmNvbQ==Abstract
Data mart facts almost always have temporal attributes such as date and time. Ways of organizing such attributes to efficiently support DSS queries are presented.
The Dimensions
Instead of storing temporal attributesSuch as: transaction datetime, event start datetime, event end daytime, etc. as facts, it is common to use temporal dimensions–this provides for query efficiency and flexibility in temporal slicing and dicing. The choices are to either have two separate dimensions one for date and one for time, or to combine date and time into one datetime dimension.
The Date Dimension
It is fairly easy to build the date dimension using the facilities of an RDBMS. The following code creates the table on a PostgreSQL database:
create table date_dimension ( date_key INT4 not null, date_type VARCHAR(16) not null, full_date DATE not null, day_of_week VARCHAR(9) null, day_number_in_epoch INT4 null, month_number_in_epoch INT4 null, day_number_in_week INT2 null, day_number_in_month INT2 null, day_number_in_year INT2 null, last_day_in_week_indicator flag null, last_day_in_month_indicator flag null, week_ending_date DATE null, week_number_in_year INT2 null, month_name VARCHAR(9) null, month_number_in_year INT2 null, year_month CHAR(7) null, quarter INT2 null, year_quarter CHAR(6) null, half_year INT2 null, year INT2 null, weekday_flag flag null, constraint PK_DATE_DIMENSION primary key (date_key), constraint AK_AK1_DATE_DIM unique (date_type, full_date) );
and this code creates a function that populates it from the epoch dateUsually 1970/1/1 on Unix systems. to 2100/12/31:
-- populates date_dimension -- install it and run with: select date_dim(); create or replace function date_dim() returns void as $$ declare epoch date := 'epoch'::date; current_date date := epoch; end_date date := '2100/12/31'; begin insert into date_dimension (date_key, date_type, full_date) values(-1, 'inapplicable', '1900/1/1'); insert into date_dimension (date_key, date_type, full_date) values(-2, 'missing', '1900/1/2'); insert into date_dimension (date_key, date_type, full_date) values(-3, 'corrupt', '1900/1/3'); while current_date <= end_date loop insert into date_dimension ( date_key, date_type, full_date, day_of_week, day_number_in_epoch, month_number_in_epoch, day_number_in_week, day_number_in_month, day_number_in_year, last_day_in_week_indicator, last_day_in_month_indicator, week_ending_date, week_number_in_year, month_name, month_number_in_year, year_month, quarter, year_quarter, half_year, year, weekday_flag ) values ( current_date - epoch + 1, -- date_key 'date', -- date_type current_date, -- full_date rtrim(to_char(current_date, 'Day')), -- day_of_week current_date - epoch + 1, -- day_number_in_epoch -- month_number_in_epoch extract(year from age(current_date, 'epoch')) * 12 + extract(month from age(current_date, 'epoch')) + 1, extract(isodow from current_date), -- day_number_in_week extract(day from current_date), -- day_number_in_month extract(doy from current_date), -- day_number_in_year (case when extract(isodow from current_date) = 7 then 'Y' else 'N' end)::flag, -- last_day_in_week_indicator -- last_day_in_month_indicator (case when extract(month from current_date) <> extract(month from current_date + '1 day'::interval) then 'Y' else 'N' end)::flag, (current_date + '7 day'::interval - (extract(isodow from current_date) || ' day')::interval)::date, -- week_ending_date extract(week from current_date), -- week_number_in_year rtrim(to_char(current_date, 'Month')), -- month_name extract(month from current_date), -- month_number_in_year -- year_month to_char(current_date, 'YYYY') || '_' || to_char(current_date, 'MM'), extract(quarter from current_date), -- quarter -- year_quarter to_char(current_date, 'YYYY') || '_' || to_char(current_date, 'Q'), -- half_year case when extract(quarter from current_date) <= 2 then 1 else 2 end, extract(year from current_date), -- year (case when rtrim(to_char(current_date, 'Day')) in ('Saturday', 'Sunday') then 'N' else 'Y' end)::flag -- weekday_flag ); current_date := current_date + '1 day'::interval; end loop; end; $$ language plpgsql;
The Time Dimension
Similarly, to create the time dimension:
create table time_dimension ( time_key INT4 not null, time_type VARCHAR(16) not null, full_time TIME not null, hour_number_in_day INT2 null, minute_number_in_day INT2 null, minute_number_in_hour INT2 null, second_number_in_day INT4 null, second_number_in_hour INT2 null, second_number_in_minute INT2 null, half_hour INT2 null, quarter_hour INT2 null, constraint PK_TIME_DIMENSION primary key (time_key), constraint AK_AK1_TIME_DIM unique (time_type, full_time) );
and populate it with:
-- populates time_dimension -- install it and run with: select time_dim() create or replace function time_dim() returns void as $$ declare current_time time := '00:00:00'; current_second_number_in_day int4 := 0; end_second_number_in_day int4 := 24 * 60 * 60 - 1; begin insert into time_dimension (time_key, time_type, full_time) values(-1, 'inapplicable', '00:00:00'); insert into time_dimension (time_key, time_type, full_time) values(-2, 'missing', '00:00:00'); insert into time_dimension (time_key, time_type, full_time) values(-3, 'corrupt', '00:00:00'); -- loops '00:00:00' to '23:59:59' while current_second_number_in_day <= end_second_number_in_day loop insert into time_dimension ( time_key, time_type, full_time, hour_number_in_day, minute_number_in_day, minute_number_in_hour, second_number_in_day, second_number_in_hour, second_number_in_minute, half_hour, quarter_hour ) values ( current_second_number_in_day + 1, -- time_key 'time', -- time_type current_time, -- full_time extract(hour from current_time), -- hour_number_in_day -- minute_number_in_day extract(hour from current_time) * 60 + extract(minute from current_time), extract(minute from current_time), -- minute_number_in_hour current_second_number_in_day, -- second_number_in_day -- second_number_in_hour extract(minute from current_time) * 60 + extract(second from current_time), extract(second from current_time), -- second_number_in_minute -- half_hour floor((extract(minute from current_time) * 60 + extract(second from current_time)) / 1800) + 1, -- quarter_hour floor((extract(minute from current_time) * 60 + extract(second from current_time)) / 900) + 1 ); current_time := current_time + '1 second'::interval; current_second_number_in_day := current_second_number_in_day + 1; end loop; end; $$ language plpgsql;
The Datetime Dimension
The combination of the date and time dimensions into one datetime dimension
is straightforward, the first three columns are
datetime_key, datetime_type, full_datetime
and the rest are the combined columns of the two dimensions.
Choosing the Right Dimensions
The choice of temporal dimensions depends on how fine the temporal resolution needs to be to accomodate queries to the data mart.Which of course depends on the nature of the data stored. It is usually preferable to have separate date and time dimensions because the cardinality of a single datetime dimension explodes quickly depending on the time granularity. For example, for 50 years we have:
Dimension | Granularity | Cardinality |
---|---|---|
datetime | Second | 1,576,800,000 (50 × 365 × 24 × 60 × 60) |
datetime | Minute | 26,280,000 (50 × 365 × 24 × 60) |
datetime | Hour | 438,000 (50 × 365 × 24) |
date | Day | 18,250 (50 × 365) |
time | Second | 86,400 (24 × 60 × 60) |
time | Minute | 1,440 (24 × 60) |
Temporal Ranges
A Case for Using a Datetime Dimension
Although it is often unnecessary to combine date and time into one dimension, there are cases where it may be warranted. For example, facts associated with temporal ranges (e.g., orbit images from a satellite) rather than a point in time (e.g., sales transactions) can benefit from the use of a datetime dimension. Consider the example of calculating monthly counts temporally constrained by a starting datetime of '2005/1/1 15:00' and ending datetime of '2007/12/31 13:00'. The simple implementation using the date and time dimensions
-- don't do this select d.year, d.month_number_in_year, count(1) from fact f join date_dimension d on d.date_key between f.begin_date_key and f.end_date_key join time_dimension t on t.time_key between f.begin_time_key and f.end_time_key where d.full_date >= '2005/1/1' and t.full_time >= '15:00' and d.full_date <= '2007/12/31' and t.full_time <= '13:00' group by d.year, d.month_number_in_year order by d.year, d.month_number_in_year
is in fact very wrong because:
- It incorrectly applies the time constrains to all days instead of the beginning and ending days only.
- It over-counts fact records because of the join to the time dimension.
The correct implementation using the date and time dimensions
-- you probably don't want to do this either -- distinct so we don't over-count fact records select d.year, d.month_number_in_year, count(distinct f.fact_key) from fact f join date_dimension d on d.date_key between f.begin_date_key and f.end_date_key join time_dimension t on ( -- inside the period but not the first or last day d.date_key > f.begin_date_key and d.date_key < f.end_date_key ) or ( -- the first, but not the last day d.date_key = f.begin_date_key and t.time_key >= f.begin_time_key and d.date_key < f.end_date_key ) or ( -- the last but not the first day d.date_key > f.begin_date_key and d.date_key = f.end_date_key and t.time_key <= f.end_time_key ) or ( -- both the first and the last day d.date_key = f.begin_date_key and t.time_key >= f.begin_time_key and d.date_key = f.end_date_key and t.time_key <= f.end_time_key ) where ( d.full_date > '2005/1/1' and d.full_date < '2007/12/31') or ( d.full_date = '2005/1/1' and t.time_key >= '15:00' ) or ( d.full_date = '2007/12/31' and t.time_key <= '13:00' ) group by d.year, d.month_number_in_year order by d.year, d.month_number_in_year
will be slowOR clauses and range joins are not exactly the fastest SQL operations. if the data mart is sufficiently large.As it should, otherwise there would be no need for a data mart. With a datetime dimension the query is reduced to:
select d.year, d.month_number_in_year, count(1) from fact f join datetime_dimension dt on dt.datetime_key between f.begin_datetime_key and f.end_datetime_key where dt.full_datetime between '2005/1/1 15:00' and '2007/12/31 13:00' group by dt.year, d.month_number_in_year order by dt.year, d.month_number_in_year
There is one more improvement to group by queries such as the one above that applies when using either the date or datetime dimensions. Instead of applying the between operator to every day or time unit, we can pre-compute the group by's temporal boundaries and determine fact temporal overlap only for them. Consider the following query:
select month_number_in_year, ( select count(distinct fact_key) from fact where z._min <= fact_end_date_key and fact_begin_date_key <= z._max ) from ( select month_number_in_year, min(date_key) as _min, max(date_key) as _max from date_dimension where year = 2008 group by month_number_in_year ) z
This calculates the monthly counts for facts that have a temporary overlapFor example, they may begin in 2007 and end in 2008 or 2009, or begin in 2008 and end in 2009. into 2008. The key here is that for a given period starting at Ps and ending at Pe a fact starting at Fs and ending at Fe has a temporary overlap with that period if (Ps <= Fe) and (Fs <= Pe). The second part of the query above simply calculates the Ps and Pe date_key values for every month in 2008.
Using a Temporal Range Bridge
Another way of working with temporal ranges is to have a temporal range bridge table that associates temporal units with facts. For the example of orbit images, we can store one entry per image per hour for the entire temporal range of the image. This alone is not sufficient for temporal constraining to the minuteWhich can be implemented via a hybrid approach., but it allows for efficient temporal slicing and dicing using an equijoin between the temporal range bridge and the date dimension which can be very fast as long as the temporal unit for the bridge is not unreasonably small.For example, a temporal unit of one second, or even one minute, is probably a bad idea for facts with an average range of several hours. The query using the bridge is:
-- distinct because the bridge has one entry per fact per hour select d.year, d.month_number_in_year, count(distinct b.fact_key) from fact_temporal_range_bridge b right join date_dimension d on d.date_key = b.date_key where d.full_date between '2005/1/1' and '2007/12/31' group by d.year, d.month_number_in_year order by d.year, d.month_number_in_year
The downside of the temporal range bridge approach is that, depending on the granularity of the temporal unit stored per fact record, its size can explode quickly, especially if some facts represent large temporal ranges. This can have a negative impact on:
- The ETL rate of insert.
- Query performance, depending on indexing and the nature of the queries.