Date Tags dataset

BRD Presentation layer

The Presentation layer's role is to respond to all user needs for reporting, data analytics and front-end applications like visualization or dashboarding. The focus is to optimize read-access, as opposed to write-access. The challenge is to optimize read-access without knowing the exact data access pattern that will be triggered from users interactions.

In this post, I'll define the physical data model created for a Redshift DWH Cloud target platform. This implementation choice influences considerably the resulting physical data model.

Redshift

Redshift is a Massively parallel processing (MPP) Cloud-based database suited for BI and analytics needs running on top of commodity hardware based architectures available from AWS. Among its features, we can highlight Columnar storage, high compression data, execution of query compiled code. More info available here and I've also gathered details on a separate post covering key aspects influencing data modeling.

Physical Data model

BRD integration layer is highly normalized and this can penalize Redshift data access performance. So we should first de-normalize our model by merging descriptive tables (ex. user_*) and association ones (ex. review_similarto, work_title, ..) into their parent entity.

Looking at the physical data model of the integration layer's model (see below), we see it is a bit complex and not suited to any end-user. Physical Data model

However Presentation layer should be easily interpreted and understood by a much larger audience, hence simpler. We have one central fact table REVIEW storing facts rating and nb_likes (and raw review text) for each review, surrounded by dimensions DIM_DATE, DIM_REVIEWER, DIM_BOOK and DIM_SITE. We also have tables REL_TAG and REL_AUTHOR capturing the more complex many-to-many relationship between a review and dimensions DIM_TAG and DIM_AUTHOR respectively.

Other changes required are due to Redshift support limitation of data type compared to Potgres. The TEXT type are converted to VARCHAR using data input to determine its size. Size is based on bytes and not character, hence to determine size required use SQL function octet_length() instead of char_length(). Also, UUID are replaced by BIGINT (note that UUID are useful for data integration where load can be parallelized without dependency on surrogate keys lookup, but not so useful for our Presentation layer).

For each table, we can also set Distribution type using assumptions on most frequent query joins, Sort key, as well as Compression encoding for all attributes.

The rest of this post discusses implementation details of some table.

Review

REVIEW table is by far our biggest table and needs to be distributed optimally. Its distribution key should correspond to one FK that often gets used for joining another large table. Two logical candidates are book_id for table BOOK or reviewer_id for table REVIEWER. At this point, it is quite difficult to know whether analysis of reviews will be mostly done against users demographic or book information. Without real usage audit, we can only guess... and that's ok, our layered-architecture approach offers the advantage of re-building this layer on demand (Integration-layer safely keeps our data)!

Let's assume book_id is the best candidate and co-locate reviews with their associated books data. As for sorting, we'll use review's date (DATE_ID) as a way to optimize time-series report.

create table presentation.review (
    id bigint primary key,
    id_similarto bigint,
    -- dimensions
    book_id int not null,
    reviewer_id int not null,
    site_id smallint not null,
    date_id smallint not null,
    -- facts
    rating smallint,
    nb_likes int,
    lang_code char(3),
    review varchar(30000),  --based on max found
    foreign key (book_id) references presentation.book(id),
    foreign key (reviewer_id) references presentation.reviewer(id),
    foreign key (site_id) references presentation.site(id),
    foreign key (date_id) references presentation.dim_date(id),
    foreign key (lang_code) references presentation.dim_language(code)
)
diststyle KEY distkey (book_id)
sortkey(date_id)
;

Tag

It is recommended to define character type as small as possible. We can determine the max size needed for the tag using a pre-loaded Postgres DB by:

select max(octet_length(tag)) from dim_tag;

The table TAG is relatively small (300K rows) and loaded in append-mode (inserting only), so we can safely set its Distribution style to ALL to offer best read performance:

create table presentation.dim_tag (
    tag_id int primary key,
    tag varchar(80) unique not null,
    lang_code char(3) not null,
    foreign key (lang_code) references presentation.dim_language(code)
)
diststyle ALL
;

Book

BOOK table holds informational data on Book reviewed. For reasons explained in REVIEW table, we chose id as distribution key. We also de-normalize different language titles pivoted by a few popular languages. This allows to do search query by title leveraging all cluster nodes.

We'll also sort rows along title_ori which should be used frequently as predicate.

create table presentation.dim_book (
    id int primary key,
    title_ori text,
    ori_lang_code char(3),
    mds_code varchar(30),
    --pivot by popular lang
    english_title varchar(500),
    french_title varchar(500),
    german_title varchar(500),
    dutch_title varchar(500),
    spanish_title varchar(500),
    italian_title varchar(500),
    japenese_title varchar(500),
    swedish_title varchar(500),
    finish_title varchar(500),
    portuguese_title varchar(500)
)
diststyle KEY distkey (book_id)
sortkey(title_ori)
;

Reviewer

REVIEWER table holds informational data on users having reviewed one or more books. As with any dimension, its id is the attribute used in joins, so we'll use it for Distributing its rows. We decide to arbitrarily sort rows based on birth_year to favor queries filtering on this attribute.

create table presentation.dim_reviewer (
    reviewer_id int primary key,
    username varchar(200) not null,
    gender char(1) not null,
    birth_year smallint not null,
    status varchar(20) not null,
    occupation varchar(100),
    city varchar(200) not null,
    lati float,
    longi float,
    site_name varchar(20) not null
)
diststyle key distkey (id)
sortkey(birth_year)
;

Tuning parameters summary

The following table summarizes the choice of distribution style and sort key (not used at this point, but TODO later for optimization) for all tables.

Table name Sort key Distribution style
Dim_author name ALL
Rel_author n.a. KEY (book_id)
Dim_book note KEY (book_id)
Dim_date note ALL
Dim_language n.a. ALL
Dim_mds note ALL
Dim_reviewer note KEY (reviewer_id)
Dim_site n.a. ALL
Dim_tag note ALL
Rel_tag KEY (book_id)
Review note by key (book_id)

note: these would be set following realistic data access usage

Next step will be to populate it through Redshift load command and applying our business rules transformation logic. To be covered in a separate post.