Loading into Redshift

This post will go through the steps needed to set-up a new Redshift cluster and get data into it. We'll be using a standard approach but there are many alternatives, see here for more details. For the sake of simplicity, we assume a number of ETL jobs already exist to generate the presentation-layer data as flat files. In real life, these ETL jobs would hold and maintain our set of Business rules and transformation logic required by our project, but for now we only focus on loading mechanisms involved with Redshift.

Setting up an Amazon Redshift Cluster

The pre-requisite ...

more ...

BRD summary stats

After a few weeks spent harvesting and integrating book reviews, it is time to share some statistics.

Over 22 millions reviews have been harvested on a sample of roughly 300K books (I use the term Book here instead of Work). I've started harvesting book sequentially (by id) and later processed them by popularity as a way to get more reviews. Some Book catalogued in Librarything could not be found in other sites while others had no reviews.

Statistics Librarything Goodreads Babelio
Book sample size 300K -- --
Book found -- 216K 78K
Number of reviews 1.2M 20.5M 415K

Notes on ...

more ...

BRD Presentation layer

Date Tags dataset

The Presentation layer's role is to respond to all user needs for reporting, data analytics and front-end applications like visualization or dashboarding. The focus is to optimize read-access, as opposed to write-access. The challenge is to optimize read-access without knowing the exact data access pattern that will be triggered from users interactions.

In this post, I'll define the physical data model created for a Redshift DWH Cloud target platform. This implementation choice influences considerably the resulting physical data model.

Redshift

Redshift is a Massively parallel processing (MPP) Cloud-based database suited for BI and analytics needs running on top ...

more ...