BRD summary stats

After a few weeks spent harvesting and integrating book reviews, it is time to share some statistics.

Over 22 millions reviews have been harvested on a sample of roughly 300K books (I use the term Book here instead of Work). I've started harvesting book sequentially (by id) and later processed them by popularity as a way to get more reviews. Some Book catalogued in Librarything could not be found in other sites while others had no reviews.

Statistics Librarything Goodreads Babelio
Book sample size 300K -- --
Book found -- 216K 78K
Number of reviews 1.2M 20.5M 415K

Notes on ...

more ...

How to do data integration, BRD example (part3)

Business Layer

This layer contains derived data needed by Presentation/Delivery layer. We can build components like associations, groupings or hierarchies defined by Business Rules, and also do data cleansing to fix issues found in our raw data.

Building new association: Similar Reviews

Let's say we're required to find similar reviews written on Work. This could be useful for:

  • Identify duplication issues
  • Identify users duplicating reviews within or across sites
  • Identify spam where reviews are written to bias opinion
  • Find plagiarism among reviewers

How do we do that? Data processing on unstructured text is efficiently done using NoSQL ...

more ...

How to do data integration, BRD example (part2)

Physical Data model

This post presents the physical data model. Compare to logical model, it contains a lot more tables. Relational databases are less flexible than schema-less NoSQL environments and highly normalized model is one technique used to mitigate rigidity through extension. We accommodate changes by adding new structures as we discover new attributes and relationship relevant to our evolving needs. Interested reader can check methods like Data Vault or Anchor Modeling.

To explain some details of physical data model, we'll look at the code. Although SQL is not well suited for self-documented code, most DB engines support explicit ...

more ...

How to do data integration, BRD example

Data Integration: one of the main BI functions

BI environment architecture is often left as an after-thought. Business is pressuring technical teams for delivery, so they quickly jump into designing star schema or dimensional models (the Presentation Layer), and neglect the Integration Layer. End result: no separation of concerns will exist between the integration AND presentation aspects.

Integration and Presentation are critical functions that must be decoupled into separate layers (at least logically) reflecting their independent goals and specifications. Integration is concerned with capturing raw and untransformed data originating from sources, while Presentation applies transformation and business rules to derive ...

more ...

Why Book Review?

Date Tags BRD

Passion

To keep on working on any personal project you need (above all) motivation. Without any constraints or external pressure to work on something, what can help you maintain motivation? Can factors like potential gain, popularity or recognition help? Answer from personal experience: no.. these probably help at beginning, but on the long run they'll leave you unfulfilled.

So what else can bring you lasting motivation? One word: passion! Working on stuff compatible with personal interest make your work less like work and more like leisure! Personally, I enjoy working on data-oriented projects so I only need to find ...

more ...