BRD - DataPlatform Franchising

Preamble

This introduction is written as a recit d'anticipation hopefully predicting the outcome of the first DataPlatform Franchising experimentation!

Link to tag BRD to follow-up.

BRD: the first DataPlatform Franchising

BRD or Book Review Dataset is a rich dataset gathering millions of book reviews/ratings along with book reference data. Its goals is focusing on "understanding book" in contrast to "understanding reader". Information related to reviewers/users are anonymized (to be discussed) with only demographics info available (gender, country, age, etc..).

BRD showcase end-to-end BI and data integration expertise applied to produce a high-quality dataset collected from social media sites.

BRD is relevant for many analytical needs from applications such as recommending system, data mining and machine learning and online analytical processing (OLAP), etc. It is useful for anyone looking to gain insights on:

Book explicit and implicit relationships
Book reader's social-demographics
Review text mining for identification of fake reviews, sock-puppet, griefer and troll reviewer
Book sentiment and opinion analyst and evolution
Book recommendation engine and collaborative filtering
Book ratings evolution timeline and analysis
Book and author characteristics versus reviews
Book reviews text analysis through sophisticated NLP and text mining techniques
Book preference variation across cultural and language differences

It provides ready-to-be-consumed data :

Book title (original and translated)
Book author
Book reference data (ISBN, library of congress subjects, MDS, ..)
Social demographic of readers (reviewers)
Review standardized rating
Review text (clean, dedupe, formatting and tag stripping, ..)
Tag given to books

It contains over ? millions reviews collected from well known sites (feel free to contact us if you’re running a review’s book site):

Librarything
Goodreads
Amazon
International Amazon (br, ca, fr, de, in, it, jp, mx, nl, es, uk and au)
Babelio

With BRD, you can start doing analytics on millions of book reviews without any initial investment!

Benefits for website owner

First off, BRD has a pay-per-use cost policy applicable to all registered users. A percentage of benefit generated is shared with site’s owner pro-rata (i.e. using number of reviews).

BRD focus is not about understanding users/readers behavior and profile, so it only needs site user’s unique id for integration purposes (data de-duplication and demographics rollup). BRD will neither expose it to its registered users (except for user coming from one of the participating sites).

Besides monetizing your data asset, you can also enhance your own analytical capability by extending your database with many more millions of reviews.

If you choose to participate to BRD, you can also take advantage of BRD data integrity checks and obtain special reports on:

Data anomalies
Plagiarism reviews check on same book
Duplicate reviews and reviewers across sites
Fake reviews or spam
Other on-demand analysis

BRD Data Integration

BRD collects and integrates heterogeneous and multi-source dataset into its Cloud-based solution. All reviews are reported at Work and language level. Work is the main integration point that consolidate reviews assigned to all Books regardless of their editions, translations, format (print, digital or other form). Refer to LT’s Work definition. For simplicity both terms are used interchangeably.

BRD Data Cleansing

BRD applies procedure to cleanse, conform and validate data:

Review Data is de-duped
- Spot reader making same reviews within/across sites
Data Error correction
- Review with data issues can be flagged/corrected
- Reviews too small or considered non meaningful can be filter out
Data harmonization
- Rating are normalized to a 10 point scale to account for full stars only or half-star
- Tag can be aggregated across site and report as-is or reformat to merge similar Tag (case insensitive and singular form)

BRD initial steps

The initial goal is to get enough reviews (maybe for 10-15% of Book from Librarything) to validate the Cloud DW design choice and produce realistic experimentation with the analytic/visualization applications.

I'll contact site owners so they are aware of this experiment and my intention to harvest reviews for these books. Although the data is publicly available, it does not give me legal right to harvest their sites (web harvesting is a huge business on Internet, yet no clear jurisprudence still exist). Among other things, you need to respect site licensing, limit your hitting rate, and a lot more issues that are over my head... in other words be a good citizen.