February 2015 has been a big month for snow this year across the nation.  I love to think about all those snowflakes all being unique, like fingerprints and social security numbers.  Almost immediately my mind transitions into data analytics because I start thinking about star and snowflake schemas.  Those two schemas are of course the long standing optimal database designs for a data mart.  Let’s explore a little more why a star schema benefits your association’s data mart.

This is an image of a star schema intended to capture event registrations for analytical purposes.  As with so many other data marts, this one is populated from a normalized, transactional, enterprise CRM/AMS system.

registrant star

In the center of the diagram is what is referred to as the fact table.  The fact table contains field references to the other tables in the schema, referred to as dimension tables.  The fact table also holds measurable amounts of the fact, such as registration fee amount.  There might be a small quantity of other fields as well, such as in this case where there is a field that indicates if the registration was a guest or not (holding 1 or 0 for its value).

Moving on to the dimension tables, we see that each table has a set of standard fields for load date, active flag, valid from date, and valid to date.  Plus, each has a primary key unique to the table that is not from the source system.  This is called a surrogate key.  The recommended field type for the primary key in dimension tables is an auto incrementing integer because it ensures uniqueness.  Dimension tables will enable your data to be sliced and diced and allow you to segment your association’s measures which are being held in the fact table.

Let’s assess the benefit of this architecture because you might be pondering why it is so much better than the traditional relational database structure of a transactional database, such as an AMS/CRM.

Change Tracking – change tracking happens at the dimension table level which is why it has all those standard fields for dates and active flag. There are multiple types of change (slowly changing dimensions) which are outside the scope of this topic, but the short version is that some types of change will yield an additional row in the same table representing the same customer. The structure supports this because of the surrogate key and active flag (with dates).

  1. Scale – Since all keys are integers, the amount of disk space consumed is minimal – compared for example to using a GUID (36 characters). Also, you do not have to do calculations on the fly for the thousands or millions of records in a data set when you have dimension tables such as the date table.  The date table stores all the specifics about a particular date, such as the quarter in which it occurred and each of the date parts and their long and short names. Specifically one place you might look in this diagram is registration type. Sure, it might be a little easier to store the registration type in the fact table instead of as a reference, but for hundreds of thousands of rows, storing the text value of the registration type is not as efficient/scalable as storing an integer that represents it, where the registration type is stored once.
  2. Table Width – creating large dimension tables has a benefit in analysis with a tool such as Tableau because it lends itself to easily building drill downs. One such example is storing country, state, and city with the individual record. That is a natural hierarchy – just be careful to aggregate data to the audience’s needs.

It turns out that this post could be much longer because there are entire books dedicated to star schemas with more information like don’t store NULL in any field and don’t store short codes like M for Male.  However, this is a good place to stop and head out to play in the snow!