A disciplined approach towards data modelling in non relational space

Shashank Baravani
11 min readMar 28, 2021

The process of data modelling is a complex and elaborate one. My most important learning over the years has been that one should strive to create a clear distinction between identifying what kind of data store to opt for, how data is to be modelled, organised & stored and finally what specific technologies to choose from.

Relational modelling is typically driven by the structure of available data aptly described by the idiom “What answers do I have?”. Conversely, non relational and non search data modelling is typically driven by access patterns as indicated by the idiom “What questions do I have?”. The overarching goal is to optimise for scale. The purpose of the blog is to focus on data modelling in the non relational space. I believe these are the high level steps one can follow.

1. Data store characterisation
2. Logical data model - sharding Strategy
3. Logical data model - data summarisation
4. Logical data model - table and column design
5. Tech stack selection
6. Physical data model

Data store characterisation

I believe this is the most important part of data modelling; one should first identify the characteristics of the data store that will solve the given problem on hand. The goal is not to be able to say for example “I need a Cassandra or HBase” at the end of the exercise. Instead the intent is to be able to say “I need a columnar key value store “ or “I need a relational store with strong support for aggregations” or “I need a write optimised OLTP databases with strong consistency guarantees”. This requires understanding the business domain, nature of underlying data, the high level access patterns and the consistency model demanded by the use cases.

Nature of data

We can understand the nature of data by asking some of the following questions

1. Do I need a search index that supports free text search and pagination with some additional capabilities around ranking and sorting? Hint: you are most probably looking for a search store2. Are some parts of data going to be always requested by the entity id? Will the requested entity data also require additional data from related entities? Will some of this data be read/written at web scale? Hint: you are most probably looking for a column oriented k-v store3. Do the entities whose data we are trying to store have rich and complex relationships amongst themselves? Are there stringent integrity constraints that govern the values some of the entity attributes can take? Hint: you are most probably looking for a relational data store4. Are we dealing with temporal data whose creation is attributed to a monotonically varying increments of time and whose variation across time if of interest? Hint: you are most probably looking for a time series data store

Access patterns

To understand access patterns in its totality we need to understand, mainly, the cardinality of queries fields, query complexity and frequency of updates.

  • Query profile: What is the composition of queries in terms of lookups, exact match queries and range queries?
  • Query complexity: If the queries are complex and involve a large number of fields then the data store will have to support a large number of secondary indexes or some other efficient mechanism of querying.
  • Cardinality of queries fields: Cardinality of the queried fields can reveal which file systems and index data structures are the most efficient to support the queries.
+--------------------------------+------------------+--------------+
|Cardinality of queried fields | Data structures | Store |
+--------------------------------+------------------+--------------+
| High or unique | Hash Table/B+tree|Relational/k-v|
| Mid & equal distribution | B+tree/LSM trees |Relational/k-v|
| Low & un-equal distribution | Bitmaps | Relational |
| Very Low | Roaring Bitmaps | Search |
+--------------------------------+------------------+--------------+
  • Read and write coverage: Will the reads and writes happen uniformly across the data universe?
  • Support for aggregations: Do the queries involve aggregating across multiple dimensions over a large volume of data, say, spanning a year or so? In such scenarios you are going to need a data store that supports aggregations at scale as a first class citizen.
  • Frequency of updates: Is the volume of reads expected to be order of magnitude greater than writes or vice versa? The data store will have to be optimised accordingly for read or write path.
  • Volume of data and scale of overall traffic: This will determine if you need a distributed data store that can handle web scale storage and compute.

A detailed analysis of the data base storage engines under the hood can be found here

Operational hardness

Thought not a mandatory step, but given that most companies are preferring to build software in cloud, following checks are useful.

  • Bulk upload support: Are you going to find yourself doing bulk uploads either on a periodic basis or even as a back up?
  • Out of the box mirroring for disaster recovery: Some frameworks are quite amenable to disaster recovery in the sense that they support duplication of data with an eye on failover due to some catastrophic event. e.g. kafka
  • Cloud agnostic technologies: If you don’t want to tied up with a specific cloud provider then it makes sense to stick to cloud agnostic technologies.

Consistency Model

The use cases will reveal whether we need a data store that supports highest forms of consistency or will an eventually consistent store suffice. Consistency models is a separate topic of discussion and merits its own write up which can be found here

Partitioning or Sharding Strategy

  • Horizontal partitioning (sharding): Horizontal partitioning can be done in following ways
- Round Robin partitioning: just a random distribution of data into different shards
- Hash partitioning: data arrives in a shard as determined by the hash of the primary key
- Range partitioning: Data between range of values of the primary key ends up in a shard. For e.g. where primary key is or contains time stamp then the data for specific day can end up in a shard
- Range partitioning with Salting: This is same as above but with an additional salt in the primary key to avoid hot spotting i.e. too much traffic in specific ranges
  • Vertical partitioning: This is to vertically slice the data by grouping together related attributes e.g. a user’s payment details can be one vertical partition while his profile details can be another vertical partition. In some columnar data stores this is accomplished simply trough column families
  • Functional partitioning or application level sharding: All data for a specific geography can go into a shard. For e.g. put data for US in one shard and data for IN in another.

Data summarisation

Once we determine that data needs to be organised in non relational way, what are the different techniques for data modelling. I came across this excellent blog that does justice to this topic in a very comprehensive way but for a brief summary follow you may take a look at following techniques.

De Normalisation

Localisation of data to satisfy data needs of a single query, which optimises for speed over storage, with following trade offs

Query data volume (IO per query) VS total data volume: Localisation of data to support fast querying leads to duplication of data and large data footprints. How far are we willing to go?Processing complexity VS total data volume: Should we optimise for compute or storage? If you are in cloud (GCP, AWS or Azure) then compute is cheaper than storage. So optimise accordingly. 

Universal Schema

Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities. E.g. A single product schema for all products across various categories, with varying attributes, in an e commerce application

Aggregation

Combine multiple related entities into a single entity. For e.g. a user’s account, address and payment details can be condensed into a single UserProfile entity

Flattening

  • Tree aggregates can be stored as a simple json. For e.g. the hierarchy reportees of an senior employee of a large organisation
  • Trees as materialised paths where the key is a composite consisting of the path traversed. For e.g. restaurants in various locations in a food ordering application
  • Graph as an adjacency lists. For e.g. relationship graphs of a social networking website
  • Trees as nested sets

Application side Joins

In case of queries that require joins but cannot be fully localised and further if the data required for join can fit in memory then this technique can be handy. For e.g. offers and promotions data from one source can be joined with payments data from another source in an application side join

Distributed locks as proxy for transactional behaviour

The lock can be a composite key when the transaction spans across many entities

Dimensionality Reduction

The intent here is to reduce a n multi-dimensional space to to a n-1 dimensional space to simply representation, storage and processing. This can be achieved using various techniques

  • Composite Keys: Create a composite key which is unique by combing all candidate attributes when its clear that all the attributes that form part of the composite key are going to participate in query. For example, a data set comprising of userids ,the websites they have visited and the count of visits can have a composite key of userid and website.
  • Merge Dimensions or composite key: For e.g. latitude and longitude of a point in 2D space can be represented as a geo hash in 1D space
  • Split dimensions: a single record with n dimensions is split into n records with 1 dimension. E.g. user location composed of city, state and country can be split into user-city, user-state and user-country

Inverted Index

For use cases that require a large number of secondary indexes, we can create reverse look ups on each attribute that can potentially participate in a query. Of course this makes sense only if the attribute on which the reverse look up is unique or has very high cardinality. Otherwise it will lead to storage problems for attributes which map to a large number of keys. For e.g. for queries on an employee data base on various high cardinality fields such department, team, date of joining, etc

Table Design

Tall and narrow tables vs short and wide tables:

This is a very handy link to follow up on table and row key design for time series data.But a generalised approach is presented here.

  • In general and especially for non time series data, this decision can be entirely based on the pivot of your queries, scope of data served in each query and frequency of updates. As a thumb rule prefer to have short and narrow tables that are focussed on a specific area of the business domain. For e.g. a table to store user to address data and another table to store user to interests data. Former changes rarely while the later changes more frequently. This lets you optimise each table differently.
  • For time series data, the above decision can be made depending on the cardinality of the whats being measured. So for e.g. if you are going to store values of different measures seen over a period of time, then tall and narrow tables are preferred where each row represents a timestamp and the columns represent values of different measures. This optimises for aggregations over time periods. However if have a slightly different use case where you are planning to store transaction data of millions of merchants, then following the same strategy won’t be useful since you are most likely to query by merchant id and not aggregate over a timestamp. In such cases a short and wide table is more useful.
  • Single universal table vs table per entity: If you are storing partial aggregates for various kinds of sales reports then do you store all the data in one table or do you go for separate tables for various products on which sales is being measured? This depends entirely on volume of data and how amenable the data is to a universal schema without causing too much sparsity.
  • Single Column Family with multiple columns or multiple column families: This depends on how much and what data is going to be sent out in each query. For e.g. if you sure to request for all of the product details from the catalogue then put all the columns in a single family. If on the contrary you are going to request user’s payment, profile and subscription details separately then put them in different column families. Also the frequency of updates and retention needs might vary across column families

Column families

  • Group related columns based on information consumed per query and frequency of reads or writes
  • Allocate columns to column families based on data retention needs.

Columns

  • Treat column qualifiers as data by defining them to be composite keys. E.g. “JoseFred:book-clubGabriel:workHiroshi:tennisSofiaHiroshi:workSeo Yoon:schoolJakob:chess-club”
  • Create as many columns as you need in the table since having a flexible schema on read is the goal of data modelling in no sql dbs
  • Avoid using too many columns in any single row. Even though a table can have millions of columns, a row should not.

Rows

  • Too big rows can result in reduced read performance.
  • All data that needs to be updated atomically should go to a single row.
  • Store related entities in adjacent rows, to make reads more efficient.

Row keys

Row keys have to designed keeping in mind query patterns and data storage.

  • Entity Id as Row key: This is the most basic k-v model
  • Composite key comprising the entity Id as Row Key: This is the composite key pattern which helps localise data associated with the entity but derived from other entities in the key itself.
Key = user234-facebook.com | Value= collection of visit details
Key = user234-linkedin.com | Value= collection of visit details
  • Entity Id with Day/Month/Year suffix as Row key: When the values are a collection of temporal data with length running to into billions, it makes sense to distribute the data into rows by suffixing the primary key with Day/Month/Year. This will support range queries as well within the context of an entity id.
Key = merchant123-Jan-2021 | Value= set of time stamped transactions
Key = merchant123-Feb-2021 | Value= set of time stamped transactions ......
Key = merchant123-Dec-2021 | Value= set of time stamped transactions
  • Monotonically increasing row keys are recommended to support range queries

Tech stack selection

The overall tech stack selection process is driven by ACID properties, CAP properties, consistency model, resiliency characteristics, performance and operational maturity. Having said that, do remember that you would have already made certain decisions around ACID, CAP and consistency models during the first phase of data store characterisation. Consequently, for this phase, the focus will be more on resiliency, performance and operational maturity which will enable use to finalise a particular technology.

Physical Data Model

At this point the physical data model is only a procedural eventuality. This would involve, for example, ddl statements and other configurations tied to the specific technology. The finer aspects will be covered by the tech documentation of the technology. I won’t go into the details of physical data modelling since its a moot point now.

So this brings us to the end of this blog. To summarise once again we covered following steps in varying detail. We started with data store characterisation and then proceeded with the logical data model which catered to a specific kind of data store. The logical data model focussed on sharding, data summarisation and table design. Lastly we spoke of tech stacks and a checklist is presented to simplify the selection process.

Hope you found this useful and will contemplate using this template for your production use cases.

--

--