Logical Data Modeling

Data Modeling Methodology

Process to design a logical model

Uses a top-down approach
Can be algorithmically defined
Effective in the long run—vs. dark art

Query-Driven Data Modeling

This methodology uses the conceptual data model, application workflow and access patterns as inputs to generate the logical data model.

This is done using the methodology’s mapping rules and patterns to guide schema design and to ensure that the resulting logical model is correct and works properly.

This logical model is described using the Chebotko diagram notation.

Logical Model

Chebotko Diagram

Visual diagram for Cassandra tables and access patterns

Chebotko Diagrams

Graphical representation of Cassandra database schema design
Documents the logical and physical data model

Graphical representation of Cassandra database schema design
Documents the logical and physical data model

Graphical representation of Cassandra database schema design
Documents the logical and physical data model

Graphical representation of Cassandra database schema design
Documents the logical and physical data model

Graphical representation of Cassandra database schema design
Documents the logical and physical data model

Chebotko Diagram Notation

Table representation

Logical-level shows column names and properties
Physical-level also shows the column data type

Although the physical-level Chebotko diagram appears to only have some additional information, it actually reflects the later optimized database schema after performing the analysis and validation steps. It should have all the information needed to instantiate the table in CQL.

Access patterns

Directed links and query label shows how tables are accessed
Similar to the application workflow, but now with our logical information

Logical UDT diagram

Represents user defined types and tuples

At the logical level, the name of a UDT diagram is the column name—including the column notation.
For tuples, field names are used at the logical level.

Physical UDT diagram

Represents user defined types and tuples

At the physical level, a name is given to a UDT or will default to <column_name>_type.
For tuples, field names are not kept at the physical level.
CQL types are only captured at the physical level.

Example Chebotko Diagram

Cassandra Data Modeling Principles

Know your data
Know your queries
Nest data
Duplicate data

The principles reflect the key takeaways from the data-modeling-wins vertex. These slides goes into these principles in more detail when transitioning from a conceptual model.

Nesting and duplicating data is essentially data denormalization.

Know Your Data

Understanding the data is the key to successful design

Data captured by conceptual data model
Define what is stored in database
Preserve properties so that data is organized correctly

Components of the conceptual data model includes:

Entities
Relationships
Attributes
Keys
Cardinality constraints

Key constraints affect schema design

Entity and relationship keys affect the table primary keys
Primary key uniquely identifies a row / entity / relationship
Composed of a key and possibly additional columns

Both tables can store the same data about videos but the data organization is different.

The key of video is represented by column video_id.

The primary key of videos_by_user includes additional columns besides video_id.

Cardinality constraints affect the key for relationships

One-to-One relationship can use the key from either entity.

Many-to-Many uses the key from both entities.

uploads' key (top): User id.

comments on' key (bottom): User id and Video id.

Know Your Queries

Queries directly affect schema design

Queries captured by application workflow model
Table schema design changes if queries change

application-workflow

Schema design organizes data to efficiently run queries

Partition per query — ideal
Partition+ per query — acceptable
Table scan — anti-pattern
Multi-table — anti-pattern

Multiple queries can be executed in parallel, which makes partition per query the most efficient. However it is possible to design queries to execute serially, such as in the case of client-side joins. This is where a query will execute first, and then a subsequent query can be executed based on the previous result. This may not be as efficient even if retrieving only a single partition per query. Client-side joins may be discussed later in more detail.

Partition per query

The most efficient access pattern
Query accesses only one partition to retrieve results
Partition can be single-row or multi-row

Examples:

Find information for a specified user—retrieve one row / partition.
Find all comments for a specified user—retrieve many rows in one partition.

An extreme form of this is to model all data into as few partitions as possible, which may make them too large and have undesirable performance.

Partition+ per query

Less efficient access pattern but not necessarily bad
Query needs to access multiple partitions to retrieve results

An example would be "Find movies that match one of multiple genres".

Table scan and Multi-table

Least efficient type of query but may be needed in some cases
Query needs to access all partitions in a table(s) to retrieve results

An example of a multi-table query would be "Retrieve all data in a database".

This is a pattern you do not want to use frequently.

You may know what you’re doing if you are actively minimizing data redundancy and performing relatively efficient, client-side joins. More information about this may be discussed later for data duplication factor and client side joins.

Nest Data

Data nesting is the main data modeling technique

Nesting organizes multiple entities into a single partition
Supports partition per query data access
Three data nesting mechanisms
- Clustering columns—multi-row partitions
- Collection columns
- User-defined type columns

Clustering columns - primary data nesting mechanism

Partition key identifies an entity that other entities will nest into
Values in a clustering column identify the nested entities
Multiple clustering columns implement multi-level nesting

Table videos does not do data nesting.

Table actors_by_video nests all videos that an actor was featured into a partition for that actor—the partition is identified by video_id.

The shaded region denotes nested data.

User-defined type—secondary data nesting mechanism

Represents one-to-one relationship, but can use in conjunction with collections
Easier than working with multiple collection columns

`video_type is a user-defined type.

Table videos_by_user nests all videos as a collection in the videos column with the video_type type.

Duplicate Data

Better to duplicate than to join data

Partition per query and data nesting may result in data duplication
- Query results are pre-computed and materialized
- Data can be duplicated across tables, partitions, and / or rows

Different "views" of the same data = different queries.

Duplicating data in Cassandra across multiple tables, partitions, and rows is a common practice that is required to efficiently support different queries over the same data.

In the Cassandra world, the trade-off between space efficiency and time efficiency is almost always in favor of the latter. Normalization is not a priority here.

Data duplication can scale, joins cannot

With duplication, Cassandra essentially works as "join on write", needing to possibly write data to multiple tables. This contrasts with relational databases that normalizes data and "joins on read".

Query: Find information about videos that includes a specified actor.

Design on the left requires a join: If there are 10 videos with the same actor, table videos must be queried for 10 partitions, which may reside on 10 different nodes.

Design on the right requires no join and accesses only one partition for that query.

Mapping Rules

For the query-driven methodology

Mapping rules ensure that a logical data model is correct
Each query has a corresponding table
Tables are designed to allow queries to execute properly
Tables return data in the correct order

Mapping Rule 1: Entities and relationships
Mapping Rule 2: Equality search attributes
Mapping Rule 3: Inequality search attributes
Mapping Rule 4: Ordering attributes
Mapping Rule 5: Key attributes