Logical Data Modeling

Data Modeling Methodology

Process to design a logical model

  • Uses a top-down approach
  • Can be algorithmically defined
  • Effective in the long run—vs. dark art
query driven methodology

Query-Driven Data Modeling

 

workflow

This methodology uses the conceptual data model, application workflow and access patterns as inputs to generate the logical data model.

This is done using the methodology’s mapping rules and patterns to guide schema design and to ensure that the resulting logical model is correct and works properly.

This logical model is described using the Chebotko diagram notation.

Logical Model

Chebotko Diagram

  • Visual diagram for Cassandra tables and access patterns
chebotko diagram

Chebotko Diagrams

  • Graphical representation of Cassandra database schema design
  • Documents the logical and physical data model
chebotko diagram 1
  • Graphical representation of Cassandra database schema design
  • Documents the logical and physical data model
chebotko diagram 2
  • Graphical representation of Cassandra database schema design
  • Documents the logical and physical data model
chebotko diagram 3
  • Graphical representation of Cassandra database schema design
  • Documents the logical and physical data model
chebotko diagram 4
  • Graphical representation of Cassandra database schema design
  • Documents the logical and physical data model
chebotko diagram 5

Chebotko Diagram Notation

Table representation

  • Logical-level shows column names and properties
  • Physical-level also shows the column data type
table schema

Although the physical-level Chebotko diagram appears to only have some additional information, it actually reflects the later optimized database schema after performing the analysis and validation steps. It should have all the information needed to instantiate the table in CQL.

Access patterns

  • Directed links and query label shows how tables are accessed
  • Similar to the application workflow, but now with our logical information
access patterns

Logical UDT diagram

  • Represents user defined types and tuples
logical udt diagrams
  • At the logical level, the name of a UDT diagram is the column name—​including the column notation.
  • For tuples, field names are used at the logical level.

Physical UDT diagram

  • Represents user defined types and tuples
physical udt diagrams
  • At the physical level, a name is given to a UDT or will default to <column_name>_type.
  • For tuples, field names are not kept at the physical level.
  • CQL types are only captured at the physical level.

Example Chebotko Diagram

complete chebotko

Cassandra Data Modeling Principles

  • Know your data
  • Know your queries
  • Nest data
  • Duplicate data

The principles reflect the key takeaways from the data-modeling-wins vertex. These slides goes into these principles in more detail when transitioning from a conceptual model.

Nesting and duplicating data is essentially data denormalization.

Know Your Data

Understanding the data is the key to successful design

  • Data captured by conceptual data model
  • Define what is stored in database
  • Preserve properties so that data is organized correctly
er diagram

Components of the conceptual data model includes:

  • Entities
  • Relationships
  • Attributes
  • Keys
  • Cardinality constraints

Key constraints affect schema design

  • Entity and relationship keys affect the table primary keys
  • Primary key uniquely identifies a row / entity / relationship
  • Composed of a key and possibly additional columns
primary key

Both tables can store the same data about videos but the data organization is different.

The key of video is represented by column video_id.

The primary key of videos_by_user includes additional columns besides video_id.

Cardinality constraints affect the key for relationships

cardinality constraints

One-to-One relationship can use the key from either entity.

Many-to-Many uses the key from both entities.

uploads' key (top): User id.

comments on' key (bottom): User id and Video id.

Know Your Queries

Queries directly affect schema design

  • Queries captured by application workflow model
  • Table schema design changes if queries change

application-workflow

Schema design organizes data to efficiently run queries

  • Partition per query — ideal
  • Partition+ per query — acceptable
  • Table scan — anti-pattern
  • Multi-table — anti-pattern

Multiple queries can be executed in parallel, which makes partition per query the most efficient. However it is possible to design queries to execute serially, such as in the case of client-side joins. This is where a query will execute first, and then a subsequent query can be executed based on the previous result. This may not be as efficient even if retrieving only a single partition per query. Client-side joins may be discussed later in more detail.

Partition per query

  • The most efficient access pattern
  • Query accesses only one partition to retrieve results
  • Partition can be single-row or multi-row
partition query

Examples:

  • Find information for a specified user—​retrieve one row / partition.
  • Find all comments for a specified user—​retrieve many rows in one partition.
An extreme form of this is to model all data into as few partitions as possible, which may make them too large and have undesirable performance.

Partition+ per query

  • Less efficient access pattern but not necessarily bad
  • Query needs to access multiple partitions to retrieve results
multi partition query

An example would be "Find movies that match one of multiple genres".

Table scan and Multi-table

  • Least efficient type of query but may be needed in some cases
  • Query needs to access all partitions in a table(s) to retrieve results
table query

An example of a multi-table query would be "Retrieve all data in a database".

This is a pattern you do not want to use frequently.

You may know what you’re doing if you are actively minimizing data redundancy and performing relatively efficient, client-side joins. More information about this may be discussed later for data duplication factor and client side joins.

Nest Data

Data nesting is the main data modeling technique

  • Nesting organizes multiple entities into a single partition
  • Supports partition per query data access
  • Three data nesting mechanisms

    • Clustering columns—​multi-row partitions
    • Collection columns
    • User-defined type columns

Clustering columns - primary data nesting mechanism

  • Partition key identifies an entity that other entities will nest into
  • Values in a clustering column identify the nested entities
  • Multiple clustering columns implement multi-level nesting
nest data clustering

Table videos does not do data nesting.

Table actors_by_video nests all videos that an actor was featured into a partition for that actor—​the partition is identified by video_id.

The shaded region denotes nested data.

User-defined type—​secondary data nesting mechanism

  • Represents one-to-one relationship, but can use in conjunction with collections
  • Easier than working with multiple collection columns
nest data udt

`video_type is a user-defined type.

Table videos_by_user nests all videos as a collection in the videos column with the video_type type.

Duplicate Data

Better to duplicate than to join data

  • Partition per query and data nesting may result in data duplication

    • Query results are pre-computed and materialized
    • Data can be duplicated across tables, partitions, and / or rows

duplicate

Different "views" of the same data = different queries.

Duplicating data in Cassandra across multiple tables, partitions, and rows is a common practice that is required to efficiently support different queries over the same data.

In the Cassandra world, the trade-off between space efficiency and time efficiency is almost always in favor of the latter. Normalization is not a priority here.

Data duplication can scale, joins cannot

duplicate join

With duplication, Cassandra essentially works as "join on write", needing to possibly write data to multiple tables. This contrasts with relational databases that normalizes data and "joins on read".

Query: Find information about videos that includes a specified actor.

Design on the left requires a join: If there are 10 videos with the same actor, table videos must be queried for 10 partitions, which may reside on 10 different nodes.

Design on the right requires no join and accesses only one partition for that query.

Mapping Rules

For the query-driven methodology

  • Mapping rules ensure that a logical data model is correct
  • Each query has a corresponding table
  • Tables are designed to allow queries to execute properly
  • Tables return data in the correct order
  • Mapping Rule 1: Entities and relationships
  • Mapping Rule 2: Equality search attributes
  • Mapping Rule 3: Inequality search attributes
  • Mapping Rule 4: Ordering attributes
  • Mapping Rule 5: Key attributes

MR1: Entities And Relationships

  • Entity and relationship types map to tables
  • Entities and relationships map to partitions or rows
  • Partition may have data about one or more entities and relationships
  • Attributes are represented by columns
mr1 1
Violating this rule means you may have an incomplete data model

Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.

  • Entity and relationship types map to tables
  • Entities and relationships map to partitions or rows
  • Partition may have data about one or more entities and relationships
  • Attributes are represented by columns
mr1 2
Violating this rule means you may have an incomplete data model

Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.

  • Entity and relationship types map to tables
  • Entities and relationships map to partitions or rows
  • Partition may have data about one or more entities and relationships
  • Attributes are represented by columns
mr1 3
Violating this rule means you may have an incomplete data model

Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.

  • Entity and relationship types map to tables
  • Entities and relationships map to partitions or rows
  • Partition may have data about one or more entities and relationships
  • Attributes are represented by columns
mr1 4
Violating this rule means you may have an incomplete data model

Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.

  • Entity and relationship types map to tables
  • Entities and relationships map to partitions or rows
  • Partition may have data about one or more entities and relationships
  • Attributes are represented by columns
mr1 5
Violating this rule means you may have an incomplete data model

Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can also be duplicated across multiple tables, partitions, or rows.
A partition may have data about one or more entities and relationships.

Relationship type example

  • Each relationship becomes a row in the table
  • Relationship type attributes are represented by columns
  • Queries and relationship cardinality affects the design of the primary key
mr1 relationship

Table videos_by_user supports queries on user_id, and optionally, video_id.
Table users_by_video supports queries on video_id.

MR2: Equality Search Attributes

Attributes to query on must be in the front of the primary key

  • Primary key is an ordered set of columns, made up of partition key and clustering columns
  • A partition key is formed by one or more of the leading primary key columns
  • Supported queries must include all partition key columns in the query
mr2 1
Violating this rule means that you may not be able to query on a specific column

Rationale Table can only be queried on columns that make up the front of a primary key, which must include all partition key columns. The table can also be queried on all of the primary key columns.

Entity type example

  • Equality search attributes become initial columns of a primary key
  • Querying on: title and type
mr2 entity

All tables support queried on title and type. However the ordering of the primary key columns can change what columns can be queried. In 'videos_by_title_type', title and type must both be queried since they make up the primary key. In videos_by_title, title is first and makes up the partition key, and can always be queried. type can only be queried if title is queried too. In videos_by_type, type is first and can always be queried, and title can only be queried in addition with type.

Relationship type example

  • Equality search attributes become initial columns of a primary key
  • Querying on: last name and first name
mr2 relationship

All tables support querying on last_name and first_name. In addition, all tables support querying on last_name and first_name and video_id. videos_by_user_2 can also query on just one column: last_name.

MR3: Inequality Search Attributes

Inequality search attributes become clustering columns

  • Clustering columns follow partition key columns in a primary key
  • The column involved in an inequality search must come after columns in the primary key that are used in an equality search
mr3 1
Violating this rule means you may not be able to do a certain inequality search or range scan

Partition key columns do not allow for inequality searches, although there is a token-based inequality search. However, results from a token-based query are not meaningful for the commonly used partitioners. For queries that require an inequality search but without an equality search, an arbitrary column or bucket can be created for use as a search predicate.

Rationale Clustering columns support inequality (range) query. Only one clustering column can be used in an inequality predicate.

Entity type example

  • Inequality search attributes become clustering columns
  • Querying on: last_name = ? and registration_date > ?
mr3 entity

In users, registration_date is a clustering column.
The primary keys are formed by columns that are queried on, last_name (equality), and registration_date (inequality).
The primary keys also include columns that correspond to the key attributes, user_id.

Relationship type example

  • Inequality search attributes become clustering columns
  • Querying on: user_id = ? and uploaded_timestamp > ?
mr3 relationship

In table videos_by_user, uploaded_timestamp is a clustering column.

How is the primary key formed? * Equality search attributes * Inequality search attributes * Key attributes

MR4: Ordering Attributes

Ordering attributes map to clustering columns

  • In each partition, CQL rows are ordered based on the clustering columns
  • Ascending or descending order
mr4 1
Violating this rule means the query may not return column values in the order you want

Entity type example

  • Ordering attributes become clustering columns
  • Querying on: last_name = ? and registration_date > ?
  • Ordering attributes: registration_date (ASC)
mr4 entity

In both tables, registration_date is a clustering column. The tables maintain the DESC and ASC order, respectively, for the rows in each partition. Retrieving default ordering is efficient. It can also be reversed in a query, which is slightly less efficient. Setting CLUSTERING ORDER BY can order columns that wouldn’t be possible for a ORDER BY query, such as having registration_date DESC and user_id ASC.

Relationship type example

  • Ordering attributes become clustering columns
  • Querying on: user_id = ? and uploaded_timestamp > ?
  • Ordering attributes: uploaded_timestamp (DESC)
mr4 relationship

uploaded_timestamp (DESC) and video_id (ASC) are clustering columns.

How is the primary key formed? * Equality search attributes user_id * Inequality search attributes uploaded_timestamp * Ordering attributes uploaded_timestamp, video_id * Key attributes video_id

MR5: Key Attributes

Key attributes map to primary key columns

  • Primary key must include the columns that represent key attributes
  • Position and order of such columns may vary
  • Primary key may have additional columns to support specific queries
Violating this rule may result in upsert operations—​loss of data

Rationale
Entities and relationships are uniquely identified by key attribute values.
Row in a table is uniquely identified by primary key column values.

Entity type example

  • Entity type key attributes are included as primary key columns
  • Queries also affect primary key design
mr5 entity

Key attribute for entity type User: id.
Tables users and users_by_last_name have the column user_id that correspond to the key attribute and are part of the primary key for these tables.

Relationship type example

  • Relationship type key attributes are included as primary key columns
  • Queries affect primary key design
mr5 relationship

Key attribute for relationship posts: id.
id represents a role of Video in a relationship; a video can only participate in the relationship one time.
Tables videos_by_user and users_by_video have column video_id that corresponds to the key attribute and is part of the primary keys for these tables.

Applying Mapping Rules

  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order
apply mp 1
  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order—​MR1
apply mp 2
  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order—​MR1—​MR2
apply mp 3
  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order—​MR1—​MR2—​MR3
apply mp 4
  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order—​MR1—​MR2—​MR3—​MR4
apply mp 5
  • Create a table schema from the conceptual data model and for each query
  • Apply the mapping rules in order—​MR1—​MR2—​MR3—​MR4—​MR5
apply mp 6

Exercise 11—​Extend the KillrVideo Logical Model

Mapping Patterns

  • Semi-formal definitions for common mapping use cases
  • Provides a graphical reference for designing a logical model
mapping patterns 1
  • Only covers use cases that nest data with the clustering column mechanism.
  • Ordering of the results is not taken into consideration.
  • These limitations mean that mapping patterns complement modeling principles and mapping rules, but doesn’t replace them.
  • Still need to consider modeling principles and mapping rules.
  • Semi-formal definitions for common mapping use cases
  • Provides a graphical reference for designing a logical model
mapping patterns 2
  • Only covers use cases that nest data with the clustering column mechanism.
  • Ordering of the results is not taken into consideration.
  • These limitations mean that mapping patterns complement modeling principles and mapping rules, but doesn’t replace them.
  • Still need to consider modeling principles and mapping rules.
  • Semi-formal definitions for common mapping use cases
  • Provides a graphical reference for designing a logical model
mapping patterns 3
  • Only covers use cases that nest data with the clustering column mechanism.
  • Ordering of the results is not taken into consideration.
  • These limitations mean that mapping patterns complement modeling principles and mapping rules, but doesn’t replace them.
  • Still need to consider modeling principles and mapping rules.
  • Semi-formal definitions for common mapping use cases
  • Provides a graphical reference for designing a logical model
mapping patterns 4
  • Only covers use cases that nest data with the clustering column mechanism.
  • Ordering of the results is not taken into consideration.
  • These limitations mean that mapping patterns complement modeling principles and mapping rules, but doesn’t replace them.
  • Still need to consider modeling principles and mapping rules.

Helps guide schema design

  • Based on mapping rules and modeling principles for correctness and efficiency
  • Enables automation of a schema design
  • For each query, find an applicable mapping pattern and apply it

Query-Driven Methodology Mapping Patterns

Nine different patterns to use, depending on query requirements

  • (2) Entity mapping patterns
  • (2) 1:1 relationship mapping patterns
  • (2) 1:n relationship mapping patterns
  • (2) m:n relationship mapping patterns
  • (1) Hierarchical mapping pattern
Mapping Pattern Data Nesting Duplication

Entity and 1:1 relationship—​partitioned by key attributes

Maybe

No

Entity and 1:1 relationship—​partitioned by non-key attributes

Maybe

Maybe

1:n relationship

Yes

Maybe

m:n relationship

Yes

Yes

Hierarchical

No

No

Entities and 1:1 relationships partitioned by key attributes may have data nesting if there are multiple key attributes. Entities and 1:1 relationships partitioned by non-key attributes will only have duplicated data if querying on a set-valued attribute, which essentially would be converted to a 1:n or m:n relationship. 1:n relationships may have data duplication if querying on entity on the right. m:n relationships patterns potentially the most expensive data duplication

Entity Mapping Patterns

Query attributes = key attributes

mapping patterns entity 1

A slightly different entity called Movie is used for this example, where title and release year make up the key attributes for the entity Movie.

Key attributes are in blue, regular attributes are in green.

Query attributes != key attributes

mapping patterns entity 2

A slightly different entity called Movie is used for this example, where title and release year make up the key attributes for the entity Movie.

Key attributes are in blue, regular attributes are in green.

Relationship Mapping Patterns

Design a table for a query accessing a relationship of multiple entities

  • Mapping patterns vary depending on the relationship cardinality

    • 1:1 relationship
    • 1:n relationship
    • m:n relationship

1:1 relationship mapping patterns

  • 1:1 relationship mapping is similar to entity mapping
  • Relationship entities can be merged together into one entity
  • The key attributes can be derived from either one of the entities
mapping patterns rel 1 1

In this 1:1 relationship, attributes from both entities were merged together, and the key attributes from Movie was arbitrarily chosen to be used for the merged entity Movie - First Showing.

For the Movie and Movie - First Showing entity, key attributes are in blue and regular attributes are in green. For the First Showing entity, key attribute are in red and regular attributes are in orange.

1:n Relationship Mapping Patterns

Query attributes = key attributes

mapping patterns rel 1 n 1

Primary key: All query attributes, followed by all key attributes of User uploads Video. Static columns: Non-key attributes of User, if and only if all key attributes of User are part of the partition key.

Query attributes != key attributes

mapping patterns rel 1 n 2

Primary key: All query attributes, followed by all key attributes of User uploads Video. Additional User attributes can be added to the table at the cost of duplicating them for every video.

n:1 relationship is the reverse direction of 1:n relationship

mapping patterns rel n 1
  • Mapping is similar to a 1:1 relationship with data duplication
  • Data nesting only if the table is partitioned by non-key attributes
  • Attributes for the entity on the right will be duplicated

For the Video entity, key attributes are in blue and regular attributes are in green. For the User entity, key attribute are in red and regular attributes are in orange. Attributes for the relationship is uploaded by is in white.

m:n Relationship Mapping Patterns

Query attributes = key attributes

mapping patterns rel m n 1

Primary key: All query attributes, followed by all key attributes of Video features Actor. Static columns: Non-key attributes of Video, if and only if all key attributes of Video are part of the partition key. Columns title and description are not included in the table to the right due to unwanted duplication.

Query attributes != key attributes

mapping patterns rel m n 2

Primary key: All query attributes, followed by all key attributes of Video features Actor. Potentially expensive data duplication.

Hierarchical Mapping Patterns

Three approaches to designing hierarchies

  • Creates one table as a supertype for hierarchies
  • Simplest to use with both covering and not covering constraints
  • Separate table for each type (not discussed here)
  • Table per subtype (not discussed here)

Not only is there no duplication, but there is also no wasted space due to Cassandra’s storage model. Other databases may still allocate disk space even if columns do not have values.

Disjoint

mapping patterns hierarchical disjoint

The type column is used to specify the video’s type, and can be TV Show, Movie, or something else.

Not Disjoint

mapping patterns hierarchical not disjoint

type is now a set column so that multiple types can be assigned to a video.

Exercise 12—​Map additional requirements for KillrVideo

Physical Data Model

physical trans