Process to design a logical model
This methodology uses the conceptual data model, application workflow and access patterns as inputs to generate the logical data model.
This is done using the methodology’s mapping rules and patterns to guide schema design and to ensure that the resulting logical model is correct and works properly.
This logical model is described using the Chebotko diagram notation.
Chebotko Diagram
Table representation
Although the physical-level Chebotko diagram appears to only have some additional information, it actually reflects the later optimized database schema after performing the analysis and validation steps. It should have all the information needed to instantiate the table in CQL.
Access patterns
Logical UDT diagram
Physical UDT diagram
<column_name>_type
.The principles reflect the key takeaways from the data-modeling-wins vertex. These slides goes into these principles in more detail when transitioning from a conceptual model.
Nesting and duplicating data is essentially data denormalization.
Understanding the data is the key to successful design
Components of the conceptual data model includes:
Key constraints affect schema design
Both tables can store the same data about videos but the data organization is different.
The key of video
is represented by column video_id
.
The primary key of videos_by_user
includes additional columns besides video_id
.
Cardinality constraints affect the key for relationships
One-to-One relationship can use the key from either entity.
Many-to-Many uses the key from both entities.
uploads'
key (top): User id
.
comments on'
key (bottom): User id
and Video id
.
Queries directly affect schema design
Schema design organizes data to efficiently run queries
Multiple queries can be executed in parallel, which makes partition per query the most efficient. However it is possible to design queries to execute serially, such as in the case of client-side joins. This is where a query will execute first, and then a subsequent query can be executed based on the previous result. This may not be as efficient even if retrieving only a single partition per query. Client-side joins may be discussed later in more detail.
Partition per query
Examples:
An extreme form of this is to model all data into as few partitions as possible, which may make them too large and have undesirable performance. |
Partition+ per query
An example would be "Find movies that match one of multiple genres".
Table scan and Multi-table
An example of a multi-table query would be "Retrieve all data in a database".
This is a pattern you do not want to use frequently. |
You may know what you’re doing if you are actively minimizing data redundancy and performing relatively efficient, client-side joins. More information about this may be discussed later for data duplication factor and client side joins.
Data nesting is the main data modeling technique
Three data nesting mechanisms
Clustering columns - primary data nesting mechanism
Table videos
does not do data nesting.
Table actors_by_video
nests all videos that an actor was featured into a partition for that actor—the partition is identified by video_id
.
The shaded region denotes nested data.
User-defined type—secondary data nesting mechanism
`video_type is a user-defined type.
Table videos_by_user
nests all videos as a collection in the videos
column with the video_type
type.
Better to duplicate than to join data
Partition per query and data nesting may result in data duplication
Different "views" of the same data = different queries.
Duplicating data in Cassandra across multiple tables, partitions, and rows is a common practice that is required to efficiently support different queries over the same data.
In the Cassandra world, the trade-off between space efficiency and time efficiency is almost always in favor of the latter. Normalization is not a priority here.
Data duplication can scale, joins cannot
With duplication, Cassandra essentially works as "join on write", needing to possibly write data to multiple tables. This contrasts with relational databases that normalizes data and "joins on read".
Query: Find information about videos that includes a specified actor.
Design on the left requires a join: If there are 10 videos with the same actor, table videos
must be queried for 10 partitions, which may reside on 10 different nodes.
Design on the right requires no join and accesses only one partition for that query.
For the query-driven methodology
Violating this rule means you may have an incomplete data model |
Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.
Violating this rule means you may have an incomplete data model |
Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.
Violating this rule means you may have an incomplete data model |
Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.
Violating this rule means you may have an incomplete data model |
Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can be duplicated across multiple tables, partitions, or rows.
Violating this rule means you may have an incomplete data model |
Rationale
Data from the conceptual level must also be preserved at the logical level.
The same data can also be duplicated across multiple tables, partitions, or rows.
A partition may have data about one or more entities and relationships.
Relationship type example
Table videos_by_user
supports queries on user_id
, and optionally, video_id
.
Table users_by_video
supports queries on video_id
.
Attributes to query on must be in the front of the primary key
Violating this rule means that you may not be able to query on a specific column |
Rationale Table can only be queried on columns that make up the front of a primary key, which must include all partition key columns. The table can also be queried on all of the primary key columns.
Entity type example
title
and type
All tables support queried on title
and type
. However the ordering of the primary key columns can change what columns can be queried.
In 'videos_by_title_type', title
and type
must both be queried since they make up the primary key.
In videos_by_title
, title
is first and makes up the partition key, and can always be queried. type
can only be queried if title
is queried too.
In videos_by_type
, type
is first and can always be queried, and title
can only be queried in addition with type
.
Relationship type example
last name
and first name
All tables support querying on last_name
and first_name
.
In addition, all tables support querying on last_name
and first_name
and video_id
.
videos_by_user_2
can also query on just one column: last_name
.
Inequality search attributes become clustering columns
Violating this rule means you may not be able to do a certain inequality search or range scan |
Partition key columns do not allow for inequality searches, although there is a token-based inequality search. However, results from a token-based query are not meaningful for the commonly used partitioners. For queries that require an inequality search but without an equality search, an arbitrary column or bucket can be created for use as a search predicate.
Rationale Clustering columns support inequality (range) query. Only one clustering column can be used in an inequality predicate.
Entity type example
last_name
= ? and registration_date
> ?In users
, registration_date
is a clustering column.
The primary keys are formed by columns that are queried on, last_name
(equality), and registration_date
(inequality).
The primary keys also include columns that correspond to the key attributes, user_id
.
Relationship type example
user_id
= ? and uploaded_timestamp
> ?In table videos_by_user
, uploaded_timestamp is a clustering column.
How is the primary key formed? * Equality search attributes * Inequality search attributes * Key attributes
Ordering attributes map to clustering columns
Violating this rule means the query may not return column values in the order you want |
Entity type example
last_name
= ? and registration_date
> ?registration_date
(ASC)In both tables, registration_date
is a clustering column. The tables maintain the DESC and ASC order, respectively, for the rows in each partition.
Retrieving default ordering is efficient. It can also be reversed in a query, which is slightly less efficient.
Setting CLUSTERING ORDER BY can order columns that wouldn’t be possible for a ORDER BY query, such as having registration_date
DESC and user_id
ASC.
Relationship type example
user_id
= ? and uploaded_timestamp
> ?uploaded_timestamp
(DESC)uploaded_timestamp
(DESC) and video_id
(ASC) are clustering columns.
How is the primary key formed?
* Equality search attributes user_id
* Inequality search attributes uploaded_timestamp
* Ordering attributes uploaded_timestamp
, video_id
* Key attributes video_id
Key attributes map to primary key columns
Violating this rule may result in upsert operations—loss of data |
Rationale
Entities and relationships are uniquely identified by key attribute values.
Row in a table is uniquely identified by primary key column values.
Entity type example
Key attribute for entity type User
: id
.
Tables users
and users_by_last_name
have the column user_id
that correspond to the key attribute and are part of the primary key for these tables.
Relationship type example
Key attribute for relationship posts
: id
.
id
represents a role of Video
in a relationship; a video can only participate in the relationship one time.
Tables videos_by_user
and users_by_video
have column video_id
that corresponds to the key attribute and is part of the primary keys for these tables.
Helps guide schema design
Nine different patterns to use, depending on query requirements
Mapping Pattern | Data Nesting | Duplication |
---|---|---|
Entity and 1:1 relationship—partitioned by key attributes |
Maybe |
No |
Entity and 1:1 relationship—partitioned by non-key attributes |
Maybe |
Maybe |
1:n relationship |
Yes |
Maybe |
m:n relationship |
Yes |
Yes |
Hierarchical |
No |
No |
Entities and 1:1 relationships partitioned by key attributes may have data nesting if there are multiple key attributes. Entities and 1:1 relationships partitioned by non-key attributes will only have duplicated data if querying on a set-valued attribute, which essentially would be converted to a 1:n or m:n relationship. 1:n relationships may have data duplication if querying on entity on the right. m:n relationships patterns potentially the most expensive data duplication
Query attributes = key attributes
A slightly different entity called Movie
is used for this example, where title
and release year
make up the key attributes for the entity Movie
.
Key attributes are in blue, regular attributes are in green.
Query attributes != key attributes
A slightly different entity called Movie
is used for this example, where title
and release year
make up the key attributes for the entity Movie
.
Key attributes are in blue, regular attributes are in green.
Design a table for a query accessing a relationship of multiple entities
Mapping patterns vary depending on the relationship cardinality
1:1 relationship mapping patterns
In this 1:1 relationship, attributes from both entities were merged together, and the key attributes from Movie
was arbitrarily chosen to be used for the merged entity Movie - First Showing
.
For the Movie
and Movie - First Showing
entity, key attributes are in blue and regular attributes are in green.
For the First Showing
entity, key attribute are in red and regular attributes are in orange.
Query attributes = key attributes
Primary key: All query attributes, followed by all key attributes of User uploads Video
.
Static columns: Non-key attributes of User
, if and only if all key attributes of User
are part of the partition key.
Query attributes != key attributes
Primary key: All query attributes, followed by all key attributes of User uploads Video
.
Additional User
attributes can be added to the table at the cost of duplicating them for every video.
n:1 relationship is the reverse direction of 1:n relationship
For the Video
entity, key attributes are in blue and regular attributes are in green.
For the User
entity, key attribute are in red and regular attributes are in orange.
Attributes for the relationship is uploaded by
is in white.
Query attributes = key attributes
Primary key: All query attributes, followed by all key attributes of Video features Actor
.
Static columns: Non-key attributes of Video
, if and only if all key attributes of Video
are part of the partition key.
Columns title
and description
are not included in the table to the right due to unwanted duplication.
Query attributes != key attributes
Primary key: All query attributes, followed by all key attributes of Video features Actor
.
Potentially expensive data duplication.
Three approaches to designing hierarchies
Not only is there no duplication, but there is also no wasted space due to Cassandra’s storage model. Other databases may still allocate disk space even if columns do not have values.
Disjoint
The type
column is used to specify the video’s type, and can be TV Show
, Movie
, or something else.
Not Disjoint
type
is now a set column so that multiple types can be assigned to a video.