About Me

My photo
I am passionate .NET developer who believes in the power of collaboration and craftsmanship.

Blog Archive

Tuesday, February 3, 2015

Cassandra DB - Event Store (Part II First Model)

My first model is based on a simple SQL implementation that I have used in the past which is defined as follows:

CREATE TABLE [dbo].[Events](
[Id] [int] PRIMARY KEY IDENTITY,
[Name] [nvarchar](50) NOT NULL,
[Version] [int] NOT NULL,
[Data] [varbinary](max) NOT NULL
) ON [PRIMARY]

The table contains only four columns:
  • Id - An identity integer value which is the primary key and used to determine the version of the Event Store with the following query.
  • SELECT MAX(Id) as version FROM Events
  • Name - Is a key used to identify the stream of events for an aggregate.
  • Version - Used to sort the events for a particular stream to ensure they are order in chronological order.
  • Data - The raw binary representation of the events for the given key and version.
The first modeling challenge that I have to deal with is how to determine the version of the Event Store on system start. Since Cassandra does not have an Identity column and because its great for time series modeling the logical decision is to try and use the timestamp datatype. So lets see how that could look.

CREATE KEYSPACE IF NOT EXISTS 
"EventStore" WITH replication = 
{'class':'SimpleStrategy', 'replication_factor':3};

CREATE TABLE IF NOT EXISTS
 "EventStore"."Events" (
id uuid,
name varchar,
version int,
version_time_stamp timestamp,
data blob,
PRIMARY KEY(id, name))
WITH CLUSTERING ORDER BY (name DESC);

CREATE INDEX 
IF NOT EXISTS version_index ON 
"Events"(version_time_stamp);

This time around there are five columns and one index:

  • id - primary partition key
  • name - Name is the same as the SQL model
  • version - The same as the SQL model
  • version_time_stamp - Time stamp when the record was created
  • data - The same as the SQL model
  • version_index - Index Used to pull missing records

Now that I have a table model and index defined I would like to write the following queries to get the current version of the event store and compare that value against the last know version and if it differs then use that last known value as a parameter in the second query to get all of the missing records.

//Get the current version
SELECT version_timestamp
FROM Events 
LIMIT 1

//Get All the records since the last version
SELECT id, name,version,version_timestamp,data 
FROM Events 
WHERE version_index >= ?


The problem with this solution is that I can not use version_index without the partition key, the version_timestamp needs to be a part of the CLUSTERING ORDER BY statement in the table definition to ensure proper ordering and my secondary index does not have high cardinality. Which does not follow the recommendations from DataStax.

When to Use Secondary IndexesCassandra's built-in secondary indexes are best on a column family having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for a secondary index.
When Not to Use Secondary Indexes
Do not use secondary indexes to query a huge volume of records for a small number of results. For example, if you create indexes on columns that have many distinct values, a query between the fields will incur many seeks for very few results. In the column family with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain a dynamic column family as a form of an index instead of using a secondary index. For columns containing unique data, it is sometimes fine performance-wise to use secondary indexes for convenience, as long as the query volume to the indexed column family is moderate and not under constant load.
I basically have a model that is full of impedances mismatches with my queries. So what is the next step? Well I will go through that the next time but here is a hint based on some advice that got recently from DataStax's Luke Tillman author of CQL Poco on modeling with Cassandra.
Many times, you'll end up with a "table per query" type data model, where you insert multiple copies of the data at write time with each table designed to handle a specific query.
Before signing off let me recommend two excellent resources created by Luke Tillman which are a great slide aimed at .NET Developers and the other is Luke's blog and in particular the following post if you are just getting started.

Thanks for reading. 

No comments:

Post a Comment