CREATE TABLE [dbo].[Events]( [Id] [int] PRIMARY KEY IDENTITY, [Name] [nvarchar](50) NOT NULL, [Version] [int] NOT NULL, [Data] [varbinary](max) NOT NULL ) ON [PRIMARY]
The table contains only four columns:
- Id - An identity integer value which is the primary key and used to determine the version of the Event Store with the following query.
SELECT MAX(Id) as version FROM Events
The first modeling challenge that I have to deal with is how to determine the version of the Event Store on system start. Since Cassandra does not have an Identity column and because its great for time series modeling the logical decision is to try and use the timestamp datatype. So lets see how that could look.
CREATE KEYSPACE IF NOT EXISTS "EventStore" WITH replication = {'class':'SimpleStrategy', 'replication_factor':3}; CREATE TABLE IF NOT EXISTS "EventStore"."Events" ( id uuid, name varchar, version int, version_time_stamp timestamp, data blob, PRIMARY KEY(id, name)) WITH CLUSTERING ORDER BY (name DESC); CREATE INDEX IF NOT EXISTS version_index ON "Events"(version_time_stamp);
- id - primary partition key
- name - Name is the same as the SQL model
- version - The same as the SQL model
- version_time_stamp - Time stamp when the record was created
- data - The same as the SQL model
- version_index - Index Used to pull missing records
Now that I have a table model and index defined I would like to write the following queries to get the current version of the event store and compare that value against the last know version and if it differs then use that last known value as a parameter in the second query to get all of the missing records.
//Get the current version SELECT version_timestamp FROM Events LIMIT 1 //Get All the records since the last version SELECT id, name,version,version_timestamp,data FROM Events WHERE version_index >= ?
When to Use Secondary Indexes¶Cassandra's built-in secondary indexes are best on a column family having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for a secondary index.
When Not to Use Secondary Indexes¶I basically have a model that is full of impedances mismatches with my queries. So what is the next step? Well I will go through that the next time but here is a hint based on some advice that got recently from DataStax's Luke Tillman author of CQL Poco on modeling with Cassandra.
Do not use secondary indexes to query a huge volume of records for a small number of results. For example, if you create indexes on columns that have many distinct values, a query between the fields will incur many seeks for very few results. In the column family with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain a dynamic column family as a form of an index instead of using a secondary index. For columns containing unique data, it is sometimes fine performance-wise to use secondary indexes for convenience, as long as the query volume to the indexed column family is moderate and not under constant load.
Many times, you'll end up with a "table per query" type data model, where you insert multiple copies of the data at write time with each table designed to handle a specific query.Before signing off let me recommend two excellent resources created by Luke Tillman which are a great slide aimed at .NET Developers and the other is Luke's blog and in particular the following post if you are just getting started.
Thanks for reading.
No comments:
Post a Comment