Cassandralabguidebook finalcopy docx by Niranjan Pandey

CASSANDRA LAB GUIDEBOOK

TABLE OF CONTENTS LAB 1: INSTALLING CASSANDRA…………………………………………………………………………………………3 LAB 2: CASSANDRA DATA MODELLING USING CQLSH…………………………………………………………5 LAB 3: WORKING WITH CASSANDRA CLI…………………………………………………………………………..8 LAB 4: INSTALLING l DATASTAX COMMUNITY EDITION FREE PACKAGE OF CASSANDRA ON A TWO-NODE UBUNTU 12.04……………………………………………………………………………………………11 LAB 5: CONFIGURING A MULTIPLE NODE CLUSTER (SINGLE DATA CENTER)……………………16 LAB 6: CONFIGURING MULTI-NODE CLUSTER WITH MULTIPLE DATA CENTER………………..19 LAB 7: CONFIGURING CASSANDRA PROPERTIES………………………………………………………………21 LAB 8: ADDING CAPACITY TO EXISTING CLUSTER…………………………………………………………….23 LAB 9: CASSANDRA REPLICATION AND SNITCHES CONFIGURATION………………………………..29 LAB 10: INSTALLING AND USING OPSCENTER…………………………………………………………………..32 LAB 11: MONITORING USING NODESTAT………………………………………………………………………….38

LAB 1: INSTALLING CASSANDRA Lab Environment: Ubuntu 12.04 LTS on Amazon AWS Installation Checklist: Before we begin with the Installation ensure that your system fulfills the following system requirement: ● Advanced Package Tool is installed.

4 ● Root or sudo access to the install machine. ● Python 2.6+ (needed if installing OpsCenter). ● Latest version of Oracle Java SE Runtime Environment (JRE) 7 Installing JDK: Commands: Use the following commands to get the repository of java and to ensure the right java is installed as expected by Cassandra. sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer Install Cassandra: In a terminal window: ● Check which version of Java is installed by running the following command: ● $ java -version

● Use the latest version of Oracle Java 7 on all nodes. ● Add the DataStax Community repository to the /etc/apt/sources.list.d/ cassandra.sources.list $ echo "deb http://debian.datastax.com/community stable main" | sudo tee - a /etc/apt/sources.list.d/cassandra.sources.list ●

Add the DataStax repository key to your aptitude-trusted keys.

$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -

● Install the package. For example: $ sudo apt-get update $ sudo apt-get install dsc20=2.0.11-1 cassandra=2.0.11

Post Installation; perform a test to verify the Installation.

6 LAB 2: CASSANDRA DATA MODELLING USING CQLSH LAB PRE-REQUISITE: Completed Lab 1. Step 1: Launch the cqlsh prompt to start building Cassandra elements.

Note: We can use cqlsh to connect to other nodes by appending the host (either hostname or IP address) and port as command-line parameters. Step 2: Creating a Keyspace: We will create a keyspace called “DEMO” using SimpleStrategy. as its replication strategy. We will sett the replication factor as one for a single-node Cassandra cluster.

Step 3: Create Table:

Use “Describe Keyspace” to pull the content of the keyspace as specified below.

Step 4: Insert data to table as specified below.

Step 5: Arrange the data into columns.

Step 6: Do not use INSERT to add data, instead use UPDATE as specified below.

Step 7: Cassandra supports collection data. Collections can be managed by using set, list or map as specified below.

Step 8: Creating index. Do the following activity to understand the relevance of indexes.

Step 9: Ordering Records using â&#x20AC;&#x153;Order Byâ&#x20AC;? as specified in below sample.

Step 10: Deleting records from the table. Shown below is the method of deleting record from the table.

9 LAB 3: WORKING WITH CASSANDRA CLI LAB PRE-REQUISITE: Completed Lab 1 Step 1: Start Cassandra CLI Step 2: Create Keyspace Step 3: Show user and system keyspaces

Step 4: Delete a Cassandra Keyspace

Step 5: Switch to a keyspace Step 6: Create a Cassandra column family

Step 7: Inserting Cassandra columns

Step 8: Inserting Cassandra Column

Step 9: Retrieve Cassandra Columns.

Step 10: Define the column display type if it displays the information as binary data.

Step 11: List all Cassandra Columns.

Step 12: List columns limited to maximum of 2.

Step 13: Drop a Cassandra column family.

Step 14: Deleting a row.

Step 15: Deleting a specific column.

Step 16: Open a file and enter the contents specified below. Save it with Sample-schema.txt.

Step 17: Executing a script.

12 LAB 4: INSTALLING l DATASTAX COMMUNITY EDITION FREE PACKAGE OF CASSANDRA ON A TWO-NODE UBUNTU 12.04. In this lab we are going to install Datastax Community on two nodes. Please replace the IP of nodes as assigned to you by administrator. We assume common steps on both nodes as specified below: Common steps for node1 (IPv4 192.0.2.1) and node2 (IPv4 192.0.2.2) Step 1: Letâ&#x20AC;&#x2122;s get the DataStax Repository Key and add it to our system.

Step 2: Edit the source.list.

Step 3: Add the following line to it.

Step 4: Use apt-get to ensure the system recognizes the new source.

Step 5: Start installing python-cgl and dsc

Step 6: Stop Cassandra

Step 7: Install DataStax OpsCenter and its prerequisite

Step 8: Edit Cassandra.yaml on Node1 (specify your IP Address)

Step 9: Edit opscenterd.conf to change the IP of the interface address.

Step 10: Start Cassandra.

Start 11: Launch Node 2 and edit Cassandra.yaml as specified below.

Step 12: Start Cassandra

Step 13: Step verify ring status

Step 14: Ensure you verify the output and status is Up. Step 15: Start opscenterd start

Step 16: Connect to opsCenter1.

Step 17: OpsCenter displays two active nodes but needs the agents installed on each node. Click the link â&#x20AC;&#x153;fixâ&#x20AC;?.

Step 18: Enter your Credentials to let the agent install and accept the fingerprint.

Step 19: Use Cassandra-cli to connect with Cassandra. Connect to the cluster from Node 1:

Step 20: Define a KeySpace.

17 LAB 5: CONFIGURING A MULTIPLE NODE CLUSTER (SINGLE DATA CENTER) LAB PRE-REQUISITE: Completed Lab 1. Step 1: Once we have installed Cassandra packages on all nodes we need to ensure all instances are terminated on Ubuntu. Execute the following process. You must stop the server and clear the data. Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name. $ sudo service cassandra stop $ sudo rm -rf /var/lib/cassandra/data/system/*

Note: In Cassandra, the term Data Center is a grouping of nodes. Data Center is synonymous with replication group, that is, a grouping of nodes configured together for replication purposes. Step 2: Planning: Get all the nodes along with their respective IP addresses and identify the seed. In our case we are using the ones described below. Having more than one node as seed is considered the best practice in Cassandra. node0 node1 node2 node3 node4 node5

110.82.155.0 (seed1) 110.82.155.1 110.82.155.2 110.82.156.3 (seed2) 110.82.156.4 110.82.156.5

Before we proceed we must ensure Cassandra services are not running and the data is cleared as indicated above. Step 3: Set the properties in the cassandra.yaml file for each node: Location: /etc/cassandra/cassandra.yaml Sample:

Step 4: Set the following properties (Not all properties are covered in this segment). ● num_tokens: recommended value: 256 ● -seeds: internal IP address of each seed node ● Seed nodes do not bootstrap, which is the process of a new node joining an existing cluster. For new clusters, the bootstrap process on seed nodes is skipped. ● listen_address: ● If not set, Cassandra asks the system for the local address, the one associated with its hostname. In some cases Cassandra doesn't produce the correct address and you must specify the listen_address. ● endpoint_snitch: name of snitch (See endpoint_snitch.) If you are changing snitches, see Switching snitches. ● auto_bootstrap: false (Add this setting only when initializing a fresh cluster with no data.) Step 5: Make entries in cassandra.yaml, cluster_name: 'MyCassandraCluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "110.82.155.0,110.82.155.3" listen_addr ess: rpc_address : 0.0.0.0 endpoint_snitch: GossipingPropertyFileSnitch

19 Step 6: Configure Rack file, assign the data center and rack names you determined in the prerequisites. For example:

1. In the cassandra-rackdc.properties

# indicate the rack and dc for this node dc=DC1 rack=RAC1 2.

After you have installed and configured Cassandra on all nodes, start the seed nodes one at a time, and then start the rest of the nodes. $ sudo service cassandra start

Note: If the node has restarted because of automatic restart, you must first stop the node and clear the data directories. Step 7: Verify that the ring is up and running. Use nodetool to verify as specified below. $ nodetool status

Each Node should be listed as specified below.

20 LAB 6: CONFIGURING MULTI-NODE CLUSTER WITH MULTIPLE DATA CENTER LAB PRE-REQUISITE: Completed Lab 1 & 2. Step 1: Checklist to fulfill before we start this lab. ● Install Cassandra on each node. ● Choose a name for the cluster. ● Get the IP address of each node. ● Determine which nodes will be seed nodes. ● Determine the snitch and replication strategy. The GossipingPropertyFile Snitch and NetworkTopologyStrategy are recommended for production environments. ● If you’re using multiple data centers, determine a naming convention for each data center and rack. For example: DC1, DC2 or 100, 200 and RAC1, RAC2 or R101, R102. Choose the name carefully; renaming a data center is not possible. Step 2: Assuming we have installed Cassandra on all nodes and we have to configure it for multiple data. Shown below are our assumptions. node0 node1 node2 node3 node4 node5

10.168.66.41 (seed1) 10.176.43.66 10.168.247.41 10.176.170.59 (seed2) 10.169.61.170 10.169.30.138

Note: If Cassandra is running, you must stop the server and clear the data using the following commands. $ sudo service cassandra stop $ sudo rm -rf /var/lib/cassandra/data/system/*

Step 3: Configure Cassandra.yaml fie with the following properties. • •

num_tokens: recommended value: 256 -seeds: internal IP address of each seed node

Seed nodes do not bootstrap, which is the process of a new node joining an existing cluster. For new clusters, the bootstrap process on seed nodes is skipped. • listen_address: If not set, Cassandra asks the system for the local address, the one associated with its hostname. In some cases Cassandra doesn't produce the correct address and you must specify the listen_address. • endpoint_snitch: name of snitch (See endpoint_snitch.) If you are changing snitches, see Switching snitches. • auto_bootstrap: false (Add this setting only when initializing a fresh cluster with no

21 data.) Since in our case we have identical nodes we can copy the same configuration file on all the nodes.

Step 4: Edit the file as specified below. cluster_name: 'MyCassandraCluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.168.66.41,10.176.170.59" listen_address: endpoint_snitch: GossipingPropertyFileSnitch

Step 5: Configure Datacenter and the rack as specified in Cassandra-rackdc.prperties file. Nodes 0 to 2 # indicate the rack and dc for this node dc=DC1 rack=RAC1

Nodes 3 to 5 # indicate the rack and dc for this node dc=DC2 rack=RAC1

Step 6: Start the nodes, start seed nodes and then start the rest of the nodes. $ sudo service cassandra start

Step 7: Verify the ring. $ nodetool status

View the output of this command and ensure each node are listed and are in a normal state.

23 LAB 7: CONFIGURING CASSANDRA PROPERTIES LAB PRE-REQUISITES: Completed Lab 1, 2 & 3. Step 1: Configuring gossip settings. Step 2 Configure heap dump: Analyzing the heap dump helps troubleshooting memory problems. By default, Cassandra puts the file in a subdirectory of the working root directory when running as a service. If Cassandra does not have write permission to the root directory, the heap dump fails. If the root directory is too small to accommodate the heap dump, the server crashes. Configure heap dump directory to ensure heap dump prevents crashes. To ensure this your heap dump directories must be accessible to Cassandra for writing and it must be capable of handling large amount of data to accommodate a heap. Default location of dump is /etc/ds/Cassandra.Follow the following steps to configure heap. 1.

Open the cassandra-env.sh file for editing. # set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR

Scroll down to the comment about the heap dump path: # set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR

On the line after the comment, set the CASSANDRA_HEAPDUMP_DIR to the path you want to use: # set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR CASSANDRA_HEAPDUMP_DIR =<path>

Save the cassandra-env.sh file and restart.

Step 3: Configure virtual node on a new cluster. Note: Generally all nodes will have the same number of virtual nodes when they have equal hardware capability. Always assign a proportional number of virtual nodes to the larger machines if the hardware capabilities vary among the nodes in your cluster. Follow the steps provided below to enable virtual nodes on new cluster: â&#x2014;? Set the number of tokens on each node in your cluster with the

24 num_tokens parameter in the cassandra.yaml file. â&#x2014;? The recommended value is 256. Do not set the initial_token parameter. Step 4: Configure virtual node on an existing cluster. Note: Enabling virtual nodes (vnodes) has less impact on performance if you bring up a another data center configured with vnodes already enabled and let Cassandraâ&#x20AC;&#x2122;s automatic mechanisms distribute the existing data into the new nodes. Follow these steps to configure: 1. Add a new data center to the cluster. 2. Once the new data center with the vnodes enabled is up, switch your clients to use the new data center. 3. Run a full repair with nodetool repair. 4. This step ensures that after you move the client to the new data center, any previous writes are added to the new data center and that nothing is dropped when you remove the old data center. 5. Update your schema to no longer reference the old data center. 6. Remove the old data center from the cluster. Step 5: Log configuration.

25 LAB 8: ADDING CAPACITY TO EXISTING CLUSTER LAB PRE-REQUISITES: Completed Lab 1, 2, 3 & 4. Task 1: Adding node to existing cluster Step 1: Install Cassandra on the new nodes, but do not start Cassandra. Note: Follow Lab 1 to install Cassandra on new node. Step 2: Set the following properties in cassandra.yaml •

• •

• • •

auto_bootstrap - If this option has been set to false, you must set it to true. This option is not listed in the default cassandra.yaml configuration file and defaults to true. cluster_name- The name of the cluster the new node is joining. listen_address/broadcast_address - May usually be left blank. Otherwise, use IP address or host name that the other Cassandra nodes use to connect to the new node. endpoint_snitch - The snitch Cassandra uses for locating nodes and routing requests. num_tokens - The number of vnodes to assign to the node. seed_provider - The -seeds list in this setting determines which nodes the new node should contact to learn about the cluster and establish the gossip process.

Note: Ensure the new node is not listed in the – seeds list. Change any other non-default settings you have made to your existing cluster in the cassandra.yaml file and cassandra-topology.properties or cassandrarackdc.properties files Step 3: Start Cassandra on each new node. Step 4: Use nodetool to verify. Step 5: Run nodetool cleanup on each of the previously existing nodes to remove the keys no longer belonging to those nodes.

26 Note: Wait for cleanup to complete on one node before doing the next. Task 2: Adding a data center to a cluster Follow the following step to add a data center to a cluster. Step 1: Before you start, check and ensure you are using NetworkTopologStrategy for all keyspaces. Step 2: Set the following properties for all nodes in the cluster Edit Cassandra.yaml file. Add (or edit) auto_bootstrap: False. Set other properties, such as -seeds and endpoint_snitch, to match the cluster settings. • If you want to enable vnodes, set num_tokens. • Update the relevant property file for snitch used on all servers to include the new nodes. You do not need to restart. • •

GossipingPropertyFileSnitch: cassandra-rackdc.properties • PropertyFileSnitch: cassandra-topology.properties •

Step 3: Start Cassandra on the new nodes. Step 4: After all nodes are running in the cluster, change the keyspace properties to specify the desired replication factor for the new data center. Step 5: Run nodetool rebuild specifying the existing data center on all nodes in the new data center: nodetool rebuild -- name_of_existing_data_center Step 6: For each new node, change to true or remove auto_bootstrap: false in the cassandra.yaml file. Task 3: Replacing a dead node In several scenarios you would have to replace a node that has died due to some reason (hardware failure).

27 Follow the following steps to replace them. Step 1: Confirm that the node is dead using nodetool status: The nodetool command shows a down status for the dead node (DN):

Step 2: Note the Address of the dead node (will be used in Step 6) Step 3: Install Cassandra on the new node, but do not start Cassandra. If it has started stop it and clear the data. Step 4: Set the following properties in the cassandra.yaml and, depending on the snitch, the cassandra- topology.properties or cassandrarackdc.properties configuration files: •

• •

•

auto_bootstrap - If this option has been set to false, you must set it to true. This option is not listed in the default cassandra.yaml configuration file

and defaults to true. cluster_name - The name of the cluster the new node is joining . listen_address/broadcast_address - May usually be left blank. Otherwise, use IP address or host name that other Cassandra nodes use to connect to the new node. endpoint_snitch - The snitch Cassandra uses for locating nodes and routing requests. num_tokens - The number of vnodes to assign to the node. If the hardware capabilities vary among the nodes in your cluster, you can assign a proportional number of vnodes to the larger machines. seed_provider - The -seeds list in this setting determines which nodes the new node should contact to learn about the cluster and establish the gossip process.

Step 5: Change any other non-default settings you have made to your existing cluster in the cassandra.yaml file and cassandra-topology.properties or cassandrarackdc.properties files. Use the diff command to find and merge (by head) any differences between existing and new nodes.

28 Step 6: Start the replacement node with the replace_address option: Packaged installs: Add the following option to /usr/share/cassandra/cassandra-env.sh file: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node Step 7: After the new node finishes bootstrapping, remove the option you added in Step 6. Step 8: Remove the old node's IP address from the cassandratopology.properties or cassandrarackdc.properties file Note: In production, wait for at least 72 hours to ensure that the old node information is removed from gossip. If removed from the property file too soon, youâ&#x20AC;&#x2122;ll face problems and complications. Task 4: Replacing a dead seed node Cassandra doesnâ&#x20AC;&#x2122;t allow seed node to be bootstrapped. To remove a dead seed node we need to follow following steps: Step 1: We can promote an existing node to be a seed node by adding its IP address to â&#x20AC;&#x201C; seeds list and remove the IP address of the dead seed node from the Cassandra.yaml file for each node in the cluster. Step 2: Replace the dead node as specified in the previous task. Task: 5 Replacing a running node: Sometimes we need to replace a node with a new node when updating to a new hardware or performing maintenance. We follow a simple process of starting the replacement node, integrate into cluster and then decommission old node. To do that the steps described below. Step 1: Prepare and start the replacement node as specified in the first task. Step 2: Confirm that the replacement node is alive.

29 $nodetool status Step 3: Note the HOST ID of the node. Use this ID and decommission the original node from the cluster using the command provided below. $nodetool decommission. Task 6: Removing a Node Sometimes you would want to reduce the size of data center. To do that you would like to remove nodes to reduce the size of the cluster. Follow the steps shown below to do so. Step 1: Use nodetool status command to see the status of the node (UN,DN) as shown below.

Step 2: If the node is up, run nodestool decommission. $nodetool decommission Step 3: If the node is down: If the cluster uses vnodes, remove the node using the nodetool removenode command. • If the cluster does not use vnodes, before running the nodetool remove command. •

Task 7: Decommissioning a data center Follow these steps to remove a data center so no information is lost. •

Make sure no clients are still writing to any nodes in the data center.

•

Run a full repair with nodetool repair.

•

This ensures that all data is propagated from the data center being decommissioned.

•

Change all keyspaces so they no longer reference the data center being removed.

•

Run nodetool decommission on every node in the data center being removed.

Task 8: Repairing nodes Node repair ensures the data on a replica remains consistent with the data on other nodes. The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data. Guidelines for running routine node repair include: •

The hard requirement for routine repair frequency is the value of gc_grace_seconds. Run a repair operation at least once on each node within this time period. Following this important guideline ensures that deletes are properly handled in the cluster.

•

Use caution when running routine node repair on more than one node at a time and schedule regular repair operations for low-usage hours.

•

In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility.

LAB 9: CASSANDRA REPLICATION AND SNITCHES CONFIGURATION LAB PRE-REQUISITES: Completed Labs 1, 2, 3 & 4. Task 1: Defining replication factor Use the syntax given below to define the replication factor while creating the keyspace.

Note: Replication factor reflects the total number of replicas A value of one means storing data in one node only (No duplicate copy) • A value higher than one provides data redundancy for failover • A value of 3 means data are committed to 3 separate nodes • A value of 3 also means each node now acts as a replica for 3 separate token ranges. • •

Task 2: Changing the Cassandra Replication Factor To increase the replication factor for a cluster we need to perform the following task on each server. 1. Edit Cassandra.yaml to contain the desired replication_factor. 2. Restart Cassandra. 3. Run nodetool repair against each keyspace individually. Task 3: Setting replica placement strategy

32 The replica placement strategy determines which other nodes are picked as replicas besides the one chosen by the token. Replica Placement Strategies: • SimpleStrategy (Default): Returns the nodes that are next to each other on the ring. • NetworkTopologyStrategy: Configure the number of replicas per data center as specified in the strategy_options. • OldNetworkTopologyStrategy: Places one replica in one data center while the rest on different racks in the current data center. Update the replication factor used for two different data center DC1 and DC2. • If the second data center is for load balancing based on geographical location.

•

If the second data center is only for disaster recover

OldNetworkTopologyStrategy: Places one replica in one data center while the rest on different racks in the current data center. To define the placement strategy use the following command to create keyspace.

Task 4: Configuring Snitches Snitches give Cassandra hints on how to route inter-node communication more effectively. We can use the following snitches: SimpleSnitch (Default) • Full class name: org.apache.cassandra.locator.SimpleSnitch) • Used this if all nodes located in a single data center RackInferringSnitch • Full class name: org.apache.cassandra.locator.RackInferringSnitch • Automatic determine the network topolology by analyzing the IP addresses • Assumes the second octet identifies the data center and the third octet identifies the rack PropertyFileSnitch • Full class name: org.apache.cassandra.locator.PropertyFileSnitch

33 â&#x20AC;˘

Store the network description in a property file

Edit the Cassandra-topology.properties files as specified.

Use RackInferringSnitch or PropertyFileSnitch for Cassanda cluster deployment on multiple data centers. It provides hints to Cassanda on where to replicate data and reduce inter-node communication latency. To configure the snitch for Cassandra â&#x20AC;˘

vi conf/cassandra.yaml

Change endpont_snitch as specified below:

With NetworkTopologyStrategy or OldNetworkTopologyStrategy placement strategy, one of the rack-aware snitches (RackInferringSnitch or PropertyFileSnitch) must be used. By default, snitches are wrapped in a dynamic snitch that monitors read latency. When needed, Cassandra routes requests away from poorly performing nodes.

34 LAB 10: INSTALLING AND USING OPSCENTER LAB PREREQUISITE: Completed Lab 1. Task 1: Install Ops for monitoring (Basic) This lab assumes you have completed Lab 1. Follow the steps shown below to install OpsCenter .

Task 2: Login to launch the OpsCenter Connect to OpsCenter in a web browser at http://localhost:8888 Task 3: Explore various functionalities of OpsCenter Home page displays option to create a new cluster or add an existing one. OpsCenter can be used to provision the cluster, monitor and even customize.

Adding Cluster: We need to specify the IP/Host.

The dashboard provides you with various options to work with cluster (view)

When we move to data view we find the list of all keyspaces that we can select and manage from here.

The View shown below provides you with the metcis in the activity view.

Sample Ring View shown below.

Here we get the list of all nodes in node view.

38 In our current default cluster we have only 1 node and is shown below.

Monitoring view on dashboard with default metics

The diagram shown below tells us how we can customize the metrics of our choice.

Finally we plot the metrics on graph

Once we specify we get the below graph.

LAB 11: MONITORING USING NODESTAT LAB PRE-REQUISITE: Completed Lab 1

Try these commands on your Ring.