Presented By Kelly Technologies www.kellytechno.com
Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study Condor Hadoop – HDFS and map reduce
www.kellytechno.com
What is Distributed Computing/System? Distributed computing A field of computing science that
studies distributed system. The use of distributed systems to solve computational problems. Distributed system Wikipedia
There are several autonomous computational entities, each of which has its own local memory. The entities communicate with each other by message passing.
Operating System Concept
The processors communicate with one another through various communication lines, such as high-speed buses or telephone lines. Each processor has its own local memory.
www.kellytechno.com
What is Distributed Computing/System? Distributed program A computing program that runs in a distributed system Distributed programming The process of writing distributed program
www.kellytechno.com
What is Distributed Computing/System? Common properties Fault tolerance
When one or some nodes fails, the whole system can still work fine except performance. Need to check the status of each node
Each node play partial role Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input. Resource sharing Each user can share the computing power and storage resource in the system with other users Load Sharing Dispatching several tasks to each nodes can help share loading to the whole system. Easy to expand We expect to use few time when adding nodes. Hope to spend no time if possible.
www.kellytechno.com
Why Distributed Computing? The nature of application Performance Computing intensive
The task could consume a lot of time on computing. For example, π
Data intensive
The task that deals with a lot mount or large size of files. For example, Facebook, LHC(Large Hadron Collider).
Robustness No SPOF (Single Point Of Failure) Other nodes can execute the same task executed on failed node.
www.kellytechno.com
Common Architectures Communicate and coordinate works among concurrent
processes
Processes communicate by sending/receiving messages Synchronous/Asynchronous
www.kellytechno.com
Common Architectures Master/Slave architecture Master/slave is a model of communication where one device or process has unidirectional control over one or more other devices
Database replication Source database can be treated as a master and the destination database can treated as a slave. Client-server web browsers and web servers
www.kellytechno.com
Common Architectures Data-centric architecture Using a standard, general-purpose relational database
management system customized in-memory or file-based data structures and access method Using dynamic, table-driven logic in logic embodied in previously compiled programs Stored procedures logic running in middle-tier application servers Shared databases as the basis for communicating between parallel processes direct inter-process communication via message passing function
www.kellytechno.com
Best Practice
Data Intensive or Computing Intensive Data size and the amount of data
The attribute of data you consume Computing intensive We can move data to the nodes where we can execute jobs Data Intensive We can separate/replicate data to difference nodes, then we can execute our tasks on these nodes Reduce data replication when executing tasks
Master nodes need to know data location No data loss when incidents happen SAN (Storage Area Network) Data replication on different nodes Synchronization When splitting tasks to different nodes, how can we make sure these tasks are synchronized?
www.kellytechno.com
Best Practice Robustness Still safe when one or partial nodes fail Need to recover when failed nodes are online. No further or few action is needed
Condor – restart daemon
Failure detection
When any nodes fails, master nodes can detect this situation. Eg: Heartbeat detection
App/Users don’t need to know if any partial failure
happens.
Restart tasks on other nodes for users
www.kellytechno.com
Best Practice Network issue Bandwidth
Need to think of bandwidth when copying files from one node to other nodes if we would like to execute the task on the nodes if no data in these nodes.
Scalability Easy to expand
Hadoop – configuration modification and start daemon
Optimization What can we do if the performance of some nodes is not good?
Monitoring the performance of each node According to any information exchange like heartbeat or log Resume the same task on another nodes
www.kellytechno.com
Best Practice App/User shouldn’t know how to communicate between nodes User mobility – user can access the system from some point or anywhere
Grid – UI (User interface) Condor – submit machine
www.kellytechno.com
Case study - Condor Condor Computing intensive jobs Queuing policy
Match task and computing nodes
Resource Classification
Each resource can advertise its attributes and master can classify according to this
www.kellytechno.com
Case study - Condor
www.kellytechno.com
Case study - Condor Role Central Manger
The collector of information, and the negotiator between resources and resource requests
Execution machine
Responsible for executing condor tasks
Submit machine
Responsible for submitting condor tasks
Checkpoint servers
Responsible for storing all checkpoint files for the tasks
www.kellytechno.com
Case study - Condor Robustness One execution machine fails
We can execute the same task on other nodes.
Recovery
Only need to restart the daemon when the failed nodes are online
www.kellytechno.com
Case study - Condor Resource sharing Each condor user can share computing power with other condor users. Synchronization Users need to take care by themselves
Users can execute MPI job in a condor pool but need to think of the issues of synchronization and Deadlock.
Failure detection Central manager can know when nodes fails
Based on update notification sent by nodes
Scalability Only execute few commands when new nodes are online.
www.kellytechno.com
Case study - Hadoop HDFS Namenode:
manages the file system namespace and regulates access to files by clients. determines the mapping of blocks to DataNodes.
Data Node : manage storage attached to the nodes that they run on save CRC codes send heartbeat to namenode. Each data is split as a chunk and each chuck is stored on some data nodes. Secondary Namenode
responsible for merging fsImage and EditLog
www.kellytechno.com
Case study - Hadoop
www.kellytechno.com
Case study - Hadoop Map-reduce Framework JobTracker
Responsible for dispatch job to each tasktracker Job management like removing and scheduling.
TaskTracker
Responsible for executing job. Usually tasktracker launch another JVM to execute the job.
www.kellytechno.com
Case study - Hadoop
www.kellytechno.com
From Hadoop - The Definitive Guide
Case study - Hadoop Data replication Data are replicated to different nodes
Reduce the possibility of data loss Data locality. Job will be sent to the node where data are.
Robustness One datanode fails
We can get data from other nodes.
One tasktracker failed
We can start the same task on different node
Recovery
Only need to restart the daemon when the failed nodes are online
www.kellytechno.com
Case study - Hadoop Resource sharing Each hadoop user can share computing power and storage space with other hadoop users. Synchronization No synchronization
Failure detection Namenode/Jobtracker can know when datanode/tasktracker fails
Based on heartbeat
www.kellytechno.com
Case study - Hadoop Scalability Only execute few commands when new nodes are online. Optimization A speculative task is launched only when a task takes
too much time on one node.
The slower task will be killed when the other one has been finished
www.kellytechno.com
www.kellytechno.com