Sdj open 01 20133

Page 1


ACUNU ANALYTICS WHEN SECONDS MATTER... WE PROVIDE IMMEDIATE BUSINESS INSIGHT

PREDICTABLE FAST QUERIES

LOW-LATENCY ANSWERS

Even as data volumes grow

Sub-second from source to insight

HISTORICAL, FRESH DATA

POWERFUL ANALYTICS

Combined for deeper insight

Qualitative and quantitive

Benefits:

RICH VISUALIZATION SCALABILITY, DISTRIBUTION, AND AVAILABILITY

Easy to assemble dashboards and APIs

Round the world and into the peta-scale

Acunu Analytics on Cassandra delivers instant answers from event streams, making it simple to build analytic applications and live dashboards that deliver fresh insight from live data.

There is no lag introduced by batching up events on arrival nor by indexing loaded data before it is available to queries. Results with millisecond latency.

Our unique approach of pre-building the cubes, on ingest, that satisfy the queries, means predictable, very short query times that are not dependent on data volumes.

Read about our customers’ success and hear them talk in our webinars at www.acunu.com

www.acunu.com | @acunu contact@acunu.com | US office: +1 866-487-2650 | UK office: +44 203-1760143


Big data goes in. Business insight comes out. Our software turns your massive streams of machine data into business insights by making sense of website clickstreams, mobile devices, applications and other technologies that drive your business. It’s what our market-leading customers call real-time Operational Intelligence. Over half of the Fortune 100™ use Splunk software and have the business results to prove it. In days or weeks, not months or years.

www.splunk.com/goto/listen

www.splunk.com © 2013 Splunk Inc. All Rights Reserved. Fortune 100 is a trademark of the FORTUNE magazine division of Time Inc.


Editor’s Note

Dear Readers,

W

e’re proud to give you a brand new Software Developer’s Journal issue: SDJ Open. In this magazine we’ll show you the essence of w SDJ is all about.

team Editor in Chief: Stanisław Winch stanislaw.winch@sdjournal.org

First two issues of this year were dedicated to the open-source projects. Inside, you could learn everything about Cassandra and Hadoop. Big data management is one of the most popular subject in the whole IT world and Cassandra and Hadoop are giving us possibilities, which just a few years ago where beyond our imagination. Fast, high performance distributed file system, as well as a framework for distributed computation are the examples of great contribution of Cassandra and Hadoop to the BigData solutions.

Editorial Advisory Board: Lee Sylvester, Adrian Cockcroft

The big thing we want to share with you is Apache ISIS, one of the newest top-level projects from The Apache Software Foundation. It’s a framework for rapidly developing domain-driven apps in Java. ISIS will easily transfer your domain services to a web app or a RESTful API. You can read about it in SDJ Open exclusively!

Publisher: Paweł Marciniak

Inside you will find what ISIS is in “Introduction to ISIS” (great article by one of the it’s creators Dan Haywood). To learn more about ISIS you can read about “RESTful Objects” (one of the constituent parts the framework), also by Dan Haywood. You will know “A hybrid aproach to enabling Real time queries to End-users” ( by Benoit Perroud). A very useful article is “Grokking the Menagerie: An Introduction to the Hadoop Software Ecosystem” (by Blake Matheny). From the Cassandra issue we’ve included an article “COTS to Cassandra“ (by Christopher Keller) and Getting Started with Cassandra, using Node.js by (Russell Bradberry) .

We hope that now you are really “in” our magazine. Hope to see you soon Stay tuned for more news @SDJ_EN Stanisław Winch & SDJ Team

Special thanks to our Beta testers and Proofreaders who helped us with this issue. Our magazine would not exist without your assistance and expertise. Also we want to thank Dan Haywood from Apache ISIS for his great contribution.

Managing Director: Katarzyna Kurant katarzyna.kurant@sdjournal.org Production Director: Andrzej Kuca andrzej.kuca@sdjournal.org Art. Director: Ireneusz Pogroszewski ireneusz.pogroszewski@sdjournal.org DTP: Ireneusz Pogroszewski Marketing Director: Anahita Rouyan anahita.rouyan@sdjournal.org Publisher: Software Press sp. z o.o. SK 02-682 Warszawa, ul. Bokserska 1 Phone: 1 917 338 3631 http://en.sdjournal.org/ Whilst every effort has been made to ensure the highest quality of the magazine, the editors make no warranty, expressed or implied, concerning the results of the content’s usage. All trademarks presented in the magazine were used for informative purposes only. All rights to trademarks presented in the magazine are reserved by the companies which own them.

DISCLAIMER! The techniques described in our magazine may be used in private, local networks only. The editors hold no responsibility for the misuse of the techniques presented or any data loss.

4

OPEN 01/2013


Contents

06

ISIS exclusive

Introducing ISIS By Dan Haywood

To stop himself from procrastinating in his work, the Greek orator Demosthenes would shave off half his beard. Too embarrassed to go outside and with nothing else to do, his work got done. We could learn a lesson or two from old Demosthenes. After all, we forever seem to be taking an old concept and inventing a new technology around it (always remembering to invent a new acronym, of course!) – anything, it would seem, instead of getting down to the real work of solving business problems.

18

Restful Objects On Apache ISIS By Dan Haywood

So, why REST? Well, REST is becoming a common way to integrate systems and components together, the idea being to have computer systems work together the same way that the world wide web works. In practical terms this almost always means using HTTP as the network protocol, and some sort of markup (analogous to HTML) as the way of representing information. Sometimes (X)HTML itself is used for this representation format, other times some other XML format such as AtomPub or RDF, sometimes a proprietary XML format, most commonly of all is to use JSON. You’ll find that many organizations now provide a RESTful API into their services – Twitter, Google Maps, NetFlix, Amazon, to name just a few.

32

Introduction to Hadoop

Grokking the Menagerie: An Introduction to the Hadoop Software Ecosystem by Blake Matheny

The Hadoop ecosystem offers a rich set of libraries, applications, and systems with which you can build scalable big data applications. As a newcomer to Hadoop it can be a daunting task to understand all the tools available to you, and where they all fit in. Knowing the right terms and tools can make getting started with this exciting set of technologies an enjoyable process.

38

Hybrid approach to enable real-time queries to end users by Benoit Perroud

Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind.There is a lot of traction today in this area and this article will try to answer the question of how to fill in this gap with specific open-source components and build a dedicated platform that will enable real-time queries on an Internet-scale data set.

44

Cassandra

Getting Started with Cassandra, using Node.js by Russell Bradberry

Although Cassandra is classified as a NoSQL database, it is important to note that NoSQL is a misnomer. A more correct term would be NoRDBMS, as SQL is only a means of accessing the underlying data. In fact, Cassandra comes with an SQL-like interface called CQL. CQL is designed to take away a lot of the complexity of inserting and querying data via Cassandra.

50

COTS to Cassandra by Christopher Keller

At NASA’s Advanced Supercomputing Center, we have been running a pilot project for about eighteen months using Apache Cassandra to store IT security event data. While it’s not certain that a Cassandra based solution will go into production eventually, I’d like to share my experiences during the journey.

en.sdjournal.org

5


ISIS exclusive

Introducing Apache Isis™ To stop himself from procrastinating in his work, the Greek orator Demosthenes would shave off half his beard. Too embarrassed to go outside and with nothing else to do, his work got done.

W

e could learn a lesson or two from old Demosthenes. After all, we forever seem to be taking an old concept and inventing a new technology around it (always remembering to invent a new acronym, of course!) – anything, it would seem, instead of getting down to the real work of solving business problems. Domain-driven design (hereafter DDD) puts the emphasis elsewhere, “tackling complexity in the heart of software”. And Apache Isis™ (http://isis.apache.org) – an open source Java framework at the Apache Software Foundation – helps you build your business applications with ease. No beard shaving necessary. In this article we’re going to quickly review why a domain-driven approach is important, and then we’ll see how Isis™ supports those principles of DDD, and how it lets you rapidly build apps either for prototyping or indeed deployment into production.

Understanding Domain-Driven Design

There’s no doubt that we developers love the challenge of understanding and deploying complex technologies. But understanding the nuances and subtleties of the business domain itself is just as great a challenge; perhaps more so. We’ve all worked with technologies that require us to write reams of boilerplate, where coding the app is so much rote and routine. If we devoted our efforts instead to understanding the domain and addressing its subtleties, then we could build clean, maintainable software that did a better job for our stakeholders. There’s no doubt that our stakeholders would thank us for it. There are two central ideas at the heart of domaindriven design:

6

• The ubiquitous language is about getting the whole team (both domain experts and developers) to communicate transparently using a domain model. Ubiquitous Language Build a common language between the domain experts and developers by using the concepts of the domain model as the primary means of communication. Use the terms in speech, in diagrams, in writing, and when presenting. If an idea cannot be expressed using this set of concepts, then go back and extend the model. Look for and remove ambiguities and inconsistencies. • Meanwhile, model-driven design is about capturing that model in a very straightforward manner in code. Model-Driven Design There must be a straightforward and very literal way to represent the domain model in terms of software. The model should balance these two requirements: form the ubiquitous language of the development team while also being representable in code. Changing the code means changing the mode; refining the model requires a change to the code. But how can a software framework help in this? Well, in the case of Apache Isis, it does so by adopting the naked objects (NO) architectural pattern. The fundamental idea of NO is to infer both the structure and behaviour of the domain objects from the code, and to allow the end-user to interact with the domain objects directly through a user interface generated dynamically (at runtime) from those domain objects.

OPEN 01/2013


Introducing Apache Isis™

Let me emphasise the “and behaviour” bit here. The naked objects pattern is sometimes characterized merely as a means to build CRUD applications. And while you can indeed build CRUD apps with an NO framework such as Isis, its real value comes from the fact that domain object behaviour is also exposed by the framework. For example, suppose you work for an estate management company where the lease agreements with tenants can be based on a multitude of factors. For index-linked leases, an invoice must be sent out each quarter and an adjustment must be made retrospectively when an index rate is updated. The non-trivial logic to recalculate and reissue invoices can all be exposed as a behaviour on the business object. Moreover, the different calculations required for other lease types can be invoked in the same way: polymorphic operations, and at the same abstraction level that the domain expert thinks! But another reason why you should consider using Isis to build your DDD app – perhaps the killer reason – is the speed of development that it allows. To build an application in Isis means writing the classes in your domain model, but it does not require you to write any UI logic. No controllers, no view models, no HTML pages with markup, no DTOs… nothing. As you might imagine, that lets you build stuff very rapidly indeed. In fact, you can even pair-program with your domain expert, building a working webapp that you can both evaluate. More subtly, because it doesn’t cost much in terms of time to build the app, you are much more likely to experiment and try out other designs. Each time you iterate you’ll gain greater insights into the business domain, and all the time you’ll be extending and enriching your ubiquitous language. Listing 1. ToDoItems domain service’s methods

How does Apache Isis compare to other frameworks? Many other frameworks promise rapid application development, so how do they compare with Apache Isis? For many developers, the most obvious difference of Apache Isis is its deliberate lack of an explicit controller layer nor views. Instead, both state and behaviour is automatically exposed in its generic object-oriented UI. This UI can then be customized in various ways. Isis supports a number of different viewers; the Wicket viewer (based on Apache Wicket™ (http://wicket.apache. org)) allows components to be registered that will render objects in a calendar view, or in a google map, or as a chart, or whatever. Another viewer is its RestfulObjects viewer which – as you might have guessed – exposes a full RESTful API from the domain object model. Using this REST API you can if you wish hand-craft your own UI, with Isis automatically taking care of the grunt work of generating the RESTful controllers. Yeah, jolly good. But I suspect by now you want to see some code. So let’s see how you build an application in Isis.

Yet another “ToDo” app…

It seems that the “ToDo” app has become the modern equivalent of “Hello World”. But we make no excuses for using the ToDo app as the basis for Isis’ quickstart archetype; its relatively lack of complexity makes it easy for you to refactor towards your own requirements. If you go to the Isis website (at http://isis.apache. org) then you’ll find instructions and screencasts showing you how to run the quickstart (Wicket/Restful/ JDO) archetype (http://isis.apache.org/getting-started/ quickstart-archetype.html). All the example code and discussion that follows relates to that app. Do run the archetype yourself and follow along.

@Named(“ToDos”)

public class ToDoItems { ...

@MemberOrder(sequence = “1”)

public List notYetComplete() { ... } @MemberOrder(sequence = “2”)

public List complete() { ... } @MemberOrder(sequence = “3”)

public ToDoItem newToDo( ... ) { ... } @MemberOrder(sequence = “4”)

public List allToDos() { ... } }

...

en.sdjournal.org

Figure 1. ToDoItems domain service’s methods rendered in the UI as actions

7


ISIS exclusive

Domain Services vs. Domain Entities

The quickstart application consists of just two domain classes. The ToDoItem entity is obvious enough – it represents a single item on your todo list – while the ToDoItems class (note the plural) is a domain service acting as both repository (to lookup existing ToDoItems) and as a factory (to create new ToDoItems). To see how Isis automatically builds a UI from the code, look at the signature of the methods in the ToDoItems service in Listing 1. And in the user interface this is rendered as shown in Figure 1. Most of these methods have no parameters, but the newToDo() method does. If invoked, it will, as Figure 2 shows, render a page prompting for argument values. You can see that there’s actually enough information within the ToDoItems domain class for Isis to build this UI; each method in the class corresponds to (what Isis calls) an action in the UI. The menu item names can be inferred directly from the method names, while annotations – such as @MemberOrder – are used as additional rendering hints. Similarly the @Named annotation is used in the menu (“TODOS”); otherwise the menu label would have been derived from the class name. Of course, not everyone is a fan of annotations, and we also sometimes hear complaints that such a programming model pollutes the domain model with UI concerns. However, the way that Isis builds up its internal metamodel is completely extensible, and so if you wanted to define some other way by which the ordering is defined (eg by reading some external XML file), then you are very welcome. Another reason you might do this is if you wanted to use Isis for prototyping, but preferred to deploy on top of some other, existing, framework. In this case you could “teach” Isis about the programming conventions of your preferred deployment framework, but still benefit from Isis’ rapid development cycle during earlier prototyping.

Domain Service Implementations

So far we’ve only discussed the method signatures of a domain service, so let’s now look at the implementations. Listing 2 shows the newToDo() method/action (with some helper methods inlined to simplify the discussion). In the method implementation note the container field, which is used to obtain the current user, to instantiate and finally to persist the ToDoItem. The container is of type DomainObjectContainer, defined in Isis’ application library, and it represents the domain objects’ only hard dependency upon the framework. Even then, DomainObjectContainer is an interface, and so can easily be mocked out during unit testing. Persistent agnostic entities When writing an Isis application, you can create and persist entities using the DomainObjectContainer’s newTransientInstance() and persist() methods. In Isis entities are perhaps best described as “persistent agnostic” rather than “persistent ignorant”; they do Listing 2. ToDoItems newToDo() method public class ToDoItems { ...

@MemberOrder(sequence = “3”) public ToDoItem newToDo(

@Named(“Description”) String description,

@Named(“Category”) Category category, @Named(“Due by”) LocalDate dueBy) {

final String ownedBy = container.getUser(). getName();

final ToDoItem toDoItem =

container.newTransientInstance(ToDoItem. class);

toDoItem.setDescription(description); toDoItem.setCategory(category); toDoItem.setOwnedBy(userName); toDoItem.setDueBy(dueBy);

container.persist(toDoItem);

}

return toDoItem;

}

Listing 3. DomainObjectContainer injection public class ToDoItems { ...

private DomainObjectContainer container;

public void setContainer(DomainObjectContainer container) {

Figure 2. ToDoItems newToDoItems action takes arguments

8

}

}

this.container = container;

OPEN 01/2013


Introducing Apache Isis™

know that they are persisted, but they don’t know (nor care) how they are persisted. But how does the domain service get hold of the container? Well, it is injected as a dependency, as shown in Listing 3. Not only is the DomainObjectContainer injected into the ToDoItems domain service, it is also automatically injected into the ToDoItem entity (by virtue of the fact that the instantiation of the ToDoItem is done using the container). In fact, Isis goes a little further: any of the domain services will also be injected into domain entities automatically. In this sense the DomainObjectContainer just happens to be one built-in general purpose domain service.

Let’s look at another of the ToDoItems domain service methods, notYetComplete(). This is shown in Listing 4. Once more the method delegates to the DomainObjectContainer, this time calling its allMatches() method that returns all instances of the specified type meeting some criteria. We call this a “naïve” implementation of the repository because, although it is easy to write, this isn’t code that would scale to large volume ; the filtering is actually done client-side. We trade off the speed of development against performance. When your domain model starts to stabilize though, you’ll want to replace this naïve implementation with one that will scale. Apache Isis has a pluggable persistence layer, and the implementation most common-

Listing 4. The ToDoItems repository’s implementation of the notYetComplete() action method public class ToDoItems { ...

public List<ToDoItem> notYetComplete() {

return container.allMatches(ToDoItem.class, new Filter<ToDoItem>() { @Override

public boolean accept(final ToDoItem t) {

}

}

});

}

return ownedByCurrentUser(t) && !t.isComplete();

Listing 5. The JDO-specific ToDoItemsJdo’s implementation of notYetComplete() action method public class ToDoItemsJdo { ...

public List<ToDoItem> notYetComplete() { return container.allMatches(

new QueryDefault<ToDoItem>(ToDoItem.class, “todo_notYetComplete”,

}

“ownedBy”, currentUserName()));

}

Listing 6. The JDO annotations on ToDoItem entity @PersistenceCapable(identityType=IdentityType.DATASTORE) @Queries( { @Query(

name=”todo_notYetComplete”, language=”JDOQL”, value=”SELECT FROM dom.todo.ToDoItem “+

} ... )

“WHERE ownedBy == :ownedBy && complete == false”)

public class ToDoItem { }

...

en.sdjournal.org

9


ISIS exclusive

ly used for production systems is the JDO objectstore, driven by the reference implementation for JDO, namely DataNucleus (http://datanucleus.org). Although JDO isn’t as well known as JPA, DataNucleus is a very capable ORM, and is for example used within Google App Engine (https://developers.google.com/appengine/ docs/java/overview). Thus, the ToDoItemsJdo class subclasses ToDoItems and overrides the repository methods with methods that delegating the querying to the database. Listing 5 shows the corresponding method, with the supporting annotations appearing on the ToDoItem entity, in Listing 6. One other thing we haven’t yet discussed is how does Isis know about this domain service in the first place? Well, this is done through the WEB-INF/isis. properties file, reading the isis.services key. This can be seen in Listing 7. As you can see, there are in fact three services that are registered; ToDoItemsJdo (already dis-

cussed), along with ToDoItemsFixtureService and AuditServiceDemo. This is why there are three menu items in Figure 1, for example. What do these other services do? Well, the ToDoItemsFixtureService provides a convenient way to install sample (“fixture”) data, for prototyping with the users and/or for testing. This service would never be deployed into production, but for prototyping we usually run with a non-persistent data configuration (either the in-memory objectstore or the JDO objectstore configured against an in-memory HSQLDB database). In either case it’s enormously beneficial to be able to quickly set up some realistic data. The AuditServiceDemo service meanwhile (only available with the JDO objectstore) any changes to entities to be automatically captured and recorded in some fashion. Listing 8 shows the AuditService interface as defined by the JDO object store. And with that, let’s now turn our attention to domain entities.

Listing 7. The JDO annotations on ToDoItem entity @MemberOrder(sequence = “2”)

public Category getCategory() { ... }

isis.services = objstore.jdo.todo.ToDoItemsJdo,\

public void setCategory( ... ) { ... }

fixture.todo.ToDoItemsFixturesService,\

dom.audit.AuditServiceDemo

@Hidden

Listing 8. The AuditService interface allows simple auditing to be performed with ease

public String getOwnedBy() { ... }

public void setOwnedBy( ... ) { ... } @Disabled

public interface AuditService {

@MemberOrder(sequence = “4”)

public boolean isComplete() { ... }

@Hidden

public void setComplete( ... ) { ... }

public void audit(String user, long

currentTimestampEpoch,

@Hidden(where=Where.ALL_TABLES)

String objectType, String identifier, }

@Optional

String preValue, String postValue);

@MultiLine(numberOfLines=5)

@MemberOrder(name=”Detail”, sequence = “6”)

Listing 9. The ToDoItem entity’s properties

public String getNotes() { ... }

public void setNotes( ... ) { ... }

public class ToDoItem ... {

@javax.jdo.annotations.Persistent

public static enum Category { Professional,

@Optional

Domestic, Other }

@MemberOrder(name=”Detail”, sequence = “7”) public Blob getAttachment() { ... }

@RegEx(validation = “\\w[@&:\\-\\,\\.\\+ \\w]*”)

public void setAttachment( ... ) { ... }

@MemberOrder(sequence = “1”)

public String getDescription() { ... }

public void setDescription( ... ) { ... }

@Hidden(where=Where.ALL_TABLES)

@javax.jdo.annotations.Persistent

@MemberOrder(name=”Detail”, sequence = “99”)

@Disabled

@Named(“Version”)

@MemberOrder(name=”Detail”, sequence = “3”) @Optional

public LocalDate getDueBy() { ... } public void setDueBy( ... ) { ... }

10

}

public Long getVersionSequence() { ... }

OPEN 01/2013


Introducing Apache Isis™

Domain Entities

Whereas domain services only expose behaviour through actions, domain entities also hold state: properties and collections. The Isis runtime sits on top of the objectstore (in the case of the JDO objectstore, integrating very closely with the JDO/DataNucleus APIs) and transparently handles such concerns as lazy loading and dirty object tracking. Moreover Isis adds optimistic locking to the lifecycle management, preventing lost updates and similar problems. In terms of code, you can think of a domain entity is a javabean/pojo “on steroids”. It uses the getters and setters to identify the properties and collections, while any remaining public methods are taken to be actions of the entity. For example, Listing 9 sketches out the properties of the ToDoItem class. If you invoke one of the ToDoItems domain service actions such as notYetComplete(), then a list of matching ToDoItem instances are rendered in a table. This is shown in Figure 3. Comparing this to the code, you can see that the table automatically renders each property appropriately (for example, a checkbox for the complete property, or a download link for the attachment property). You’ll also notice though that not every property is shown; those that are annotated as @Hidden or @ Hidden(where=Where.ALL_TABLES) are missing. Visibility is just one of three business rules that Isis supports on class members, each of which can be specified declaratively (as here) or imperatively. We’ll see examples of the other business rules and of imperative business rules shortly. One thing you’ll also have noticed is that every entity is identified by a title and an icon. The title can be specified either using the @Title annotation, or, as in this case, through a title() method. This is shown in Listing 10. It’s possible to specify the image to be used as an icon through a similar mechanism, but more commonly this is done just by naming convention. In the various screenshots the icon being rendered is in a file ToDoItem.png.

Let’s get back to looking at the rest of the ToDoItem entity though. Listing 11 shows (an outline of) the collections and actions that make up ToDoItem. Clicking on any of the links in the table (Figure 3) will take us to a page showing a single entity. This is shown in Figure 4. Listing 10. The title of the ToDoItem is provided by the title() method public String title() {

final TitleBuffer buf = new TitleBuffer(); buf.append(getDescription()); if (isComplete()) {

buf.append(“ - Completed!”);

} else {

if (getDueBy() != null) {

} }

}

buf.append(“ due by “, getDueBy());

return buf.toString();

Listing 11. The ToDoItem entity’s collections and actions public class ToDoItem ... { @Disabled

@MemberOrder(sequence = “1”) @Resolve(Type.EAGERLY)

public SortedSet<ToDoItem> getDependencies() { ... } public void setDependencies( ... ) { ... } @MemberOrder(sequence = “5”) @NotPersisted

@Resolve(Type.LAZILY)

public List<ToDoItem> getSimilarItems() { ... } @Bulk

@MemberOrder(sequence = “1”)

public ToDoItem completed() { ... } @MemberOrder(sequence = “2”)

public ToDoItem notYetCompleted() { ... } @MemberOrder(name=”dependencies”, sequence = “3”) public ToDoItem add( ... ) { ... }

@MemberOrder(name=”dependencies”, sequence = “4”) public ToDoItem remove( ... ) { ... } @Named(“Clone”)

@MemberOrder(sequence = “3”)

Figure 3. A collection of ToDoItems is automatically shown in a table

en.sdjournal.org

}

public ToDoItem duplicate( ... ) { ... }

11


ISIS exclusive

On the left hand side are the properties, on the right are the collections, and at the top and also associated with the collections are buttons for the actions. As we saw previously, the @MemberOrder annotation is used to specify the relative order of class members to each other, though note now that the for properties the @ MemberOrder’s name attribute is used to group them together (“General”, “Detail”). @MemberOrder also orders the collections and actions, and for actions the name attribute can optionally be used to associate an action closer to a collection; thus the add and remove actions are shown within the dependencies collection. If you look at the notes property you’ll notice the @ MultiLine annotation; this is why that property is rendered using a textbox rather than a textfield. The category property meanwhile is rendered as a dropdown and its values are constrained by the Category enum. If we hit the “Edit” button then we can change the properties of the entity. Not every property can be edited though; those that are annotated @Disabled – such as the complete property – remain read-only. Derived properties also cannot be edited; an example is the version property. After visibility/invisibility, this enablement/disablement is the second of our business rules we can apply to class members. Of those properties that can be changed, not every property is mandatory. Those that are not are annotated with @Optional. We chose this approach when designing the Isis programming model (rather than the opposite of having an @Required or @Mandatory annotation) because identifying those properties that are strictly optional takes domain analysis; it’s safer for Isis to assume that properties are required unless told otherwise. Finally, you’ll see that the description property is annotated with @Regex, specifying a regular expression. This is one of a handful of annotations that can validate a property to make sure its value is correct; another commonly used one is @MaxLength. And validation is the third of the business rules that Isis lets us apply to class members. In all we therefore

have three types of rules: visibility rules, enablement rules and validation rules. That’s a bit of a mouthful, so you can also remember them as the “see it, use it, do it” rules.

Declarative and also Imperative Business Rules

So much for the annotations; Isis also allows business rules to be specified imperatively, using a number of supporting methods. For example, Listing 12 fleshes out the dueBy property. Both the validateDueBy() and the clearDueBy() are examples of supporting methods for the dueBy property; Isis matches them by naming convention. In the case of the validateDueBy() method, this is called whenever the user enters a new value, prior to actually Listing 12. Supporting methods for the ToDoItem entity’s dueBy property private LocalDate dueBy;

@javax.jdo.annotations.Persistent

@MemberOrder(name=”Detail”, sequence = “3”) @Optional

public LocalDate getDueBy() { }

return dueBy;

public void setDueBy(final LocalDate dueBy) { }

this.dueBy = dueBy;

public void clearDueBy() { }

setDueBy(null);

public String validateDueBy(final LocalDate dueBy) { if (dueBy == null) { }

return null;

return isMoreThanOneWeekInPast(dueBy) ?

“Due by date cannot be more than one week old” : null;

}

Listing 13. Supporting methods for the ToDoItem entity’s completed action @Bulk

@MemberOrder(sequence = “1”)

public ToDoItem completed() { setComplete(true);

}

return this;

public String disableCompleted() {

Figure 4. A single ToDoItem entity, as rendered in Isis

12

}

return complete ? “Already completed” : null;

OPEN 01/2013


Introducing Apache Isis™

calling the setter. If the validate method returns a nonnull string, then, as shown in Figure 5, this is used as the error message in the UI. Another example of a supporting method, this time to disable an action, can be seen in the completed action, as per Listing 13. As we see in Figure 6, this disable method will cause the button for the action to be greyed out for a ToDoItem where its complete property is true. Hovering over the button provides a tooltip as to why the action cannot be invoked. While we’re on the topic of the complete action, you might have noticed the @Bulk annotation. This can be annotated on no-arg actions, and if so, it allows the action to be performed against all selected objects at once. If you look back you’ll notice the “Complete” button above the table in Figure 3. Subtractive Programming In other frameworks the functionality of the app must be built up piece by piece. With Isis it is to some extent the other way around: defining the domain class structure gives us basic CRUD behaviour, and then we add the “hide it, use it, do it” rules to subtract functionality. Where things are more like other frameworks is that, on top of this basic CRUD behaviour, we must identify and implement the domain object actions that provide the main value-added business logic. But we can get to that point very quickly, and won’t have exhausted ourselves writing reams of boilerplate in the meantime.

Usability

As well as implementing business rules, supporting methods also help in the usability of the resultant application. A common case is in providing defaults for arguments when invoking actions, and/or in providing a selection of choices. For example, Figure 7 shows this in the case of the ToDoItem’s clone action. And Listing 14 shows the corresponding code. The default arguments for the three parameters are provided in the supporting methods defaultNDuplicate(). Listing 14. The supporting methods for the clone action that provide the defaults @Named(“Clone”)

@MemberOrder(sequence = “3”) public ToDoItem duplicate(

@Named(“Description”) String description, @Named(“Category”)

ToDoItem.Category category, @Named(“Due by”) @Optional

LocalDate dueBy) {

return toDoItems.newToDo(description, }

category, dueBy);

public String default0Duplicate() { }

return getDescription() + “ - Copy”;

public Category default1Duplicate() { }

return getCategory();

public LocalDate default2Duplicate() { }

return getDueBy();

Figure 5. A single ToDoItem entity, as rendered in Isis

Figure 6. Isis automatically disables the completed action

en.sdjournal.org

Figure 7. Defaults are provided for the clone action

13


ISIS exclusive

Note, by the way, that the actual method is called

duplicate() and is renamed to “Clone” using the @Named annotation; clone() itself is an inherited method whose

semantics ought not to be overloaded. This technique can also be used for actions whose name would clash with a Java keyword, for example package or default.

Injecting Domain Services If you look at the ToDoItem’s duplicate() method, you’ll see it’s a one-liner: return toDoItems.newToDo(description, category, dueBy);

The toDoItems object being delegated to here is an injected domain service; the injection happens automatically through a setter. It’s a small but important point; domain services extend the reach of domain entities so that they can do such things as: • call out to other systems via web services • publish events onto an enterprise service bus • generate representations of themselves as word documents or PDFs, • attach barcodes to such a PDF so that a document can be scanned • send out emails, text messages or tweets The (Java) interface for each of these is defined as part of the domain object model, expressed in terms that make sense to the domain expert. The technology that implements all this is down in the infrastructure. During testing, each of these interfaces can be easily mocked out, and the service implementations themselves tested independently. And what would Isis look like if it didn’t support dependency injection in this way? Well, you’d need to have some sort of application layer that requested state from the domain object and called the domain service. Not only would that make our domain objects “dumber”, it would also introduce an additional layer of code to maintain.

Let’s move onto the choices supporting method, an example of which can be seen in the remove action. If dependencies have been added to a ToDoItem, then it makes sense only to offer those dependencies for removal. This is shown in Figure 8. The supporting methods in this case can be found in Listing 15 (there’s also examples of disableXxx() and validateXxx() methods here). We didn’t actually look at the ToDoItem’s add action method, yet. The Java method for it is simple enough, as shown in Listing 16. But the way that Isis renders this is interesting because, as shown in Figure 9, the user can specify the Listing 15. The supporting methods for the remove action @MemberOrder(name=”dependencies”, sequence = “4”) public ToDoItem remove(final ToDoItem toDoItem) { getDependencies().remove(toDoItem);

}

return this;

public String disableRemove(final ToDoItem toDoItem) {

return getDependencies().isEmpty()?

“No dependencies to remove”: null;

}

public String validateRemove(final ToDoItem toDoItem) {

if(!getDependencies().contains(toDoItem)) { } }

return “Not a dependency”;

return null;

public List<ToDoItem> choices0Remove() { }

return Lists.newArrayList(getDependencies());

Listing 16. The ToDoItem add action method takes a reference to another entity @MemberOrder(name=”dependencies”, sequence = “3”) public ToDoItem add(final ToDoItem toDoItem) { getDependencies().add(toDoItem);

}

return this;

public String validateAdd(final ToDoItem toDoItem) { if(getDependencies().contains(toDoItem)) { }

return “Already a dependency”;

if(toDoItem == this) { }

Figure 8. Choices are provided for the remove action

14

}

return “Can’t set up a dependency to self”;

return null;

OPEN 01/2013


Introducing Apache Isis™

ToDoItem for the action by just typing its title. How does Isis know to do this? Well, the wiring is done declaratively. As Listing 17 shows, we annotate the ToDoItem class with @AutoComplete that, in turn, specifies an action to be invoked on the ToDoItems repository/domain service. One possible issue with all these supporting methods – associated to their class member by naming convention – is that the renaming of the class member might leave the supporting method “orphaned”. To

Figure 9. The add action’s parameters can be searched for using autocomplete

counteract that, Isis validates the resultant metamodel that it builds up internally; any orphaned methods are flagged when Isis starts. Listing 18 shows what appears in the stack trace if validateDueBy() is renamed to validateXDueBy(). This metamodel validation is extensible, by the way. If you want to register your own checks (perhaps to enforce some project-specific convention), then you can.

Customizing the User Interface

The user interface provided by Isis is – we think – reasonably usable given that it is completely generic. But there will be times when you want to extend it. In the case of Isis’ Wicket viewer (the viewer that we’ve been showing in the screenshots), it provides an extensible API to allow different renderings of any of the components on the UI, leveraging Apache Wicket’s own Component interface. For example, one extension that is available (https://github.com/danhaywood/isis-wicket-gmap3) is to automatically render entities on a map, as shown in Figure 10.

Listing 17. Autocomplete is provided through an annotation and a method @AutoComplete(repository=ToDoItems.class, action="autoComplete") public class ToDoItem ... { }

...

public class ToDoItemsJdo { ...

@Hidden

public List<ToDoItem> autoComplete(final String description) { return allMatches(

new QueryDefault<ToDoItem>(ToDoItem.class, "todo_autoComplete",

"ownedBy", currentUserName(),

}

}

});

"description", description));

Listing 18. Isis checks that supporting methods do not get orphaned ...

Caused by: org.apache.isis.core.metamodel.specloader.validator.MetaModelInvalidException: 1: dom.todo. ToDoItem#validateXDueBy has prefix validate, has probably been orphaned. then rename and use @Named annotation

If not an orphan,

at org.apache.isis.core.metamodel.specloader.validator.ValidationFailures.assertNone(ValidationFailures.

at org.apache.isis.core.metamodel.specloader.ObjectReflectorDefault.init(ObjectReflectorDefault.java:237)

java:40)

at org.apache.isis.core.runtime.system.session.IsisSessionFactoryAbstract.init(IsisSessionFactoryAbstract.

...

en.sdjournal.org

java:110)

15


ISIS exclusive

Supporting this is a matter of implementing a

Locatable interface, as shown in Listing 19.

Other extensions to support graphs, calendars and so on are just as easily supported. If you want to go further, then you can consider some of the other viewers that Isis provides. One in particular – the RestfulObjects viewer – automatically provides a complete RESTful API of your domain objects, exposing both their state and their behaviour/business rules. Using this you are free to build any bespoke user interface that you wish. Should you, though? There’s a large maintenance cost when you start building bespoke UIs; as you’ve seen, Isis can generate a reasonably good user interface without introducing that maintenance cost. So what are the pros and cons? Well, a distinction that we like to make is between users that are “problem solvers” compared to those that are “process followers”. Problem solvers are those users who have a (reasonably) deep understanding of the domain, and want a computer system to help them work within that domain in whichever way makes sense to them. The domain objects need to enforce their business logic (shouldn’t do things they aren’t to nor allow themselves to be put into an invalid state), but the system doesn’t prescribe how it is interacted with. Many lineof-business/enterprise apps (pick your term) fall into this category. And as I hope you’ll have guessed, the generic user interfaces provided by Isis support this category of users very well. Process followers, meanwhile, are those users who are following a more narrowly defined process. This may be because they are inexperienced with the domain, or it may be that the process is high volume, simple, and needs optimization. These are the situations when a customized user interface should be built. In this case Restful Objects is a great way to provide the backend plumbing, while you can choose your favourite user interface technology for the front-end.

The dangers of custom UIs While there are good and valid reasons for writing a custom UI to your app (see the “process followers” discussion in the main text), you should try to defer that step to as late as possible. Once implemented, a user interface tends to “bake in” the underlying domain model and make it that much harder to refactor. Jumping too prematurely to a custom IUI also means that insights into the domain model can get lost. We should not be wasting time tweaking fonts; if there’s a discussion to be had it would be why would the user want the font size larger in one place of the UI than another. It presumably isn’t arbitrary, which means that there’s probably some domain knowledge waiting to be discovered. But the most obvious danger of having a custom written UI is that we can very easily start writing business logic directly in the presentation layer. What starts as a quick check for a non-negative number soon mushrooms into a huge chunk of logic that really should reside within a domain object model. In contrast, Isis’ generic presentation and application layers act as a kind-of firewall, ensuring that domain logic does not leak out of the domain. With Isis it’s impossible to misplace domain logic because there’s only one place to write your code: in the domain objects themselves!

Moving into Production

At some point you’re presumably going to want to take your application into production. Isis apps are typically built using Maven, resulting in a WAR file, so in principle there’s not a lot to do. Over and above making sure your app is fully tested, of course, there are though two primary areas to consider. The first we’ve discussed already – developing objectstore-specific implementations of your repositories in order to scale. Thus, you should have repositories Listing 19. Implement Locatable to support being rendered on a map public class ToDoItem implements Locatable { ...

@javax.jdo.annotations.Persistent private Location location;

@MemberOrder(name="Detail", sequence = "10") @Optional

public Location getLocation() { }

return location;

public void setLocation(Location location) {

Figure 10. Entities can be rendered within a Google™ map

16

}

}

this.location = location;

OPEN 01/2013


Introducing Apache Isis™

such as ToDoItemsJdo class, rather than the naïve implementations in the ToDoItems class. The other main aspect to consider is security. Isis provides an integration with Apache Shiro™ (http://shiro. apache.org), so the authentication part of the puzzle can be handled using Shiro’s own pluggable architecture. Authorization, too, can be handled by Shiro; the only thing you need to know is Isis’ format for permissions, which is: packageName:ClassName:memberName:r,w

where: • memberName is the property, collection or action

name.

• r indicates that the member is visible • w indicates that the member is usable (editable or

invokable)

However, Shiro handily allows “*” to be used as a wildcard at any level, or indeed be omitted. Thus we can have permissions such as: myapp.dom.todo:ToDoItem:dueBy:r,w myapp.dom.todo:ToDoItem:dueBy:r myapp.dom.todo:ToDoItem:add:* myapp.dom.todo:ToDoItem:add:* myapp.dom.todo:ToDoItem:*:r myapp.dom.todo:ToDoItem:* myapp.dom.todo:ToDoItem myapp.dom.todo:*:*:r

should have given you a good idea of the Isis’ programming conventions and how they translate into a fully working webapp, we have really only scratched the surface of the benefits to be had when using Isis. I can quite honestly guarantee that you’ll be amazed at how productive you will find yourself once you start to use Isis in earnest. And even though we tout Isis’ killer feature as being its ability to create a user interface directly from your domain object model, that isn’t really what Isis is about at all. You’ll quickly discover when you work with Isis that things like names matter very much in Isis; and names are very much the heart of the ubiquitous language concept. It’s funny how being able to quickly and cheaply rename classes and class members significantly enhances understanding. To give an example: on one piece of new functionality I’ve been working on, the class name has been renamed from Service, to ServiceRequest, to Questionnaire, as we figured out what we really wanted the app to do. You’ll find yourselves with some quite sophisticated concepts being identified; a couple that spring to mind are BookRenewalCycle (pertaining to how a government agency reissues expired pension books), or Indexable (for an item whose year-on-year cost are based upon some external Index). Hopefully you can see why we call Isis a framework for domaindriven design. It’s easy to get started with Apache Isis. The quickstart archetype will generate the same application that we’ve discussed in this article, and you’ll find plenty of help on the mailing lists (http://isis.apache.org/support. html). I hope to see you there.

myapp.dom.todo:* myapp.dom.todo *

As I’ve already noted, we do recognize that Isis is a young product, and while you may like the rapid prototyping that it offers, you may prefer to deploy your application on some other framework. Doing that will involve writing additional code of course – controllers and views, mostly – but you should be able to leverage most or all of your domain object model. You’ll have to decide whether you want to bring in a dependency on Isis’ applib (which defines the annotations such as @ MemberOrder and the DomainObjectContainer interface), or whether you’ll avoid the applib completely and instead bend Isis to follow your own framework’s conventions instead. We recommend the former but it’s your choice, of course.

Closing Thoughts

If you’ve got through to this far in the article, then first: well done (it is quite long!) and second, I’m hoping that you’ve liked what you’ve seen? But while this article

en.sdjournal.org

Dan Haywood Dan is a freelance consultant, developer, writer and trainer, specializing in domain-driven design, agile development, enterprise architecture and REST on the Java and .NET platforms. He blogs at http://danhaywood.com. He’s a well-known advocate of the naked objects pattern, and was instrumental in the success of the first large-scale naked objects system which administers state benefits for citizens in Ireland. He continues in his role there as an advisor to the government. Dan is a committer and the current chair of the Apache Isis™ project, and the primary author of various of its components. He is also the author of the Restful Objects specification, which defines a hypermedia API for exposing domain object models. Apache Isis provides one implementation of this spec, and Dan is also a committer on Restful Objects.NET, an implementation for .NET on ASP.NET MVC.

17


ISIS exclusive

Restful Objects on Apache Isis™ If you’ve read the other article on Apache Isis™ in this issue – Introducing Apache Isis – then you may recall that I mentioned Isis’ ability to automatically generate a RESTful API for your domain object model. In this article we’re going to delve a little deeper into this capability.

T

o illustrate the ideas, I’ll once more be using the example “To Do” application, as generated by the Isis quickstart archetype (http://isis.apache. org/getting-started/quickstart-archetype.html). If you haven’t read the other article, then you might want to do so first; it provides a bit of context as to what Isis is all about and will introduce you to this simple domain. But let’s start off by making sure we’re on the same page when it comes to REST.

A bit about REST

So, why REST? Well, REST is becoming a common way to integrate systems and components together, the idea being to have computer systems work together the same way that the world wide web works. In practical terms this almost always means using HTTP as the network protocol, and some sort of markup (analogous to HTML) as the way of representing information. Sometimes (X) HTML itself is used for this representation format, other times some other XML format such as AtomPub or RDF, sometimes a proprietary XML format, most commonly of all is to use JSON. You’ll find that many organizations now provide a RESTful API into their services – Twitter, Google Maps, NetFlix, Amazon, to name just a few. REST is not only used to integrate systems across the internet, though; it’s increasingly being used within the intranet. Two common use-cases are to link independent systems within an organization (for example, through an enterprise service bus), and to act as the communication between the client- and server- portions of a single application. A million ways to write an app As the main text notes, REST is increasingly important as the communication protocol within a single applica-

18

tion, and there now numerous ways the client-side portion of that app. Many, for example, are building apps in Javascript and hosted in the web browser, using any one of the multitude of Javascript libraries out there (Backbone, Knockout, Angular, EmberJs, Kendo, and so on). The generic term for these is single page apps (SPAs); the user never refreshes the web page. Think gmail. But Javascript isn’t the only UI technology out there. Other equally capable options include Microsoft’s XAML technologies (WinRT, Silverlight, WPF), JavaFX and Flex. And, of course, we have all the mobile technologies: Android, iPhone, more XAML (Windows Phone), as well as cross-platform solutions such as Xamarin’s MonoTouch. The list goes on and on. And in all cases, REST is an important enabler to deliver the data to these apps. Not all RESTful APIs are equal though. In fact, according to the strict definition of REST (http://www.ics. uci.edu/~fielding/pubs/dissertation/rest_arch_style. htm), most APIs that claim to be RESTful are nothing of the sort. The one thing that is absolutely fundamental to a RESTful API is hypermedia links between representations. If we think of REST as a generalization of the way that the world wide web works, then the phrase “hypermedia links” means something equivalent to HTML’s humble <a href=”...”> and <form method=”...”> elements. As we know, the former instructs the web browser to use an HTTP GET to traverse to some related representation; the latter is an instruction to use either HTTP GET or HTTP POST to upload some information to the server, and again render the resultant representation. This support for hypermedia links is so critical to the “pure” definition of REST that it even has a (rather clum-

OPEN 01/2013


Restful Objects on Apache Isis™

sy) acronym: HATEOAS. That stands for “hypermedia as the engine of application state”. Think of checking out a shopping basket when using some website, where each page of the checkout flow takes you through to the next page, and finally to the finish. This is what HATEOAS means; you’ve just used an application where your state within that application was determined by the links on the page. Another term that’s starting to be used more commonly as a way of describing REST is “Hypermedia API”. I must admit I prefer this term, if only because it rolls off the tongue a little easier than HATEOAS. Web services in REST vs Web services in SOAP Since REST is about building a (different sort of) web service, it’s useful comparing it to the way we used build web services a few years ago, namely SOAP. With SOAP, we define pairs of XML messages to represent a request and response, and we send the request message via HTTP POST to a single endpoint. Although SOAP uses the HTTP protocol here, it isn’t doing so in a web-like manner. There are no links between different resource URIs, for one thing; all requests go to a single endpoint that must look inside the request to see what operation we want to perform, then does a clumsy switch statement to delegate to the right bit of functionality. Another difference between REST and SOAP is that, in using HTTP POST, SOAP ignores one of the main benefits of the web architecture: caching. When your web browser loads a page it requests multiple resources: the HTML, the CSS stylesheets, globs of Javascript, images and so forth. But of course most of those resources are cached either in the web browser’s own cache, or in some intermediary cache server. Even if not yet cached, they are often served from dedicated CDN (content delivery network) servers. REST is an architecture designed to optimize the hell out of HTTP GET. None of this happens in SOAP. One other aspect of REST that’s important to mention is media types. The reason that a web browser knows what to do with a CSS file, or an HTML page, or an image, is because of the media type that is served up with the page (in the HTTP Content-Type header). Thus we have “text/html” or “image/png”. Serving up such media types is important because they tell the REST client how to process the page/representation being returned to it.

Writing RESTful APIs

So REST is an important architectural style, the intention being to have computer systems communicate over web infrastructure in a way that goes with the conventions of the web rather than fight against it. Fair enough. So how does would normally go about writing such an API?

en.sdjournal.org

Well, in the same way that Java EE defines servlets and JSPs and the like, there is similarly the JAX-RS (JSR-311) API (http://jcp.org/en/jsr/detail?id=311) that aims to make it easy enough to write controllers to route requests through for the various URIs that make up a RESTful API. You can do this, and lots and lots of people do. However... just as building a custom UI is an expensive business, so too is building a custom RESTful API. All the same issues that one must address in building a custom UI – the structure of the pages/representations, the widget set, the URIs to expose, etc. – all of these must also be addressed when building a custom Restful layer. Indeed, if you subscribe to the various mailing lists focusing on REST (http://tech.groups.yahoo.com/group/ rest-discuss/) then you’ll see that quite a lot of the discussion focuses on these relatively unimportant details. Another issue with building a custom RESTful API is ensuring that it is self-consistent. Just like the web, a REST API consists of a bunch of URIs (a generalization of the URLs that your web browser navigates to) with representations that link to other URIs. But if you make a mistake in the rendering of one of those links, then you are going to get a 404 error. And if you are following the HATEOAS principle strictly, that mistake means you’ve just broken your app. When you build a custom RESTful API, you also need to thoroughly check that every hypermedia link rendered in your representations does indeed point to a valid URI. I am sure that you test your applications, but remember that to test these APIs can only really be done by integration testing; running your app in a web server and hitting it across the network. There are of course good testing libraries such as Arquillian (http:// www.jboss.org/arquillian.html) to assist, but these will still be comparatively slow tests to run.

Why Restful Objects

I wrote the Restful Objects specification (http://restfulobjects.org) because it occurred to me (as it has to many others, no doubt) that there’s a correspondence between the RESTful web (of representations/pages with hypermedia links) and the web that can be inferred from a domain object models (a web of domain objects that relate to each other and can interact with each other). For example (and referring to Isis’ quickstart app here) if we start at the ToDoItems domain service, I can invoke its notYetCompleted action to obtain a list of ToDoItem entities. Using the ToDoItem’s add action I can add a reference to another ToDoItem. Looking at any given ToDoItem I can look navigate to other items through either its dependencies collection or its similarTo collection. What the Restful Objects spec aims to do is to formalize this correspondence between these two different “webs of stuff”. For example, the URI

19


ISIS exclusive

http://localhost:8080/restful/objects/ToDoItem:L_1

is a resource corresponding to ToDoItem with id=1. More interestingly, it will have a number of links, one of which is: http://localhost:8080/restful/objects/ ToDoItem:L_1/ collections/similarItems

by which we can traverse to the related ToDoItems. However, the intention in writing the spec was also to be able to exploit the metamodel that sits within Isis; or said another way, to be able to leverage the “see it, use it, do it” business rules (discussed in the previous article) that Isis enforces. If the domain object hides a property, then the representation of that domain object would similarly have no such link to the (representation of the) related entity. If a domain object made a property disabled, then there would be no link in its representation to be able to update its value. If an action was invoked with invalid arguments, then a 400 response code would be returned. The Restful Object spec, then, defines not only the resource URIs and representations for each domain entity and domain service, it also defines the media types for these representations, the links between the representations, how to follow links that represent the updating of an entity’s property, similarly how to follow links to add to or remove from an entity’s collection or to invoke an domain entity or service’s actions. It also describes what HTTP caching should be performed, how optimistic locking is implemented, and a whole bunch more. For the mathematicians among us 2+5=7.

½ x ½ = ¼.

SELECT a, b FROM t1 WHERE c=3

Order newOrder = cust.placeOrder(product, 3);

Oh, and also... (http://en.wikipedia.org/wiki/ File:Rubik%27s_cube.svg)

What is common to all the above? Well, they are all examples of an algebra. When we multiply two numbers together we get another number; when we invoke a method on an object we get back some other object; when we twist a face on the Rubik’s cube we still have a cube, but with a different permutation.

20

One might also say that a RESTful API forms an algebra of representations. We start off at a representation, say the home page. We follow the “my profile” link, and we end up at another representation, a page showing the user profile page for the currently logged-in user. We follow the “my basket” link and we end up on a page representing the shopping basket of the currently logged-in use. And so on it goes. Just like solving the Rubik’s cube. All told, the Restful Objects spec runs to over 200 pages. Granted, that’s a lot of words, but the upside is that it defines how to render a generic API for pretty much any domain object model. While the representation of a ToDoItem domain entity will vary from the representation of the ToDoItems domain service, fundamentally those representations will be built out of the same component parts, all defined within the spec. To go back to my UI analogy, the representations of the domain objects are all built out of the same “widget set”. Moreover, we can guarantee that the set of representations and resources are self-consistent because there’s an underlying “web of stuff” (our domain objects) that they relate to. That’s a whole bunch of testing that just doesn’t need to be done. You don’t really need to document your RESTful API either; your developers will be able to figure out what to expect just by looking at your domain classes.

Restful Objects on Isis

So far in this article I’ve only really been talking about the Restful Objects specification, so now let’s turn to Restful Objects on Isis, in other words how Isis generates a RESTful API conformant with the Restful Objects specification. There are parallels here with the way in which Isis renders the human-usable webapp (as discussed in the other article). Indeed, this is reflected in Isis’ internal architecture; both the webapp and the RESTful APIs are each generated by their respective component. This is shown in Figure 1. As the diagram shows, Isis has various types of component, the most important of which are its presentation mechanisms (also known as “viewers”), its security mechanisms, and its persistence mechanisms (“object stores”). Thus, the Wicket viewer generates the humanusable webapp, while the RESTful API is generated by the Restful Objects (RO) viewer. The Wicket viewer (as you probably have guessed) uses Apache Wicket™ (http://wicket.apache.org) as its underlying technology, while the RO viewer sits on top of JBoss RestEasy (http://www.jbos.org/resteasy) (an implementation of the JAX-RS specification). The Isis architecture is shown around a hexagon because it is, in fact, an example of the hexago-

OPEN 01/2013


Restful Objects on Apache Isis™

nal architecture (http://alistair.cockburn.us/Hexago nal+architecture). The domain object model sits in the middle of the system, dependent on nothing. The viewer(s) provide a way for the end user to interact with the domain object models, through whichever channel makes sense. The security mechanism intercepts the interactions to ensure that the user has permissions to view/modify with the domain objects. Finally the object store subscribes to changes to the domain objects and automatically saves changes to existing objects, and/or persists new objects. If you’ve run the quickstart application then you’ll have seen that the welcome page (shown in Figure 2) summarizes the above components, and invites you to follow the link to either the Wicket or the RO viewer. The Wicket viewer was covered pretty comprehensively in the other article, so let’s now explore what the RO viewer has to offer. Before we dive in, the first thing to note is that the representations served up are JSON (Javascript object notation). You presumably will write some sort of clientside app to consume these JSON representations, but you’ll also find that most web browsers have plugins that allow them to render the JSON in a sensible format. This is a great development aid for when you do get around to writing that snazzy app. My own preference is to use Google Chrome, with two plugins: jsonview (to render the JSON), and REST Console (to be able to submit arbitrary HTTP requests). Other alternatives include Postman (Chrome) and RESTclient (Firefox). But experiment yourself to find the tools that you find the most comfortable; new ones are being written all the time, so it would seem. I also must just make one small admission. The RO viewer in Isis currently implements a slightly earlier version of the RO spec (v0.5.6). So if you go comparing the output of the RO viewer with the spec you’ll see

that while all the main features are implemented, there are some minor (mostly cosmetic) differences here and there. A release of the RO viewer is planned for later this year to bring it up to the first formal release of the RO spec (v1.0.0). OK, with that out of the way, let’s now start to dig into the detail. If we click on the “restful/” link, then we’ll get a BASIC HTTP challenge, as shown in Figure 3. Other than the notion that there is a current user, neither the RO spec nor Isis’ RO viewer has much to say about authentication; the quickstart archetype configures a simple Basic challenge, but other more sophisticated authentication mechanisms could be used as required.

Figure 2. Quickstart Welcome Page

Figure 3. Basic HTTP challenge

Figure 1. Hexagonal Architecture

en.sdjournal.org

21


ISIS exclusive

Assuming that you’re past the login page, the first representation rendered by the RO viewer (and as defined by the RO spec, of course) is the home page. If you are using a plugin such as jsonview, then it will be prettyprinted in your browser, such as shown in Figure 4. As you can see, the jsonview plugin is clever enough to notice hyperlinks and we can follow them just as we would an <a href=”...”> on an HTML page. (We will, shortly, see some links that cannot be followed this way, hence the need for another plugin such as the REST Console; but we can do a surprisingly large amount just by following links within jsonview). In terms of the content of the page shown, you can see that it’s a list of links. The RO spec defines this format, with links and extensions as standard json-properties in every RO representation. As you might expect, you’ll find link nodes underneath the links json-property, while extensions is for implementations to add additional information not defined within the RO spec. The format of each link itself is also well-defined; for example Listing 1 shows the link to services representation. For legibility I’ve numbered various elements [1], [2], [3] etc. The href json-property [2] is obvious enough, but the other json-properties are equally important. The rel json-property [1] in particular describes the relationship of the resource with respect to this representation; think of it as a search key. If I were building a HATEOAS client that just follows links, then I search for the link based on a particular value of the rel key. The method json-property [3] indicates the HTTP verb to use. In this case (and for all the links on the home page) it’s a GET, but it might also be an HTTP PUT, POST or DELETE. This is the RO spec’s equivalent of

a <form method=”...”>. Lastly the type json-property [4] indicates the media type of the representation that will be returned by following the link. As you can see, it is basically application/json, however the RO spec also utilizes the ability of media types to specify parameters. The profile parameter therefore further qualifies the media type to say what flavour of json being served up. Minting new media types vs media type parameters Media types are an important part of REST, and the usual practice is to define (or to “mint”, as true RESTafarians would say) separate media types for each of the distinct representations. Thus, one might have application/vnd.org.restfulobjects-list+json

to represent a list of elements; the “vnd.org.restfulobjects” represents a vendor extension. The reason that the RO spec doesn’t follow this approach is mostly pragmatic: tools such as jsonview don’t recognize this media type as a JSON document (the “+json” bit is just a convention, it doesn’t imply a JSON format), and so fail to render the page. Using parameters is also more extensible; the RO spec v1.0.0 also defines a couple of other parameters to layer further information on top. For example, the media type for a ToDoItem would be served up as application/

json;profile=urn:org.restfulobjects:reprtypes/domainobject;x-ro-domain-type=http://~/ domain-types/dom.ToDoItem.

A small caveat: if you go searching for this media type in Isis, you won’t yet find it because as explained in the main text the RO viewer in Isis currently implements a slightly earlier version of the RO spec.

Let’s now follow the services link [1]; we should end up with the representation in Listing 2. As you can see under the value json-property [1], its contents is a list with links in turn to the three domain services that are registered in Isis’ WEB-INF/isis.properties config file. The combination of the rel and id json-properties [2,3] disambiguates them. (Note that the RO spec v1.0.0 combines these into a single rel, whose value for exListing 1. Link to services {

[1]

“rel” : “services”,

[3]

“method” : “GET”,

[2]

[4]

Figure 4. The jsonview plugin rendering the RO home page

22

}

“href” : “http://localhost:8080/restful/ services”,

“type”: “application/json;profile=\”urn:org. restfulobjects/list\””

OPEN 01/2013


Restful Objects on Apache Isis™

ample would be urn:org.restfulobjects:rels/ service;serviceId=toDoItems).

If we were building a Javascript or other application then the representation in Listing 2 would be sufficient to build a menu bar, but we don’t yet have the information to render the menu items underneath. To do that for the ToDoItems service, we should follow the first link. This gives us: Listing 3. Under the members json-property [1] you can see entries for the five methods defined by the ToDoItems class (elided for legibility). Using this information your snazzy app could now render the menu items. To invoke an action, though, we need to follow the link to a representation of the detail of the action itself. For example, Listing 4 shows the detail of the ToDoItems’ newToDo action.

There’s some important information here. For a start, the parameters json-property [7] tells the client that this action expects three parameters. In the case of the category parameter, we also see that the representation includes a set of choices [8]; these presumably would be rendered in a drop-down list box. In the links json-property [1] we can see a rel=”invoke” link [2]. As before we see the href, the method, and the type json-properties [3,4,5], and we also see an arguments json-property [6]. This provides an outline of the JSON to be submitted when the link is followed. This is equivalent to the application/xwww-form-urlencoded that is used by web browsers to encode the contents of HTML <form>s. Looking again at the “invoke” link, you probably also noticed that the value of the method json-property [4]

Listing 2. Representation of Services {

“links” : [ {

“rel” : “self”,

“href” : “http://localhost:8080/restful/services”, “method” : “GET”,

“type” : “application/json;profile=\”urn:org.restfulobjects/list\””

} ],

[1] [2] [3]

“value” : [ {

“id” : “toDoItems”, “rel” : “service”,

“href” : “http://localhost:8080/restful/services/toDoItems”, “method” : “GET”, “type” :

“application/json;profile=\”urn:org.restfulobjects/domainobject\””,

“title” : “ToDos”

}, {

[2] [3]

“id” : “fixture.todo.ToDoItemsFixturesService”, “rel” : “service”,

“href” : “http://localhost:8080/restful/services/fixture.todo.ToDoItemsFixturesService”, “method” : “GET”, “type” :

“application/json;profile=\”urn:org.restfulobjects/domainobject\””,

“title” : “Fixtures”

}, {

[2] [3]

“id” : “dom.audit.AuditServiceDemo”, “rel” : “service”,

“href” :

“http://localhost:8080/restful/services/dom.audit.AuditServiceDemo”,

“method” : “GET”, “type” :

“application/json;profile=\”urn:org.restfulobjects/domainobject\””,

“title” : “Audit Service Demo”

} ], }

“extensions” : { }

en.sdjournal.org

23


ISIS exclusive

Listing 3. Representation of the ToDoItem domain service {

“links” : [ { ... } ],

“oid” : “objstore.jdo.todo.ToDoItemsJdo:1”,

[5]

“type” :

“application/json;profile=\”urn:org.

restfulobjects/actionresult\””,

[6]

“arguments” : {

“description” : null, “category” : null,

“title” : “ToDos”,

“serviceId” : “toDoItems”,

[1]

“id” : “notYetComplete”,

},

“links” : [ {

],

...,

“memberType” : “action”, “rel” : “details”,

“href” : “http://localhost:8080/restful/

services/toDoItems/actions/

“extensions” : { ... }

[7]

“parameters” : [ {

“num” : 0,

“id” : “description”,

notYetComplete”,

“name” : “Description”,

“method” : “GET”, “type” :

“description” : “”

“application/json;profile=\”urn:org.

}, {

restfulobjects/actionresult\””

} ]

“num” : 1,

“id” : “category”,

}, {

“id” : “complete”,

“memberType” : “action”, “links” : [ { ... } ]

“name” : “Category”,

}, {

“id” : “dueBy”,

“memberType” : “action”,

“name” : “Due by”,

“links” : [ { ... } ]

“id” : “allToDos”,

“memberType” : “action”, “links” : [ { ... } ]

}, {

“id” : “similarTo”,

“memberType” : “action”,

“description” : “”

}

{

“id” : “notYetComplete”, “memberType” : “action”,

} ],

“extensions” : { ... }

} ]

Listing 5. Representation of ToDoItems domain services’ notYetCompleted action

“links” : [ { ... } ]

}

“links” : [ ..., {

[1]

“rel” : “invoke”,

“href” : “http://localhost:8080/restful/services/

Listing 4. Representation of the ToDoItems domain service’s newToDo action {

toDoItems/actions/notYetComplete/ invoke”,

[2]

“method” : “GET”,

“type” :

“application/json;profile=\”urn:org.

“id” : “newToDo”,

restfulobjects/actionresult\””,

“memberType” : “action”,

[1]

“arguments” : { }

“links” : [

},

..., {

[2]

24

... } ],

“rel” : “invoke”,

[3]

“href” : “http://localhost:8080/restful/services/

[4]

“method” : “POST”,

toDoItems/actions/newToDo/invoke”,

“choices” : [ “Professional”, “Domestic”, “Other” ]

“num” : 2,

“id” : “newToDo”,

}, {

“description” : “”,

[8]

}, {

} ]

“dueBy” : null

}

“members” : [ {

“extensions” : { ... },

}

“parameters” : [ ]

OPEN 01/2013


Restful Objects on Apache Isis™

is POST this time rather than a GET. That’s because this is an action that will have side effects (it will create a new ToDoItem). Moreover, if it were invoked twice it would create two new object; this is not an idempotent operation. The RO spec therefore requires that such actions are invoked using HTTP POST. An idempotent action, on the other hand, is one that also has a side effect, but whose post-conditions is fixed. The ToDoItem’s completed action is a good example, because its end result is always that the ToDoItem ends up in a completed state, no matter how many times invoked. Still other actions have no side effects at all; queries are the obvious example. For example, from Listing 3 we could follow the link for the detail of the ToDoItems notYetCompleted action, shown in Listing 5. From the “invoke” rel [1] we can see that this action is invoked HTTP GET method [2]. As always with Isis, the representation being generated is inferred from information within the domain ob-

ject model. In this particular case Isis looks for the @ ActionSemantics annotation to determine whether a

GET, PUT or POST should be used. Listing 6 shows this for the ToDoItems class. If the @ActionSemantics annotation is omitted then Isis plays safe and assumes that the action is not idempotent (and thus, must be invoked using an HTTP POST). This is the case with the newToDo action. Because the notYetCompleted action is a GET and takes no arguments, we can invoke it directly from within jsonview plugin. We’ll do that in just a minute, but I just want to detour slightly to look at how one might informally invoke the newToDo action that is invoked with a POST and takes arguments. For this, I use the REST console plugin, shown in Figure 5. Listing 6. Specifying action semantics in the ToDoItems class public class ToDoItems ... { ...

@ActionSemantics(Of.SAFE)

@MemberOrder(sequence = “1”)

public List<ToDoItem> notYetComplete() { ... } @ActionSemantics(Of.SAFE)

@MemberOrder(sequence = “2”)

public List<ToDoItem> complete() { ... } @MemberOrder(sequence = “3”) public ToDoItem newToDo(

@Named(“Description”) String description, @Named(“Category”) Category category,

@Named(“Due by”) LocalDate dueBy) { ... } @ActionSemantics(Of.SAFE)

@MemberOrder(sequence = “4”)

Figure 5. The REST Console plugin allows non-GET methods to be invoked

public List<ToDoItem> allToDos() { ... } @NotInServiceMenu

@ActionSemantics(Of.SAFE)

@MemberOrder(sequence = “5”)

public List<ToDoItem> similarTo(final ToDoItem

}

Figure 6. The REST Console plugin shows the response body ...

en.sdjournal.org

...

toDoItem) { ... }

Figure 7. ... and the REST Console plugin shows the response headers

25


ISIS exclusive

Assuming all the information is entered correctly, we should get a 200 response and a returned representation, as shown in Figures 6 and 7. OK, let’s spin back to the representation of the notYetCompleted action (as per Listing 5) and follow the “invoke” link [1]. As shown in Listing 7, this will take us to a representation of the list of ToDoItem entities that meet the criteria. In the links json-property [1] we can see there’s a rel=”self” link [2]. You might have already noticed that most RO representations provide such a link; conceptually equivalent to the this keyword in Java. In the case of invoking an action it provides a bookmark to being Listing 7. List representation of those ToDoItems not yet complete {

[1]”links” : [ { [2]

“rel” : “self”, “href” :

“http://localhost:8080/restful/services/toDoItems/

actions/notYetComplete/invoke”,

“method” : “GET”, “type” :

“application/json;profile=\”urn:org. “args” : { }

restfulobjects/actionresult\””,

} ],

[3]”resulttype” : “list”, [4]”result” : { “value” : [ ..., { [5]

“rel” : “object”,

“href” : “http://localhost:8080/restful/objects/ TODO:L_5^1:sven:1359490144166”,

“method” : “GET”, “type” :

“application/json;profile=\”urn:org.restfulobjects/ domainobject\””,

[6]

“title” : “Buy a book on REST !!! due by 2013-02-28”

} ], [7]

“links” : [ {

“rel” : “returntype”,

[8]

}, {

} ],

...

“rel” : “elementtype”, ...

“extensions” : { }

}, }

26

“extensions” : { ... }

able to re-execute the action; refresh my to-do list in the case of the notYetComplete. The resulttype json-property [3] indicates that the result of the action was a list, and so the result jsonproperty [4] contains a representation of that list. Other possibilities are an object, a scalar, or just void. For lists, though, the contents are links. As usual we have an href to the related object [5], but note that we also have a title json-property [6]. This allows our application to render the list without having to follow each of the links in turn. Countering the dreaded N+1 selects problem If you’ve worked with object relational mappers (and Isis, of courses, uses one under the covers in its JDO object store), then you’ll be familiar with the “N+1 selects” problem. A better name might be the 1+N selects problem; we query the database once to get a list of matching items, and then we make a subsequent N queries to obtain detail about each of matching items. If we were to write the SQL by hand we would probably use a left outer-join and so do just one query to obtain all the information in a single hit. And ORMs too support this; indicate that a collection should be loaded eagerly and it will do a left outer join for you. Restful Objects also tackles this issue. It’s worth pointing out that the issue may not be as severe as you might think, because if the related objects are cached then the “N” retrievals are cheap as they involve no network traffic. But for those times when it is an issue (when the information about the related object must be bang up-todate), RO allows the client to specify a “hint” through an additional reserved query argument, indicating which links should be followed. The server provide the detail by inlining the representations as an additional value json-property within the link (alongside href and rel). Let’s now follow one of the links, say for the new

ToDoItem that was added during Figures 5, 6 and 7. We

should end up with a representation as shown in Listing 8. Ok, let’s unpick all this. Under links [1] we again have a link with rel=“self” [2], for bookmarking. The oid json-property [5] is a unique (internal) identifier for the object that you also see appearing in the various href’s. So far as the RO spec is concerned its format is opaque (so it could be encrypted if you wanted); but if you’re interested it consists of: an alias for the class name (“TODO”), a long integer (L_5), a version number (1), a “last updated by” (sven) and a “last updated at” (timestamp). The last three are related to Isis’ built-in optimistic locking support. Most of the content of the entity’s representation falls under the members json-property [6]. Here we find representations of each of the object’s properties, collections and actions, for example the description prop-

OPEN 01/2013


Restful Objects on Apache Isis™

erty [7], the similarItems collection [9], and the add action [11]. To see the contents of the similarItems collection we would follow the rel=”details” link [10] (unless an eager loading hint has been provided), whereas the property’s value is shown inline [8]. Actions are invoked on entities in exactly the same way that they are invoked on domain services. I skipped over [3], the “describedby” link and [4], the “modify” link. I’ll come back to [3] shortly, but to briefly ex-

plain the “modify” link; it’s there to enable a bulk update of multiple properties in a single shot. Note also that it is an HTTP PUT link; this is an idempotent operation. To finish up our tour of the various representations, let’s follow the similarItems details link [9], to end up with the representation shown in Listing 9. Once more there is a rel=”self” link [1] for bookmarking, and notice too the rel=”up” that takes us up to the parent representation, namely that of the ToDoItem. A

Listing 8. The representation of a ToDoItem entity

TODO:L_5^1:sven:1359448010106/ properties/description”,

{

“method” : “GET”,

[1]”links” : [ { [2]

“type” :

“rel” : “self”, ...

“application/json;profile=\”urn:org.

“href” :

“http://localhost:8080/restful/objects/

TODO:L_5^1:sven:1359448010106”,

“method” : “GET”,

} ],

[8] “value” : “Buy a book on REST !!!” },

“type” :

...,

“application/json;profile=\”urn:org.

restfulobjects/domainobject\””,

“title” : “Buy a book on REST !!! due by 02-28”

}, {

2013-

{

[9] “id” : “similarItems”,

“memberType” : “collection”, “links” : [ {

[10]

[3] “rel” : “describedby”, “href” :

“rel” : “details”,

“href” : “http://localhost:8080/restful/objects/ TODO:L_5^1:sven:1359448010106/

“http://localhost:8080/restful/domainTypes/dom.

...

collections/similarItems “,

todo.ToDoItem”,

“method” : “GET”, “type” :

}, {

“application/json;profile=\”urn:org.restfulobjects/

[4] “rel” : “modify”, “href” :

} ],

“http://localhost:8080/restful/objects/

},

...,

“type” :

“application/json;profile=\”urn:org.

}

restfulobjects/domainobject\””,

{

[11] “id” : “add”,

“memberType” : “action”, “links” : [ {

“members” : [ { ... } ]

“rel” : “details”,

“href” : “http://localhost:8080/restful/objects/

} ],

[5]”oid” : “TODO:L_5^1:sven:1359448010106”, “title” : “Buy a book on REST !!! due by

[6]”members” : [

28”,

TODO:L_5^1:sven:1359448010106/ actions/add”,

2013-02-

“method” : “GET”, “type” :

“application/json;profile=\”urn:org.

..., {

[7] “id” : “description”,

}

“memberType” : “property”,

“href” : “http://localhost:8080/restful/objects/

en.sdjournal.org

} ]

restfulobjects/actionresult\””

... ],

“links” : [ {

“rel” : “details”,

objectcollection\””

“disabledReason” : “Always disabled”

TODO:L_5^1:sven:1359448010106”,

“method” : “PUT”,

“arguments” : {

restfulobjects/objectproperty\””

}

“extensions” : { ... }

27


ISIS exclusive

nice little example of HATEOAS, there. Meanwhile in [3] we have the links to all the elements of the collection (same as in Listing 7).

A map of the landscape

What with all the detailed representaitons we’ve been staring at, I suspect by now you have some form of curly-bracketed snow blindness (I know that I have!). So it might be worth stepping back and looking at the big picture; Figure 8, in fact. This little map shows all of the resources (URIs) and the corresponding representations that they generate. Starting top left at the home page, you should be able to trace the journey that we’ve taken: • • • •

we started at the home page, and then went to the domain service list, then to a particular domain service (of ToDoItems), then to the detail of a couple of its object actions (notYetComplete and newToDo), then • to the result of invoking an action (notYetComplete) – namely, a list, then • to a particular domain object (ToDoItem), then • to the detail of one of its collections (similarItems) This map (lifted from the RO spec itself) is probably the best argument I can make as to the HATEOASness of the Restful Objects API.

How to build a snazzy app!

While playing around with jsonview and similar plugins is great for learning the RO API, at some point you will, presumably, want to build an app to consume the JSON representations.

If you intend to build a Javascript single-page app, then you can of course just use your favourite libraries to consume JSON. But you might also want to check out Spiro (http://restfulobjects.codeplex.com/wikipage? title=Spiro&referringTitle=Home); this provides a TypeScript (and therefore Javascript) library for interacting with Restful Objects. If you are writing a Java based app, then you can also checkout the Restful Objects viewer’s applib within Isis. Sitting on top of RESTEasy’s client library, this provides a set of utility classes to make it easy to consume the RO API. If .NET is your target client platform, at the time of writing there is no open source equivalent to the Isis’ applib. However, this is likely to change in the next couple of months. All the above are options for writing your own bespoke apps, and for those technologies that I haven’t mentioned (Perl, Python etc) doubtless there are JSON/ network libraries out there. But I must also mention that one of the objectives of the RO spec was to provide a set of representations that allow generic viewers (in other words, naked objects-style viewers) to be built as well. My point being: a sophisticated enough generic viewer might mean no need to build a bespoke app. If you go back to Listing 8 you’ll see that under links [3] you’ll see a link with rel=”describedby”. You might have noticed these links elsewhere, too; they provide access to a bunch of representations of the Isis metamodel itself. Let me describe it this way: the representation in Listing 8 is analogous to a java.lang. Object (a ToDoItem, to be exact), while the representation that you would get if you followed the “describedby” link would be analogous to a java.lang.Class (the ToDoItem.class, to be exact).

Listing 9. Contents of a ToDoItem’s similarItems collection {

"id" : "similarItems",

"memberType" : "collection",

],

"extensions" : { ... },

[3]"value" : [ {

"rel" : "object", "href" :

"links" : [ {

[1]

"http://localhost:8080/restful/objects/

"rel" : "self",

"href" : "http://localhost:8080/restful/objects/

...

TODO:L_5^1:sven:1359490238213/

...

"title" : "Write blog post"

collections/similarItems",

}, {

"rel" : "object",

}, {

[2]

"href" :

"rel" : "up",

"http://localhost:8080/restful/objects/

"href" :

"http://localhost:8080/restful/objects/

},

...

...

28

TODO:L_3^1:sven:1359490238224",

...

TODO:L_5^1:sven:1359490238213",

TODO:L_4^1:sven:1359490238224",

"title" : "Organize brown bag due by

}

} ]

2013-02-11"

OPEN 01/2013


Restful Objects on Apache Isis™

Access to this metadata is essential for building a generic viewer. Indeed, the Spiro library mentioned above was borne out of the first iteration (called “spiro-classic”) of building an SPA viewer. Restful Objects on .NET Apache Isis isn’t the only open source framework to implement the Restful Objects specification: another alternative is Restful Objects for .NET (http://restfulobjects. codeplex.com). This allows you to build your domain object models in either C# or VB.Net, and in fact (unlike Isis which lags a little) implements the spec to v1.0.0. Spiro is a part of the Restful Objects for .Net project. Also of note: RO for .NET is a sister project to Naked Objects MVC (http://nakedobjects.codeplex.com/), another open source project and whose project lead is Richard Pawson, inventor of the naked objects pattern. Let me also acknowledge all the valuable feedback that Richard and his colleague, Stef Cascarini, provided in the development of RO spec. Another project under way is AROW (https://github. com/adamhoward/arow), also a Javascript SPA. This is modelled on a drag-n-drop metaphor, as shown in

Figure 9. There’s also an online demo for this viewer (linked to from the project page). We do expect to see more viewers over time. In fact, I was actually contacted by someone while writing this article expressing an interest to build a viewer based on Apache Flex; and someone else in the Isis community hinted recently at an intention to build a viewer using EmberJs. When I originally wrote the RO spec I had hoped that it might kindle outside interest; it’s heartening to see this starting to occur.

Not without controversy

While there’s a growing community who are now working with Restful Objects, there are those who describe the idea of exposing a domain model over REST as wrong (“deluded” was the adjective used, if I remember correctly). The argument goes that exposing domain entities as resources “cannot be HATEOAS because there is no sensible way to create links between resources that expose the application state”. This is plainly not true; Restful Objects exposes entities as resources, and is fully HATEOAS. With its “see it, use it, do it” rules there are very clear rationales for the presence/absence of links.

Figure 8. A map of Restful Objects Resources and Representations

en.sdjournal.org

29


ISIS exclusive

Others assert that even if possible, exposing domain object models is a bad idea because this results in a tight coupling of client and server, whereas in a RESTful system client and server should be able to evolve independently. Accordingly, one should expose only ViewModels and/or objects representing use-cases, both of which should be versioned, and which effectively insulate the client from underlying changes in the domain model. I suggest that this argument, while not totally wrong, is being applied indiscriminately; the truth is much more nuanced. There are situations where this line of argument is valid, but many where it is not. Two factors need to be considered. The first is whether the client and server are both under the control of the same party. As we noted in the introduction, REST supports many different use cases. For a publicly-accessible RESTful API – such as one operated by Twitter or Amazon – then the server and clients are owned by (many) separate parties. In such a situation exposing the domain model would indeed not be a good approach. That said, this doesn’t rule out Restful Objects, it can work perfectly well with view models and/or use-case objects. The spec discusses these patterns in more detail. For those cases where the client and server are controlled by the same party – in the form of enterprise applications for primarily internal use, exposing domain entities over REST is not only safe, it is actually a good idea. Such applications typically need to give the user access to a much broader range of data and functionality than in a public-facing application. (If you have dejavu here: this is the problem-solver vs process follower argument I raised in the previous article). The second factor to consider is whether the client is a bespoke client or a generic client. To date, almost all clients consuming RESTful APIs, whether for intranet

or public internet, are bespoke – so there is a need to insulate them from changes to the domain model. But that is not true of a generic client, which will respond automatically to changes in the domain model. Indeed, the concept of a generic client is actually more in-line with the spirit of REST (and the letter of it too) than a bespoke client. After all, at one level, a web browser is just a generic client to a RESTful interface, operating at a document level. Restful Objects enables the idea of a generic client operating at a higher level of abstraction: that of a domain object model. We can plot these two factors into a 2x2 grid, as shown in Table 1. If we are dealing with an intranet application and/or a generic client, then exposing domain entities through a RESTful API is both safe and effective. Only where you are working on a public internet application and with bespoke clients is it necessary to heed the advice that domain entities should be entirely masked by ViewModels and/or use-case objects, insulating clients from server changes. To be clear: even for the bottom right corner, Restful Objects applies. In such cases one would just expose only domain services that would return very limited, or even void, representations. The fact that the bottom-right corner of this grid has received most attention to date is possibly a reflection of the difficulties of building bespoke RESTful APIs Table 1. Should domain entites be exposed? two factors to consider

Deployment: Intranet Form of client:

Internet

Generic

Exposing domain entities is fine

Exposing domain entities is fine

Bespoke

Exposing domain entities is fine

Expose only versioned View models and/or use cases

Figure 9. The AROW generic viewer

30

OPEN 01/2013


than an inherent limitation. Another common refrain – that RESTful APIs can only be built for applications that have a small, tightly-defined, state-transition diagram – is further evidence of people thinking only in that box.

Closing Thoughts

I’ve been working with the naked objects pattern in various incarnations for nigh on a decade now, and remain convinced that it’s an excellent way to build line-of-business enterprise apps. Restful Objects is a natural progression of the naked objects concept. In some ways, it’s even more naked than the naked objects pattern itself. I’ve noticed that many people who encounter naked objects think of it as a user interface technology. Given (the sad truth) that some of our previous viewers haven’t been as pretty as our current generation of viewers, they’ve thus been turned off the idea. But at its heart naked objects isn’t really about UIs; it’s about ensuring that business logic stays where it should be: in the domain object model and is accessible to the user. The auto-generated UI is a significant enabler for this because (as described in the previous article) it allows us to iterate very rapidly when building that domain object model. But if naked objects isn’t about the UI, then Restful Objects really isn’t about the UI; after all, there is none to see. Instead, RO is just about representing the domain object’s structure and behaviour in as neutral way as possible: a set of JSON representations. Anyway, I hope you’ll take the time to play some more with RO alongside your traditional webapps. Build bespoke apps against it, build generic apps against it, use it for point-to-point integrations between systems, use it as an end point to be called from an enterprise service bus, use it as a convenient way to migrate data into your new app. There are lots of ways to use the Restful Objects API, andI look forward to hearing about how you’ve used it on the Isis mailing lists (http://isis.apache.org/support.html).

Dan Haywood Dan is a freelance consultant, developer, writer and trainer, specializing in domain-driven design, agile development, enterprise architecture and REST on the Java and .NET platforms. He’s a well-known advocate of the naked objects pattern, and was instrumental in the success of the first largescale naked objects system which administers state benefits for citizens in Ireland. He continues in his role there as an advisor to the government. Dan is a committer and the current chair of the Apache Isis™ project, and the primary author of various of its components. He is also the author of the Restful Objects specification, which defines a hypermedia API for exposing domain object models. Apache Isis provides one implementation of this spec, and Dan is also a committer on Restful Objects.NET, an implementation for .NET on ASP.NET MVC.

en.sdjournal.org


Introduction to Hadoop

Grokking the Menagerie: An Introduction to the Hadoop Software Ecosystem The Hadoop ecosystem offers a rich set of libraries, applications, and systems with which you can build scalable big data applications. As a newcomer to Hadoop it can be a daunting task to understand all the tools available to you, and where they all fit in. Knowing the right terms and tools can make getting started with this exciting set of technologies an enjoyable process.

H

adoop provides both a highly available, high performance distributed file system known as HDFS as well as a framework for distributed computation known as MapReduce. This combination provides an extremely powerful abstraction for data processing that can allow you to process terabytes of data in just seconds. After Yahoo first open sourced Hadoop, the rapid growth of the developer ecosystem came to reflect the fact that this type of system was useful for many types of problems. Although Hadoop started out as a project aimed at helping to solve ‘big data’ problems, today it is

a much more diverse software ecosystem and has inspired many related projects and competing systems. The Hadoop software ecosystem can be generally described by Figure 1. We will cover the bulk of this software ecosystem at a cursory level in order to get you familiar with the most popular tools and applications available to you.

Requirements

Today there are a number of companies providing commercial support of Hadoop, including Cloudera and Horton Works. Cloudera provides a stable, commercial-

Figure 1. Hadoop Ecosystem

32

OPEN 01/2013


Grokking the Menagerie

ly supported distribution of Hadoop known as CDH, or Cloudera’s Distribution including Hadoop. CDH is free to download and includes nearly all of the components covered in this article. Software versions covered in this article assume the reader is using CDH4.1.2. You can find a list of the versions of software packaged with CDH4 here: http://bit.ly/Xu84UK. You can find downloadable packages for CDH4.1.2 here: http://bit.ly/Vff4KM.

Hadoop

Hadoop is a number of components working together to provide the following services: • A fault-tolerant, high performance distributed file system known as HDFS • A distributed computation framework • A job management system Although Hadoop provides a distributed file system, HDFS is not typically used on its own. The basic unit of computation in Hadoop is described by MapReduce. The Hadoop architecture runs four primary services providing this functionality. These components are listed below. • NameNode – Manages HDFS interactions • JobTracker – Farms out MapReduce tasks to cluster nodes, ideally nodes with the data • TaskTracker – Accepts tasks and handles map/reduce/shuffle operations • DataNode – Stores data in HDFS A typical Hadoop cluster is described in Figure 2. Clients interacts with the JobTracker to submit MapReduce jobs and with the NameNode to interact with data. In older versions of Hadoop the NameNode was a

single point of failure but newer versions of Hadoop provide both high availability NameNode support as well as HDFS federation. Hadoop provides batch processing semantics and offers no ability to update (append to) a file once it has been written to HDFS. Hadoop is the perfect environment to collect large volumes of write once data to run analytics jobs on. In the early days of Hadoop, when it was an internal Yahoo project, Hadoop was initially designed to support web-scale, crawler-based search. Although it has evolved significantly since then, Hadoop still excels at this type of workload.

The Data Pipeline: DistCp, Sqoop, Flume, Kafka, and Scribe

Once your Hadoop cluster is online you will want to start thinking about your data pipeline. The data pipeline is going to be largely responsible for getting most of your data into your cluster. Depending on the data source, you may want a different tool for the job.

DistCp

DistCp comes with Hadoop and is arguably the simplest way of getting data into your cluster. DistCp recursively copies files and directories between file systems. For instance, if you wanted to copy everything from your home directory into HDFS you might run the following command:

hadoop distcp -p “file:///home/jdoe”

“hdfs://namenode:8020/user/jdoe”

This command will recursively copy all files and directories found in /home/jdoe into your Hadoop cluster with a prefix path of /user/jdoe. The source and destination file systems don’t have to be HDFS, they may also be an S3 file system or any other supported file system type.

Figure 2. Hadoop Architecture

en.sdjournal.org

33


Introduction to Hadoop

Sqoop

Sqoop is primarily useful for copying data from a SQL database into your cluster. At Tumblr we use Sqoop to load commonly needed SQL data, such as blog meta data, into our cluster. SQL data copied by Sqoop into HDFS is available to your MapReduce jobs. As of version 1.4.1 Sqoop supports MySQL, PostgreSQL, Oracle, and HSQLDB. Note that there are some limitations in how you can use Sqoop, and if you have a heavily sharded DB environment (as we do at Tumblr), there can be some added challenges.

Flume and Kafka

Although they are separate projects, Flume and Kafka are both useful for supporting a data or event stream. For instance, you might want to capture page view or login events from an application, and have those events available in Hadoop for processing. Flume and Kafka can both support this type of use case. Your application can simply emit an event to be collected by Flume/ Kafka, and Flume/Kafka will eventually feed it to your Hadoop cluster. Although Flume is distributed with CDH4, Kafka is not. Kafka is the event processing infrastructure we use at Tumblr, but was originally created at LinkedIn. Flume is a significantly more feature rich piece of software than Kafka but in practice I have found Kafka to be simpler to administer and maintain. The Flume architecture supports the notion of a source and a sink, where a source is something like an HTTP POST or syslog, and a sink is something like an IRC channel or HDFS. You connect sources and sinks via a channel. Flume manages taking data via a source, optionally applying a transformation on the data, and pushing this data into a sink. This flexible architecture makes creating a real-time data pipeline considerably easier than you might expect. Older versions of Flume had a dependency on ZooKeeper but Flume-NG (distributed with CDH4) has no such dependency. Kafka is actually a persistent messaging system, with producers creating messages and consumers reading messages. The default Kafka distribution provides a Hadoop consumer which can be used for reading messages and writing them to a Hadoop cluster. One benefit of the Kafka architecture, besides simplicity, is that you are able to achieve extremely high message throughput (hundreds of thousands of messages per second). If you require high throughput durable messaging, Kafka may be a good option for you.

Scribe

Scribe is a high performance distributed logging system created by Facebook. Although Scribe is not part of

34

CDH4, it is a common part of many data pipelines (including the one at Tumblr) so is included here for completeness. The Scribe architecture consists of clients, messages, categories, servers and stores. A client submits a message with an associated category to a server. A message can be arbitrary data such as a JSON blob or a tab delimited string. A category might be something like db_query_time or cache_hit_ count. A Scribe server can be configured to route different categories to one or more stores, where a store might be a file, another Scribe server, or an HDFS directory. This allows you to create a hierarchy of Scribe servers providing message routing. One benefit of Scribe is that message sending is generally asynchronous, and buffered locally. A typical setup has a Scribe server installed on every server which is responsible for buffering local messages to disk (or memory). The local server will be configured to submit Scribe data upstream to central servers periodically, but since the writes are buffered locally can go on if the upstream server is under load or down.

The Analysis Toolkit: Hive, Pig, Scalding, Mahout, Giraph, and Hue

Now that you are collecting all kinds of data you will want to think about how to analyze your data. Although typically the basic component of analysis is a MapReduce job, there are a number of tools and libraries out there to make writing your MR jobs much simpler.

Hive, Pig, and Scalding

Writing MapReduce jobs is where you will spend the majority of your time interacting with your Hadoop cluster. Someone once referred to the MapReduce API as being similar to assembly for Hadoop; low level, unwieldy, and no fun. While I personally enjoy assembly, I understand the metaphor. There are a number of frameworks for simplifying the writing of MapReduce jobs, with Hive, Pig, and Scalding being just a few of them. Hive and Pig are bundled with CDH4 and provide query languages for interacting with your Hadoop data. Scalding is a Scala framework developed by Twitter that does not provide a query language, but does provide a very easy to use API. Because WordCount is the canonical example for MapReduce, lets compare them below using each of these abstractions. The original WordCount example is on the order of 60 lines of Java code. I won’t include it due to the length, but you can find it here: http://bit.ly/WgVyIN. Let’s first look at Hive. Hive provides a very SQL like abstraction for interacting with Hadoop data. The example in Listing 1 copies data from /tmp/article.txt and into Hadoop. It then runs an HQL query on this data and outputs the results of the query into Hadoop. All example listings will follow this general format.

OPEN 01/2013


Grokking the Menagerie

Pig is another popular abstraction for writing MapReduce jobs. Pig is not quite as widely adopted as Hive in my personal opinion due to the fact that it’s fairly different than SQL (and therefor Hive), so the barrier to entry may be a bit higher. The example in Listing 2 does roughly the same thing as the Hive version. Scalding is a library for Scala users that provides a very easy to use abstraction on top of Hadoop. Since it’s written in Scala (and therefor runs on the JVM alongside Java), you have the power of the MapReduce API if you need it. The example in Listing 3 does roughly the same thing has the Hive example.

Hue

Hue is a web based environment packaged with CDH4 for interacting with your Hadoop environment and most importantly, writing and running MapReduce jobs. Hue allows you to look at and interact with files in HDFS,

browse and manage MR jobs, and design jobs before you submit them to the cluster. Hue also allows you to perform Hive queries, and edit and manage Oozie workflows. At Tumblr we found that Hadoop adoption and usage increased dramatically after installing Hue and making it available to the team.

Mahout and Giraph

With access to huge volumes of data, machine learning and graph processing algorithms are common applications for Hadoop clusters. Mahout is a library provided with CDH4 providing a rich set of scalable machine learning/data mining algorithms optimized to leverage the power of Hadoop. You can use Mahout for classification such as when doing spam detection, clustering such as when performing market segmentation, recommendations such as for e-commerce sites, as well as for a number of other applications. Giraph is not yet pro-

Listing 1. Hive WordCount CREATE TABLE docs (line STRING); LOAD DATA INPATH '/tmp/article.txt' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM

(SELECT explode(split(line, '\s')) AS word FROM docs) w

GROUP BY word

ORDER BY word;

Listing 2. Pig WordCount input_lines = LOAD '/tmp/article.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/article-wc.txt';

Listing 3. Scalding WordCount class WordCountJob(args : Args) extends Job(args) { TextLine( "/tmp/article.txt" )

.flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size }

.write( Tsv( "/tmp/article-wc.txt" ) ) def tokenize(text : String) : Array[String] = }

text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")

en.sdjournal.org

35


Introduction to Hadoop

vided with CDH but appears like it may be in an upcoming release. Giraph is designed with graph processing such as page rank calculations and shared connections in mind. Once large volumes of data is accessible via a powerful computational abstraction such as MapReduce, these types of algorithmic applications become much more common place. Mahout and Giraph make these types of sophisticated applications accessible to engineers without a machine learning background.

The Analysis Pipeline: Oozie and Akaban

Once you have a number of analysis jobs running, you will eventually want to tie them together into a workflow which I refer to as the analysis pipeline. As your data pipeline exists to get data into your cluster, your analysis pipeline exists to get intelligence out of your cluster. Imagine that you have created a set of MapReduce and other scripts to do the following: 1. Look for files in a few directories 2. Join the data from these files, and import it into Hadoop 3. Analyze the data, classifying as SPAM or not SPAM 4. For SPAM data, create a mechanical turk job to have a person verify it 5. Import the mechanical turk data into Hadoop 6. Run another MR job to determine your rate of false positives 7. Email someone with the results You can imagine that steps 1, 2, 5 and maybe 7 might be done with a shell script. Step 3 is probably a MapReduce job, where the input comes from step 2 and the output feeds step 4. Step 5 feeds a new MR job described by step 6, and once that MR job is done an email needs to be kicked off. Phew, this is getting complicated. Once you have the power of Hadoop at your fingertip, job workflow can get fairly complex. In order to reliably and dependably coordinate the hand-off between each step you will want a system to help manage this process. Oozie and Akaban provide just that kind of coordination along with a number of other nice features such as job scheduling and parameterization of your workflow. Oozie is provided with CDH and can be interacted with via the Hue web interface. Akaban is not provided with CDH but has some differences in implementation (such as workflow ordering) which make it better to use for certain applications.

Distributed Coordination with ZooKeeper

Once you start building and working with a distributed system a number of common problems arise. For instance, let’s assume you have an application with the

36

following snippet of code for managing some arbitrary resource:

Lock resourceLock = new ReentrantLock();

if (resourceLock.tryLock(10, TimeUnit.MILLISECONDS)) {

...

consumeResource();

resourceLock.unlock(); }

This works fine if it is running on a single server, but what if your application now needs to run on multiple servers? How do you coordinate resource locking across multiple servers and applications? This is just the kind of problem that ZooKeeper can help address. Although this example is a bit simplistic, the hardest part of building a robust distributed system is handling the inevitable failures that occur in these systems. ZooKeeper is built to be highly available and reliable even in the face of failures. ZooKeeper provides primitives for distributed locks and queues, group membership, leader elections, distributed configurations, just to name a few. ZooKeeper works by electing a leader (master) from the set of running ZooKeeper instances. Reads can be fulfilled from any healthy ZooKeeper server, and writes flow through the leader. The leader confirms a write once a quorum (majority) of servers confirms the update. Because a majority is required, it is critical for production environments to deploy an odd number of ZooKeeper servers. Without an odd number of servers, it’s impossible to reliably achieve a quorum in the event of a network partition. This limitation helps prevent split brained clusters which can be common in distributed systems. ZooKeeper is a dependency for newer versions of Hadoop if you choose to configure a high availability NameNode. ZooKeeper is a requirement for HBase and is used for master election as well as coordination of regions. Older versions of Flume also depend on ZooKeeper for master reliability. Because implementing ZooKeeper code can be tricky, there are a number of libraries that exist for simplifying this code. One example is curator (https://github.com/Netflix/curator), a ZooKeeper framework created by Netflix. Curator is not included with CDH4.

Getting Online with HBase

While Hadoop targets batch processing applications, HBase targets online processing (real-time) applications. HBase is a non-relational distributed database which provides a column oriented data model. HBase is modeled after Google BigTable and provides functionality on top of Hadoop allowing for real-time updates in HDFS.

OPEN 01/2013


Grokking the Menagerie

The HBase Data Model

Conceptually there are parallels between HBase and an RDBMS. HBase has the notion of a table, column family, column, row, row key, and cell. A table in HBase is a collection of column families, just as a database is a collection of tables in an RDBMS. A column family in HBase is a collection of columns and rows, just as a table is a collection of columns and rows in an RDBMS. Columns in HBase don’t need to be declared before use, which is in stark contrast to a traditional RDBMS. A row key in HBase is what makes a row directly accessible, and is most similar to a primary key in a traditional RDBMS. A row key, column, and version is how a cell is specified in HBase. A cell may have one or more versions. When versions are used, a common value to use is a timestamp. In HBase, a range of row keys is known as a region. A region is the minimum unit of scalability and load balancing in HBase. Each region can be assigned to only one server which is responsible for managing that region.

HBase Components

HBase builds on top of Hadoop and provides the following additional components: • Master – Assigns regions to region servers, load balances regions • Region Server – Responsible for read/write requests for assigned regions ZooKeeper is used to provide high availability of the master process. In the event that one master process becomes unavailable, another master will be elected as a leader. It is typical in an HBase cluster to have many region servers where each region server is responsible for many regions. The master process will assign regions and rebalance (move) regions that become hot, but is not involved in the read/write path to regions. It is important to note that schema design can have a big impact on cluster performance. For instance if in your application you decide to use monotonically increasing row keys (such as a timestamp), there will always be a ‘hot’ region server. This is, a region server which is currently managing the region getting the vast majority of writes. It is important to design your row keys such that writes are typically well distributed across all regions.

Automating Hadoop with Whirr

Given the large number of libraries, applications, services, and dependencies in the Hadoop ecosystem it is important to think about automation up front for your production clusters. Apache Whirr is a set of tools and libraries for running Hadoop infrastructure in your cloud

en.sdjournal.org

of choice. Whirr can help you spend less time setting up your systems by providing reusable recipes that you can use to launch your Hadoop instances. Whirr is a fairly recent addition to the Hadoop ecosystem but has gained traction as it provides consistent environments quickly and on-demand. Because Whirr is based on jclouds you can deploy your Hadoop cluster onto dozens of different providers including Amazon, GoGrid, OpenStack, and SoftLayer. Setting up a new HBase cluster for example is as simple as running the following command:

whirr launch-cluster --config recipes/hbase-cdh.properties \

--private-key-file ~/.ssh/id_rsa_whirr

See the Whirr documentation (http://whirr.apache.org/) for more recipes and examples.

Coming Soon: Impala

Cloudera Impala is a very recent entry in the Hadoop ecosystem but is worth an honorable mention. Similar to Google Dremel, Impala provides real-time ad hoc query capabilities for Apache Hadoop. Impala can query both HBase and HDFS, and bypasses MapReduce in order to provide real-time results. There is more information available here: http://bit.ly/UJodrj.

Conclusion

As you have seen, the Hadoop software ecosystem offers an enormous number of options for addressing your distributed computing needs. Despite the diversity of tools and systems, they all have a role to play in solving these common challenges. With a better understanding of the tools and systems out there for your use, I hope you are now armed to go out and explore on your own.

Blake Matheny Blake is currently the VP of Engineering at Tumblr where he works on building the backend infrastructure that serves millions of users and billions of page-views per month. Blake has been a distributed systems engineer for more than ten years, an active Hadoop user and developer since 2008, and has for the past two years been having an affair with HBase. Blake lives in Brooklyn, NY with his wife, a cat that hates him, a dog named Alan, and their collection of books. You can find him @ bmatheny or blakematheny@acm.org.

37


Introduction to Hadoop

A Hybrid Approach to Enabling Real-time Queries to End-Users Since it became an Apache Top Level Project early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major downsides: query latency and data freshness.

A

t the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries. Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind. There is a lot of traction today in this area and this article will try to answer the question of how to fill in this gap with specific open-source components and build a dedicated platform that will enable real-time queries on an Internet-scale data set. I will start by defining a couple of terms to ensure everyone is on the same page before proposing a hybrid approach and discussing why it could be a good fit. I will eventually come up with a

concrete example, discussing which technology could be a good match, and how they would interact together.

Definitions

Real-time Queries to End-Users

The term real-time is quite a buzz word these days, and has a couple of different meanings. Real-time, from the point of view of the end-users, means synchronous queries with low latency, just like REST or any other RPC, as opposed to batch processing or asynchronous function calls. Asynchronous APIs are less convenient to deal with and could be a barrier to mass adoption, even more so when the ultimate goal is to render the content to a user. Imagine a website where you need to enter your email address to be notified when the page is eventually generated. Due to the inherent nature of surfing – and user’s low tolerance to waiting – a vast majority of APIs are synchronous and hopefully have low latency.

Figure 1. Batch scheduling and data lifetime

38

OPEN 01/2013


Hybrid Approach

Real-time, from the point of view of the data it means querying the latest version of the data. I will not make a difference here between strong and eventual consistency, as well as considering eventually consistent data as available in real-time. But real-time available data here is referred to in contrast with batch transformed data. Imagine you are batch processing daily log files to aggregate sums of requests per distinct query source per day. You will obviously wait till the end of the current day before starting your aggregation, which means that you will only be able to query the data from the day before. Expressed in a more generic way, your data processed in batchs are considered as a snapshot view (time stamped when the batch process started) and will already be outdated at the time the batch finishes. The frequencies at which you will run your batch will dictate the maximum time delta between the latest view of the data and the data available for reading – one day in our example (+ the batch processing time), referred as processing time in Figure 1. There are still plenty of use cases where this model of offline processing fits well and has the strong advantage of being simple and easier to scale. As the last definition, end-users in that context will be understood as any consumer of the API, no matter internal or external, human or computer. If I put all these definitions together and try to refine the initial goal of the article, from an end-user point of view, I would try to come up with a REST API allowing advanced search on a big data set, reducing the gap of data availability to query to a minimum – a couple of seconds or less.

Data Access Patterns

One of the questions attentive readers should have in mind so far is why would we come with a hybrid approach when there are plenty of NoSQL datastores performing really well today? The short answer is that use cases with lots of reads and writes are harder scale. In computer science, whatever you are building, there is always a tradeoff, because your resources (RAM, bandwidth, CPU power, etc) are not infinite. Everyone is aware of time-memory tradeoff in algorithms. CAP theorem is also about tradeoff, etc. It is just about what property you are ready to sacrifice. So let us distinguish three types of data access patterns, classified by their read/write ratios. • Ratio tends to 0 – few reads, lots of writes, will be called write-only • Ratio tends to 1 – same order of reads and writes, called random read/write • Ratio tends to infinity – lots of reads, few writes, can be called read-only or write-once.

en.sdjournal.org

Data access with unbalanced read or write amounts are easier to scale than the general purpose random read/write pattern, because both read-only and writeonly patterns do not do tradeoffs. Systems that scale really well for a random read/write data access usecase will have other tradeoffs – for instance, single points of failure or bigger operating costs. One strong requirement in this architecture is to prefer specific over generic components, because as Tristan Nitot, president of Mozilla Europe would say: the only place where cheese is free is under the trap.

Putting Hands-on

Now that we are done with the definitions and theoretical discussions, it is time to start talking about concrete pieces of software and forming a possible implementation. Let us pick an example to illustrate and discuss the architecture. The goal starting from this point is to build a platform taking as input Apache HTTP log entries and as output the capability to answer the following queries: • Give me all the requests done by a given IP in a given time range • Give me all the IPs with the sum of their hits that requested a given page within a given time range • Give me the average requests per hour per country for a given month As you can figure out, the queries are non-trivial. The first requirement for responding with low latency means we have to move away from the choice of running a MapReduce job at every request.

Pre-Computed Views

Forget about real-time for a minute and let us concentrate on the processing. One intuitive way of answering queries with low latency is to process the log entries, aggregate the data in different dimensions and store the results, called pre-computed views, in a dedicated datastore allowing synchronous requests. In a log entry with the default format, the following information is available: • • • • • • •

Remote IP Remote logname (most likely “-“) Remote user (most likely “-“) Time HTTP request (verb, version, URI) Status code Size of the response

Pre-generating all the views to answer all the possible queries increases the size of the pre-generated views quadratically. It is obviously not the option ex-

39


Introduction to Hadoop

plored here because of its storage cost. Instead, we will define the minimum dimensions we will pre-compute views for. The dimensions to keep are the most often requested ones. In our specific case, we will keep time, source IP (or simply IP) and URI. Then the granularity for the dimensions needs to be defined. For instance, the time granularity could be seconds, minutes, hours, days, etc. Choosing the right dimension granularity is application specific, because the smaller your granularity, the more precision you can provide to your end-user (for instance, the sum of hits for the last 15 seconds). But it remains a tradeoff between storage footprint and precision. As an alternative a dimension could have different granularities over time. The older the data is, the larger the granularity. In case of IP address, aggregation could be done per single IP (/32 in case of IPv4) for a first period of time, per block of /24 for the n following periods and finally per block of /20 for the remaining periods. In our given example, we will have a time granularity of one hour for the first month, one day granularity for the next two months and one week granularity for the last 6 months. This will allow us to query nine months of data, with decreasing granularity depending on the desired time range (Figure 2). In order to be able to serve our queries, here is the list of data sets we need to generate. The format is key = value, where key and value are given in a pseudoJSON format: [] for list and {} for a tuple • { time } = [ { IP, URI, hits } ] • { time, IP } = [ { URI, hits } ] • { time, URI } = [ { IP, hits } ]

The first data set simply lists, for all possible time values (remember the granularity of hours, day and week), the list of tuples IP, URI and aggregated sum of hits for this tuple. The second data set goes in the same direction, with the exception that the key could be all the possible values of time combined with IPs. The value is the list of all URIs requested by the given IP at a given time. Ok, now how do we query these aggregated data sets efficiently? One requirement of the solution is to allow bulk loading of the data in specialized data store: once the processing is done, the fresh data sets will be uploaded into a dedicated datastore that will discard the old version of the data, and start serving the new version. This datastore will be higly optimized for reading. It can be mostly read-only, or write-once. HBase, Cassandra or Voldemort provide this capability, but the most specialized datastore at the time of writing is ElephantDB – and remember, I will prefer to use highly specialized building blocks (Figure 3). ElephantDB is a light datastore downloading pre-generated data sets (domain) from a distributed filesystem and allowing key-value queries through a Thrift interface. The data set will be stored in an appropriate format (BerckleyDB, and more recently LevelDB) reducing the amount of work to be done before serving the data, down to zero. The query model of ElephantDB is key-value only, so the end-user queries need to be translated into a couple of key-value queries to ElephantDB. The query translation is really model oriented. In our case, if you want to receive the list of URIs queired by a given IP in between time t1 and time t2, you will explicitly generate all the

Figure 2. Granularity can decrease in time

Figure 3. Batch layer

40

OPEN 01/2013


Hybrid Approach

keys { ti, ip }, query ElephantDB for the list of URIs for this given key and join all the URIs together before returning to the end-user. Remember the time granularity is decreasing with the age of the data, keeping the number of queries to issue predictable. If your granularity is smaller than the one chosen here, the number of queries will increase proportionally. But you can always add constraints in the user space to enforce users keeping the number of queries reasonable. Moreover keeping the join-fork pattern in mind, we can also run the queries in multiple batches to try to parallelize the query. So far we have achieved the first implication of the term ‘real-time’ – synchronous low latency. We are still missing the ability to query the most recent version of the data, i.e. taking into account the data updated between two pre-computed views.. Let us catch up on the moving part.

Hybrid Approach

It is interesting to note at this point that the missing data is only a small subset of the entire data, tightly coupled to the frequency at which the batches are run. What about building a second layer, complementary to our first batch processing layer?

The data flowing in will be processed as a stream, transformed and aggregated on the fly, and made available for query for a small amount of time, flushing and restarting the computation once the next batch is available. Figure 4 shows the data that are missing in the batch layer that need to be filled in order to be able to claim that we are serving real-time data. The trick here is to maintain two different sets of data, one updating and serving till the new batch is available, and another that will be used for the new batch, but that needs to start at the exact same time the batch processing is starting, so as to not miss any updates. Once the new batch is available, the previous data set can be simply discarded. The required building block for this real-time layer is a stream processing engine and a datastore that works well for random read/write data access pattern – fortunately on a subset of the log entries. The new software that is gaining lot of traction, and is a good candidate for becoming the defacto standard for stream processing is Storm. Storm is a distributed stream processing engine. With Storm, processing appears to be a directed acyclic graph (DAG) where ev-

Figure 4. Data complementary to batches

Figure 5. Batch and real-time layer

en.sdjournal.org

41


Introduction to Hadoop

ery node (bolt) is a transformation of the data and every edge can be local or remote. Events enter the DAG through a sink (spout) and are in turn sent to every following bolt of the graph. Each bolt can have a different number of instances processing the event and thus processing can be parallelized depending on the resources needed by each transformation. Storm seems to be a perfect match to run on the fly transformation of the log entries and eventually store the aggregations into a dedicated datastore. Any datastore can be used here, with the preference being Cassandra or HBase. Again the datastore will be queried in the same manner as ElephantDB, with the difference being the time granularity. In a Cassandra world, the idea is to roll column families from one batch to another, in order to be able to truncate the previous one – a very efficient way to free all the resources data sets that are no longer used. Regarding the data model, keys will be the same as in ElephantDB ( { time }, { time, IP }, { time, URI } ) and the value will be, respectively, a composite column name with { IP, URI } and a simple column name with { URI } and { IP }. The hits will be stored in counters.

Merging both layers

Once we have the two layers – batch and real-time layer – running together, the final step is to query both layers simultaneously and to merge the results before returning the data to the final results (Figure 5). The last missing part is the technology used to perform the merge. I proposed an approach using a joinfork pattern to parallelize the queries, but now that we have introduced Storm, I will demonstrate that Storm is a better match for this part: every end-user query will be transformed into an event in Storm. The Storm DAG will be double, one path to query the real-time layer, the other one to query the batch layer. An intermediate bolt could be introduce to split the query in several events if the number of request is too high, and the scalability of the system is inherently determined by the parallelism factor defined for each bolt. Moreover this approach will achieve higher availability, because if one server dies, the remaining servers running Storm will continue to correctly process the queries. So here we are! We have a complete picture of a hybrid – composed of one batch and one real-time layer – platform empowering real-time queries for end-users.

Benefits of the Approach

As a conclusion I will walk through the benefits of the platform, and discuss a couple off properties that come for free. First of all the platform is entirely based on opensource building blocks. This means no license costs,

42

References • • • • • •

CAP Theorm: http://en.wikipedia.org/wiki/CAP_theorem ElephantDB: https://github.com/nathanmarz/elephantdb Storm: https://github.com/nathanmarz/storm Cassandra: http://cassandra.apache.org HBase: http://hbase.apache.org Join-Fork

and a high probability for the platform to continue to sustain growing data sets, more user queries, smaller granularity, etc. An interesting property is the recovery in case of error: all the data are stored in an append mode in the distributed filesystem. If a bug corrupts the data of the pre-computed views, or of the real-time views, it’s easy to restart from scratch, flushing all the views and recovering from the buggy state. And the last benefit is the availability. Every pieces of the puzzle are available enough to make the overall platform highly available.

Next steps

The next step is to keep an eye open on Dremel and its clones, Impala, Apache Drill, etc. At the time of writing none of these are fully operational, but they could add their own value to the ecosystem.

Benoit Perroud Benoit holds a M.S. in Computer Science from EPFL Switzerland. He is currently a Software Engineer working at Verisign, developing and operating the Company-wide data Processing platform – based on Hadoop infrastructure. His work includes ingesting and storing data feeds, engineering a data retrieval platform – querying and searching on Internet scale data sets – and supporting internal products to run and optimize MapReduce processing. In addition of being a father of two lovely children, he is an Apache Committer, NoSQL enthusiast and frequent speaker at Swiss and European Tech Conferences.

OPEN 01/2013


April 2-5, 2013 | Expo: April 3-4 silicon vAllEy, cAliforniA

4 9 DAYS

TRACKS

8 50+ 100+

WORKSHOPS

SESSIONS

EXHIBITORS

Driving the Cloud Computing Agenda With over 60 education sessions and 100 exhibiting companies Cloud Connect will help you identify opportunities in cloud computing and prepare you to take advantage of them. Attend Cloud Connect and get access to: ❯ 60+ top-level private cloud, virtualization, and big data sessions ❯ NEW! Beyond Virtualization – Introduction to Private Cloud Workshop ❯ Vibrant Cloud Connect Expo Floor with over 100 exhibiting companies ❯ NEW! Cloud Executive Summit ❯ Free Vendor Sessions and training in the Cloud Solutions Theater ❯ Networking Receptions and Parties

facebook.com/cloudconnect

cloudconnectevent.com

Take 25% off ConferenCe Passes or Claim a free exPo Pass Register with Priority Code PASDJ at cloudconnectevent.com or Call (866) 535-8985 (Toll free)

twitter.com/cloud_connect

© UBM Tech 2013


Cassandra

Getting Started with Cassandra, using Node.js Apache Cassandra is the massively scalable open-source NoSQL database. It is an Apache Software Foundation top-level project designed to handle very large amounts of data in real-time while providing availability even across multi-datacenters and the cloud.

C

assandra evolved from work at Google, Amazon and Facebook, and is in use at leading companies such as Netflix, Rackspace, and eBay. The Chair of the Apache Cassandra Project is Jonathan Ellis, co-founder and CTO of DataStax. Although Cassandra is classified as a NoSQL database, it is important to note that NoSQL is a misnomer. A more correct term would be NoRDBMS, as SQL is only a means of accessing the underlying data. In fact, Cassandra comes with an SQL-like interface called CQL. CQL is designed to take away a lot of the complexity of inserting and querying data via Cassandra. In this article, we will walk through the basics of what Cassandra is, getting up and running with Cassandra, and a simple Node.js application sample to show the ease of use.

The basics

The basic Cassandra schema starts off with the keyspace. The keyspace is synonymous with a database in an RDBMS. The keyspace defines how many times the data is replicated, and in what datacenters the data resides. In every keyspace there is a set of column families. A column family is like a table in an RDBMS, however, there is no set schema. While you can specify the value type for a specific column, this can also be done on the fly. This feature allows for you to create millions of columns in a single row and a great use case for this is time-series data. Each column family has a set of rows, which consist of columns. Every column is a tuple that contains the column name, value, time stamp, and optionally a TTL. The column name can be of any supported type, including a composite of several types. These are called composite columns.

44

Cassandra can query data in several different ways. You can select columns directly by their name; such as you would in an RDBMS. You can also select slices of columns by either their name, part of their name, or their ordinal position. It is important to note that columns are automatically ordered according to their name and not necessarily when they were added.

Getting started with Cassandra

Getting started with Cassandra is fairly easy. You can download the binaries or source as a tarball directly from planetcassandra.org. If you are running OSX and use HomeBrew, you can simply run “brew install cassandra”. There is also a Debian package for the Debian distribution of the Linux operating system. Testing small clusters locally is easiest done using Cassandra Cluster Manager (CCM). Let’s walk through using CCM to start up a small 3-node cluster on our local machine. $ sudo ifconfig lo0 alias 127.0.0.2 up $ sudo ifconfig lo0 alias 127.0.0.3 up

$ git clone git://github.com/pcmanus/ccm.git $ cd ccm

$ sudo ./setup.py install

$ ccm create test -v 1.1.8 $ ccm populate -n 3 $ ccm start

After the cluster starts, you can verify that your cluster is up and running by using the nodetool command: Listing 1. The nodetool command is your primary means of managing tasks, not related to queries, across the cluster. This includes adding a new node, decommissioning

OPEN 01/2013


Getting Started with Cassandra, using Node.js

a node, rebalancing the data in the cluster, checking statistics and many more.

cqlsh:webinar> INSERT INTO users (email, first_name,

CQL

As is selecting data:

Cassandra Query Language (CQL) is an SQL-like language for querying Cassandra. Although CQL has many similarities to SQL, there are some fundamental differences. For example, the CQL adaptation to the Cassandra data model and architecture doesn’t support operations such as JOINs, which make no sense in a non-­-relational database. To start using CQL all you need to do is start up the cqlsh shell. There are 2 main versions of CQL, CQL2 and CQL3. The default version of CQL for Cassandra 1.1.x is CQL2, however, CQL3 is preferred and should be used if possible. Let’s start a shell using CQL3 $ cqlsh -3

Connected to test at localhost:9160.

[cqlsh 2.2.0 | Cassandra 1.1.8-SNAPSHOT | CQL spec 3.0.0 19.33.0]

| Thrift protocol

Use HELP for help. cqlsh>

Now that we are in the shell, we can create our keyspace and column families. cqlsh> CREATE KEYSPACE webinar

... WITH strategy_class=SimpleStrategy

... AND strategy_options:replication_factor=1;

To start using the newly created keyspace, the command is just like in SQL. cqlsh> USE webinar;

Now we can create a column family. cqlsh:webinar> CREATE COLUMNFAMILY users (email

text, first_name text, last_name text, PRIMARY KEY (email));

Inserting a row is also very similar to SQL.

last_name) VALUES (‘foo@bar.com’, ‘Foo’, ‘Bar’)

cqlsh:webinar> SELECT * FROM users; email

| first_name | last_name

------------+-----------+----------foo@bar.com | Foo

| Bar

Creating an application

Creating an application in Cassandra has been made a relatively simple task by the collective efforts of the community in creating easy-to-use drivers that all have a similar API. No matter what language you prefer, there is development going on for integration with Cassandra. For the purposes of this exercise, we will be using Node.js. Node.js is a platform built on Chrome’s JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. Node.js especially lends itself to being a suitable starting point because of its ease of use and out-of-the-box non-blocking IO capabilities. In this example we will use the ExpressJS framework, as it is the most widely used web framework for Node.js. First let’s install express and create our app. $ npm install –g express $ express node-cassandra

Now open the folder node-­-cassandra using your favorite editor. In the directory list you will see a file called package.json, this is where the application’s dependencies are maintained. Edit that file and add to the dependencies “helenus”:”*”. Helenus is the Node. js driver for Cassandra and will be how we are going to communicate with Cassandra in this application. Now go back to the command line and, in the application’s root directory, run “npm install” to install all of the application’s dependencies.

Listing 1. Output of nodetool command describing cluster $ nodetool -p7100 ring

Note: Ownership information does not include topology, please specify a keyspace. Address DC Rack Status State

Load Owns

Token

127.0.0.1 datacenter1 rack1

Up

Normal

6.71 KB 33.33%

0

127.0.0.3 datacenter1 rack1

Up

Normal

6.71 KB 33.33%

127.0.0.2 datacenter1 rack1

en.sdjournal.org

Up

Normal

6.71 KB 33.33%

113427455640312821154458202477256070484 56713727820156410577229101238628035242

113427455640312821154458202477256070484

45


Cassandra

Let’s connect to our test cluster using Helenus. First let’s edit the “app.js” file and modify the top requires to add our driver. 1. /** 2. 3.

* Module dependencies. */

4.

5. var express = require(‘express’)

6.

, routes = require(‘./routes’)

8.

, http = require(‘http’)

7. 9.

10.

, user = require(‘./routes/user’) , path = require(‘path’)

, helenus = require(‘helenus’);

The next step is to create our connection pool. The Helenus connection pool gives you a lot of options when it comes to managing the connections. It has automatic detection of dead nodes and attempts to reconnect to the nodes that are considered dead. If a node goes down during a request, the request will fail and it is up to the client to decide to retry the request or not. The driver will not throw an exception on a downed node unless all nodes are down. This behavior allows for multiple-node outages that would otherwise cause an application to crash. By default, Helenus will select a connection at random to send the writes to, however this behavior can be overridden. There is an optional function that can be passed when creating the connection that will choose a connection based on any logic. A good example use of this would be to implement a round-robin connection selection. The only required fields in a Helenus connection are the keyspace and host names, you can also specify the CQL version, user, password, default timeout for connections, and the host pool size. The host pool size is the number of connections to make to each node, it current defaults to 1, however we recommend a minimum of 3 connections per node. Let’s create our connection pool: 1. var pool = new helenus.ConnectionPool({

2.

hosts : [‘127.0.0.1:9160’, ‘127.0.0.2:9160’,

3.

keyspace : ‘webinar’,

5.

hostPoolSize: 3

4. 6.

‘127.0.0.3:9160’],

cqlVersion : ‘3.0.0’, });

Now that we have specified our connection pool parameters, we can connect. We don’t want to have the web server start taking requests before we establish our connection, so we will put the server startup in the callback function. The callback function is the function that is invoked once the connection has been established.

46

1. pool.connect(function(err){ 2.

if(err){

4.

}

6.

http.createServer(app).listen(app.get(‘port’),

throw(err);

3. 5.

function(){

7.

console.log(“Express server listening on port “ +

8.

});

9.

app.get(‘port’));

});

As you can see, the function takes an argument of error. This parameter will be null unless there was a problem establishing the connection to the Cassandra. In the event there is an error, we do not want to continue any further, thus we will throw the error, causing it to halt the application. Now that we have connected to the database, we need to make the connection available to the rest of the application. We can do this by adding it to the app configuration. 1. app.configure(function(){ 2.

app.set(‘port’, process.env.PORT || 3000);

4.

app.set(‘view engine’, ‘jade’);

3. 5.

app.set(‘views’, __dirname + ‘/views’); app.set(‘cassandra’, pool);

6. 7.

app.use(express.favicon());

app.use(express.logger(‘dev’));

8.

app.use(express.bodyParser());

9.

app.use(express.methodOverride());

10.

app.use(app.router);

11.

app.use(express.static(path.join(__dirname, ‘public’)));

12. });

So anywhere that has access to the application can now access the connection pool via the “cassandra” setting. The last thing we will do in this file is to create a few routes, these will tell our app how we want to handle the data that comes in. 1. app.get(‘/’, routes.index); 2. app.post(‘/’, routes.new);

3. app.delete(‘/’, routes.delete);

Now we will create the endpoints to the routes we just created. Edit the index.js file in the routes folder and add the index, new, and delete methods. 1. exports.index = function(req, res, next){ 2.

3. }; 4.

5. exports.new = function(req, res, next){ 6.

OPEN 01/2013


Getting Started with Cassandra, using Node.js

7. }; 8.

9. exports.delete = function(req, res, next){ 10.

11. };

The functions for these methods take 3 arguments. The first argument is the request, the second is the response and the third is the “next” method, which is used when there is an error that we want to pass to the browser. First we will create the “index” or user listing. To access Cassandra we will need to gain access to the application, which is an object in the request method. 1. req.app.get(‘cassandra’).cql(‘SELECT * FROM users LIMIT 10’, function(err, users){

2.

if(err){

4.

}

6.

res.render(‘index’, { title: ‘Users’, users: users });

3. 5.

return next(err);

7. });

In the above code, we access the connection pool and call the CQL statement “SELECT * FROM users LIMIT

10”. The callback method takes 2 arguments. The first is the error argument that will be passed to the browser if it exists. The second is the response from Cassandra containing the users returned from the query. When using the “cql” method, you can also use variable based replacement that will properly escape the input before sending it to Cassandra for processing. This is essential when creating an application that take user input. We can now create the “delete” and “new” methods in the routes file (Listing 2). The routes will redirect the user to the index after the operation has completed. Now that we have our basic routes created, we can now create our view. A view can take variables passed to it form the render command. In this example we will use Jade, the default template engine in Express. Edit the “index.jade” file in the views directory (Listing 3). Since a Cassandra column consists not only of the key and value, but also the TTL and timestamp. The response from the SELECT query will be an object that will allow you to get the column by name and retrieve the TTL, value and timestamp if needed. In the example above, we can get the email address for the user by calling “user.get(‘email’).value”. Now that we have created the view we can start the application.

Listing 2. Example POST and DELETE verbs 1. exports.new = function(req, res, next){

2. var insert = ‘UPDATE users SET first_name=?, last_name=? WHERE email=?’, 3. 4.

params = [req.body.first_name, req.body.last_name, req.body.email];

5. req.app.get(‘cassandra’).cql(insert, params, function(err, users){ 6. if(err){

7.

8. }

return next(err);

9.

10.

res.redirect(‘/’);

11. }); 12. }; 13.

14. exports.delete = function(req, res, next){

15. var remove = ‘DELETE FROM users WHERE email=?’, params = [req.body.email];

16. 17.

18. req.app.get(‘cassandra’).cql(remove, params, function(err, users){ 19.

if(err){

21.

}

23.

res.redirect(‘/’);

20. 22.

24.

25. };

return next(err);

});

en.sdjournal.org

47


Cassandra

$ node app.js

You can now point your browser to http://localhost:3000 (Figure 1). As you can see, the user we added from the origin

CQL statements above is already there. You can also add and remove users via the page. All of the above code and examples are available on github at: https:// github.com/devdazed/node-cassandra-webinar.

Russ Bradberry Twitter: @devdazed Russell Bradberry is the Principal Architect at SimpleReach where he is responsible for architecting and building out highly scalable data solutions. He has put into market a wide range of products including a real time bidding ad server, a rich media ad management tool, a content recommendation system and most recently a real time social intelligence platform. He is also a Datastax Cassandra MVP and the author of the Node.js Cassandra driver Helenus.

Figure 1. Screen shot of application Listing 3. Jade view code 1. extends layout 2.

3. block content 4. 5. 6.

h1= title

7.

table

9.

tr

8. 10. 11. 12. 13.

thead th First Name th Last Name th Email th

14.

tbody

16.

tr

15. 17.

each user in users td #{user.get('first_name').value}

18.

td #{user.get('last_name').value}

19.

td #{user.get('email').value}

20.

td

21.

form(action='/', method='post')

22.

input(id='_method', type='hidden', name='_method', value='delete')

23.

input(id='email', name='email', type='hidden', value="#{user.get('email').value}")

24. 25. 26. 27. 28. 29. 30. 31. 32. 33.

48

input(type='submit',value='Remove User',data-transition='fade', data-theme='c') h2 New User

form(action='/',method='post') tbody tr

td: input(id='email',type='text',value='',placeholder='foo@bar.com',name='email')

td: input(id='first_name',type='text',value='',placeholder='First Name',name='first_name') td: input(id='last_name',type='text',value='',placeholder='Last Name',name='last_name') td: input(type='submit',value='Add User',data-transition='fade', data-theme='c')

OPEN 01/2013



Cassandra

COTS to Cassandra At NASA’s Advanced Supercomputing Center, we have been running a pilot project for about eighteen months using Apache Cassandra to store IT security event data. While it’s not certain that a Cassandra based solution will go into production eventually, I’d like to share my experiences during the journey.

D

uring the sumer of 2011 I began looking for a replacement technology for a commercial event management solution that would handle our anticipated growth in data, improve ingestion and retrieval times, and eliminate a single point of failure in a data store. My team and I researched various approaches, including traditional relational database technologies, and we came to the conclusion that the amount of resources required to implement many of the native Cassandra features (time-tolive, snapshotting, no single point of failure) in another technology, would take away from our goal of quickly solving this problem and moving on to the next one. Throughout that fall, as we had available time, my team and I set up a small three node Cassandra cluster on a virtualized server and we learned by doing. Using Python and Pycassa, I wrote a data parser and several sample query routines against time series data so we could understand how Cassandra behaved. We added and dropped key spaces, we modified column families, we let data expire and then reloaded it, and we shut down virtual interfaces at critical times in attempt to understand how Cassandra would behave. I came away mostly impressed, but cautious of how little we understood data modeling and how much effort would be required in terms of parallelizing our code for increased performance. We took our data points, timings, and overall impressions and made a pitch to NASA that Cassandra mapped well to our use-cases and our environment. Some of the highlights were: • our environment was tailor made for cassandra being heavily write-centric with far fewer reads

50

• replication/availability is a core Cassandra capability • data expiration via TTL timestamps on all inserted data would be immensely helpful in keeping an accurate rolling window • flexibility in organizing the data Perhaps the most important thing I tried to convey was that Cassandra was an enabling technology that allowed us to ask increasingly complex questions of our data.

Deployment

As it came time to order the hardware, I decided that we would go with five physical servers divided into various virtual machines. • 3 2U servers for Cassandra (48GB RAM, 24 cores, 2 SATA 500GB drives, 6 SAS 600GB drives, RAID card) • 1 2U server to handle all the data ingestion • 1 2U server to handle our analytics During the Christmas and New Years holidays, a time when most offices slow down, one member of my team and I got busy with our performance testing. We ran through several different Cassandra configurations and benchmarks before settling on our final solution of three virtual machines per physical server. We decided to base our benchmark configuration off of similar tests performed by the Netflix team. We used the stress tool shipped with Cassandra 1.0.6: $ cassandra-stress -D “nodes” -e ONE -l 3 -t 150 -o INSERT -c 10 -r -i 1 -n 2000000

OPEN 01/2013


COTS to Cassandra

• “nodes” was a comma separated list of all cluster IP’s • consistency level of one • replication factor of three • one hundred fifty threads • perform a write operation • ten columns per key • random key generator • report progress in one second intervals • two million keys

Here is the corresponding latency graph for each configuration. Again, the red line has the lowest overall latency while remaining the smoothest (Figure 2). Our final virtual machine configuration is as follows:

Our cassandra.yaml file was configured with two changes from the default. Aside from the seed provider, per the documentation we changed:

Each hypervisor is configured with:

concurrent_writes: 56

We first tested two virtual machines per physical server with fully virtualized disks (the blue line). We noticed our operations fluctuated quite dramatically which we assumed to be disks competing for throughput. Next we tried three virtual machines per physical server with an assumption going in that we’d get worse performance as the disks had another image to keep up with (the green line). Finally, we decided to de-virtualize the I/O and bind two disks together in a RAID 0 for each virtual machine (red line). As you can see, this increased our operations per second, as well as smoothing the overall curve, so I knew we were making progress (Figure 1).

• • • • •

• • • • •

Xen 4.1.2 Gentoo Linux (kernel 2.6.38) 7 cores 15 GB RAM 1.2 TB disk (2 SAS disks per VM in a RAID 0)

Xen 4.1.2 Gentoo Linux (kernel 2.6.38) 3 cores 512MB RAM 500 GB disk (2 SATA disks in a RAID 1)

I chose not to run bare metal, as several talks with the Cassandra developers on Freenode’s #cassandra channel made it clear that I’d get better Cassandra performance by virtualizing our larger machines into more numerous smaller ones.

Data Modeling

When you ask people about data modeling in Cassandra, one of the first things you inevitably hear is to organize the column families based on the questions you

Figure 1. Write Operations

Figure 2. Latency

en.sdjournal.org

51


Cassandra

plan to ask of the data. This seems absurdly obvious until you come to the stark realization that you do not have a good handle on every question across every data type and you need flexible models. By far, the most time spent was in terms of experimenting with various ways of organizing the column families and modifying our Python code to be multiprocess for our heavier data feeds. I was fortunate in that all of our data is well structured and was largely time series in nature. This meant that we usually ended up with a column family that looked similar to this: Table 1. Since we were dealing with time series data and we’d often bound our lookups by a timestamp, we’d row partition by time, sometimes in minute intervals (as above), fifteen minute intervals, or even hour intervals depending on how large the data was. Each column would then be a unique entry for that piece of data with the value being a JSON encoded blob. If a particular piece of data occurred over multiple partitions, we’d write that piece of data to each row. Pretty straightforward really. The drawback to this approach is that without any other column families, we’d be doing a full row scan (sometimes in the tens of thousands of columns) for a single entry, which consisted of decoding each JSON blob and looking for a particular value. In practice, I found that this wasn’t much of a problem for small time intervals as it only took a few seconds per row. However, as I started increasing our query size to hundreds or thousands of rows (1440 rows for a twenty four hour period), I realized that we had to spend some time working with the Python multiprocessing module so we could fetch rows in parallel if we wanted any sort of response time not measured in minutes. While this allowed us to take full advantage of multiple cores, it left a pretty high barrier to entry in terms of development skills required by a group largely staffed by operations folks.

Even without getting into multiprocessing, there were ways to optimize retrieval by creating lookup tables pertaining to values in the JSON blobs. As an example, we could decide to index by IP address: Table 2. With this column family we simply need to find the IP address in question and then retrieve the column names for every occurrence and go and grab the values from the first column family. A small optimization (at the cost of data redundancy) would be to write the full JSON blob into the column_value so we don’t have to read two tables when looking up by IP address. If the data values are frequently changing, having to ensure consistency across multiple column families is a solvable problem, but a problem nonetheless. In our environment, the data is immutable, so either approach would be fine: it’s speed versus storage.

Takeaways

Cassandra can be daunting if your only background is comprised of single instance relational databases. While I ultimately got things working as expected, there was a significant time and skill investment in doing so. Like any complex technology, it’s expert friendly. The more you understand it, the more you and your organization will get out of it. These days, it’s not uncommon to find someone who can deliver all of these skills alone, but having a well engineered team of these people can significantly reduce the time it takes to go from nothing to full production. • systems administrator to manage a distributed cluster and tune the OS & Cassandra • business analyst to fully understand the data and questions being asked of it • software developer to facilitate the optimal means of retrieving the data and writing the complex analytics

Table 1. Sample column family

row key 2012-12-01 00:00

2012-12-01 00:01

column_name

column_value

ffd82988-1c01-498b-ac53-14ef9cb9467b

JSON blob

fff3ad45-b4b8-42df-a021-6c53ad2095a4

JSON blob

fff76205-75ff-4cca-bb9e-2c963be10f54

JSON blob

ffea8e76-5c24-426f-9bb9-a40912e7a8f0

JSON blob

fff753d5-91f1-4045-86ad-8a20d4fa4fc5

JSON blob

row key

column_name

column_value

a.b.c.d

fff3ad45-b4b8-42df-a021-6c53ad2095a4

empty

fff76205-75ff-4cca-bb9e-2c963be10f54

empty

ffea8e76-5c24-426f-9bb9-a40912e7a8f0

empty

fff753d5-91f1-4045-86ad-8a20d4fa4fc5

empty

Table 2. Indexed by IP address

a.b.c.e

52

OPEN 01/2013


COTS to Cassandra

While making this journey, it’s difficult to overstate how much I’ve learned about our data, our uses cases, and Cassandra. To that end, I’m happy to pass on some tidbits I’ve learned from DataStax engineers, Cassandra developers, and the community at large:

cluster in less than a day, but getting the most out of it will take longer • Cassandra is a work in progress, be prepared for bugs – 1.1.3 removed the ability to drop column families (fixed in 1.1.4)

Operations

Usually the first question someone might ask is “Do I need Cassandra or can I use a relational database?” That can be tough to answer without delving into a myriad of requirements, but one of the best answers I’ve heard was that it makes sense to use Cassandra when your data no longer fits comfortably on one node. As our data set is currently measured in the hundreds of gigabytes, it remains to be seen if Cassandra will move into production given the amount of development expertise required to write the analytics. Cassandra has a SQL-like query language called CQL that allows business analysts familiar with SQL to avoid writing complex Python, Java, or your high level language of choice. With proper schemas, it might be very possible to only use CQL, however we’re not at that point.

• try and keep your disk utilization lower than 50% so you won’t get full disks in the event of a node failing • if repairing, use the ‘-pr’ option so you don’t repair the same range twice • once token ID’s are set, you can remove it from cassandra.yaml, thus allowing each node in the cluster to use the same file • using MX4J as an interface to JMX caused random crashes across every node until it was removed • have a planned method of managing configurations and upgrades across a distributed cluster

Tuning

As I loaded more and more data into our cluster, I started seeing errors in the Cassandra logs, so I needed to make the following changes to cassandra.yaml:

Conclusion

flush_largest_memtables_at: 0.50 reduce_cache_sizes_at: 0.60

reduce_cache_capacity_to: 0.4

I’ve not gone back and re-run our benchmarks with these changes and the latest versions of Cassandra, as the cluster is under load now (and we aren’t experiencing any performance issues).

Modeling • a good understanding of the data and the required analytics cannot be over emphasized • don’t use supercolumns • reading successive column families can cause query timeouts – be prepared to refactor your data if this happens • don’t be afraid of de-normalizing your data even if it changes – have well defined ways of updating every occurrence • secondary indexes never worked well for me, so I don’t use them • materialized views work very well if you have the disk space

Overall • give yourself time to master any complex technology, Cassandra is no different – you can stand up a

en.sdjournal.org

CHRISTOPHER KELLER (@cnkeller), Solutions Architect, CSC Christopher Keller is a Solutions Architect at CSC where he specializes in big data technologies for the High Performance Computing group. When he’s not grappling with data, he holds a second degree blue belt in Gracie Jiu- Jitsu. cnkeller@gmail.com

53



26th - 27th June 2013 • National Hall Olympia, London

Join 5,000 delegates and hear from over 220 premier speakers across 9 theatres representing the most amount of cloud knowledge on the planet. ..

REGISTER FOR FREE - www.cloudwf.com



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.