NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE by David Young

GlobalLogic Leaders in Software R&D Services

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE A White Paper by GlobalLogic

Connect. Collaborate. Innovate.

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

TABLE OF CONTENTS ABSTRACT

INTRODUCTION

DEFINITION OF A NETWORK PROCESSOR

NETWORK PROCESSOR MANAGEMENT PROCESSOR UNIT PACKET CO‐PROCESSOR UNIT

6 6 6

KEY FEATURES

DATA FORWARDING FUNCTIONS WIRE SPEED PERFORMANCE SCALABILITY SIMPLE PROGRAMMABILITY FLEXIBILITY MODULAR SOFTWARE MANAGEMENT FUNCTIONS THIRD PARTY SUPPORT HIGH FUNCTIONAL INTEGRATION

8 8 9 9 10 11 11 12 12

APPLICATIONS

ROUTING SECURITY QOS RELATED APPLICATIONS ATM SWITCHING

14 15 17 18

FUNCTIONAL COMPONENTS

DATA PARSING CLASSIFICATION LOOKUP COMPUTATION DATA MANIPULATION TRAFFIC MANAGEMENT CONTROL PROCESSING MEDIA ACCESS CONTROL

20 20 20 21 21 21 22 22

ARCHITECTURE

[2]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

NEXT GENERATION NETWORK‐SPECIFIC PROCESSORS

FAST PATH – SLOW PATH

CONTROL PLANE DATA PLANE

27 28

SUMMARY

[3]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

ABSTRACT This white paper researches the role of the network processor in the past, present and future. It discusses the functional components and applications of the etwork processor, as well as its evolution and architecture.

[4]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

INTRODUCTION In the ever‐evolving network equipment segment, vendors have to walk the tightrope between the demands of gigabit performances and intelligent processing challenges such as QoS and 7‐layer applications. In the face of fierce competition and shorter time‐ to‐market, network solution providers are shifting gears in the fast lane. Switching from general purpose CPUs to ASICs to SoCs to FPGAs to ASIPs in less than two decades is the stuff of which Formula 1 racing dreams are made! Balancing performance, flexibility, cost efficiency, time‐to‐market and intelligent processing is a tough challenge, but network processors have lived up to these expectations. The network processor segment is slowly but steadily making inroads as the future of network equipment solutions. Let us take a quick look at the journey of network processors of the past, present and future.

Figure 1. Comparison of Technologies

[5]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

DEFINITION OF A NETWORK PROCESSOR Network Processor A network processor is a programmable system‐on‐chip that broadly consists of two distinct units: one for performing intelligent packet processing and the other to handle fast path tasks such as packet switching. These two hardware functional units, combined with optimized yet customizable software, together define the network processor.

Management Processor Unit The intelligent management processing unit (MPU) of a network processor is a central processor that provides essential management and system‐level functionality, along with programmable intelligent packet processing (such as Layer 7 applications) that can be customized rapidly. The MPU can perform offloaded, non‐critical packet processing. It is also responsible for initializing the co‐processor units and communication with the network management systems and host processors.

Packet Co‐Processor Unit An array of co‐processors powered by optimized networking software accelerates the well‐understood packet processing functions in the fast path. These include compute‐ intensive packet processing such as packet switching at Layer 2/3, QoS management, etc. These co‐processors handle the bulk of ingress traffic, which needs to follow common packet processing: classification, forwarding, computation, modification, etc. HOST CPU

HOST CPU

Figure 2. Network Processor Block Diagram

[6]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

In addition to the two units mentioned above, it is common to use off‐chip specialized co‐processors to provide additional service functions ranging from encryption to policy management to segmentation/re‐assembly. Network processor units (NPUs) also need several high bandwidth interfaces, as their main function is to perform data switching. Some of these interfaces (such as MAC devices) may be integrated in the network processor to further increase the entire value proposition. The challenge for new networking equipment is to create the best of both worlds by performing sophisticated packet processing at wire speed. Network processor technology uses communication software components that are combined with advanced packet processing technology, “standard” programming interfaces, and a robust development environment. This enables network equipment vendors to quickly bring to market a wide array of different products based on the same hardware and software architecture. The result is significantly faster time‐to‐market for new products and dramatically longer time‐in‐market by using software upgrades to deliver new, advanced services that extend the product life cycle. Network processors interface with the host CPU through standard bus interfaces such as PCI. They also interface with external memory units such as SDRAM/SRAM to implement lookup tables and PDU buffer pools. Another important interface of network processors is a standard‐based switch fabric interface such as CSIX. Network processors interface with physical devices such as MACs/Framers for PDU ingress/egress operations.

[7]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

KEY FEATURES Some of the key features that define and distinguish a network processor are:

Data Forwarding Functions Network processors must provide fundamental data forwarding functions. Data processing from ingress to egress can be roughly classified into flow classification and flow processing. Flow classification refers to the process through which the data PDU is examined in order to decide further processing. In the first stage of classification, the ingress PDU needs to be reassembled if required. Then traffic policing is performed using user‐ defined algorithms and appropriate actions to thwart policy violations. Third and most importantly, deep packet classification is required to make forwarding decisions. The fast path implementation of the network processor must be sensitized for any depth of frame classification at wire speed, irrespective of the packet length. Furthermore, deep classification must be fully programmable in order to process any possible protocol PDU. As an add‐on function, intelligent statistics collection is required by the network processor to track the classification results of path flows. Flow processing is the stage during which forwarding decisions are applied based on the result of flow classification. Next, forwarding decisions should be made based on standard Layer 2/3 switching algorithms or proprietary/customizable protocols. This function should be optimized to eliminate bottlenecks, irrespective of the traffic bandwidth or number of nodes in the network deployment. The network processor must provide a programmable buffer management interface to perform packet forward/drop actions on the data streams. Several industry standard buffer management algorithms are available for this purpose. A user programmable stream modification facility on data flows should be provided, especially for header manipulation during forwarding. The network processor must also provide traffic shaping by using various scheduling algorithms during the output scheduling stage. Finally, a statistics gathering function needs to be implemented for tracking flow processing.

Wire Speed Performance Network processors must live up to wire speed performance expectations. Typically all network processing follows the parallel processing model in the fast path, which is a distinguishing feature of these devices. Network processor must evolve to be able to track and support the complex networking applications at a high performance rate. Thousands of simultaneous connections need to be managed by core devices that use technologies such as MPLS. Also, quickly emerging VoIP services lay down strict quality of service requirements in the fast path. Network processors must be able to support a variety of protocols such as ATM/AAL5, IP, Ethernet, etc. In many cases, network processors must also provide support for legacy protocols such as IPX and SNA.

[8]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

The bottom line is that network processors must be able to support large bandwidth connections, multiple protocols and advanced features without becoming a performance bottleneck. That is, network processors must be able to provide wire speed and non‐blocking performance regardless of the bandwidth requirements or the type of protocol or the features that are enabled. It is a tall order to fill, but the solution lies in following a highly optimized fast path processing model for common networking tasks such as Layer 2 and Layer 3 switching, packet classification, etc. A fully optimized processing architecture with a high MIPS (millions of instructions per second) to GBPS (gigabits per second) ratio is required to support a wire speed operation at high bandwidths while still have processing headroom for advanced applications.

Scalability Compounding this wire speed performance crisis is the clear requirement for such performance to scale in two dimensions. Currently it must scale sufficiently to build a complete platform with multiple access speeds. In the future, network processor architecture must provide developers confidence that their network processor of choice will keep pace with the constant increase in interface speeds. One of the primary requirements for an network processor is to be able to rapidly scale its performance capabilities to support ever‐increasing bandwidth requirements: Gigabit Ethernet / OC‐ 12 (today), OC‐48 (near future), and OC‐192 / 10‐times‐Gigabit Ethernet (distant future). With the increasing deployment of high speed bandwidth technologies such as Gigabit Ethernet and DWDM, the demand on networking equipment performance is increasing at an exponential rate. However, the speed of ICs is still bound by Moore’s Law! It is clear that network processor vendors must provide a breakthrough in scalability. More specifically, the throughput of a family of network processors must scale upwards significantly over time. This scaling should largely preserve the original programming interface and the original fast path software.

Simple Programmability With network application requirements rapidly changing, simple programmability is the key to facilitating the customization and integration of emerging technologies. This need for simple adaptability is perhaps the strongest argument in favor of a network processor, especially in contrast to its high‐speed but rigid cousins such as ASICs. Time‐ to‐market is critical for the success of any product, meaning that a user‐friendly development environment, as well as extensive support development and debugging tools, are essential. In this fiercely competitive marketplace, simple programmability and development environment support make network processors accessible to even the smallest equipment manufacturers.

[9]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

In order to meet the demand of simple programmability, network processor manufacturers are striving to supply programming and testing tools that are easy to use. These programming tools are based on a simple programming language that allows for the reuse of code wherever possible. In addition, programming tools provide extensive testing capabilities and intelligent debugging features such as descriptive codes and definitions. They also strive to provide code level statistics for optimization. Testing tools must be able to simulate real world conditions and provide accurate measurements of throughput and other performance measurements. Another important consideration is the selection of a suitable high‐level programming language for network processor programming. By far, the most common software languages in real‐time communication systems are C and C++. Programming in the C and C++ languages enhances the future portability of the code base, which enables its use in future generations of network processors and industry standard programming interfaces. This option is not possible with specialized languages or state‐machine codes.

Flexibility For real platform leverage, a network processor must be universally applicable across a wide range of interfaces, protocols and product types. This requires programmability at all levels of the protocol stack, from Layer 2 through Layer 7. Protocol support must include packets, cells and data streams (separately or in combination) across various interfaces to meet the requirements of carrier edge devices, which are the cornerstone of the emerging multi‐service carrier network. The number of instructions required to implement a fast path application must be kept to a minimum. To accomplish this goal, each instruction must be powerful and targeted at performing data path functions. A user must be able to implement a new application, such as a new protocol or a scheduling algorithm, in a matter of days instead of months. The programming interface must be concise. Furthermore, a user must be able implement virtually any network application, existing and futuristic, without needing to replace the network processor hardware. System operators should be able to add new routes, connections and forwarding treatments at runtime without affecting the fast path flow. True network processors integrate all the functions implemented between the physical interfaces and the switching fabric, thus enabling an open approach for the PHY and fabric levels. This permits best‐of‐breed, multi‐vendor solutions that allow vendors to offer true product differentiation and scalability. In addition, software implementation of these functions allows simpler upgrade paths in this constantly changing networking world.

[10]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

Modular Software Network processor‐based solutions must also address the fundamental programming and processing models. As multiple processing cores are grouped together to achieve very high throughput, the simple application of traditional programming tools leads to the extreme complexity of multiprocessor software development. Regardless of the language or tools employed, asking a programmer to understand and distribute time‐ sensitive, interdependent task execution across multiple processing cores adds tremendous complexity to the process and goes directly against the aim of total time‐to‐ market. Ideally, even with multiple processor cores in operation, a programmer should be able to address the NPU as a simple, single, logical processor from a software perspective, thus minimizing programming complexity while maximizing real‐time performance. In order to facilitate scalability across multiple product lines and migration across multiple generations, software architecture must be highly modular and simultaneously facilitate both portability and the rapid rollout of additional features. A communication processor cannot deliver software flexibility and portability if the programming interfaces are dependent on the processor. The processor’s architecture must support generic communications programming interfaces to simplify the programming task and allow future software reuse across processor generations. By delivering software stability across product generations, network processors radically improve software development cycles and reliability, which is the largest factor in total system availability. Furthermore, open APIs allow for the maximum degree of integration with industry standard application software

Management Functions A management interface allows an external source such as a management system or human user to monitor and modify the operation of the communication device on the fly. It is common for network processor vendors to provide standard‐based management software that runs in the main processor core to manage the network device, thus eliminating the need for any separate host processor. Typical management activities include: • Configuration – setting device parameters that affect the device's runtime behavior • Performance Monitoring – collecting information concerning the performance of the

device, which may prompt reconfiguration • Usage Monitoring – collecting information concerning the usage of the device

[11]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

•

Fault Monitoring – collecting information concerning the faults, errors and warnings related to device operation

•

Diagnostics and Testing – prompting the device to perform self‐diagnostics and tests, and reporting the information to the management source

Network processor must be able to gather performance and traffic flow statistics that can (1) be collected by a billing or accounting system using such common protocols as RMON and SNMP and (2) provide support for services such as SLA management and enforcement. Network processors also play a key role in enforcing various classes of services that providers may offer. To enforce these policies, traffic must be identified and classified at the ingress. Traffic filtering also typically occurs at this boundary using access control lists or some other policy enforcement mechanism. By performing management tasks such as identification, classification and accounting within the network processor, hardware vendors can take advantage of the specialized nature of these processors to provide a large performance boost to their products.

Third Party Support To realize the full potential of a software‐driven environment, the network processor needs to be the foundation of a complete communications platform that takes advantage of industry‐wide hardware extensions, software applications and tool suites. This is only possible with an architecture that has the flexibility to support third‐party protocol stacks, to support any PHY or fabric interface, and to link with industry standard tools. Developers typically must integrate network processor functions with a high‐speed interconnect (i.e., a switch fabric or switching engine) and add the queuing and scheduling services of a traffic management engine to facilitate rich Quality of Service (QoS). This integration can be one of the most difficult and time‐consuming tasks for developers. Clearly, minimizing the time spent in developing additional “glue” logic, whether in hardware or software, is another essential part of delivering time‐to‐market. Network equipment providers typically look to a vendor to supply total platforms of compatible, integrated network processors, traffic management engines and switch fabrics.

High Functional Integration Network processors need to provide a high level of system integration that dramatically reduces part count and system complexity while simultaneously improving performance, as compared to using a design that incorporates multiple components. In addition, a highly integrated network processor avoids the interconnection bottlenecks that is common with component‐oriented designs.

[12]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

Integrated co‐processor engines (such as for classification or queuing) can be fully utilized by internal processing units without interconnection penalties. Integration of lower layer functions (such as SONET framers) within the chip also enables higher port densities and lower costs than have typically been possible in the past. Therefore the high functional integration of bus interconnects, co‐processing engines, special purpose hardware, memory units and standard bus and switch interfaces is an important characteristic of the network processor. Also Read On: Saas application development Health care software development Product engineering services

[13]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

APPLICATIONS As the number of applications for network processors has grown, the market has begun to segment into three main network equipment areas: core, edge and access. Each of these areas has different target applications and performance requirements. Core devices sit in the middle of the network. As a result, they are the most performance‐ critical and the least responsive to flexibility. Examples of these devices are gigabit and terabit routers. Edge devices sit between the core network and access devices. Examples of edge devices include URL load balancers and firewalls. They are focused on medium‐high data rates and higher layer processing, so a certain amount of flexibility is required. Access equipment provides various devices access to the network. Most of their computation relates to aggregating numerous traffic streams and forwarding them through the network. Examples of access devices include cable modem termination systems and base stations for wireless networks. Each level of the network requires a different mix of processing performance, features and costs. To meet these needs effectively, network processors must be optimized not only for the specific requirements of the equipment, but also for the services delivered in each segment of the network infrastructure.

Figure 3. Network Equipment Areas

Routing Routers are the workhorses of the Internet. A router accepts packets from one of several network interfaces and either drops them or sends them out through one or more of its other interfaces. Packets may traverse a dozen or more routers as they make their way across the Internet. In order to forward an IPv4 packet, a router must examine the destination address of an incoming packet, determine the address of the next‐hop router, and forward the packet to the next‐hop address. The next‐hop route is stored in a routing table, which is

[14]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

created and maintained by a routing protocol, such as Border Gateway Protocol (BGP). There are many different algorithms for performing routing table lookups. However, they all use longest prefix matching, which allows entries to contain wildcards and find the entry that most specifically matches the input address. For example, all packets going to subnet 128.32.xxx.xxx may have the same next‐hop address. While this significantly reduces the size of the routing table, multiple lookups may be required depending upon the data structures and algorithms used. The ideal packet forwarding solution must be able to support the link data rate and must be large enough to accommodate the routing table sizes of next‐generation routing equipment (up to 512K routes at the edge). It must also be able to handle prolonged bursts of route updates with low update latency. In addition, network processors can handle a large number of addresses in the routing table and high data rates, which is not possible for entirely software‐based solutions. Although the traditional methods of IP forwarding (hashing and trees) are well understood, they are limited by table updates that are dependent on the prefix length and the size of the table. They cannot provide a level of performance comparable with hardware solutions and will not scale to higher data rates. Software‐controlled, ternary, CAM‐based solutions are a step closer to the ideal, but they are saddled with the requirement that the table must be sorted. In addition, the update rate is dependent upon how many entries are in the table. Similarly, hardware‐based solutions such as ASICs remain highly inflexible towards fast‐changing requirements such as intelligent feature additions and new protocol scaling of Internet routers. The co‐processor‐based network processors for IPv4 packet forwarding offer a solution to the table update and searching problem by providing a single cycle performance for all prefix lengths and table sizes. These co‐processors allow router vendors to guarantee the performance of their routers even during the heaviest bursts of traffic.

Security IP Security provides an extensible security platform at Layer 3 for higher layer protocols. This relieves higher layer protocols from defining their own ad‐hoc security measures. Most of the security mechanisms entail encryption/decryption of data to ensure confidentiality. This is a highly compute‐intensive task, making security applications a common target for network processor‐based platforms. Many network processors in the market today integrate a special purpose crypto engine for hardware acceleration of encryption/decryption algorithms. IPSec is by far the most commonly used protocol to provision IP VPN solutions. It consists of two protocols: • Authentication Header (AH) – proof‐of‐data origin, data integrity and anti‐replay protection

[15]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

•

Encapsulated Security Payload (ESP) – AH plus data confidentiality but limited traffic flow confidentiality

Either of these protocols can be implemented in transport mode (protects higher layer protocols only) or tunnel mode (protects the IP layer and higher layer protocols by encapsulating the original IP packet in another packet). To ensure that all participating network equipment is consistent, some connection‐related state must be stored at each of the endpoints of a secure connection. Called a Security Association (SA), this state determines how to protect traffic, what traffic to protect, and with whom protection is performed. The SA is updated using various control protocols and is consulted for data plane operations. AH does not provide data confidentiality, but it does verify the sender and data integrity. The Security Parameters Index (SPI) and the destination address help identify the SA used to authenticate the packet. The sequence number field is a monotonically increasing counter that is used for anti‐replay protection, which protects against replay attacks. Anti‐replay service is implemented by a sliding window of acceptable sequence numbers. For ingress packets, a device that supports AH must execute the following operations: 1. If the packet is fragmented, wait for all fragments and reassemble them. 2. Find the SA used to protect the packet (based on destination address and SPI). 3. Check the validity of the sequence number. 4. Check the Integrity Check Value (ICV). a. Save authenticated data and clear the authentication field. b. Clear all mutable fields. c. Pad the packet if necessary. d. Execute the authenticator algorithm to compute the digest. e. Compare this digest to the authenticated data field. 5. Possibly increment the window of acceptable sequence numbers. The following list enumerates the steps involved in supporting AH for egress packets: 1. Increment the sequence number in the SA. 2. Populate the fields in the AH header. 3. Clear the mutable fields in the IP header. 4. Compute the Integrity Check Value (ICV) using the authentication algorithm and the key defined in the SA. 5. Copy the ICV to the authentication data field. Encapsulating Security Payload (ESP) provides data confidentiality and authentication. ESP defines a header and trailer that surround the protected payload. The presence of the trailer means that the payload may have to be padded (with zeros) to ensure 32‐bit

[16]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

alignment. Some data encryption algorithms require a random initialization vector; if necessary, this is stored just before the protected data. The list below illustrates the major steps that are required to support ESP in ingress packets: 1. Wait for additional fragments, if applicable. 2. Check for the SA and drop the packet if one does not exist. 3. Check the sequence number and drop if it is outside of the window or is a duplicate. 4. Authenticate the packet (same as Step 4 in AH ingress support). 5. Decrypt the payload using the key and cipher from the SA. 6. Check the validity of the packet with mode (transport vs. tunnel). 7. Check the address, the port and/or the protocol, depending on the SA. On the egress side, the following functions must be executed for each packet: 1. Insert the ESP header and fill in the fields. For transport mode, an ESP header just needs to be inserted. For tunnel mode, the original IP packet needs to be wrapped in another IP packet first, then the ESP header needs to be added. 2. Encrypt the packet using the cipher from the SA. 3. Authenticate the packet using the appropriate algorithm from the SA and insert the digest to the authentication field in the trailer. 4. Recompute and populate the checksum field.

QoS Related Applications QoS applications have recently come to the forefront due to emerging applications such as VoIP, Service Level Agreements implementation, usage‐based accounting, etc. On one hand, well‐defined QoS protocols such as Diffserv and IntServ enable a wide variety of services and provisioning policies, either end‐to‐end or within a particular set of networks. On the other hand, collecting network usage information pertaining to flows and sessions is essential for billing and network analysis applications. QoS protocols such as Diffserv require data flow identification based on some well‐ defined classification criteria. Classification can be performed based on QoS‐specific fields (e.g., DSCP field, Flow ID field) or on multiple header fields (e.g., source/destination IP/TCP address/port). Once a classifier rule is used to map an incoming packet to a QoS flow, all packets in the same flow receive the same treatment. A typical action taken in the data path that is based on flow associated policies includes the prioritized output scheduling of the frames. The packet scheduler undertakes control of forwarding different packet streams using a set of queues. The main function of a packet scheduler is to reorder the output queue using an algorithm such as weighted fair queuing (WFQ) or round robin. Also, low priority traffic may be dropped in scenarios where there is insufficient buffer space. Ingress rate limiting can be performed based on the pre‐defined rate limits per flow. Dropping packets in case of rate limit

[17]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

violations is also known as traffic policing. Another QoS operation is admission control, which decides whether a new flow can be granted the requested QoS. This is implemented with a control‐plane reservation setup protocol such as RSVP. Usage‐based billing applications require highly granular policy rules to associate bandwidth usage to selected applications, specific users and distinct content. For example, it is necessary to track the download and bandwidth usage of a client when accessing a server and using RTSP (Real Time Transport Protocol) to play the latest rock video clip. The main network processor functions and corresponding packet processing tasks for catering to this application include: • Recognizing session initiation for a specific server: Layer 3 IP addresses and Layer 4 port numbers • Monitoring login session to identify user name: Layer 5 ‐ 7 extraction of login information • Recognizing RTSP session and associating with user: Layer 4 port numbers and Layer 5 key words detection • Identifying desired file name (e.g., video clip) to download: Layer 5‐7 extraction of file name and matching to users and programs policy tables • Recognizing download session and associating with user: Layer 4 port numbers and Layer 5‐7 key words detection

ATM Switching Asynchronous Transfer Mode (ATM) is a connection‐oriented standard in which the end stations determine a virtual circuit (VC), or path, through an ATM network. The VCs are made up of different virtual paths (VPs), or paths between switches. Once control plane functions set up a VC, an ATM switch simply switches ATM cells from input ports to output ports. This switching is based on consulting a lookup table indexed by two fields in ATM cells: virtual circuit identifier (8‐bit VC identifier) and virtual path identifier (16‐ bit VP identifier). A switch may then alter the VPI and VCI fields of the cell to update the new link on which the cell is traveling. The ATM Adaptation Layers (AALs) provide different ways for the ATM to communicate with higher layer protocols. The most popular method is AAL5, which is often used for IP over ATM. Since IP packets are larger than ATM cells (48 byte payload), AAL5 provides a guideline by which to segment IP packets so they can travel over an ATM network and put ATM cells back into IP packets. To accomplish this, AAL5 defines its own packet data unit (PDU).

[18]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

•

Payload – Higher layer PDU, maximum 65,535 bytes

•

Padding – for AAL5 PDUs to evenly fit into a certain number of ATM cells

•

8‐byte trailer o User‐to‐user (UU) field: 1 byte o Common Part Indicator (CPI) field: 1 byte o Length field: 2 bytes o CRC field: 4 bytes

After calculating the amount of padding needed, as well as the length field and CRC field, the AAL5 PDU is simply sliced into 48‐byte chunks that are used as the payload for ATM cells. The Next Generation Network (NGN) is converging towards using omnipresent IP DSLAMs as edge devices to aggregate multiple DSL connections on the CPE side and connect to an IP/Ethernet router/switch on the other side. This involves an ATM to Ethernet conversion function, which is implemented in the data path of the network processor. An intelligent IP DSLAM will act as a Layer 2/Layer 3 switch in addition to the ATM‐Ethernet interworking function implementation. A host of differentiating features such as QoS support, security mechanisms and management functions are made feasible by using a network processor at the core of the IP DSLAM solution.

[19]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

FUNCTIONAL COMPONENTS To examine how the applications outlined in the previous section map to network processors, we can make some generalizations about what type of processing is done on a PDU between the time it is received and when it is retransmitted. The applications decompose into their computational kernels, which broadly fall into six different categories: data parsing, classification, lookup, computation, data manipulation, traffic management and control processing. Based on the application requirements, they can be mapped onto the various functional components.

Data Parsing Data parsing includes parsing cell or packet headers that contain addresses, protocol information, etc. In the past, parsing functions were fixed based on the type of device being constructed. For example, LAN bridges by definition only needed to look at the Layer 2 Ethernet header. Today, switching devices need the flexibility to examine and gain access to a wide variety of information at all layers of the ISO model, both in real‐ time and on a conditional packet‐by‐packet basis.

Classification Classification refers to identifying a packet or cell against a set of criteria defined at Layers 2, 3, 4 or higher of the ISO model. Once data is parsed, it must be classified in order to determine the required action. This examination consists of looking at the PDU content to see which patterns of interest it contains. This process is referred to as “classification,” and it is used in routing, fire walling, QoS implementation and policy enforcement. Following a data classification action such as a filtering/forwarding decision, advanced QoS and accounting functions that are based on a specific end‐to‐ end traffic flow may be taken. This is an area of rapidly changing requirements.

Because of high packet volume and the need to classify and process packets at wire speed, hardware acceleration has become the industry standard method. In one implementation, hardware acceleration is provided for Layer 2/3 classification, while flexible software provides for processing at higher layers. For example, a network processor may enforce a policy that prioritizes an enterprise's internal communications over external web traffic. The first step in this process is to distinguish between the two traffic types. The first trade‐off between hardware and software involves packet classification. Although software can be used to compare and analyze a number of different fields, these functions are more suited to hardware acceleration since Layer 2 Ethernet and Layer 3 IP classification are well defined.

Lookup The lookup kernel is the actual action of looking up data based on a key. It is mostly used in conjunction with pattern matching (classification) to find a specific entry in a table. The data structures and algorithms used are dependent on the type of lookup

[20]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

required (one‐to‐one or many‐to‐one) and the size of the key. For ATM and MPLS, this field is quite small and the mapping is one‐to‐one, so often only one lookup is required. However, for IPv4 and IPv6 routing, the large address field and longest prefix matching (LPM) requirement make it impossible to find the destination address in one memory access. Therefore, trees are used to efficiently store the address table, and multiple lookups are required. A critical operation in packet processing is table lookups. Acceleration is becoming more and more critical as systems need to provide sophisticated QoS functions that require additional lookups farther up the OSI stack. The memory access required for table lookups (addresses, routes or flows) should be optimized in hardware with co‐processor support that accelerates this function.

Computation The types of computation required for packet processing vary widely. To support IPSec, encryption, decryption and authentication, algorithms need to be applied over an entire packet. Most protocols require that a checksum or CRC value be computed. Often, this value simply needs to be updated, not recalculated, based on changes to header fields. Network equipment that implements protocols that support the fragmentation (and reassembly) of PDUs require computation to determine if all fragments of a particular PDU have arrived.

Data Manipulation We consider any function that modifies or translates a packet header or packet data within or between protocols to be data manipulation. For example, in IPv4 routing, the time‐to‐live (TTL) field must be decremented by one for each hop. Additional instances of data manipulation include adding tags, header fields and replacing fields. Other examples in this space include segmentation, reassembly and fragmentation. The variety of low‐layer transport protocols is matched only by the diversity of protocol combinations and services. Transformation requirements can range from address translation within a given protocol (such as IP) to full protocol encapsulation or conversion (such as between IP and ATM). A PDU may be modified. For example, an IP packet will have its TTL counter reduced. In label‐switched traffic, an incoming label will be replaced with an outgoing label. Headers may be added or removed. Modification usually entails recalculation of a CRC or checksum.

Traffic Management Traffic management includes the queuing, policing and scheduling of data traffic through the device according to defined QoS parameters that are based on the results of classification and established policies. This function is central to supporting the convergence of voice, video and data in next‐generation networks. Queue management is the scheduling and storage of ingress and egress PDUs. This includes coordination with fabric interfaces and elements of the network processor that need to access packets. The queue management kernel is responsible for enforcing dropping and traffic shaping policies, as well as storing packets for packet assembly, segmentation and

[21]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

various QoS applications. The actual QoS process of selecting packets, discarding packets and rate limiting flows must be done as the packet is leaving the system after processing. A combination of specialized hardware functions and software configuration must be implemented at the output point to manage the QoS egress functions. Re‐ transmission of a PDU is not generally straightforward. Some PDUs may be prioritized over others while some may be discarded. Multiple queues may exist with different priorities.

Control Processing Control processing encompasses a number of different tasks that don’t need to be performed at wire speed, including exceptions, table updates, details of TCP protocols and statistics gathering. While statistics are gathered on a per packet basis, this function is often executed in the background using polling or interrupt‐driven approaches. Gathering this data requires examining incoming data and incrementing counters.

Media Access Control Implementation of low‐layer protocols (such as Ethernet, SONET framing, ATM cell processing and so on) is covered under media access control. These protocols define how the data is represented on the communications channel, as well as the rules that govern how that channel is accessed. Paradoxically, this is the area of greatest standardization among network devices due to standards‐based protocol definitions. It is also the area of greatest diversity due to the wide and ever‐growing variety of protocols. These include: • Ethernet, with three different versions at 10Mbps, 100Mbps and 1000Mbps • SONET, supporting both data packets and ATM cells at a wide range of standard rates (OC‐3, OC‐12, OC‐48 and so on) • Legacy T/E‐carrier interfaces from the existing public voice infrastructure • A variety of emerging optical interfaces that must all coexist and interact

[22]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

ARCHITECTURE The field of network processors is notable for its great architectural heterogeneity. In general, however, it can safely be said that network processors universally provide programmable support for processing packets, and that this usually takes the form of one or more packet processors. These can be supported either on a single chip or across multiple chips. In addition, network processors universally support a number of MAC‐ level ports, some memory and some form(s) of processor interconnect. Network processor designs can be broadly divided into three main architecture types:

1. General RISC‐based architecture 2. Augmented RISC architecture (with hardware accelerators) 3. Next generation network‐specific processors The first two architectures are sufficient for building today's fast Ethernet products. However, they will be unable to provide full 7‐layer processing on more than a handful of ports at gigabit speed.

Figure 4. Network Processor Architectures

Next Generation Network‐Specific Processors A new wave of network processors, namely network‐specific processors, is now being developed to provide the processing performance required for next‐generation networking products. network‐specific processors integrate many small, fast processor cores that are each tailored to perform a specific networking task. By optimizing the individual processor cores for packet‐processing tasks, network‐specific processors overcome the limitations of RISC‐based architectures. Network‐specific processors can

[23]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

deliver the packet processing performance to handle an appreciable number of ports at gigabit and terabit speeds. With network‐specific processors, achieving exceptionally fast packet processing at high bandwidths is accomplished by optimizing both the instruction set and data path. Since each task‐oriented core is designed with a specific networking function in mind, it uses a concise instruction set to accomplish the task. It may require as few as 1/10th the number of commands used by a RISC‐based processor to accomplish same task. The network‐specific processor architecture is a new way of approaching the problem. The processors integrate many small, fast processor cores that are each tailored to perform a specific networking task. These processors have optimized instruction sets and data paths to accomplish the task at hand. Some of the important paradigms directing the architecture of the next generation of network‐specific Processors are listed below: •

Exploit parallelism – multiple processing elements

•

Hide latency – multithreading and pipelining increase throughput

•

Mix of SRAM and DRAM helps improve utilization and latency

•

Optimize for header inspection and lookups using bit field extraction, comparisons, alignment Bind common special functions using integrated functions and/or co‐processors

• Optimize housekeeping functions and inter‐unit communications •

High integration

•

[24]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

FAST PATH – SLOW PATH The basic concept inherent to all network processor architectures is the packet distribution between fast path and slow path. The fast path is usually the native way of transferring data in certain architectures. It is very efficient and optimizes the performance in use. The fast path is used when no special operations for the data packets are needed. If the packet entering the system needs some extra servicing (e.g., routing protocol packets or filter configuration packets), the packet has to be sent to the slow path. As its name explains, the slow path is slower than the fast path. In the slow path, however, the packets can be manipulated and processed in more complex ways than in the fast path. Fast path operations – such as classification, lookups or priority checking – must be done on every packet at wire speed. An additional level of processing involves system level functions for which performance is not as critical, such as maintaining tables, communicating status information and statistics keeping. Networking systems need to create an efficient environment that diverts these slow path packets for processing and lets the slow path processor easily inject the processed packets back into the system. An open architecture should be employed so that an industry standard processor is used to execute existing code for slow path processing. This approach is much more efficient than porting slow path code to a new networking‐specific processor and provides an effective mechanism for communicating between data plane and control plane processors. For incoming data, the decision between the data plane and the control plane has to be chosen carefully. If the performance is critical, as it usually is in gigabit‐class devices, the slow path processing has to be kept low. Keeping both the performance requirement of the fast path and the intelligent processing requirement of the slow path in mind, the network processor architecture is divided into the data plane and the control plane. Control plane is typically implemented in software that executes on a general purpose processor. IT includes protocols and network management software. They handle control packets, perform data plane table updates, and perform interface management and statistics retrieval. Data plane, on the other hand, is typically constructed with programmable and configurable hardware entities. It performs switching functions and transfers packets from one interface to another. It also performs classification, scheduling, filtering and other functions. Dividing network processor software into two planes has many advantages. A significant advantage is that the operation and/or load of one plane does not affect the other plane, which eliminates many causes of failure. It also adds the advantage of being able

[25]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

to leverage advances in the data plane technologies without impacting the control plane.

Figure 5. Control Plane and Data Plane Typically data plane tasks require a small amount of code but a large amount of processing power since they handle almost 99% of the ingress traffic. In contrast, control plane tasks require little processing power but a large amount of code, as they handle intelligent processing functions for 1% of the ingress traffic. The different requirements of data plane and control plane tasks are often addressed by what is called a fast path – slow path design. In this type of design, as packets enter the networking device, their destination address and port are examined; based on that examination, they are sent on either the "slow path" or the "fast path" internally. Packets that need minimal or normal processing take the fast path, and packets that need unusual or complex processing take the slow path. Fast path packets correspond to data plane tasks, while slow path packets correspond to control plane tasks. Once they have been processed, packets from both the slow and fast path may leave via the same network interface. The fast path is found in the data processor half (i.e., data plane) of the network processor.

Dividing up the processing in this way provides substantial implementation flexibility. While the slow path processing will almost certainly be implemented with a CPU, fast path processing can be implemented with a FPGA, ASIC, co‐processor or maybe just another CPU. This architecture is particularly strong because it allows you to implement simple time‐critical algorithms in hardware and complex algorithms in software. For example, in Intel’s IXP1200, the fast path is serviced by the microengines, and the slow path is implemented by the operating system running in the StrongARM core of IXP1200.

[26]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

Using a router as an example, this phenomenon can be considered from two vantages: code size or processing requirements. It seems apparent that one could handle the data plane tasks of the router without a lot of code. On the other hand, even in a traditional network device like a router, control task implementations vary. All routers will have code to handle routing protocols like OSPF and BGP, and they will almost certainly have a serial port for configuration. However, they may be managed via a web browser, Java application, SNMP, or all three. This can add up to a lot of code. Now, let's consider the packets entering the router. Nearly all of them are addressed to somewhere else, and they need to be examined and forwarded to these destinations very quickly. For example, in order for a router to run wire speed with a 155Mbps OC‐3 link, it needs to forward a 64‐byte packet in three microseconds. These packets may not need to have much done with them, but the tasks need to be done in a timely manner. This requires tight code and a lot of processing power. By contrast, the occasional OSPF packet that causes the routing tables to be updated, or an HTTP request to make a configuration change, might require a fair bit of code to be handled properly; however, this will have little impact on overall processing requirements.

Control Plane The control plane of the network processor is responsible for critical tasks such as network management, policy applications, signaling and topology management. Much of the behavior of a network processor is subject to control and configuration. A classifier function must be told what patterns to detect, and a queue management function must have its queues specified. Routing tables need to be updated. Control and configuration parameters originate either in policy decisions or in network protocols, but they are usually conveyed to the network processor by the GPP. The network processor is said to operate in the data plane, and the GPP is said to operate in the control plane. Information also flows from the data plane to the control plane. The network processor may deliver signaling PDUs to the control plane; gather statistics that are returned to the control plane; or notify the control plane of error conditions. Generally, control plane tasks are less time‐critical.

Although signaling PDUs will travel into and out of the traffic manager through the network processor along with other PDUs, they are different in that they are usually handled by the control plane. PDUs that are handled by the control plane are said to travel the slow path, while the majority that enter and exit without being seen by the GPP are said to travel the fast path. Some non‐signaling PDUs also travel the slow path. A network processor may delegate PDUs with unusually complex processing to the GPP to reduce the complexity and size of the network processor code. This tactic also prevents difficult PDUs from reducing the ability of the network processor to handle its normal workload.

It is important that the control plane is implemented into an integrated high‐ performance, low‐power processing core. It should be designed to handle a broad range

[27]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

of complex processing tasks, including application processing; communication with the backplane; managing and updating data structures shared with data plane processing engines such as routing tables; and setting up and controlling media and switch fabric devices. In addition, the control plane core handles exception packets that require additional complex processing. A multi‐stage, high‐efficiency processing pipeline architecture that minimizes latency and enables high clock speeds with low power consumption is ideally suited to implement the control plane. It is meritorious if the control plane core is integrated in the network processor chipset and not externally. This integrated approach gives OEMs significant flexibility in matching processing tasks to resources and minimizes integration costs. Apart from the generality/specificity of their packet processors, different network processors make different choices regarding centralization/decentralization of control and management. For example, some network processors rely exclusively on external control in the form of a host workstation. Others (e.g., the IXP1200) incorporate a commodity CPU on the network processor that runs an operating system. Still others support a sufficiently powerful and general packet processor so that any of these can potentially serve as a locus of control and management. The IXP 1200’s on‐board StrongARM CPU runs a commodity OS such as Linux. In addition to handling slow path packet processing, the StrongARM is also responsible for loading code onto the microengines and stopping and starting them as required. The Motorola C‐Port, on the other hand, has no built‐in centralized controller. Instead, it relies on a host workstation to load and supervise the operation of its channel controller packet processors. Nevertheless, it is theoretically possible to dedicate one of the channel controllers to take the supervisory role, especially if fine‐grained dynamic reconfiguration of the network processor is a goal. Similarly, the EZChip relies on a host workstation for control and management. In this case, there is no alternative because dedicating one of the packet processors, even if possible (cf. their lack of generality), would introduce an unacceptable bottleneck in the pipeline.

Data Plane The data plane performs operations occurring in real‐time on the “packet path”. The data plane implements core device operations such as receive, process and transmit packets. The common data plane operations include: • Media Access Control – implementing low‐level protocol such as Ethernet, SONET framing, ATM cell processing, etc. • Data Parsing – parsing cell or packet headers for address or protocol information • Classification – identifying packet against a criteria (filtering / forwarding decision, QoS, accounting, etc.)

[28]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

•

Data Transformation – transforming packet data between protocols

•

Traffic Management – queuing, scheduling and policing packet data

As the network continues to evolve, the value of network processor technology will increasingly depend on intelligent packet processing at wire speed, rather than on raw performance alone. The ability of carriers to provision and bill for new services will require a combination of performance and flexible control over processing resources of the data plane. For an OC‐192/10 gbps link, deep packet inspection must occur in intervals as short as 35 nanoseconds. The high performing data plane is expected to perform the necessary Layer 3 – 7 applications on these cells/packets and then transmit them in the correct sequence and at the required rate without loss. Ideally, a multiprocessing architecture of the data plane subsystem ensures that aggregate processing capacity is available to enable rich packet/cell processing, even for 10 gbps wire speeds in applications that traditionally required high‐speed ASICs. The inherently parallel processing data plane allows a single‐stream packet/cell analysis problem such as routing to be decomposed into multiple, sequential tasks, including packet‐receive, route table lookup and packet classification (which can be linked together). The performance and flexibility provided by the software‐defined processing pipeline allows multiple tasks to be completed simultaneously while preserving data and time dependencies. As network requirements evolve, the powerful and flexible data plane design will enable OEMs to easily scale performance and add features to meet new requirements. Multithreading is a popular technique employed in data plane design to enhance overall performance. Multiple memory register technology enables data and event signals to be shared among threads and processing engines at virtually zero latency while maintaining coherency. Other innovations, known as ring buffers, establish fifo "producer‐ consumer" relationships among processing engines. These provide a highly efficient mechanism for flexibly linking tasks among multiple software pipelines. Through the combination of flexible software pipelining and fast inter‐process communication, data planes can be adapted for access, edge and core applications to perform complex processing at wire speed. Most network processors feature multiple packets processors to implement the data plane, but the nature of these can vary from units with very general instruction sets to single‐purpose dedicated units that are not programmable for tasks such as checksum calculation or hashing. Furthermore, some network processors feature only one type of packet processor, and others support a number of different types. For example, the Intel IXP1200 network processor supports a uniform set of six so‐called microengines that serve as packet processors. These are 233‐600Mhz CPUs whose instruction set includes I/O to/from MAC‐ports, packet queuing support and checksum calculation.

[29]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

They support hardware threads with zero context switch overhead and can be programmed either in assembler or C. On the other hand, the Motorola C‐Port [8] employs so called channel processors that are generic packet processors grouped in sets of four that share an area of fast memory. In addition, it supports a range of dedicated, non‐programmable processors that perform functions such as queue management, table lookup and buffer management. As a third example, the EZChip network processor‐1 has no fully generic processors. Rather, it exclusively employs dedicated packet processors that perform specific tasks such as parsing packets, table lookup or packet modification. Although these are dedicated to their given ‘domain’, they are quite flexible and programmable within that domain.

[30]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

SUMMARY Considerations The issues one faces when choosing network processor architecture are complex. The merits of network processor architecture depend on algorithm behavior, which in turn depends on traffic characteristics. Furthermore, network processors will spend their working lives handling traffic that is only being guessed at now. Today's predictions about Internet traffic are likely to be blown away by a new application. The current volume of MP3 downloads could not have been predicted a few years ago. Many network processor architectures are available today, and the only prediction we can make about them is that some will succeed while others will fail. Different designs represent different predictions about traffic and processing, not all of which will be correct. It may turn out that different architectures will dominate different application areas. Furthermore, survival will depend on both non‐technical as well as technical issues. One of the most important technical issues is scalability. A good question to ask a network processor vendor is how their device architecture that now runs at, say, OC‐48 (2.5Gbps) scales to OC‐192 (10Gbps) or OC‐768 (40Gbps). When exploring the design possibilities of network processors in Next Generation Network service platforms, it is important to look for the following architectural features: • High Performance Processing Capability • Scalable Processing Architecture • Flexibility and Programmability • Ability to Leverage Co‐Processors and Memory • Headroom for Emerging Services • Variety of Open Interconnect Mechanisms • Control Plan Processor Independence • Robust Software Development Environment • Right Mix of Processing Element and Functional Unit Parallelism

[31]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

Challenges While much of the industry has focused on the hardware side of the system, what about the software side? The complexities of these architectures make them very difficult for programmers to think about, let alone to provide effective high‐level language support. While C/C++ compilers exist for many network processors, performance‐critical code will continue to be written in assembly. Is there a common programming model that can be used to target multiple network processors (much like C is to general purpose processors)? Will this software difficulty force future architectures to be much more programmable? To use network processors, GPP software must be designed to isolate the fast path from the slow path. Take into consideration that: • • • • •

This GPP must communicate with the network processor via a carefully defined API. Functionality running on the network processor must be minimized. Network processor code may be written in assembly and functional languages, as well as C++. Network processor code must exploit the hardware architecture of the network processor, particularly the multiprocessor architecture. Future generations of network processors are likely to use larger numbers of PEs and to have more powerful co‐processors.

Network equipment vendor developers recognize that choosing to go with network processors is only the first step. Only when they've delivered their complete software feature sets; integrated these sets into a total network processing platform capable of providing the necessary high‐capacity switching throughput and required system‐level Quality of Service; achieved wire speed performance for all of their functionality; and future‐proofed it all with headroom and scalability for the future … then and only then have they achieved total time‐to‐market.

Conclusion The time for network processors has truly arrived. The computer industry is ripe for combining hardware and software components into a high‐performance, programmable solution. The demand for speed and features appears to have no end in sight. The strength of a network processor is that its programmable off‐the‐shelf parts are specifically created for networking applications. Over time, network processors will almost certainly displace general purpose CPUs and ASICs. Network processors have a performance advantage over CPU architectures that were designed years ago for other tasks, and those have numerous advantages over ASICs.

The opportunity, however, extends beyond the standard benefits of off‐the‐shelf merchant silicon. Processors that form the foundation of complete communications platforms, based on a simple programming model, promise to radically improve the way networking technology is brought to market. This adds up to better product features,

[32]

NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE

faster time‐to‐market and better reliability for network equipment vendors and their customers. With their combination of high performance and software‐programmable feature‐flexibility, network processors are indeed revolutionizing the way all kinds of feature‐rich, high‐speed networking and communications products are rapidly brought to market today. Opportunities even exist for network processors outside of networking. They may be suitable for other tasks that handle heavy streams of data. Some possibilities are RAIDs in large servers and the video‐switching equipment at cable‐TV head‐end plants. Only time will tell the extent of the application of network processors in other areas. To summarize: • Network processors are developing very fast and are a hot research area. • Multithreaded network processor architectures provide tremendous packet processing capability. • Network processors can be applied in various network layers and applications. • Hardwired performance, programmability of deep packet processing and differentiated solutions are some of the advantages that network processors offer. • Software reuse and portability is another benefit of using network processors. Open software interfaces encourage third party development; therefore interoperability and functionality will grow for network processors in the coming years. The final winner in the network processor industry will be the company that is able to quickly deliver price‐competitive products to meet market requirements; support all of the applications and interfaces of the value chain; and focus on superior customer service, supply chain management and marketing programs. Also Read On: Saas application development

Product engineering services OEM application development

[33]