Data Guard Performance Issues
PERFORMANCE ISSUES OF DATA GUARD Carel-Jan Engel, DBA!ert Oracle Data Guard can protect your database from data loss, but at what cost? Apart from licensing, what is its resource consumption, how can you manage that? Will day-to-day work suffer from performance degradation when Data Guard is put in service? This paper will lead you through the parameters that cover Data Guard's performance and protection level. It will discuss the network-related issues that will influence the performance of your Data Guard configuration. The paper helps you to determine what protection level is within your reach, and at what (performance) cost.
Introduction Whenever I am interviewing a DBA candidate, I always ask “What do you think the most important role you will have as a DBA is?” Some candidates squirm, others answer directly and with confidence but more often than not (in my experience) they do not provide the answer I am looking for. And my preferred answer is: having a solid recovery and supporting backup policy. You can argue that requirements vary from site to site, sure. But invariably the data in the database is business centric and without access to that data, the large majority of companies will struggle as an ongoing entity. Having worked for various airports since 1994, providing High Availability became a pretty dominant factor in my daily DBA-life. So it has been with tremendous glee and joy that I have watched the development of Data Guard from its infancy as Standby database in 7.3, to a little known shell script engine in 8i, to the very effective tool it is today. Many an Oracle widget vendor tried to ply me with the magic of OPS, but in those pre-RAC days, my heart swung firmly toward my little buddy we’ll, from now on, refer to as DG. And, to be honest, for High Availability I rank DG higher than RAC. The cluster may not contain a Single Point Of Failure, the cluster IS a Single Point Of Failure. DG provides me with a redundant database as well, ruling out the last SPOF. Combine that with the growing requirements for data availability and remove the transactional overhead of similar solutions such as Replication and Oracle was certainly on a winner. Oracle’s software suite has blossomed emphatically over the past years, some strong products have emerged and that trend looks set to continue with the likes of RAC, ASM and ADDM and others. DG, however, remains in the shadow of these products, but that shadow does nothing to weaken its strength as a pivotal role player in any strong backup and recovery solution. This paper will focus, rather surprisingly, on DG and DG performance. I know all too well its strengths from an HA/DR perspective, but am increasingly interested in its high end performance and how my clients can reap the DG benefits. Thus, in a world of high performance systems and highly available applications, DG has definitely cemented itself as a solid player, won me several times over and I sincerely hope the words that follow will aid you in a decision to join the DG faithful (most of whom do not wear funny hats, nor attend odd conventions wearing strange clothes, but generally do enjoy a good glass of Scotch!).
Data Guard Purpose DG protects your data by providing a suite of processes that move data from your production database to 1 .. n standby sites (where n is up to 9 with 9iR2). It is both HA and a true disaster recovery solution. RAC provides an HA solution across nodes as well as linear scalability yet does not provide true DR – that’s where DG comes in and combined with RAC it provides enticing levels of enterprise data protection. However, make no mistake that DG is a capable HA component in and of itself. 9i has introduced synchronous log transfer that provides absolutely no data loss as well as other transfer methods that ensure your standby(s) is/are in real time synchronization. Together with TAF (and TAF-aware applications) a failover can pass almost undetected by end users. Incomplete transactions will be rolled back, but connections can be transferred to the new standby automatically. Many systems simply do not require full availability such that in case of a disaster, a short outage of a couple of minutes is often acceptable. DG, in its physical form, is a perfect solution for events that render a primary site unusable – consider any number of natural disasters or even something like total power grid failure coupled by backup generator collapse. Or, at the power failure one detects that the generator can serve the computer hardware, but not the airconditioning. Despite the costly investment in a generator, you have to shutdown the systems before everything gets overheated. These real-life stories happen and if you’re protecting critical enterprise data, you need data protection reflective of the data’s import. DG provides a transactionally consistent “remote” (it could be local should you so choose, but I would not consider that an optimal decision) image and provides options on the level of synchronicity. With features such as delays built into redo application and the capability to Carel-Jan Engel
1
DBA!ert
Data Guard Performance Issues deploy multiple sites, you can not only avoid natural disasters, but those associated with humanoid carbon units as well. Additionally, DG can be used to maintain uptime SLAs in the face of required node maintenance – SAN work that will effect the primary? No worries, seamlessly switchover (role reversal) to your standby and away you go. Memory upgrade? Likewise. Easy, effective and very, very realistic. Depending on your data protection requirements, DG can operate with little footprint on a primary site. Sort of a having your cake and eating it too type of situation. Negligible overhead but data protection *and* it all comes free with the Enterprise Edition1.
Data Guard Benefits Disaster Recovery Natural disaster or common humanoid, a DG instance is there to protect you from the wild world we currently inhabit. Using the DELAY attribute of the LOG_ARCHIVE_DEST_N parameter applying the redo information at the standby database can be postponed. When a drop table or other logical error is detected before the delay has passed, it is likely possible to retrieve the missing information from the standby database. This doesn’t harm the protection level, but comes at the cost of a somewhat larger outage at failover time. High Availability Both failover (disaster) and switchover (planned) options aid administrators in meeting enterprise level service level agreements for data availability. Instead of long running restore/recovery operations, data processing can continue almost instantly (just a couple of minutes in most cases) at the standby server. All this requires careful planning of course, but that is no different than with straightforward recovery techniques. Guaranteed Data Protection Synchronous redo transfer guarantees consistent standby data. For less demanding systems various options for asynchronous redo transfer are available as well. As opposed to many hardware based mirroring options DG provides true synchronous data replication at transaction granularity, with far less bandwidth demands than the storage mirroring needs. Ease of Administration Simple setup, simple admin, good data dictionary coverage, easy deployment. Of course all commands needed are available through the good old SQL*Plus interface, but DG comes with a GUI interface as well, offering simple point and click setup. Personally I favour the command line interface. Knowing the commands behind the scenes helps understanding the nuts and bolts of DG. This is quite important when dealing with High Availability. Learning about the concepts at recovery time is bad timing. Data Protection Flexibility Three protection modes provide excellent data protection alternatives. The modes vary from the Fort Knox-like protection mode, called ‘maximum protection mode’, till the lean-back-like ‘maximum performance’ mode. Maximum protection mode protects your transactions at all cost, it favours a database shutdown for completing your transaction in less than two (or more) databases. Maximum performance mode in its most at-ease setup takes care of forwarding the redo data to the standby at some time, preferrably before the archived redo files get rounded up by some background cleaning process. Light Footprint Intelligent coupling of network capacity with synchronous log transfer means you can have consistent standbys without a heavy transactional penalty. These days servers come with several network interfaces, so reserving an interface for your DG traffic is not all that difficult. For many systems throughput, however, is not the bottleneck, it’s the latency of the network that counts, especially when in synchronous replication mode. Functionality Options It is possible to open the standby database Read Only. The standby can then be used to take over the load of heavy reporting processes, hammering the database without harming the day-to-day data processing. Reports often do not need data up to the 1
Servers must be licensed if the software is used (loaded into memory) more than 10 days per year…Because DG needs an active instance at the standby server, that server needs to be licensed. However, significant discounts have been seen negotiated, especially at the end of May. Carel-Jan Engel
2
DBA!ert
Data Guard Performance Issues latest hour or day, so in such a situation, DG can help reduce primary reporting load, releasing precious CPU/IO/Memory bandwidth for more critical business processes. This, of course, comes at some cost: The data protection level will remain unchanged, but depending on the amount of redo data produced, recovery of the primary will take some time. So, the outage of the system might be somewhat longer if you are struck by disaster whilst the standby database is in Read Only mode. But if configured correctly, all transactions are safe. Logical Standby can help you for reporting scenarios in which preparing aggregated data proves useful for the performance of reports, but the maintenance of that aggegrated data is not appropriate in the primary. Logical Standby was introduced in 9iR2. It has quite some restrictions, many of which, but not all of them, are lifted in 10g. From a performance point of view Logical Standby requires the enabling of Supplemental Logging in your primary database. This generates extra redo, all of which needs to be written, copied and forwarded across disks and network. Apart from that, the Logical Standby receives the changes through genuine SQL statements as opposed to the recovery mechanism in Physical Standby. This generates the cost of creating the SQL from standby redo logfiles, executing the SQL including all redo, undo and archives that is created due to this process. On top of that the standby needs to run the reports. Resource Maximization I know of some sites considering abandoning tape backups. DG, geographically distributed, guarantees data availability, especially when two standby’s are put in operation. When this is not an option for you, e.g. because of legal obligations, you can consider creating backups at the standby site. This has the advantage that your tapes are stored off-site right away. The previously mentioned possibility of moving reporting off to your standby also helps maximise hardware capacity utilisation. It’s Free Don’t fall over, I speak the truth. Almost. Watch the footnote. When you have an Enterprise Edition License DG is yours without extra cost. When you use the standby server for QA purposes or even development, and you bought CPU-based licenses, deployment of DG is free. 2
Requirements DG can be implemented on all supported Oracle platforms but has the following restrictions: •
You must be Enterprise Edition licensed
•
You must be on the same Oracle release and Operating system (and operating system architecture) on both ends
•
You must be in ARCHIVELOG mode
•
Although you do not have to have the same file system structure, it sure makes things easier if they are the same
•
Setting FORCE_LOGGING is not a requirement, but, as above, is a jolly good idea unless you like surprises
Cost DG comes at no extra charge if you are EE licensed. Licensing a more or less static (from an application perspective) node might not be particularly popular with your bean counters, but should be accepted as part and parcel of a robust DR strategy. That said, Oracle has been known to provide “aggressive” pricing for standbys. Your mileage may vary, of course.
2
See footnote number 1 and the Cost section
Carel-Jan Engel
3
DBA!ert
Data Guard Performance Issues
Features DG comes with a plethora of well-documented features, a brief listing of some of the highlights must include: • Real managed recovery: •
SQL> recover managed standby database disconnect from session;
Simple role reversal (switchover)
•
Up to 9 standby destinations (log_archive_dest_1 for local .. up to 9 more for remote)
•
Cascading standbys
•
Automatic gap detection
•
The archive_lag_target parameter (very nice indeed)
•
RAC support
•
Read-only and Logical Standby
•
Three protection modes: maximum protection, maximum availability, maximum performance I personally do not like these naming conventions even though they are better than the previous ones. And more, I have no suggestions for better naming available yet
There are a few parameters specific to performance though: • •
Parallel recovery SQL> recover managed standby database parallel {x} disconnect from session;
LOG_ARCHIVE_DEST_n attributes •
LGWR/ARCH •
•
•
AFFIRM/NOAFFIRM •
AFFIRM when in SYNC mode means synchronous writes to your remote instance, potentially increasing commit response time whilst providing maximum data protection
•
If you place the database in maximum protection or availability mode, the database will set this attribute to AFFIRM for redo log archiving destinations using LGWR
MAX_FAILURE/REOPEN •
•
•
Controls whether you write redo near real time via LGWR (SYNC/ASYNC) or at log switch (ARCH)
Controls retry attempts and time between retries when network issues are preventing shipment to remote instances
NET_TIMEOUT •
Only valid when using LGWR
•
Specifies how long LWGR will wait for a status from a network operation. If response is not received during this period, LGWR will disconnect the primary database connection. When in MAXIMUM PROTECTION mode, specifying a low value for this attribute can cause your database to be shutdown by LGWR when the last remote archive destination vanishes
SYNC/ASYNC •
Only applies to LGWR operations
•
Options are pretty self-explanatory
•
Defaults to SYNC=PARALLEL for LGWR
•
You can specify a number of blocks (e.g. ASYNC=512) which, in turn, specifies the size of the network buffer to be used. Larger values can partly compensate for slower networks.
Carel-Jan Engel
4
DBA!ert
Data Guard Performance Issues
Architecture Log Transport Services A rather fancy name for existing processes, nothing more. There are no actual “log transport services” in the literal sense, rather LTS is made up of the service (LGWR or ARCH) used to transport the redo stream to the Remote File Server (RFS) at the standby site(s). Log Apply Services Made up of the Remote File Server (RFS), Managed Recovery Process (MRP), Logical Standby Process (LSP) and ARCn processes. The RFS receives redo from either the LGWR or ARCH process on the primary and writes it to the standby archived redo log, standby redo log or archived redo log. In Physical Standby, the Managed Recovery Process (MRP) applies archived redo to the standby immediately following primary log switch. This is the same in a Logical Standby, with the only variant being the LSP process performs archived redo application. Note that MRP cannot apply redo when a Physical Standby is open in Read Only mode. It will catch up when the database is put back in Managed Recovery Mode. The following diagrams depict two different configurations. The first depicts a scenario in which LGWR is responsible for shipping redo to the standby site. The second depicts a scenario in which ARC is responsible. Notice the lack of standby redo logs when using the ARC process.
Primary Instance LGWR (SYNC/ASYNC)
Oracle
Net
RFS
Online Redo Groups MRP or LSP Standby Redo Logs
ARCn ARCn
Archived Redo Logs
Archived Redo Logs
Figure 1: Redo forwarding by LGWR
Standby Instance
This picture shows the primary database at the left hand side, having two online redo groups with two members each. Normally you wouldn’t have less than three groups, but this works fine for the picture. LGWR forwards redo information either synchronously or asynchronously to the standby server. The Remote File Server process, abbreviated as RFS, runs on the receiving server, writing redo information into the standby redo log files. Standby redo log files are another type of redo Carel-Jan Engel
5
DBA!ert
Data Guard Performance Issues log file like online and archive redo log files. The documentation states there should be at least one more group of standby redo log files than online redo groups. Remember this and take care at database creation, because creating more space in the controlfile for extra redo log files cannot be done without bouncing the instance. Note: although the picture has two members per standby redo group, this is not strictly necessary. Both the online and standby redo groups should be equally sized, but the number of members per group may vary. Upon logswitch at the primary the standby will also initiate a logswitch, creating an archived logfile at the standby. The Managed Recovery Process will take up the archived logfile and apply the contents to the database.
Primary Instance LGWR
RFS
Online Redo Groups MRP or LSP
Oracle
Net ARCn
ARCn
Archived Redo Logs
Archived Redo Logs
Figure 2: redo log forwarding by ARCn This picture shows how redo information is transported by the ARCn process to the standby at log switch time, which is pretty much how the old Hot Standby and DG solutions (8i) worked. Note there is the risk of a certain amount of transaction loss depending on the size of your online redo log files. When the primary becomes totally unavailable, all unarchived redo information, including the corresponding transactions, is lost.
Performance DG will degrade the performance of our primary database one way or the other. The good news is that most of us won’t be hurt too much by this performance degradation as it is hardly noticeable. Actually, when I initially submitted the abstract of this paper for the UKOUG Conference 2004, I still had to put together and run the proper tests to get conclusive evidence of any DG performance issues. After I put them together they didn’t work. The deadline was approaching fast and despite my best efforts I could see no performance effect on the primary database . Configuration Test hardware for the kind of tests I wanted to perform is hard to obtain. However, I was very happy that one of my customers (see Acknowledgements at the end of the paper) was so kind to provide me two servers to perform the tests on. The hardware consisted of two IBM servers, with one Intel Pentium 3.0 GHz HyperThreading CPU each. Both systems were Carel-Jan Engel
6
DBA!ert
Data Guard Performance Issues equipped with 2 GB memory, storage consisted of two HHD/system, 1 10,000 RPM disk and 1 15,000 RPM disk. Furthermore both systems were equipped with a 1 Gb NIC as well as a 100Mb NIC. Although this is not a high-end enterprise configuration, I see quite some systems running on comparable configurations. The OS was Redhat AS, the database was Oracle 9.2.0.5. The database_cache_size was 16 MB. Testharness I put together a testharness, which enabled me to run several test-scenarios unattended. Every testrun initializes a database with proper users, tablespaces etc., puts the proper init<SID>.ora files in place, instantiates the standby when appropriate, enables statspack snapshotting every 5 minutes and restarts the database(s) for a clean start. Then the test itself will run, producing logfiles and statistics about test run performance. Upon test completion, the test schema and perfstat schema are exported, contents of bdump/udump/cdump are saved and the database cleaned up for the next run. By creating several scenarios, each in their own subdirectory, I was able to have the testharness executing them one after another, unattended. All I had to do after the run finished was collecting all statistics, import them in my test-repository and compare the results. OLTP-tests As I said, the good news was that it was hard to create that much load on the system that enabling Data Guard was noticeable. I started my DG performance tests putting together an OLTP-simulation. I used dbaman for this, a tool provided by James Morle. It basically consists of a tcl-shell, extended with OCI-calls. Together with some other programs built by James, it is possible to run a number of processes in a repeatable way, and piece together a good bit of logging information about the whole setup. The datamodel of the test consists of 6 tables: PRODUCTS, CUSTOMERS, ORDERS, ORDERLINES, INVOICES, INVOICELINES, PAYMENTS. All primary and foreign key constraints are in place, plus some extra ‘performance-indexes’. The customers table also contains an aggegrated value, representing the total debt of the customer. This brings the need of frequent updating of the customers table during invoicing and payment processing. Just to create some extra redo. There are 4052 products and 4053 customers. Orders are assigned to customers randomly. The product_ids in the orderlines are referring to products on a random basis. My OLTP simulation consisted of 100 sessions, running 5 different programs. 64 sessions performed order-entry, 10 sessions performed order-changing. Order delivery was done by 16 sessions, invoicing by 6 sessions and payments by 4 sessions . ORDERENTRY The system started with order entry. Every order was inserted with a randomly chosen intended extension date and a randomly chosen delivery date. Initially the orders were inserted with a random number (between 1 and 10) orderlines. ORDEREXTENSION The average interval between order-entry and order extensions was 3m18s. 25% of the orders got extended around the intended extension date, up to 18 orderlines. This caused some lookups in the products-table for prices as well. ORDERDELIVERY After on average 17 minutes orders got delivered. This actually was just the update of the actual_delivery_date column of he orders-table. INVOICING Delivered orders were ready for invoicing. The invoicing program created the invoice, generating an invoiceline for every orderline. The total amount of the invoice got added to the debt of the customer. Delivered orders got invoiced after on average 10 seconds after delivery. PAYMENTS The invoices got paid in randomly chosen partial payments. Every payment was subtracted from the total debt in the customers table. The OLTP-simulation took some time before every process was active. I started with somewhat longer intervals between all the actions, and ran the process for an hour. Because testing several scenarios took too much time this way, I reduced the intervals and ran the tests for 30 minutes. Some results: In 30 minutes 42824 orders were created (23/second), with 258335 orderlines (143/second). 4329 orders got invoiced, 1301 of them got paid. 31 logswitches were performed, the logfile size was 25 MB. So, 775 MB of logging was created in 30 minutes, resulting in a rate of 1.5 GB redolog/hour. The ‘problem’ was that enabling DG didn’t affect the results at all. Whether redolog forwarding was done synchronously or asynchronously, with or without the ‘AFFIRM’ parameter, the Carel-Jan Engel
7
DBA!ert
Data Guard Performance Issues throughput of the system remained the same. Running the testscripts for 30 minutes didn’t change the figures for more than 0.1 – 0.3 %. This was not enough difference to draw any conclusions. However, it is good news for most (potential) DG users, because it confirms that DG can keep up with quite a lot of transactions, even on (or thanks to) a suboptimal I/O system. When I get my hands on a configuration with better performing storage, I will repeat the tests. Batch-oriented tests Because I needed some more different figures for this paper, I decided to include some batch-oriented tests as well. Furthermore, I decided to take the risky move of putting the online redo log files on a ramdisk for one of the runs, to simulate ‘optimal’ storage. The acceleration this should give to the LGWR was expected to make the influence of DG more obvious. NOTE ABOUT REDO LOG FILES ON RAMDISK Take ultimate care when putting online redo log files on ramdisk. It may cause data loss when a failure occurs. For huge data loads and unresumable batches, which require you to revert to the last backup upon failure whatsoever, this can help you to reduce the time for these runs, provided there is no other activity on the database. I’ve seen ramdisks available on Linux and AIX. On Solaris you can put your redo log files on the /tmp filesystem, this behaves like a ramdisk, with comparable performance gains. The same risks as with ramdisks exist. Finally, don’t create multiple members in your redo log groups when you put them on ramdisk. The batchtest was much simpler than the OLTP-test. It had 4 parallel sessions, performing this script (through dbaman): create table t as select * from all_objects; delete from t; rollback; delete from t; drop table t;
Instead of measuring the amount of work done in a predetermined amount of time, I measured the time a scenario needed to do this 100 times per script. This resulted in 70 logswitches (+/- 1), i.e. 1750 MB redolog. I created 12 scenarios. The first scenario was the baseline, without DG enabed. The next 10 scenarios used LGWR to forward the redo to the standby instance. The last scenario was based on redo forwarding by ARCH. The scenarios ran in appr. 13-14 minutes, resulting in appr. 1.9 MB redo/second. The 12 scenarios were run 3 times, in 3 different configurations: Redo on ramdisk using a 100 Mb network, Redo on harddisk using a 100 Mb network and Redo on harddisk using a 1 Gb network. The tables below show the results of these runs: Scenario
Protection mode
0
NONE
1
Redo forwarding by
SYNC/ ASYNC
AFFIRM Y/N
PARALLEL/ NOPARALLEL
ASYNC buffer size
Duration (sec.)
Relative speed
-
-
-
-
685
1
MAX. PERFORMANCE
LGWR
SYNC
N
NOPARALLEL
844
1.23
2
MAX. AVAILABILITY
LGWR
SYNC
Y
NOPARALLEL
801
1.17
3
MAX. PROTECTION
LGWR
SYNC
Y
NOPARALLEL
836
1.22
4
MAX. PERFORMANCE
LGWR
SYNC
N
PARALLEL
826
1.21
5
MAX. AVAILABILITY
LGWR
SYNC
Y
PARALLEL
832
1.21
6
MAX. PROTECTION
LGWR
SYNC
Y
PARALLEL
841
1.23
7
MAX. PERFORMANCE
LGWR
ASYNC
-
-
50
728
1.06
8
MAX. PERFORMANCE
LGWR
ASYNC
-
-
100
829
1.21
9
MAX. PERFORMANCE
LGWR
ASYNC
-
-
200
708
1.03
10
MAX. PERFORMANCE
LGWR
ASYNC
-
-
400
711
1.04
11
MAX. PERFORMANCE
LGWR
-
-
-
-
720
1.05
Table 1: Batchruns, redo on ramdisk, 100 Mb network
Carel-Jan Engel
8
DBA!ert
Data Guard Performance Issues The strangest observation in this run is scenario 2. Despite running with the AFFIRM option enabled in the system parameter LOG_ARCHIVE_DEST_2, this run is faster than scenario 1. Because the test environment is no longer available, I cannot explain this result, nor investigate further. I will try to repeat this test on new test-configurations for sure.
Carel-Jan Engel
9
DBA!ert
Data Guard Performance Issues
Scenario
Protection mode
0
NONE
1
Redo forwarding by
SYNC/ ASYNC
AFFIRM Y/N
PARALLEL/ NOPARALLEL
ASYNC buffer size
Duration (sec.)
Relative speed
-
-
-
-
780
1
MAX. PERFORMANCE
LGWR
SYNC
N
NOPARALLEL
763
0.98
2
MAX. AVAILABILITY
LGWR
SYNC
Y
NOPARALLEL
994
1.27
3
MAX. PROTECTION
LGWR
SYNC
Y
NOPARALLEL
959
1.23
4
MAX. PERFORMANCE
LGWR
SYNC
N
PARALLEL
944
1.21
5
MAX. AVAILABILITY
LGWR
SYNC
Y
PARALLEL
929
1.19
6
MAX. PROTECTION
LGWR
SYNC
Y
PARALLEL
950
1.22
7
MAX. PERFORMANCE
LGWR
ASYNC
-
-
50
946
1.21
8
MAX. PERFORMANCE
LGWR
ASYNC
-
-
100
836
1.07
9
MAX. PERFORMANCE
LGWR
ASYNC
-
-
200
873
1.12
10
MAX. PERFORMANCE
LGWR
ASYNC
-
-
400
848
1.09
11
MAX. PERFORMANCE
LGWR
-
-
-
-
921
1.18
Table 2: Batchruns, redo on HDD, 100 Mb network Table 2 shows in scenario 0, the baseline scenario, that putting the redo log files on ramdisk in the first run had a significant influence, that run was 13% faster. Take care, read the note about redo log files on ramdisk on page 8. Given the baseline, the relative performance degradation does not differ significantly between table 1 and table 2 , except scenario 2. This needs to be investigated further. Scenario
Protection mode
0
NONE
1
Redo forwarding by
SYNC/ ASYNC
AFFIRM Y/N
PARALLEL/ NOPARALLEL
ASYNC buffer size
Duration (sec.)
Relative speed
-
-
-
-
818
1
MAX. PERFORMANCE
LGWR
SYNC
N
NOPARALLEL
896
1.10
2
MAX. AVAILABILITY
LGWR
SYNC
Y
NOPARALLEL
902
1.10
3
MAX. PROTECTION
LGWR
SYNC
Y
NOPARALLEL
909
1.11
4
MAX. PERFORMANCE
LGWR
SYNC
N
PARALLEL
873
1.07
5
MAX. AVAILABILITY
LGWR
SYNC
Y
PARALLEL
861
1.05
6
MAX. PROTECTION
LGWR
SYNC
Y
PARALLEL
817
1.00
7
MAX. PERFORMANCE
LGWR
ASYNC
-
-
50
804
0.98
8
MAX. PERFORMANCE
LGWR
ASYNC
-
-
100
830
1.01
9
MAX. PERFORMANCE
LGWR
ASYNC
-
-
200
846
1.03
10
MAX. PERFORMANCE
LGWR
ASYNC
-
-
400
849
1.04
11
MAX. PERFORMANCE
LGWR
-
-
-
-
875
1.07
Table 3: Batchruns, redo on HDD, 1 Gb network Table 3 shows the results with a 1 Gb network connection. Especially when the parallel log forwarding is enabled, this reduces the performance loss to single-digit percentages. This opposed to the 100 Mb network runs, where parallel redo log forwarding hardly had any effect at all. As mentioned before, the tests are generating quite big amounts of redo log, up to
Carel-Jan Engel
10
DBA!ert
Data Guard Performance Issues 7.75 GB per hour. I haven’t seen that many systems doing this in real life. I have no explanation available for the fact that the baseline in table 3 is slower than the baseline in table 2, given the fact that the baseline script for both is the same.
Conclusion The performance of my OLTP system emulation was not degraded with DG enabled. This is what I observe in the real-life DG installations I’ve been involved with as well. Most of the systems suffer more from bad application design than from redo log forwarding for DG purposes. Of course, when bad application design results in loads of unnecessary redo, this will affect the total performance. Performance of batchloads, however, was significantly degraded when redo log was forwarded by the LGWR process in synchronous mode. Asynchronous redo log forwarding performed better, though it is important to determine what buffersize provides the best result when in asynchronous mode. Because this buffersize can be changed dynamically it is easy to play around with the settings to find that value. Pay specific attention to your network speed, throughput and latency in any configuration as they play extremely important roles in overall DG performance. Implementing Data Guard will always be a trade-off between Cost, Yield and Availabilty, but it can and does work in the real world without compromising performance driven service level agreements.
Epilogue Investigating Performance issues, root causes and solutions is an never ending game. As such, this paper is and will never be complete - as my investigation and testing continues, so the paper will evolve. Any comments, additions, remarks or questions you might have are always welcome at cjpengel.dbalert@xs4all.nl.
Acknowledgement I wish to thank Enovation Portal Technology of Capelle aan den IJssel, The Netherlands, for providing me with dedicated equipment and access to their systems. Without such access, the testing for this paper wouldn’t have been possible. Running the several tests was made very easy because of the availability of dbaman, the tcl-based shell that was wholeheartedly provided by James Morle, owner and founder of Scale Abilities Ltd, and author of the great book Scaling Oracle (8i), but upward compatible with newer versions. Last, but not least, I wish to thank Casey Dyke of Sydney, Australia, and Rachel Carmichael of New York, USA, for their cooperative support in discussing the contents of the paper, and for their invaluable input on the topic of grammar and language. Without their contributions this paper wouldn’t have been half as readable as it is now.
About the author Carel-Jan Engel(1960) lives and works in The Netherlands. He has been working in IT since 1982 and started to work with Oracle version 4 in 1985. In 1992 he founded the Dutch software company Ease Automation, which he headed for almost 10 years. Some of the main projects of Ease Automation were situated on airports. These projects had an important High Availability aspect and inspired him developing several techniques for standby databases, often pushing Oracle technology to its limits. In 2002 he decided to continue his career as a free-lancer. He's a member of the Oaktable Network and a regular speaker at conferences, mostly about his favorite topic, Oracle Data Guard.
Carel-Jan Engel
11
DBA!ert