This document is a synopsis of the handout distributed at “Operation of Financial ALM/Risk Management Models and Application of Computing Grid Technologies,” a seminar held by Numerical Technologies on July 11, 2008 in Tokyo.
A new high-speed computing system that is much more lightweight and inexpensive than grid computing systems
Who Wins?
Let’s assume you trade in relative value arbitrage or statistical arbitrage.
The person on the left discovers combinations of independent stock pair trades and automatically executes transactions. Behind him is the faithful system specialist who calculates the market strain at a speed 100 times faster than ordinary computers using grid computing.
The person on the right in front of the 4×4 screens is a typical arbitrageur. Each of the screens shows the same grid computing environment as the one on the left, that is, each one computes 100 times faster than a normal computer. His system calculates the arbitrage opportunities and executes transactions in cross markets in real time — whether it be interest rates, equities or commodities — whatever he comes up with.
Who wins in the market? The trader on the right, of course!


Your exclusive HPC environment
This document introduces accelerator technology that increases the computing capacity of a desktop computer more than 100 times.
A single personal computer provides the grid environment.
Exercise: Find the European swaption price using the BGM model
Let’s look at a practical example
Even though we are stressing that the system is very fast, that alone may not be convincing enough without a practical exercise. Let’s take up the pricing of a European swaption based on BGM [1997](*) as an example of an interest rate derivative product that involves a huge amount of computation. We simply use the Monte Carlo technique.
If you are not an interest rate trader but an exchange trader, then while reading the following description, please substitute the wording as appropriate to suit your situation because an exchange trade calculation involves the same scale of calculation (or the pricing model) as a Bermudan swaption, and for example, requires the Monte Carlo technique in the worst case.
This example is intended to evaluate performance, and it therefore involves a hands-on test for you. To this end, the mounting procedures described in “Modeling Derivatives in C++,” Justin London, John Wiley & Sons, p.652, were faithfully followed.
- Find the LIBOR set using LFM
- The above equation is converted into a difference equation using the Euler method
- Hereafter, we calculate the payoff from LIBOR defined for each of N time-steps:
- Iterate the above process for M times (Monte Carlo method) to find the expected value:

Note that this book is full of errors and illogical statements. We corrected the obvious irregularities. We also omitted inefficient mounting, such as in-loop push_back operations from the original program presented in the book and we performed optimization at the C++ language level before
doing speed comparisons. Otherwise, the speed would apparently be “1000 times faster,” and this allegation would impair the reliability of our present case study. It took just one week to solve this problem because the essential hardware, which will be described later, did not ship until just before the writing of this document. Frankly speaking, some bugs may remain, but they should not have any effect on the conclusion.
Formulate a convenient Excel add-in application
Do you like seeing the “Program Start-up Procedure” prompt while trading? Absolutely Not!
The world is filled with difficult-to-use systems. Are you willing to ask someone else to start up a
grid computing system for you each time you want to set a price? Just say no to such environments. Instead, just start Excel side by side with the Bloomberg on your desktop. That should be enough, shouldn’t it?
In our example, the European swaption model was mounted as an Excel add-in function and the swaption price was calculated by defining some calibrated parameters. All you need to do is just input the functions as usual into the cells of your Excel spreadsheet. Then the NtMonteCarloLFM function completes 1 million Monte Carlo iterations in less than one second. This is an increase of 175 times in performance equivalent to the acceleration of a fully optimized C++ program (as compared with our company’s products).
European Swaption Calculator for Excel (Engineering Sample)
With 175 times faster computational capacity on hand, high volume calculations are performed with ease
An accurate price is obtainable
If a sufficiently high-speed Monte Carlo method can be used, you need not to use an analytical solution based on complex and questionable hypotheses. The Quant resources allocated to the development of high-speed solutions can be used for other tasks.
A convergence improvement measure (antithetic variant method in our example) was used as well. Look at this enhanced convergence performance!
Sensitivity analyses and difference methods are also performed easily
Various analyses are possible
The NtMonteCarloLFM functions entered into the cells can be copied to other cells in the spreadsheet. This HPC is as easy to use as ordinary Excel functions.
The accelerator is the secret of high-speed operations
NVIDIA GTX280
In our example, we used two NVIDIA GTX280 accelerator cards that were released in June 2008 and inserted them into expansion slots in a PC.
These accelerator cards use a technology called a General Purpose Graphics Processing Unit, or GPGPU. ATI also sells GPGPU devices under the RV770 product name. The available accelerator technology is not limited to GPGPUs, and Cell/B.E. and ClearSpeed market accelerator products of other technologies. Intel has announced its related development code named Larrabee, which will be released in the 2008 to 2009 period. GPGPUs have been a hot topic recently in the HPC market.
In our example system we selected the NVIDIA GPGPU card for testing because the product was less expensive and easily available in the market. We had no other particular reasons for our choice, and we would like to try other options proactively.

Results of Benchmark Testing
Comparisons were made with reference to the time required for calculations performed on a PC equipped with an Intel Core2 Quad Q9550 2.83 GHz processor, using one core of the CPU without a GPGPU.
| Monte Carlo iterations and precision type | Without GPGPU (double-precision in all cases) | GPGPUx1 | GPGPUx2 |
|---|---|---|---|
| 1 million iterations Double-precision | 90.618 sec | 4.043 sec (22 times faster) |
2.094 sec (43 times faster) |
| 1 million iterations Single-precision | 90.618 sec | 1.017 sec (89 times faster) |
0.596 sec (152 times faster) |
| 10 million iterations Single-precision | 904.376 sec | 10.223 sec (88 times faster) |
5.165 sec (175 times faster) |
- Compared with the single-core performance of 760.044 sec for 10 million Monte Carlo iterations performed separately on an Intel Xeon X5460 3.16 GHz processor, the highest speed processor currently available in the market, our system was 147 times faster. Since this is not a comparison with SMP, to compare the results with 4-, 8- or 16-core processors, simply divide the value by 4, 8 or 16 to estimate the relative difference appropriately.
- Double-precision applies to the speed calculation of all non-GPGPU systems. This is because if single-precision was used, the speed of the Intel-based CPUs with single precision would be slower than those with double-precision, making it impossible to compare them correctly under the assumed actual operating environments.
- The time resolution of the clock function of PCs is low. There is also the effect of thread scheduling of the operating system. For these reasons, time measurements of less than one second inevitably incur errors. Amdahl’s Law also affects system startup. Nevertheless, the performance improvement curve was linear, as shown by the trend of the 10 million Monte Carlo iterations. Broadly speaking, we might say that, as far as our example is concerned, a performance enhancement of some 88 times per GPGPU was achieved. This fall externally mountable NVIDIA x4 HPC units are scheduled to be launched. This new x4 unit would achieve up to 350 times faster speed.
- Slow double-precision operations have been reported to be a weak point of the GTX280. In terms of FLOPS performance, double-precision is only one-eighth of the single-precision performance. Nevertheless, the results of the above measurements indicate a performance ratio of about 1:3.5, and as such, the GTX280 is performing better than expected. This is because in financial calculations many integer operations are involved, even when double-precision is used, and the characteristic slow double-precision operations are hidden when 32-bit integer calculations are a substantially large part of the total operations.
Low Construction Cost
Exceedingly low hardware setup costs compared with grid computing
[Precautions to note for anyone trying to duplicate our setup]
We assembled our PC from parts because the GTX280 requires a large-capacity power supply and a new generation PCI bus (PCI Express Gen.2). To use two GTX280 cards as in our hand-made unit, the power supply capacity must be much higher than usual. It is dangerous to blindly use PCs with non-conforming specs. To use reliable products in your company, change to the GPGPU version (TESLA) of the GTX280 and select a high-end PC equipped with a power supply of a large capacity.
The most difficult task of all: programming
Accelerator programming is difficult
After selecting the accelerator type you wish to use, you need to understand the entire system beginning with the hardware. Without this basic but overall understanding, you won’t be able to even
start up the system. Parallel processing of a grid system is of course difficult, but the accelerator
system is more difficult, I believe. An average programmer will likely be unable to develop the system.
Outsourced offshore programmers will not be able to adequately exploit the system’s performance.
Financial organizations typically invest in very good hardware but their poor software development strategies spoil their investment in such high-end systems.
All in all, if you need top-notch development personnel with you, you’d better look for them in the
financial organizations. Very capable human resources can be recruited, in particular, from
engineers and expert quants.

This means that you, the reader of this document, are the first candidate for a programming position.
Unfortunately, currently available accelerator systems are only good for small-scale development
| Grid system | Accelerator system | |
|---|---|---|
| Versatility |
![]() Various programming models can be selected. |
![]() Current-generation systems assume SIMD (limited MIMD). Good at iteration processing but poor at volume data processing. The current-generation accelerators accept only small programs. |
| Reliability |
![]() Track records do exist. However, both engineers and applications are underdeveloped and many poor systems are being used in financial HPC compared with academic HPC. |
![]() Just started to develop. Check the results with alternative means to avoid risk. No way to know which systems would survive. |
| Cost |
![]() Very expensive |
![]() Exceedingly low |
| Development risk |
![]() Financial HPC is mostly underdeveloped compared with academic HPC. |
![]() Development risk is low because the scale of development cannot become too large. But, |
- Using accelerator systems for pricing independent products or for small-scale risk management:
An accelerator system is the first choice even when the life of the product may last only one or two years. This is because the cost of an accelerator system is exceedingly low. In addition, an accelerator for high-speed connection using the same example conditions may perform better than a grid in which communication between nodes can be the cause of delay.
- Using accelerator systems for risk management of large-scale portfolios requiring a large
volume of code resources:An accelerator system is not suitable for use in a large-scale development project because of its complex programming model. In addition, no one knows which accelerator product will survive in the future. You may launch a large-scale, long-term development project involving accelerators, only to eventually find that the relevant products are no longer available in the market when you are scheduled to complete the project. The reliability of the derived numerical values is also low (low IEEE754 compliance level). The high-speed memory of a current-generation accelerator is as small as that of an 8-bit PC (only 16 kb of shared memory in the NVIDIA GTX280). The capacity of the low-speed global memory is similar to that of a 32 bit PC (1 GB in the GTX280). For these reasons, data transfer between an even lower-speed CPU and accelerator (8 GB/s one-way with PCI Express Gen.2 x 16) results in a bottleneck, making it difficult to process volume transactions that require a 64 bit operating system. We therefore have no alternative other than to select a grid system at present.
- A hybrid system as used in academic HPC may be useful in cases where a double Monte
Carlo technique is involved.
Keeping up with the latest HPC technologies
Conflict of interest with vendors
Grid computing provides hardware vendors with a business opportunity to market a large number of blade servers. Vendors can suffer great losses if a customer switches from a grid to an accelerator system. You should not expect your hardware vendor to propose an accelerator system unless your financial organization insists. If you leave all the decision-making to your consultant for hardware makers or integrators, the purchaser or the financial organization is likely to get burned.
Information gathering at HPC conferences
Held around November every year, SC is one of the major HPC conferences. There have been many conference presentations on financial HPC recently. Financial organizations should send their staff to gather information. They should not readily accept meaningless offers for grid computing systems from by incompetent vendors or integrators but prepare to be able to follow the technical trends by themselves.
Otherwise, personnel in your organization will not be able to rise above the level of those mediocre
people who keep saying, “I’m wondering which to select, Red Hat or Windows,” or “It’s all right now,
we’ve introduced Platform Middleware.” Without expertise and thorough knowledge in low-latency broadband networking, topology, parallel I/O, parallel programming model and other HPC-specific terms, your huge investments in systems will be simply wasted away. This also applies to the technologies that should be selectively used for the right jobs, including accelerator technology.

- The copyright of this document belongs to Numerical Technologies Incorporated.
- Product names and company names appearing in this document are generally the trademarks or registered
trademarks of their respective owners. - Numerical Technologies Incorporated does not warrant the completeness or correctness of this document. In no case will Numerical Technologies Inc. be held responsible for ordinary direct, indirect, consequential, accidental, special or punitive damage arising from or related to this document, even if Numerical Technologies Inc. receives notice to the effect that Numerical Technologies may be liable for compensation for any damages.










