Testing a Range of Stochastic Values.
At this point, it is impossible to avoid some amount of optimization. We are hopeful that the underlying premise of the method is good, and we would like to vary the calculation period as well as the entry and exit parameters to see if we can increase the number of trades, as well as the returns. The pattern of these relationships is predictable, so we might be on safe ground. For example, as we make the entry thresholds farther apart, we will get fewer trades, the profits should be bigger (if we keep the exit at the same level), and the risk should get larger because we're holding the trade longer.
If we hold the entry levels constant and make the exit points closer, then we should increase the number of trades, reduce the size of the profit, and decrease the overall risk because the trades will be held for a shorter time.
We will also vary the calculation period for the stochastic indicator. As that value gets larger, the indicator will reach its extremes less often, generating fewer trading signals. A shorter calculation period will produce more signals. We would like to have as many signals as possible but recognize that a faster frequency of signals will also have smaller price moves and consequently smaller potential profits. There are always trade-offs that must be balanced. In addition, we know from Chapter 2 that when we look at shorter time intervals, we see more price noise. We expect that for this mean-reverting method, shorter calculation periods should favor our trading strategy.
Because we are testing different combinations and must be concerned about overfitting, we will watch carefully to see that the pattern of results has the shape that we expect and that the numbers do not jump around.
Using In-Sample and Out-of-Sample Data.
If we were working in completely uncharted waters, that is, developing a strategy from a new concept, we would want to divide the data into in-sample and out-of-sample part.i.tions. We would then test all of the new concepts on the in-sample data until we were satisfied with the rules and the test results. Finally, we would run our best rules and parameters through the unseen, out-of-sample data. We expect the results to be worse than the in-sample performance because there are many more patterns that we could not have antic.i.p.ated. Still, if the ratio of return to risk of the in-sample test was 2.0 and the out-of-sample test was 1.2, we would consider that a success. It would be ideal if the out-of-sample data performed the same as the in-sample data, but that rarely happens because there is always some degree of overfitting, even if it is unconscious and unintended.
However, if the out-of-sample test is a complete failure, yielding a ratio near zero, then the method is also a failure. You cannot review the new results, find the problem area, and fix it, because that is feedback. You no longer have true out-of-sample data, and there is good reason to believe that your improvements are simply overfitting the data more and will result in trading losses.
Specifying the Tests.
Our primary measurement statistic is the information ratio (annualized return divided by annualized volatility), and the results are shown in Table 3.5. The overbought entry levels are varied from stochastic values of 50 to 30 and the corresponding exit levels from 10 points under the entry to zero. Rather than test only the pair AMR-CAL, we will show the average of all tests for the combinations of four airlines: LCC-CAL.
LCC-AMR.
LCC-LUV.
AMR-CAL.
AMR-LUV.
CAL-LUV.
We take this approach because the choice of parameters should work for all the pairs, not just for AMR-CAL. When we look at average results of all tests, we won't be able to see how the individual pairs performed, but we will know if one set of thresholds is better than others when applied to all markets. This will prevent us from looking too closely at the detail. We also expect to get smoother results by averaging the ratios for each pair.
Which Parameter First?
There are three parameters to test: the stochastic calculation period, the entry threshold, and the exit threshold. The general rule is that we test the parameter that has the most effect on performance. That seems to be the stochastic calculation period. We should expect that longer periods (larger values) will generate fewer trades. The entry threshold will also be a major factor in determining the number of trades: The greater the threshold, the fewer the trades. The exit threshold will have only a small effect on the frequency of trading. If we exit sooner, then there is a chance that prices will reverse and allow us to enter again, but that should happen much less often. Then the order of testing will be Stochastic calculation period.
Entry threshold (with exit set to zero).
Exit threshold.
To begin, we need to pick some reasonable values, which will be a calculation period of 20, an entry of 60, and an exit of zero. We expect that the best calculation period will be less than 20 and the entry may be less than 60. Exiting at zero is normal, but exiting a short at 10 might be safer. Table 3.5 shows the results of these tests beginning in January 2000; however, some of these stocks start later due to mergers.
TABLE 3.5 Initial tests of four airline stocks, six pairs.
The ratio will be the key statistic for determining success or failure. In Table 3.5, the ratio increases in a reasonably orderly way as the momentum calculation period declines from 20 to 6, as shown in Figure 3.6. The calculation periods are shown along the bottom and the ratios along the left scale. Periods of 10 and lower are clearly better, with 7 the best. The period 7 also had the highest profits per share, which is critical to success. We could have tested all calculation periods, but the difference between 20 and 19 days (a change of 5%) would not be as significant as the difference between 7 and 8 (a change of 14%), so we've skipped some values at the high end and included all of them at the lower end.
FIGURE 3.6 Information ratios for tests of momentum calculation periods.
We expect that everyone would choose the period 7, not just because it has the highest ratio but because it falls in the middle of the profitable set of tests. It is best to avoid the value 6 because faster trades are likely to have smaller profits per trade. The 10-day test may be better because of the profits per trade, but readers will need to perform these tests themselves to verify the results, and they can make other choices at that time. You can use these results to convince yourself that this is a viable approach to trading, but you can never simply accept someone else's work without verifying it yourself.
TABLE 3.6 Airline pairs with entry threshold of 50.
TABLE 3.7 Airline pairs with entry threshold of 40.
The greatest concern is the average number of trades. Because U.S. Airways (LCC) started trading in October 2005, all combinations using LCC will be more than 5 years shorter than the other pairs. If we consider all pairs trading for the full 10 years, an average of 36 trades is only 3.6 trades per year. That may not be enough to hold our attention. One way to increase the number of trades is to lower the entry threshold below 60; however, by lowering the threshold, we will also expose ourselves to greater risk because we will enter more trades before they reach their extremes. Tables 3.6 and 3.7 show the results of lowering the threshold to 50 and 40. The averages show that the number of trades increases along with all of the other statistics-the annualized rate of return, per share return, and information ratio-but the entry threshold of 50 is noticeably better than the threshold of 40. For the threshold of 50, more of the individual pairs were profitable (see the far right column) than with the entry threshold of 40. Calculation periods of 10 and lower are still best, and 7 is again the peak performer.
The number of trades has increased by making the threshold lower, but the profits per share are, on average, at only $0.082, which is below what we believe is a safe margin of error, given execution costs. We would consider testing the exit level of 10, compared to zero, to a.s.sure us that we exit the trade more often. But because the profits per share are marginally small, exiting sooner would reduce those profits, and it would be unlikely that these stocks would generate a net profit. One answer is to look at the volatility of the market for each individual stock and trade only when the volatility is relatively high. That will reduce the number of trades but should increase the profits per trade. It will probably add risk because there are fewer trades and less diversification, and risk is always a.s.sociated with higher volatility. But low volatility isn't an option if it doesn't produce sufficient profits.
Before looking at volatility, let's inspect the returns of the individual pairs. Up to now, we have looked at the average of the tests, which is a good way to avoid overfitting. But we need to understand the profits per trade. Table 3.8 shows that results are significantly skewed, with the first two pairs, U.S. Airways (LCC)Continental (CAL) and U.S. AirwaysAmerican (AMR), posting very large per share returns, and all other pairs posting returns that are below what we would consider sufficient for netting a profit. Still, all pairs are profitable, which can be seen as a good start.
TABLE 3.8 Results of individual pairs for airlines, momentum period 7, entry threshold 50, exit threshold 0.
If we go back to the original test that used a 60 entry threshold, we expect the profits per trade to increase, although there would be fewer trades. Table 3.9 shows that results are as expected. The per share results go up on average, and the LCC-LUV pair increases from 4.1 cents to 12.5 cents, enough to produce a real profit. There are some differences in the results of the first two pairs, LCC-CAL and LCC-AMR, due to better entries or few trades, but the gains of those two pairs hold up nicely. The number of trades drops predictably, as do the net profits. Three of the six pairs are tradable.
TABLE 3.9 Results of individual pairs for airlines, momentum period 7, entry threshold 60, exit threshold 0.
One last approach is to visually inspect the individual net a.s.set value (NAV) streams. In Figure 3.7a, these results are messy. If we look closer at the more recent U.S. Airways pairs in Figure 3.7b, the returns are much more orderly.
FIGURE 3.7 NAVs for (a) all airline pairs and (b) airline pairs using U.S. Airways (LCC) as one leg.
Are These Results Robust?
Now we come to the difficult part, deciding whether these results are robust. If they are, then we can comfortably trade these pairs. The answer, if any, comes partly from a more philosophic view of this process.
On the positive side, the idea of trading distortions between two fundamentally related stocks is a basic and believable concept. We used a stochastic indicator to measure the relative momentum of the two stocks and then found those points where they diverged. This was simply the difference between the two stochastic indicator values. The larger the difference that determined the entry threshold, the fewer the trades and the larger the profits per trade. That is all according to expectations. We exit when the stochastic values come back together.
When we run a set of tests, varying the calculation period of the stochastic, we get more trades for shorter holding periods. Again, this is very normal and conceptually correct if we are trying to emphasize the price noise. The results are continuous in terms of the number of trades, profits per trade, and information ratio.
On the negative side, we have clearly tested combinations of parameters. If we test enough, then some are very likely to be profitable, but statistics tell us that a small number of profitable results within a larger set of tests do not have predictive qualities. There are also not as many trades over the test period as we would like, but that may be the normal outcome of highly correlated stocks that don't diverge often, rather than just spurious price moves. And some of the results show very small net profits and even some losses.
One way to determine robustness is to consider the percentage of profitable results over all tests. In other words, if we used a reasonable range of calculation periods for the momentum indicator and reasonable entry and exit thresholds, and we found the percentage of profitable tests, then a large percentage would tell us that this method is sound, even though some returns were small and others large. It would remove the possibility that this method worked for only a narrow set of conditions. We find this a strong measurement of robustness.
Another confirmation of robustness is to apply this exact method on other sectors with similar fundamental relationships. If the results were similar, then we would be more confident and, at the same time, have additional pairs to trade that would provide valuable diversification.
For now, we can say that there is nothing wrong with the current results, but they are not sufficient to draw a conclusion. We would also prefer pairs that had more trades.
TARGET VOLATILITY.
Before moving on, notice that the standard deviation of returns in Table 3.7 is 12% for all pairs. That is called the target volatility. To compare the returns of different pairs, we need to make the risk equal for each of the pairs' NAV streams. We use 12% annualized volatility as the industry standard.
There are a number of steps needed to equalize the risk of all the pairs that will be traded. This has the consequence of maximizing diversification by avoiding the arbitrary allocation of more or less of the investment risk to any one pair. The first step was to volatility-adjust the two legs so that each stock in the pair had the same risk exposure. The exact way of doing that was given in the section "Different Position Sizes." The next step is to equalize the risk of each pair relative to each other. To do that, scale the number of shares traded in each pair to a level that represents a target volatility, in this case, 12%.
A 12% target volatility is where the annualized standard deviation of the daily returns is equal to 0.12. To get to that number after the fact, based on all data, follow these steps: Record the daily net profits and losses of both legs of the pairs trade.
Find the standard deviation of the entire series of profits and losses.
Multiply that standard deviation by the square root of 252 in order to annualize.
The investment size necessary to trade pair i with 12% volatility is Therefore, if the standard deviation of daily returns is $100, the annualized volatility is $1,587. For a target volatility of 12%, we would need an investment of $1,587/0.12 = $13,229 for that pair.
Create a NAV series by applying the normal formula for the compounded rate of return, but divide each daily profit or loss (expressed in dollars) by the investment size calculated in the last step.
As seen in Table 3.9, the risk for all of the pairs is shown as 12% (the column headed Std) so that the annualized returns and other statistics can be compared on an equal footing. The investment in each pair will be different, depending on its volatility.
Combining Pairs into a Portfolio.
This method of adjusting to a common volatility works well for comparing the performance of individual pairs, but does not work for combining the pairs into a portfolio. For that we need to have the same performance volatility for each pair relative to the same investment size. The steps are nearly the same.
Choose an arbitrary investment or an actual one. If the amount is $100,000 and there are six pairs, then each pair gets 1/6, or $16,666.
Divide the dollar value of the daily returns by the investment size to get the percentage returns each day.
Find the annualized volatility of the returns: the standard deviation times the square root of 252.
Divide the target volatility, for example, 12%, by the annualized volatility of this pair, giving us the adjustment factor, AFi.
Multiply all position sizes for the ith pair by AFi.
This process adjusts all position sizes for all pairs to create the same risk for each trade. We also know that the volatility of the portfolio of pairs will be less than the target volatility due to diversification. Again, we would want to increase our position sizes to bring the volatility back up to the target level, but there may not be excess money to allow that. This will be discussed later in this chapter under "Benefiting from Pseudo-Leverage," but it will be most important in the chapters concerned with futures markets.
Note that a target volatility of 12% means that there is a 16% chance that we will lose more than 12% and a 2.5% chance that we will lose more than 24% over a one-year period. The target volatility is simply one standard deviation of the returns; therefore, it has the same properties as a normal distribution. The comfort range for most traders can be as high as 17% volatility; for fund managers, it may be as low as 6%.
One of the major limitations in trading stocks is that just because we want a target volatility of 12% doesn't mean that the system performance will permit that much leverage. In the stock market, we are limited by having to pay the full share price, which is not the case with futures or options. Near the end of this chapter, we will look again at what happens when the total cost of buying and selling shares exceeds the maximum investment that we determined in advance was needed to achieve a target volatility of 12%.
Filtering Volatility.
Volatility has been discussed in different ways up to now. Rather than try to pick buy and sell thresholds based on absolute price differences between two stocks, we decided to use the difference between two stochastic values to normalize volatility and make the buy and sell levels adaptive. We also used volatility to determine the size of the positions traded in the two legs. By adjusting the position size, we prevented the returns of one stock from overwhelming the other when one of the stocks had much greater volatility. We now need to address the relationship between profits and volatility. It seems reasonable that there is a point where price volatility is too low to produce a profit. On the other side of that question, we might want to know if there is a point where volatility is so high that the risk does not compensate for the returns.
Alternative Methods for Measuring Volatility The original reason for finding the volatility was for adjusting the position sizes of the two legs. In the extreme, what if CAL was much more volatile than AMR, or if CAL was trading at $150 and AMR at $10? Then the success of every trade would depend entirely on the success of CAL because both its profits and its losses would overwhelm AMR, if trading an equal number of shares of each. During this test period, that situation is not a problem because the prices of both stocks were similar, although CAL was twice the price of AMR during short periods in the fourth quarter of 2008. We know that, under normal circ.u.mstances, there is a direct relationship between price and volatility; therefore, CAL should be significantly more volatile than AMR most of the time. We should expect to trade more shares of AMR than CAL to adjust for that volatility. Our solution needs to be general because other pairs may not trade near the same price. In an early section, "Different Position Sizes," we introduced the use of the average true range.
The traditional way of measuring annualized volatility, V, uses the standard deviation of the returns, multiplied by the square root of the number of data points in a year: where r are the 1-period returns, over n days. For monthly returns, we would multiply by However, using only the closing price differences when the high and low are also available has been shown to give inferior results. Instead, we'll use the average true range, ATR, measured over the same n days. For any day, the true range, TR, is the largest of the high minus the low, the high minus the previous low, and the previous close minus the low, The ATR is the average of the past n values of TR.
When we calculate the volatility based on the standard deviation during the last three quarters of 2008, we find that AMR has a higher volatility than CAL, even though it is trading at a much lower price. Of course, it is possible that prices jump around more if one stock is in the news more than the other. It is also likely that stocks at very low price levels, and in the news, make moves that are proportionately greater than stocks trading at higher prices. Regardless, the volatility numbers generated by the traditional standard deviation method do not seem intuitively correct. Results are shown in Table 3.10. For the first trade, we set the AMR position to 100; then the CAL position is 100 (.903/.748) = 121. The position size of CAL is larger because its volatility is lower.
TABLE 3.10 AMR-CAL continuous results during 2008 using annualized standard deviation to determine position size.
TABLE 3.11 AMR-CAL continuous results during 2008 using average true range to determine position size.
In recalculating volatility using the average true range, we find the opposite relationship: CAL had significantly higher volatility than AMR. Where the ATR of AMR was $0.768 in the first trade, the ATR of CAL was $1.558, twice the amount. This is a remarkable difference, considering both measurements were based on the same time period. The only reasonable explanation is that the intraday volatility of CAL was very high, but it tended to close nearer to its previous close. That is, the closing price volatility did not reflect the wide price swings that occurred during the trading day. When we think of volatility, we must consider the intraday price swings. It is an interesting lesson in volatility, but it may be more p.r.o.nounced for shorter calculation periods. But it seems likely that if the average true range is twice the value of the close-to-close price changes, the annualized volatility would also be quite different. This brings to mind that there are many risk calculations based on this traditional volatility measurement, and they may yield results that are far too low. Underestimating risk can have serious repercussions. Options traders will also know that implied volatility is calculated using the same standard deviation method. Is that value also too low and does that present a trading opportunity?
Returning to our position size calculation, to normalize the volatility using the ATR, we calculated the CAL position as 100 (.768/1.558) = 49. The smaller position reflects the higher volatility. These results are shown in Table 3.11. Based on using the annualized standard deviation method, the sum of the four trades was $529 and $0.61 per share. Using the ATR, the sum was $478 with $0.78 per share. It's difficult to tell from four trades which method will produce better results, but we'll stay with the average true range as a more intuitively robust measurement.
Volatility Threshold Filter.
Having settled on a way to measure volatility, we now need to decide if performance is dependent on the volatility level, that is, are we more likely to be successful during periods of high or low volatility? Again, we'll use the ATR as our measure of volatility because we believe it is more robust and it is intuitively easier to explain. We'll revert to our original example, AMR and CAL, using a 7 period momentum indicator, but this time we will reduce the entry threshold to 40 to increase the number of trades. This pair was a modest performer because the ratio was not high, but if a volatility filter works, the results should be improved.
FIGURE 3.8 Surface chart showing the relationship between trade PL and the volatility of the two stocks, measured using ATR, at the time of trade entry.
Low Volatility Filter.
The first step is to show how the profits for any trade are dependent, if at all, on the volatility of either stock at the time the trade is entered. We recorded the entry ATR values, the final trade PL, and the profits per share and created the three-dimensional surface chart shown in Figure 3.8. This chart may look a bit b.u.mpy at first, but closer inspection will show a hump near the middle and lower points around the outside. It's the outside that is of particular interest. In the left corner, where the volatility of both legs (vol1 and vol2) are at low values, the chart makes its steepest dip. There may be other areas, but for now we are interested in only this one. We can also see that, as volatility increases, there is a sharp drop along the right front of the chart. We may also be able to remove trades with very high volatility, which translates into risk, but we'll save that for the next section.
The vol1 and vol2 values used in Figure 3.8 are in dollars per share. It turns out that both AMR and CAL had volatility ranges that were nearly the same. If we were to use those values to define our threshold and we wanted to isolate the lower left corner, then we would need to say: Do not enter any trade if the AMR volatility is less than $1.50 and the CAL volatility is less than $0.80, based on a 7-day average true range.
It doesn't take much a.n.a.lysis to realize that we are seriously overfitting the data by looking at this much detail. We can try to make this more general by reverting to the traditional annualized standard deviation measurement. Although we prefer the average true range, that measurement is very specific to each stock, while we can express the annualized standard deviation as a single percentage value that makes sense for every stock price.
For this filter to be triggered, we require that both legs be less than the volatility threshold. If that occurs at the time of entry, then we do not enter that trade. In addition, we wait until that trade would have been exited before we enter another. The reasoning for the entry rule is that low volatility is expected to produce low per share returns. We did not look at the case where only one leg was under the low volatility threshold.
Table 3.12 and Figure 3.9 show a sequence of tests from low to high annualized volatility, beginning at 10%, which did not have any effect, and ending at 50%, which reduced trades from 98 to 60. The reason that the annualized volatility reaches such high levels so often is that the annualizing process uses only seven days of data. During a very volatile few days, the annualization of those price moves will look unusually high and cannot be sustained for long.
TABLE 3.12 Results of low volatility entry filter on AMR x CAL show some improvement but only when 27 trades are removed, 28% of all trades.
FIGURE 3.9 Results of low volatility filter applied to AMR x CAL shows that removing trades at low volatility increases both returns and per share profits through 40% annualized volatility.
The low volatility filter was able to increase the profits per share only from 9.4 cents to 12.8 cents. That may be enough to net a profit after costs, but it also removed 27 trades to achieve that gain.
High Distortion Filter.
A more interesting filter is one that recognizes that large differences in the volatility of the two legs of a pair cannot easily be resolved by volatility adjusting, even if it makes sense in theory. For example, if we were to trade a pair where leg 1 traded at $100 and leg 2 at $2, we would expect to trade a very large amount of leg 2 for each share of leg 1. Then, if the $2 share were to jump, it might gain 25% or 50% in one day, while that size move would be nearly impossible for a $100 per share stock. Perhaps this can be recognized using annualized volatility; however, we have decided that the ATR is a better measure. Even using annualized volatility, there is still a large risk of imbalance.
An easy way to recognize that two legs are not compatible is to take the ratio of the shares needed to be traded after applying volatility adjusting. We'll call this the distortion ratio. If leg 2 needs to trade 100 shares to offset the volatility of 10 shares in leg 1, we get a distortion ratio of 0.10, calculated as By using the min function, we get the minimum value of the two-share ratios, regardless of whether leg 1 or leg 2 is bigger. Table 3.13 shows the results of a small set of tests using the distortion ratio. When the ratio is 0.4, then all trades with share ratios more extreme than 40 to 100 are filtered. At a ratio of 0.5, all trades with share ratios more extreme than 50 to 100 are filtered. The best results come from a ratio of 0.60, which means the shares must be no more extreme than 60 to 100, which essentially puts the restriction on the two legs that they have reasonably similar volatility-that the volatility of one leg is less than twice the volatility of the other leg.
TABLE 3.13 AMR x CAL filtered with the high distortion ratio.
FIGURE 3.10 Comparison of filters applied to AMR x CAL shows that the high-distortion filter was best.
The results of these two filters-the low volatility filter and the distortion filter-can be seen in Figure 3.10. The NAVs of the original case (No Filters) are very similar to the low volatility returns. The high-distortion filter does much better, avoiding a sharp drawdown but generally selecting better trades. This results in a final NAV above 180 and much higher than the previous peak at 160, while the other two NAV streams are only able to return to their previous highs. One concern is that the distortion filter is removing trades that are only modestly unbalanced, rather than extremes. That may be another sign of overfitting. Because this method improved results at every level that was more extreme (where 0.1 is most extreme), the general solution that works on all pairs may not be as good as this example.
What We Have Learned from Airline Pairs.
We would like to conclude that our premise was correct: low volatility is not good for this strategy, and two stocks that are out of balance with regard to volatility present unwanted risk and unstable expectations.
We could fiddle around further with the parameters and volatility filters and perhaps find a combination that produces enough profit and enough trades to be worth trading, but that is clearly going in the direction of overfitting. Some might question whether we have already overfit the problem, but the concept has been kept fairly clean. So far, we have Identified a pairs trading opportunity based on a fundamental market relationship.
Used a relative value measurement, the stochastic, to identify when two fundamentally related stocks have diverged.
Isolated the entry point based on normal trade-offs, that is, greater potential but fewer trades if we entered at larger extremes.
Exited the trade when the relative differences approach or are equal to zero. We did not consider holding the trade looking for a reverse distortion.
Adjusted the position sizes to the same volatility using the average true range, which we considered a more robust measurement than the annualized standard deviation of the daily changes.
Applied the same rules and parameters to each pair in our test sector.