I've combined my interest in the predictability of financial variables (the time series of stocks and indices), mathematical programming, complexity and self-organizing systems, and the graphical display of information into an exercise over the past week examining the behavior of a time-series statistical measure called the Hurst exponent. But, first, a little background.
Benoit Mandelbrot was one of the most interesting thinkers of the 20th century. He passed away in 2010, but his work spanned over fifty years and a broad range of topics. He is known best as the inventor of the term 'fractal', and the description of a two-dimensional object called the Mandelbrot set. There is much to be said about Mandelbrot, but for my purposes we need to focus on his financial analysis and his book, "The Misbehavior of Markets".
Mandelbrot was a quantitative financial analyst in the 1960s, and tried a simple experiment: he applied well-known statistical techniques on historical price data of a traded commodity (cotton). His thoughts are at once controversial and widely accepted. You have to read the book to see the rich interplay of ideas and the fertile tension between Mandelbrot's mathematical approach to stock price statistics and the various 'accepted' theories.
I focus here on one idea Mandelbrot had begun to develop in the final decade of his life. Among the many statistical quirks of market prices -- be they the cotton prices Mandelbrot wrote about in a controversial 1963 paper or prices of IBM stock -- is their long-range dependence. That is, when you measure the autocorrelation of the stock price time series, you find that the price today is heavily dependent on the price over the past several days. Mandelbrot was beginning to explore a measure of this long-range dependency (actually, months and years instead of days) called the Hurst exponent. I won't reproduce the math here, but the Hurst exponent is a parameter in a 'data generating function' . The data generating function is a mathematical relationship that a theoretician proposes as a surrogate for the processes that generate a time series. It is often the basis of quantitative analysis of the open market in traded stocks (or currencies, commodities, government bonds, tulips, etc.)
The ultimate goal of analyzing any time series is to determine the underlying data generating function. It often occurs that we never know the actual function, but that our proposed function provides a close enough imitation that we can do predictions. These predictions might be about a range of values for future prices, or they may be about risk: the probability that a stock will fall below a particular value.
Long-term dependency is, itself, an indicator of the propensity of the system to go through large, unexpected 'phase-change' type shifts. That is, LTD helps us define risk. The concept was first observed by the civil engineers who watch the flow of rivers and must determine the proper capacity for dams and other flood control measures. Hurst himself was an engineer designing such systems for the Nile in the 1950s.
One does not calculate the actual Hurst exponent of a time series. This is because, as I stated before, we don't know the real data generating function. But, one can estimate the Hurst exponent. There are several formulas for this. Hurst used an estimator called the "rescaled range" or R/S. More recently, researchers working with a synthetic data set (in which we know the DGF) have published a more accurate estimator, the rescaled variance or V/S.
One would think that the wide array of quantitative tools provided in the various computer languages would mean that somebody would have coded up the R/S and V/S estimators in Java, R, or Python. One would be wrong. There is a R/S estimator as part of an importable module in R, but there doesn't seem to be much discussion of its behavior or even validation of the algorithm. And, nobody seems to have extended it to the V/S estimator.
So I wrote one. I created, in Python, a module that calculates both an R/S and a V/S estimator given a time series. This includes a rescaling process that segments the time series and makes a rolling H calculation at several scales, and reports H as the gradient as the result of a log-log regression. I'll spare you the math but you'll have to take my word for the fact that this is how the literature describes the proper way to estimate H.
The Hurst Exponent measures long-range dependency, and varies between 0 an 1. A random, white-noise time series will have no long-term dependency and the H will be around 0.5. Highly dependent streams of data will have H approaching 1.0. Some odd time series are anti-correlated -- very sharp and volatile -- will have an H approaching 0.
That's what I did in December 2012. Over the past week I downloaded the past three years of data from all 30 DJIA stocks, and explored the value of H. But, before I describe my results, I need to get the non-financial analyst on the same page as financial analysts.
Actual closing prices of a stock like Alcoa or IBM are highly correlated. There's no real surprise here: there is an underlying value in a company and future profits fall within a pretty well-defined and stable range. The Hurst exponent of the prices is well above 0.9, and for many of the DJIA 30 stocks is near 0.99. Most of that is uninteresting.
Stock market quantitative analysts actually focus on 'returns'. These can be calculated in a variety of ways. Assume that we have a price at time, t, called P(t). And, assume we have a time difference d. The return can be defined as R(d) = (P(t + d) - P(t))/P(t). Often this leads to exponential results as d gets larger. For this and other reasons, stock market analysts take the logarithm of R(d). As the log(1) = 0, the above relationship becomes log R = log (P(t+d)/P(t)) = log P(t+d) - log P(t).
That sets it up. In my next post, I'll show the results of H using R/S and V/S for the daily, weekly DJIA averages and the 30 industrials.
This is a blog about the theory of complex adaptive systems. How do we recognize them, work with them, analyze them, moderate them, and--hopefully, some day--predict their behavior?
Thursday, January 17, 2013
Wednesday, January 02, 2013
An Agent-Based Model of Insurgency
Back in 2007, I created a model in NetLogo that reproduces the research presented in "Modeling civil violence: An agent-based computational approach" by Santa Fe researcher Joshua M. Epstein. The full citation is:
Epstein, J. M. “Modeling Civil Violence: An Agent-based Computational Approach.” Proceedings of the National Academy of Sciences 99, no. 90003 (May 7, 2002): 7243–7250. doi:10.1073/pnas.092080199
Below is a screen shot of my model (click to enlarge). It postulates a toroidal 'world' in which there are civilians and cops. The civilians can become active revolutionaries (red) or remain passive (blue). The cops patrol the landscape and arrest the active revolutionaries, one per turn. The plots on the screen show the number of citizens that are active, and the number that are in jail.
In this model I can vary many of the parameters that were discussed in the 2001 article. But, for the purposes of my research, I have also put in a selector to change the activation pattern.I have been exploring two different activation methods. Epstein used random activation: agents were chosen at random to execute their methods: move and act (cops would arrest, citizens would choose whether they would be active). An additional, typical activation scheme is "uniform", in which all agents get one turn, but their sequence is shuffled each turn. These two methods are analogous to sampling with and without replacement, respectively.
My next problem is: how do I characterize the output? Epstein used the 'pulses' of revolt that appear in the model. So, in order to quantify these pulses I needed to build a 'pulse detector'. This is simply a counting algorithm (written in Python) that creates a sequence of data points for each pulse: the time since the last pulse and the height of the pulse.
This, in turn, requires a definition of a pulse. Epstein arbitrarily chose a value of 50. That is, a 'revolt' event occurs in his data when over fifty agents have converted to 'active' status. The revolt (or pulse) ends when that number drops below 50.
I needed something a little less arbitrary, that could be used for a variety of 'revolt' time series across a broad parameter space. Thus, I chose a threshold of one standard deviation. That is, once the number of 'active' agents goes above the average number of actives plus the standard deviation of the number of actives, there would be a 'pulse'. I used the standard deviation of the whole sample, so this required a completion of the run in order to define a pulse. It could not be computed real-time.
A distribution of these inter-arrival times is shown in the first histogram. This is a model run using random activation, and executed for 25,208 'ticks'. The majority of peaks happen after a wait less than 25 ticks. But, in one instance, a gap of over 200 ticks occurred between peaks. I found that this distribution held for all activation methods and for all the parameter changes I instituted.
Additionally, a similar distribution can be found for the maxima of the peaks. The second histogram shows the distribution of peak heights -- the largest number of active citizens in each peak -- for the same large run. Most revolts involved less than 10 individuals, while a very few exceeded 60. The largest was over 80.
These two output values -- inter-arrival average and average peak height -- represent the 'model behavior parameters' that can be further examined. Thus, with the output quantified, I can then proceed to evaluate the impact of activation schemes on these values.
I also examined what input parameters could be changed to change the averages of the inter-arrival times and the revolt peaks. The candidates were: citizen vision, cop vision, threshold (for activation), maximum jail term, and a constant, k, that is supposed to create a plausible arrest probability. Citizen vision appears to be one input parameter that makes a difference, so I began with this. Citizens will make several decisions based on what they see. They will chose to become active, in part, based on whether there are 'cops' within their vision, and based upon how many other citizens are already active within their vision. Epstein reported results for a citizen vision of 7.0. I found that interesting results occurred as vision is increased. (Note that, in my explorations, 'cop' vision remains fixed at 7.0. I did not find that varying 'cop' vision changed the output very much. People just got swept into jail more efficiently.)
The results -- the change in inter-arrival time and peak height as a function of different citizen visions -- are shown in the next two scatter plots.
There is, to be sure, quite a variance in the outcomes. But, it can be seen that peak height is highly dependent on activation at all levels of citizen vision, and inter-arrival times of revolts seems to be dependent on activation at higher levels of citizen vision. For random activation, the inter-arrival times seem to become chaotic at a citizen vision of 8.6. Runs at this setting (citizen vision = 8.6, random activation) result in an average gap between peaks that can be as low as 17.9 and as high as 70.9.
I'll discuss all this later, but I thought it would be interesting to post the results. I'm also conducting more runs to help better characterize variation in output. (I'll say more about the behavior of the model as the sample size gets larger in a subsequent post.)
One thing appears to be clear from this data, however. The choice of activation makes a significant difference in quantitative model outcome.
Epstein, J. M. “Modeling Civil Violence: An Agent-based Computational Approach.” Proceedings of the National Academy of Sciences 99, no. 90003 (May 7, 2002): 7243–7250. doi:10.1073/pnas.092080199
Below is a screen shot of my model (click to enlarge). It postulates a toroidal 'world' in which there are civilians and cops. The civilians can become active revolutionaries (red) or remain passive (blue). The cops patrol the landscape and arrest the active revolutionaries, one per turn. The plots on the screen show the number of citizens that are active, and the number that are in jail.
My next problem is: how do I characterize the output? Epstein used the 'pulses' of revolt that appear in the model. So, in order to quantify these pulses I needed to build a 'pulse detector'. This is simply a counting algorithm (written in Python) that creates a sequence of data points for each pulse: the time since the last pulse and the height of the pulse.
This, in turn, requires a definition of a pulse. Epstein arbitrarily chose a value of 50. That is, a 'revolt' event occurs in his data when over fifty agents have converted to 'active' status. The revolt (or pulse) ends when that number drops below 50.
I needed something a little less arbitrary, that could be used for a variety of 'revolt' time series across a broad parameter space. Thus, I chose a threshold of one standard deviation. That is, once the number of 'active' agents goes above the average number of actives plus the standard deviation of the number of actives, there would be a 'pulse'. I used the standard deviation of the whole sample, so this required a completion of the run in order to define a pulse. It could not be computed real-time.
A distribution of these inter-arrival times is shown in the first histogram. This is a model run using random activation, and executed for 25,208 'ticks'. The majority of peaks happen after a wait less than 25 ticks. But, in one instance, a gap of over 200 ticks occurred between peaks. I found that this distribution held for all activation methods and for all the parameter changes I instituted.
Additionally, a similar distribution can be found for the maxima of the peaks. The second histogram shows the distribution of peak heights -- the largest number of active citizens in each peak -- for the same large run. Most revolts involved less than 10 individuals, while a very few exceeded 60. The largest was over 80.
These two output values -- inter-arrival average and average peak height -- represent the 'model behavior parameters' that can be further examined. Thus, with the output quantified, I can then proceed to evaluate the impact of activation schemes on these values.
I also examined what input parameters could be changed to change the averages of the inter-arrival times and the revolt peaks. The candidates were: citizen vision, cop vision, threshold (for activation), maximum jail term, and a constant, k, that is supposed to create a plausible arrest probability. Citizen vision appears to be one input parameter that makes a difference, so I began with this. Citizens will make several decisions based on what they see. They will chose to become active, in part, based on whether there are 'cops' within their vision, and based upon how many other citizens are already active within their vision. Epstein reported results for a citizen vision of 7.0. I found that interesting results occurred as vision is increased. (Note that, in my explorations, 'cop' vision remains fixed at 7.0. I did not find that varying 'cop' vision changed the output very much. People just got swept into jail more efficiently.)
The results -- the change in inter-arrival time and peak height as a function of different citizen visions -- are shown in the next two scatter plots.
I'll discuss all this later, but I thought it would be interesting to post the results. I'm also conducting more runs to help better characterize variation in output. (I'll say more about the behavior of the model as the sample size gets larger in a subsequent post.)
One thing appears to be clear from this data, however. The choice of activation makes a significant difference in quantitative model outcome.
Blogging to Resume
This has always been intended to be a blog in which I discuss my research into the world of complex adaptive systems. I've been spending the past several months reinvigorating my modeling skills, as well as developing my quantitative analysis capabilities (both the skills and the software).
In the coming year, I expect there will be much of interest here. In my next post, I plan to present some results in one of the three modeling areas I am investigating. The next few entries will document:
- The basic precepts of my dissertation research
- Preliminary results for an agent-based model of civil revolt.
- Progress and results of a model of the labor market.
- Progress and results of a model of stock market trading.
- Underlying questions that ABM research can help with.
- The companion statistical techniques that should accompany ABM research and help to understand its output.
- The question of how to build 'valid' agent-based models. (I think there needs to be an article written on standardizing the validation process.)
In the coming year, I expect there will be much of interest here. In my next post, I plan to present some results in one of the three modeling areas I am investigating. The next few entries will document:
- The basic precepts of my dissertation research
- Preliminary results for an agent-based model of civil revolt.
- Progress and results of a model of the labor market.
- Progress and results of a model of stock market trading.
- Underlying questions that ABM research can help with.
- The companion statistical techniques that should accompany ABM research and help to understand its output.
- The question of how to build 'valid' agent-based models. (I think there needs to be an article written on standardizing the validation process.)
Subscribe to:
Posts (Atom)