ASU Bayesian Approach to Reservoir Storage Project
Description
Download mcmc file and copy the contents into a function in a Google Colab notebook (or the IDE of your choice). Add cells containing the porosity and layer thickness data, then work your way through the assignment. After finishing and plotting,
Make sure functions are defined within their own code cells, and add comments to explain your code choices and logic. Name your file “LastName_FirstInitial_BayesReserviorProblem” and upload below. Make sure to delete any unnecessary or extraneous code.
The Bayesian model
Unlike the frequentist approach, Bayesians can directly talk about probability. Here we will use Bayes’ theorem to update our estimate of porosity and layer thickness based on the new observations. Recall the definition of the theorem:
(1) p(θ∣d)=p(d∣θ)p(θ)p(d)
p(d) is of course the prior probability distribution, or just the “prior.” It represents our state of belief about the system before collecting or interpreting any new data. It’s important to realize that even if you don’t formally update your beliefs based on new data, but you do look at the new data, you’ve contaminated the prior. Thus, it is important to establish beforehand what the prior model is that best captures your understanding or beliefs about a system.
p(d|theta) is called the “likelihood” function, and it represents an update to our beliefs given the new data. Formally, it is how likely the data are given some model or parameterization of the system. For example, what is the likelihood of observing the 6 new porosity measurements, given that the prior mean is true? How about a slightly different mean?
Here’s an important point:
The best estimate for the mean reservoir porosity based purely on the new data is simply the arithmetic average of the sampled porosity values. However, this ignores the fact that we had some prior beliefs about the system before observing any data. If you were told that the reservoir was a sandstone, would you believe that porosity could be 0.5%, on average, over the entire reservoir? Why or why not? What if you were told porosity was 90%? This simple example demonstrates that we tend to have fairly strong prior beliefs about what models make sense in a given system or context, and we would be fairly likely to reject observations that suggested porosity was less than 1% for a sandstone or greater than 90%. The formal integration of our prior beliefs with the data is one of the things that makes Bayesian probability inference so powerful.
The Evidence p(d)
What would p(d) be in this case? It’s the probability of observing the given data, given all possible models (including all possible values of θ ), and it is called the “evidence.” How would you compute this? You would have to simulate the data that you would get from all possible values of θ, then integrate over all of those possibilities to get the final probability. This would likely be very difficult for you to do, and potentially very inefficient. For our simple problem it might be possible to do, but for a more complex problem determining p(d) becomes prohibitive. How then do we calculate p(theta | d)?
Markov Chain Monte Carlo – MCMC
Markov Chain Monte Carlo (Links to an external site.) methods are a way of sampling the posterior distribution (left side of the equality) without needing to know the denominator p(d) on the right-hand side. The basic idea is that p(d) is a (unknown) constant, so it can be ignored if you only want the relative probability of two different models. I.e., p(d) is a function only of the data (no theta in there), so it will be the same regardless of whatever model you propose. So, we can compute the probability of two values of theta, and then simply compare to see which one is better (has higher probability). In this homework you will use an MCMC algorithm to sample from the posterior distribution of porosity and layer thickness, given the prior distributions we know and the likelihood function, discussed below.
(2) p(θ|d)∝p(d|θ)p(θ)
where the proportionality sign shows that we are longer looking at proper probabilities, but at something which is proportional (up to a constant) to the actual probabilities. Since we are only concerned about distinguishing which values of porosity are probable given our data, we don’t need to know their absolute probability, only the relative.
Bayesian estimation of porosity and layer thickness
MCMC works by drawing correlated samples of θ and then computing the likelihood of those samples. The particular algorithm we will use is called “Metropolis-Hastings.” It works like this:
1) Decide what the prior probability distribution p(θ) should be. In this case we are given the prior distributions, which are the Gaussian and Uniform distributions we’ve already discussed. Note: because we have specified the thickness distribution as uniform, any value of thickness outside the bounds (10 and 14) will be automatically rejected, even if they are favored by the data! For this homework it will not be a problem, but in the real world you must be careful about rejecting models when you have bounds active, in case you inadvertently exclude some possible models because of your bounds. The data is given in the table:
Borehole # | Reservoir Porosity (%) | Reservoir thickness (m) |
1 | 29.3 | 12.5 |
2 | 21.0 | 11.7 |
3 | 19.2 | 12.2 |
4 | 29.1 | 13.1 |
5 | 21.9 | 13.0 |
6 | 23.1 | 13.7 |
2) Specify your likelihood function (Links to an external site.) p(d∣θ) . The likelihood function mathematically encapsulates how well the data fit a given model. So in this case, given a value for mean porosity, how well do the observations fit that mean? In our example, the likelihood is a function of the difference between the data and the tested mean. In our particular case, the likelihood is very simple because we don’t have any measurement errors, and we are simply looking at the empirical distribution of the mean:
(3) logp(d|θ)∝−0.5∑i(μϕ−d¯)2σi2
where d(bar) is the mean of all the measurements and is subtracted from the mean porosity. We have to decide what to use for the standard deviation. Normally, this would be measurement error or uncertainty associated with each measurement, but in this case we do not have those. For this problem, we will assume that the standard deviation is the same the prior uncertainty (so 3% for porosity). For layer thickness, compute the sample standard deviation of the measurements and use that for each σi.
3) You will need to write a function that takes in 3 required and 3 optional arguments:
- x – the unknown parameters, in this case the means of porosity and thickness
- bounds – a list of two lists, each containing the lower and upper bounds for porosity and thickness. E.g. bound = [[-np.inf, 10], [np.inf, 14]]
- the data for the variable you are looking at (either porosity or thickness.
- Optionally, a standard deviation of the data
- The prior mean (if using a Gaussian prior)
- The prior variance (if using a Gaussian prior)
Your function should return the log-likelihood given by equation 3 above, multiplied by your prior. You should have a single function that you call twice to compute the distribution for porosity and thickness.
4) Run MCMC to search out the space of likely models. Do this by setting the parameters required in mcmc.py, and passing the function that you wrote in the appropriate argument spot. The input arguments are:
- Niter – the number of MCMC steps to take
- stepsize – the size of your step in parameter space
- fun – the function you write
- x0 – the starting value or estimate for the mean (use the sample average)
- bounds – your list of bounds
- bcut – the number of initial samples to skip (make this 0 for this homework)
- k – the number of intermediate steps to skip (make this 10 for this homework)
- data – the actual porosity or thickness data
Be sure that you collect all of the output variables separately:
- xhat – the estimated values of the mean
- likelihoods – the actual likelihoods (proportional to probabilities) of each sample
- acceptance_ratio – the ratio of samples you kept to the total, should be around 0.3 – 0.7 or so.
5) After you have run the code, plot a histogram of the posterior samples (xhat). You now have an empirical distribution of the posterior probability function! Create one histogram for each of variables (porosity and thickness). You should not have any thickness values above 14 or below 10.
6) Create a report outlining the steps you took, include your likelihood function and any other functions you may have used, and your final histograms. Report the mean and 95% confidence bounds on porosity and layer thickness. Use the values in your histogram to determine the confidence bounds.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."