Recommended reading on HMMs
- A tutorial on Hidden Markov Models and selected applications in speech recognition, L. Rabiner, 1989, Proc. IEEE 77(2):257--286.
- What HMMs can do, Jeff Bilmes, U. Washington Tech Report, Feb 2002
- Markovian Models for Sequential Data, Y. Bengio, Neural Computing Surveys 2, 129--162, 1999.
- Probabilistic independence networks for hidden Markov models P. Smyth, D. Heckerman, and M. Jordan, Neural Computation , vol.9, no. 2, 227--269, 1997.
- Notes from Dan Ellis' class on speech processing at Columbia (2002). I snagged the excellent notes for lecture 10 and Lecture 11.
- Krstulovic, Doss & Bourlard's EPFL matlab lab manuals on HMMs here .
- Bibliography on HMMs (2001)
- Bookmarks on HMMs
- Machine Learning for Sequential Data: A Review. Tom Dietterich. In T. Caelli (Ed.) Lecture Notes in Computer Science (2002).
- Teaching Baum Welch using Excel spreadsheets, Jason Eisner
http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html
http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
http://nbviewer.ipython.org/7460513
http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
http://nbviewer.ipython.org/7460513
import math
import numpy
class HMM(object):
''' Simple Hidden Markov Model implementation. User provides
transition, emission and initial probabilities in dictionaries
mapping 2-character codes onto floating-point probabilities
for those table entries. States and emissions are represented
with single characters. Emission symbols comes from a finite. '''
def __init__(self, A, E, I):
''' Initialize the HMM given transition, emission and initial
probability tables. '''
# put state labels to the set self.Q
self.Q, self.S = set(), set() # states and symbols
for a, prob in A.iteritems():
asrc, adst = a[0], a[1]
self.Q.add(asrc)
self.Q.add(adst)
# add all the symbols to the set self.S
for e, prob in E.iteritems():
eq, es = e[0], e[1]
self.Q.add(eq)
self.S.add(es)
self.Q = sorted(list(self.Q))
self.S = sorted(list(self.S))
# create maps from state labels / emission symbols to integers
# that function as unique IDs
qmap, smap = {}, {}
for i in xrange(len(self.Q)): qmap[self.Q[i]] = i
for i in xrange(len(self.S)): smap[self.S[i]] = i
lenq = len(self.Q)
# create and populate transition probability matrix
self.A = numpy.zeros(shape=(lenq, lenq), dtype=float)
for a, prob in A.iteritems():
asrc, adst = a[0], a[1]
self.A[qmap[asrc], qmap[adst]] = prob
# make A stochastic (i.e. make rows add to 1)
self.A /= self.A.sum(axis=1)[:, numpy.newaxis]
# create and populate emission probability matrix
self.E = numpy.zeros(shape=(lenq, len(self.S)), dtype=float)
for e, prob in E.iteritems():
eq, es = e[0], e[1]
self.E[qmap[eq], smap[es]] = prob
# make E stochastic (i.e. make rows add to 1)
self.E /= self.E.sum(axis=1)[:, numpy.newaxis]
# initial probabilities
self.I = [ 0.0 ] * len(self.Q)
for a, prob in I.iteritems():
self.I[qmap[a]] = prob
# make I stochastic (i.e. adds to 1)
self.I = numpy.divide(self.I, sum(self.I))
self.qmap, self.smap = qmap, smap
# Make log-base-2 versions for log-space functions
self.Alog = numpy.log2(self.A)
self.Elog = numpy.log2(self.E)
self.Ilog = numpy.log2(self.I)
def jointProb(self, p, x):
''' Return joint probability of path p and emission string x '''
p = map(self.qmap.get, p) # turn state characters into ids
x = map(self.smap.get, x) # turn emission characters into ids
tot = self.I[p[0]] # start with initial probability
for i in xrange(1, len(p)):
tot *= self.A[p[i-1], p[i]] # transition probability
for i in xrange(0, len(p)):
tot *= self.E[p[i], x[i]] # emission probability
return tot
def jointProbL(self, p, x):
''' Return log2 of joint probability of path p and emission
string x. Just like self.jointProb(...) but log2 domain. '''
p = map(self.qmap.get, p) # turn state characters into ids
x = map(self.smap.get, x) # turn emission characters into ids
tot = self.Ilog[p[0]] # start with initial probability
for i in xrange(1, len(p)):
tot += self.Alog[p[i-1], p[i]] # transition probability
for i in xrange(0, len(p)):
tot += self.Elog[p[i], x[i]] # emission probability
return tot
def viterbi(self, x):
''' Given sequence of emissions, return the most probable path
along with its probability. '''
x = map(self.smap.get, x) # turn emission characters into ids
nrow, ncol = len(self.Q), len(x)
mat = numpy.zeros(shape=(nrow, ncol), dtype=float) # prob
matTb = numpy.zeros(shape=(nrow, ncol), dtype=int) # backtrace
# Fill in first column
for i in xrange(0, nrow):
mat[i, 0] = self.E[i, x[0]] * self.I[i]
# Fill in rest of prob and Tb tables
for j in xrange(1, ncol):
for i in xrange(0, nrow):
ep = self.E[i, x[j]]
mx, mxi = mat[0, j-1] * self.A[0, i] * ep, 0
for i2 in xrange(1, nrow):
pr = mat[i2, j-1] * self.A[i2, i] * ep
if pr > mx:
mx, mxi = pr, i2
mat[i, j], matTb[i, j] = mx, mxi
# Find final state with maximal probability
omx, omxi = mat[0, ncol-1], 0
for i in xrange(1, nrow):
if mat[i, ncol-1] > omx:
omx, omxi = mat[i, ncol-1], i
# Backtrace
i, p = omxi, [omxi]
for j in xrange(ncol-1, 0, -1):
i = matTb[i, j]
p.append(i)
p = ''.join(map(lambda x: self.Q[x], p[::-1]))
return omx, p # Return probability and path
def viterbiL(self, x):
''' Given sequence of emissions, return the most probable path
along with log2 of its probability. Just like viterbi(...)
but in log2 domain. '''
x = map(self.smap.get, x) # turn emission characters into ids
nrow, ncol = len(self.Q), len(x)
mat = numpy.zeros(shape=(nrow, ncol), dtype=float) # prob
matTb = numpy.zeros(shape=(nrow, ncol), dtype=int) # backtrace
# Fill in first column
for i in xrange(0, nrow):
mat[i, 0] = self.Elog[i, x[0]] + self.Ilog[i]
# Fill in rest of log prob and Tb tables
for j in xrange(1, ncol):
for i in xrange(0, nrow):
ep = self.Elog[i, x[j]]
mx, mxi = mat[0, j-1] + self.Alog[0, i] + ep, 0
for i2 in xrange(1, nrow):
pr = mat[i2, j-1] + self.Alog[i2, i] + ep
if pr > mx:
mx, mxi = pr, i2
mat[i, j], matTb[i, j] = mx, mxi
# Find final state with maximal log probability
omx, omxi = mat[0, ncol-1], 0
for i in xrange(1, nrow):
if mat[i, ncol-1] > omx:
omx, omxi = mat[i, ncol-1], i
# Backtrace
i, p = omxi, [omxi]
for j in xrange(ncol-1, 0, -1):
i = matTb[i, j]
p.append(i)
p = ''.join(map(lambda x: self.Q[x], p[::-1]))
return omx, p # Return log probability and path
In [2]:
# We experiment with joint probabilities first
hmm = HMM({"FF":0.9, "FL":0.1, "LF":0.1, "LL":0.9}, # transition matrix A
{"FH":0.5, "FT":0.5, "LH":0.75, "LT":0.25}, # emission matrix E
{"F":0.5, "L":0.5}) # initial probabilities I
jprob1 = hmm.jointProb("FFFLLLFFFFF", "THTHHHTHTTH")
myprob1 = (0.5 ** 9) * (0.75 ** 3) * (0.9 ** 8) * (0.1 ** 2)
jprob1, myprob1
# these should be about equal
Out[2]:
In [3]:
# confirming that log version of jointProb works as expected
jprobL1 = hmm.jointProbL("FFFLLLFFFFF", "THTHHHTHTTH")
math.log(jprob1, 2), jprobL1
Out[3]:
In [4]:
# Trying another path
jprob2 = hmm.jointProb("FFFFFFFFFFF", "THTHHHTHTTH")
myprob2 = (0.5 ** 12) * (0.9 ** 10)
jprob2, myprob2
# these should be about equal
Out[4]:
In [5]:
# Note that jprob2 is greater than jprob1
# Now we experiment with viterbi decoding
jprobOpt, path = hmm.viterbi("THTHHHTHTTH")
path
Out[5]:
In [6]:
# maximum likelihood path is same path (all fair) as the second one
# we tried above, so jprobOpt should equal jprob2
jprobOpt, jprob2
Out[6]:
In [7]:
# confirming that log version of viterbi works as expected
jprobLOpt, _ = hmm.viterbiL("THTHHHTHTTH")
math.log(jprobOpt, 2), jprobLOpt
Out[7]:
In [8]:
# Now let's make a new HMM with the same states but where jumps
# between fair (F) and loaded (L) are much more probable
hmm = HMM({"FF":0.6, "FL":0.4, "LF":0.4, "LL":0.6}, # transition matrix A
{"FH":0.5, "FT":0.5, "LH":0.8, "LT":0.2}, # emission matrix E
{"F":0.5, "L":0.5}) # initial probabilities I
hmm.viterbi("THTHHHTHTTH")
Out[8]:
In [9]:
# Here's an example of underflow. Note that probability returned
# is 0.0 and the state string becomes all Fs after a while.
hmm.viterbi("THTHHHTHTTH" * 100)
Out[9]:
In [10]:
# Moving to log2 domain fixes underflow
hmm.viterbiL("THTHHHTHTTH" * 100)
Out[10]:
In [11]:
cpgHmm = HMM({'IO':0.20, 'OI':0.20, 'II':0.80, 'OO':0.80},
{'IA':0.10, 'IC':0.40, 'IG':0.40, 'IT':0.10,
'OA':0.25, 'OC':0.25, 'OG':0.25, 'OT':0.25},
{'I' :0.50, 'O' :0.50})
x = 'ATATATACGCGCGCGCGCGCGATATATATATATA'
logp, path = cpgHmm.viterbiL(x)
print x
print path # finds the CpG island fine
In [12]:
x = 'ATATCGCGCGCGATATATCGCGCGCGATATATAT'
logp, path = cpgHmm.viterbiL(x)
print x
print path # finds two CpG islands fine
In [13]:
x = 'ATATATACCCCCCCCCCCCCCATATATATATATA'
logp, path = cpgHmm.viterbiL(x)
print x
print path # oops! - this is just a bunch of Cs. What went wrong?
Hidden Markov Models
Introduction
The following presentation is adapted from [Rabiner & Juang, 1986] and [Charniak, 1993].Notational conventions
- T = length of the sequence of observations (training set)
- N = number of states (we either know or guess this number)
- M = number of possible observations (from the training set)
- Omega_X = {q_1,...q_N} (finite set of possible states)
- Omega_O = {v_1,...,v_M} (finite set of possible observations)
- X_t random variable denoting the state at time t (state variable)
- O_t random variable denoting the observation at time t (output variable)
- sigma = o_1,...,o_T (sequence of actual observations)
Distributional parameters
- A = {a_ij} s.t. a_ij = Pr(X_t+1 = q_j |X_t = q_i) (transition probabilities)
- B = {b_i} s.t. b_i(k) = Pr(O_t = v_k | X_t = q_i t) (observation probabilities)
- pi = {pi_i} s.t. pi_i = Pr(X_0 = q_i) (initial state distribution)
Definitions
A hidden Markov model (HMM) is a five-tuple (Omega_X,Omega_O,A,B,pi). Let lambda = {A,B,pi} denote the parameters for a given HMM with fixed Omega_X and Omega_O.Problems
- 1. Find Pr(sigma|lambda): the probability of the observations given the model.
- 2. Find the most likely state trajectory given the model and observations.
- 3. Adjust lambda = {A,B,pi} to maximize Pr(sigma|lambda).
Motivation
A discrete-time, discrete-space dynamical system governed by a Markov chain emits a sequence of observable outputs: one output (observation) for each state in a trajectory of such states. From the observable sequence of outputs, infer the most likely dynamical system. The result is a model for the underlying process. Alternatively, given a sequence of outputs, infer the most likely sequence of states. We might also use the model to predict the next observation or more generally a continuation of the sequence of observations.Hidden Markov models are used in speech recognition. Suppose that we have a set W of words and a separate training set for each word. Build an HMM for each word using the associated training set. Let lambda_w denote the HMM parameters associated with the word w. When presented with a sequence of observations sigma, choose the word with the most likely model, i.e.,w* = arg max_{w in W} Pr(sigma|lambda_w)
Forward-Backward Algorithm
Preliminaries
Define the alpha values as follows,alpha_t(i) = Pr(O_1=o_1,...,O_t=o_t, X_t = q_i | lambda)Note that
alpha_T(i) = Pr(O_1=o_1,...,O_T=o_T, X_T = q_i | lambda) = Pr(sigma, X_T = q_i | lambda)The alpha values enable us to solve Problem 1 since, marginalizing, we obtain
Pr(sigma|lambda) = sum_i=1^N Pr(o_1,...,o_T, X_T = q_i | lambda) = sum_i=1^N alpha_T(i)Define the beta values as follows,
beta_t(i) = Pr(O_t+1=o_t+1,...,O_T=o_T | X_t = q_i, lambda)We will need the beta values later in the Baum-Welch algorithm.
Algorithmic Details
- 1. Compute the forward (alpha) values:
a. alpha_1(i) = pi_i b_i(o_1) b. alpha_t+1(j) = [sum_i=1^N alpha_t(i) a_ij] b_j(o_t+1)
- 2. Computing the backward (beta) values:
a. beta_T(i) = 1 b. beta_t(i) = sum_j=1^N a_ij b_j(o_t+1) beta_t+1(j)
Viterbi Algorithm
Intuition
Compute the most likely trajectory starting with the empty output sequence; use this result to compute the most likely trajectory with an output sequence of length one; recurse until you have the most likely trajectory for the entire sequence of outputs.Algorithmic Details
- 1. Initialization:
For 1 <= i <= N, a. delta_1(i) = pi b_i(o_1) b. Phi_1(i) = 0
- 2. Recursion:
For 2 <= t <= T, 1 <= j <= N, a. delta_t(j) = max_i [delta_t-1(i)a_ij]b_j(o_t) b. Phi_t(j) = argmax_i [delta_t-1(i)a_ij]
- 3. Termination:
a. p* = max_i [delta_T(i)] b. i*_T = argmax_i [delta_T(i)]
- 4. Reconstruction:
For t = t-1,t-2,...,1, i*_t = Phi_t+1(i*_t+1)
- The resulting trajectory, i*_1,...,i*_T, solves Problem 2.
Baum-Welch Algorithm
Intuition
To solve Problem 3 we need a method of adjusting the lambda parameters to maximize the likelihood of the training set.Suppose that the outputs (observations) are in a 1-1 correspondence with the states so that N = M, varphi(q_i) = v_i and b_i(j) = 1 for j = i and 0 for j != i. Now the Markov process is not hidden at all and the HMM is just a Markov chain. To estimate the lambda parameters for this Markov chain it is enough just to calculate the appropriate frequencies from the observed sequence of outputs. These frequencies constitute sufficient statistics for the underlying distributions.In the more general case, we can't observe the states directly so we can't calculate the required frequencies. In the hidden case, we use expectation maximization (EM) as described in [Dempster et al., 1977].
Instead of calculating the required frequencies directly from the observed outputs, we iteratively estimated the parameters. We start by choosing arbitrary values for the parameters (just make sure that the values satisfy the requirements for probability distributions).
We then compute the expected frequencies given the model and the observations. The expected frequencies are obtained by weighting the observed transitions by the probabilities specified in the current model. The expected frequencies so obtained are then substituted for the old parameters and we iterate until there is no improvement. On each iteration we improve the probability of O being observed from the model until some limiting probability is reached. This iterative procedure is guaranteed to converge on a local maximum of the cross entropy (Kullback-Leibler) performance measure.
Preliminaries
The probability of a trajectory being in state q_i at time t and making the transition to q_j at t+1 given the observation sequence and model.xi_t(i,j) = Pr(X_t = q_i, X_t+1 = q_j | sigma, lambda)We compute these probabilities using the forward backward variables.
alpha_t(i) a_ij(o_t+1) beta_t+1(j) xi_t(i,j) = ------------------------------------- Pr(O | lambda)The probability of being in q_i at t given the observation sequence and model.
gamma_t(i) = Pr(X_t = q_i | sigma, lambda)Which we obtain by marginalization.
gamma_t(i) = sum_j xi_t(i,j)Note that
sum_t=1^T gamma_t(i) = expected number of transitions from q_iand
sum_t=1^T xi_t(i,j) = expected number of transitions from q_i to q_j
Algorithmic Details
- 1. Choose the initial parameters, lambda = {A, B, pi}, arbitrarily.
- 2. Reestimate the parameters.
a. bar{pi}_i = gamma_t(i) sum_t=1^T-1 xi_t(i,j) b. bar{a}_ij = ------------------------ sum_t=1^T-1 gamma_t(i) sum_t=1^T-1 gamma_t(j) 1_{o_t = k} c. bar{b}_j(k) = ------------------------------------ sum_t=1^T-1 gamma_t(j)
- where 1_{o_t = k} = 1 if o_t = k and 0 otherwise.
- 3. Let bar{A} = {bar{a}_ij}, bar{B} = {bar{b}_i(k)}, and bar{pi} = {{bar{pi}_i}.
- 4. Set bar{lambda} to be {bar{A}, bar{B}, bar{pi}}.
- 5. If lambda = bar{lambda} then quit, else set lambda to be bar{lambda} and return to Step 2.
Bayesian Network Algorithms
The Bayesian network representation is shown in Figure 1.X_0 X_1 X_2 X_3 X_T-1 X_T o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 1: Bayesian network representation for an HMMIn the description of the Baum-Welch algorithm provided above, the computation of the expected sufficient statistics depends on computing the following term for all i and j in Omega_X.
xi_t(i,j) = Pr( X_t=q_i, X_t+1=q_j | sigma, lambda)These computations in turn rely on computing the forward and backward variables (the alpha's and beta's).
alpha_t(i) a_ij(o_t+1) beta_t+1(j) xi_t(i,j) = ------------------------------------- Pr(sigma | lambda)Generally, the forward and backward variables are computed using the forward-backward procedure which uses dynamic programming to compute the variables in time polynomial in |Omega_X|, |Omega_O|, and T. In the following paragraphs, we show how the xi's can be computed using standard Bayesian network inference algorithms in the same big-Oh complexity. One advantage of this approach is that it extends easily to the case in which the hidden part of the model is factored into some number of state variables.In the network shown in Figure 1 the O_t's are known. In particular, we have that O_i=o_i. If we assign the O_t's accordingly, use the probabilities indicated by lambda, and apply a standard Bayesian network inference algorithm, we obtain for every X_t the posterior distribution Pr(X_t|sigma,lambda). This isn't exactly what we need since X_t and X_t+1 are clearly not independent. If they were independent, then we could obtain Pr(X_t,X_t+1|sigma,lambda) from the product of Pr(X_t|sigma,lambda) and Pr(X_t+1|sigma,lambda). There are a number of remedies but one approach which is graphically intuitive involves adding a new state variable (X_t,X_t+1) which is the obvious deterministic function X_t and X_t+1. This addition results in the network shown in Figure 2.
X_0,X_1 X_1,X_2 X_2,X_3 ... X_T-1,X_T o o o o o / ^ / ^ / ^ ^ / ^ / | / | / | | / | / | / | / | | / | o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 2: Bayesian network with joint variables, (X_t,X_t+1)If we update the network in Figure 2 with O_t=o_t then we obtain Pr((X_t,X_t+1)|sigma,lambda) directly. It should be clear that we can eliminate the X_t (except for X_0) to obtain the singly connected network shown in Figure 3 which can be updated in time polynomial in |Omega_X|, |Omega_O|, and T.
X_0 X_0,X_1 X_1,X_2 X_2,X_3 ... X_T-1,X_T o ----> o ----> o ----> o o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 3: Bayesian network with X_t's eliminatedThe extension to HMMs with factored state spaces (e.g., see Figure 4) is graphically straightforward. The computational picture is more complicated and depends on the specifics of the update algorithm. It is important to point out, however, that there is a wide range of update algorithms, both approximate and exact, to choose from.
X_1 o ----> o ----> o ----> o ... o ----> o \ \ \ \ \ \ \ \ \ \ \ \ X_2 o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v O o o o o ... o o Fig. 4: Bayesian network representation for an HMM with factored state space Omega_X = Omega_X_1 times Omega_X_2. The state variable is two dimensional X_t = X_{1,t},X_{2,t}.
댓글 없음:
댓글 쓰기