전체 페이지뷰

2017년 6월 6일 화요일

HMM (Hidden Markov Model)

Recommended reading on HMMs

http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
http://nbviewer.ipython.org/7460513

import math
import numpy

class HMM(object):
    ''' Simple Hidden Markov Model implementation.  User provides
        transition, emission and initial probabilities in dictionaries
        mapping 2-character codes onto floating-point probabilities
        for those table entries.  States and emissions are represented
        with single characters.  Emission symbols comes from a finite.  '''
    
    def __init__(self, A, E, I):
        ''' Initialize the HMM given transition, emission and initial
            probability tables. '''
        
        # put state labels to the set self.Q
        self.Q, self.S = set(), set() # states and symbols
        for a, prob in A.iteritems():
            asrc, adst = a[0], a[1]
            self.Q.add(asrc)
            self.Q.add(adst)
        # add all the symbols to the set self.S
        for e, prob in E.iteritems():
            eq, es = e[0], e[1]
            self.Q.add(eq)
            self.S.add(es)
        
        self.Q = sorted(list(self.Q))
        self.S = sorted(list(self.S))
        
        # create maps from state labels / emission symbols to integers
        # that function as unique IDs
        qmap, smap = {}, {}
        for i in xrange(len(self.Q)): qmap[self.Q[i]] = i
        for i in xrange(len(self.S)): smap[self.S[i]] = i
        lenq = len(self.Q)
        
        # create and populate transition probability matrix
        self.A = numpy.zeros(shape=(lenq, lenq), dtype=float)
        for a, prob in A.iteritems():
            asrc, adst = a[0], a[1]
            self.A[qmap[asrc], qmap[adst]] = prob
        # make A stochastic (i.e. make rows add to 1)
        self.A /= self.A.sum(axis=1)[:, numpy.newaxis]
        
        # create and populate emission probability matrix
        self.E = numpy.zeros(shape=(lenq, len(self.S)), dtype=float)
        for e, prob in E.iteritems():
            eq, es = e[0], e[1]
            self.E[qmap[eq], smap[es]] = prob
        # make E stochastic (i.e. make rows add to 1)
        self.E /= self.E.sum(axis=1)[:, numpy.newaxis]
        
        # initial probabilities
        self.I = [ 0.0 ] * len(self.Q)
        for a, prob in I.iteritems():
            self.I[qmap[a]] = prob
        # make I stochastic (i.e. adds to 1)
        self.I = numpy.divide(self.I, sum(self.I))
        
        self.qmap, self.smap = qmap, smap
        
        # Make log-base-2 versions for log-space functions
        self.Alog = numpy.log2(self.A)
        self.Elog = numpy.log2(self.E)
        self.Ilog = numpy.log2(self.I)
    
    def jointProb(self, p, x):
        ''' Return joint probability of path p and emission string x '''
        p = map(self.qmap.get, p) # turn state characters into ids
        x = map(self.smap.get, x) # turn emission characters into ids
        tot = self.I[p[0]] # start with initial probability
        for i in xrange(1, len(p)):
            tot *= self.A[p[i-1], p[i]] # transition probability
        for i in xrange(0, len(p)):
            tot *= self.E[p[i], x[i]] # emission probability
        return tot
    
    def jointProbL(self, p, x):
        ''' Return log2 of joint probability of path p and emission
            string x.  Just like self.jointProb(...) but log2 domain. '''
        p = map(self.qmap.get, p) # turn state characters into ids
        x = map(self.smap.get, x) # turn emission characters into ids
        tot = self.Ilog[p[0]] # start with initial probability
        for i in xrange(1, len(p)):
            tot += self.Alog[p[i-1], p[i]] # transition probability
        for i in xrange(0, len(p)):
            tot += self.Elog[p[i], x[i]] # emission probability
        return tot
    
    def viterbi(self, x):
        ''' Given sequence of emissions, return the most probable path
            along with its probability. '''
        x = map(self.smap.get, x) # turn emission characters into ids
        nrow, ncol = len(self.Q), len(x)
        mat   = numpy.zeros(shape=(nrow, ncol), dtype=float) # prob
        matTb = numpy.zeros(shape=(nrow, ncol), dtype=int)   # backtrace
        # Fill in first column
        for i in xrange(0, nrow):
            mat[i, 0] = self.E[i, x[0]] * self.I[i]
        # Fill in rest of prob and Tb tables
        for j in xrange(1, ncol):
            for i in xrange(0, nrow):
                ep = self.E[i, x[j]]
                mx, mxi = mat[0, j-1] * self.A[0, i] * ep, 0
                for i2 in xrange(1, nrow):
                    pr = mat[i2, j-1] * self.A[i2, i] * ep
                    if pr > mx:
                        mx, mxi = pr, i2
                mat[i, j], matTb[i, j] = mx, mxi
        # Find final state with maximal probability
        omx, omxi = mat[0, ncol-1], 0
        for i in xrange(1, nrow):
            if mat[i, ncol-1] > omx:
                omx, omxi = mat[i, ncol-1], i
        # Backtrace
        i, p = omxi, [omxi]
        for j in xrange(ncol-1, 0, -1):
            i = matTb[i, j]
            p.append(i)
        p = ''.join(map(lambda x: self.Q[x], p[::-1]))
        return omx, p # Return probability and path
    
    def viterbiL(self, x):
        ''' Given sequence of emissions, return the most probable path
            along with log2 of its probability.  Just like viterbi(...)
            but in log2 domain. '''
        x = map(self.smap.get, x) # turn emission characters into ids
        nrow, ncol = len(self.Q), len(x)
        mat   = numpy.zeros(shape=(nrow, ncol), dtype=float) # prob
        matTb = numpy.zeros(shape=(nrow, ncol), dtype=int)   # backtrace
        # Fill in first column
        for i in xrange(0, nrow):
            mat[i, 0] = self.Elog[i, x[0]] + self.Ilog[i]
        # Fill in rest of log prob and Tb tables
        for j in xrange(1, ncol):
            for i in xrange(0, nrow):
                ep = self.Elog[i, x[j]]
                mx, mxi = mat[0, j-1] + self.Alog[0, i] + ep, 0
                for i2 in xrange(1, nrow):
                    pr = mat[i2, j-1] + self.Alog[i2, i] + ep
                    if pr > mx:
                        mx, mxi = pr, i2
                mat[i, j], matTb[i, j] = mx, mxi
        # Find final state with maximal log probability
        omx, omxi = mat[0, ncol-1], 0
        for i in xrange(1, nrow):
            if mat[i, ncol-1] > omx:
                omx, omxi = mat[i, ncol-1], i
        # Backtrace
        i, p = omxi, [omxi]
        for j in xrange(ncol-1, 0, -1):
            i = matTb[i, j]
            p.append(i)
        p = ''.join(map(lambda x: self.Q[x], p[::-1]))
        return omx, p # Return log probability and path
In [2]:
# We experiment with joint probabilities first

hmm = HMM({"FF":0.9, "FL":0.1, "LF":0.1, "LL":0.9}, # transition matrix A
          {"FH":0.5, "FT":0.5, "LH":0.75, "LT":0.25}, # emission matrix E
          {"F":0.5, "L":0.5}) # initial probabilities I
jprob1 = hmm.jointProb("FFFLLLFFFFF", "THTHHHTHTTH")
myprob1 = (0.5 ** 9) * (0.75 ** 3) * (0.9 ** 8) * (0.1 ** 2)
jprob1, myprob1
# these should be about equal
Out[2]:
(3.5469405120849628e-06, 3.5469405120849624e-06)
In [3]:
# confirming that log version of jointProb works as expected
jprobL1 = hmm.jointProbL("FFFLLLFFFFF", "THTHHHTHTTH")
math.log(jprob1, 2), jprobL1
Out[3]:
(-18.104993435171657, -18.104993435171654)
In [4]:
# Trying another path
jprob2 = hmm.jointProb("FFFFFFFFFFF", "THTHHHTHTTH")
myprob2 = (0.5 ** 12) * (0.9 ** 10)
jprob2, myprob2
# these should be about equal
Out[4]:
(8.51265722900391e-05, 8.512657229003909e-05)
In [5]:
# Note that jprob2 is greater than jprob1

# Now we experiment with viterbi decoding
jprobOpt, path = hmm.viterbi("THTHHHTHTTH")
path
Out[5]:
'FFFFFFFFFFF'
In [6]:
# maximum likelihood path is same path (all fair) as the second one
# we tried above, so jprobOpt should equal jprob2
jprobOpt, jprob2
Out[6]:
(8.51265722900391e-05, 8.51265722900391e-05)
In [7]:
# confirming that log version of viterbi works as expected
jprobLOpt, _ = hmm.viterbiL("THTHHHTHTTH")
math.log(jprobOpt, 2), jprobLOpt
Out[7]:
(-13.520030934450498, -13.520030934450496)
In [8]:
# Now let's make a new HMM with the same states but where jumps
# between fair (F) and loaded (L) are much more probable
hmm = HMM({"FF":0.6, "FL":0.4, "LF":0.4, "LL":0.6}, # transition matrix A
          {"FH":0.5, "FT":0.5, "LH":0.8, "LT":0.2}, # emission matrix E
          {"F":0.5, "L":0.5}) # initial probabilities I
hmm.viterbi("THTHHHTHTTH")
Out[8]:
(2.8665446400000001e-06, 'FFFLLLFFFFL')
In [9]:
# Here's an example of underflow.  Note that probability returned
# is 0.0 and the state string becomes all Fs after a while.
hmm.viterbi("THTHHHTHTTH" * 100)
Out[9]:
(0.0,
 'FFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF')
In [10]:
# Moving to log2 domain fixes underflow
hmm.viterbiL("THTHHHTHTTH" * 100)
Out[10]:
(-1824.4030071946879,
 'FFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFFFFFLLLFFFFL')
In [11]:
cpgHmm = HMM({'IO':0.20, 'OI':0.20, 'II':0.80, 'OO':0.80},
             {'IA':0.10, 'IC':0.40, 'IG':0.40, 'IT':0.10,
              'OA':0.25, 'OC':0.25, 'OG':0.25, 'OT':0.25},
             {'I' :0.50, 'O' :0.50})
x = 'ATATATACGCGCGCGCGCGCGATATATATATATA'
logp, path = cpgHmm.viterbiL(x)
print x
print path # finds the CpG island fine
ATATATACGCGCGCGCGCGCGATATATATATATA
OOOOOOOIIIIIIIIIIIIIIOOOOOOOOOOOOO
In [12]:
x = 'ATATCGCGCGCGATATATCGCGCGCGATATATAT'
logp, path = cpgHmm.viterbiL(x)
print x
print path # finds two CpG islands fine
ATATCGCGCGCGATATATCGCGCGCGATATATAT
OOOOIIIIIIIIOOOOOOIIIIIIIIOOOOOOOO
In [13]:
x = 'ATATATACCCCCCCCCCCCCCATATATATATATA'
logp, path = cpgHmm.viterbiL(x)
print x
print path # oops! - this is just a bunch of Cs.  What went wrong?
ATATATACCCCCCCCCCCCCCATATATATATATA
OOOOOOOIIIIIIIIIIIIIIOOOOOOOOOOOOO

Hidden Markov Models

Introduction

The following presentation is adapted from [Rabiner & Juang, 1986] and [Charniak, 1993].

Notational conventions

T = length of the sequence of observations (training set)
N = number of states (we either know or guess this number)
M = number of possible observations (from the training set)
Omega_X = {q_1,...q_N} (finite set of possible states)
Omega_O = {v_1,...,v_M} (finite set of possible observations)
X_t random variable denoting the state at time t (state variable)
O_t random variable denoting the observation at time t (output variable)
sigma = o_1,...,o_T (sequence of actual observations)

Distributional parameters

A = {a_ij} s.t. a_ij = Pr(X_t+1 = q_j |X_t = q_i) (transition probabilities)
B = {b_i} s.t. b_i(k) = Pr(O_t = v_k | X_t = q_i t) (observation probabilities)
pi = {pi_i} s.t. pi_i = Pr(X_0 = q_i) (initial state distribution)

Definitions

hidden Markov model (HMM) is a five-tuple (Omega_X,Omega_O,A,B,pi). Let lambda = {A,B,pi} denote the parameters for a given HMM with fixed Omega_X and Omega_O.

Problems

1. Find Pr(sigma|lambda): the probability of the observations given the model.
2. Find the most likely state trajectory given the model and observations.
3. Adjust lambda = {A,B,pi} to maximize Pr(sigma|lambda).

Motivation

A discrete-time, discrete-space dynamical system governed by a Markov chain emits a sequence of observable outputs: one output (observation) for each state in a trajectory of such states. From the observable sequence of outputs, infer the most likely dynamical system. The result is a model for the underlying process. Alternatively, given a sequence of outputs, infer the most likely sequence of states. We might also use the model to predict the next observation or more generally a continuation of the sequence of observations.Hidden Markov models are used in speech recognition. Suppose that we have a set W of words and a separate training set for each word. Build an HMM for each word using the associated training set. Let lambda_w denote the HMM parameters associated with the word w. When presented with a sequence of observations sigma, choose the word with the most likely model, i.e.,
        w* = arg max_{w in W} Pr(sigma|lambda_w)

Forward-Backward Algorithm

Preliminaries

Define the alpha values as follows,
        alpha_t(i) = Pr(O_1=o_1,...,O_t=o_t, X_t = q_i | lambda)
Note that
        alpha_T(i) = Pr(O_1=o_1,...,O_T=o_T, X_T = q_i | lambda)
                   = Pr(sigma, X_T = q_i | lambda)
The alpha values enable us to solve Problem 1 since, marginalizing, we obtain
        Pr(sigma|lambda) = sum_i=1^N Pr(o_1,...,o_T, X_T = q_i | lambda)
                         = sum_i=1^N alpha_T(i)
Define the beta values as follows,
        beta_t(i) = Pr(O_t+1=o_t+1,...,O_T=o_T | X_t = q_i, lambda)
We will need the beta values later in the Baum-Welch algorithm.

Algorithmic Details

1. Compute the forward (alpha) values:
        a. alpha_1(i) = pi_i b_i(o_1)
        b. alpha_t+1(j) = [sum_i=1^N alpha_t(i) a_ij] b_j(o_t+1)
2. Computing the backward (beta) values:
        a. beta_T(i) = 1
        b. beta_t(i) = sum_j=1^N a_ij b_j(o_t+1) beta_t+1(j)

Viterbi Algorithm

Intuition

Compute the most likely trajectory starting with the empty output sequence; use this result to compute the most likely trajectory with an output sequence of length one; recurse until you have the most likely trajectory for the entire sequence of outputs.

Algorithmic Details

1. Initialization:
        For 1 <= i <= N,

        a. delta_1(i) = pi b_i(o_1)

        b. Phi_1(i) = 0
2. Recursion:
        For 2 <= t <= T, 1 <= j <= N,

        a. delta_t(j) = max_i [delta_t-1(i)a_ij]b_j(o_t)

        b. Phi_t(j) = argmax_i [delta_t-1(i)a_ij]
3. Termination:
        a. p* = max_i [delta_T(i)]

        b. i*_T = argmax_i [delta_T(i)]
4. Reconstruction:
        For t = t-1,t-2,...,1,

        i*_t = Phi_t+1(i*_t+1) 
The resulting trajectory, i*_1,...,i*_T, solves Problem 2.

Baum-Welch Algorithm

Intuition

To solve Problem 3 we need a method of adjusting the lambda parameters to maximize the likelihood of the training set.Suppose that the outputs (observations) are in a 1-1 correspondence with the states so that N = M, varphi(q_i) = v_i and b_i(j) = 1 for j = i and 0 for j != i. Now the Markov process is not hidden at all and the HMM is just a Markov chain. To estimate the lambda parameters for this Markov chain it is enough just to calculate the appropriate frequencies from the observed sequence of outputs. These frequencies constitute sufficient statistics for the underlying distributions.
In the more general case, we can't observe the states directly so we can't calculate the required frequencies. In the hidden case, we use expectation maximization (EM) as described in [Dempster et al., 1977].
Instead of calculating the required frequencies directly from the observed outputs, we iteratively estimated the parameters. We start by choosing arbitrary values for the parameters (just make sure that the values satisfy the requirements for probability distributions).
We then compute the expected frequencies given the model and the observations. The expected frequencies are obtained by weighting the observed transitions by the probabilities specified in the current model. The expected frequencies so obtained are then substituted for the old parameters and we iterate until there is no improvement. On each iteration we improve the probability of O being observed from the model until some limiting probability is reached. This iterative procedure is guaranteed to converge on a local maximum of the cross entropy (Kullback-Leibler) performance measure.

Preliminaries

The probability of a trajectory being in state q_i at time t and making the transition to q_j at t+1 given the observation sequence and model.
        xi_t(i,j) = Pr(X_t = q_i, X_t+1 = q_j | sigma, lambda)
We compute these probabilities using the forward backward variables.
                       alpha_t(i) a_ij(o_t+1) beta_t+1(j)
        xi_t(i,j) =  -------------------------------------
                                  Pr(O | lambda) 
The probability of being in q_i at t given the observation sequence and model.
        gamma_t(i) = Pr(X_t = q_i | sigma, lambda)
Which we obtain by marginalization.
        gamma_t(i) = sum_j xi_t(i,j)
Note that
        sum_t=1^T gamma_t(i) = expected number of transitions from q_i
and
        sum_t=1^T xi_t(i,j) = expected number of transitions from q_i to q_j

Algorithmic Details

1. Choose the initial parameters, lambda = {A, B, pi}, arbitrarily.
2. Reestimate the parameters.
        a. bar{pi}_i = gamma_t(i) 

                        sum_t=1^T-1 xi_t(i,j) 
        b. bar{a}_ij = ------------------------
                        sum_t=1^T-1 gamma_t(i) 

                          sum_t=1^T-1 gamma_t(j) 1_{o_t = k}
        c. bar{b}_j(k) = ------------------------------------
                               sum_t=1^T-1 gamma_t(j) 
where 1_{o_t = k} = 1 if o_t = k and 0 otherwise.
3. Let bar{A} = {bar{a}_ij}, bar{B} = {bar{b}_i(k)}, and bar{pi} = {{bar{pi}_i}.
4. Set bar{lambda} to be {bar{A}, bar{B}, bar{pi}}.
5. If lambda = bar{lambda} then quit, else set lambda to be bar{lambda} and return to Step 2.

Bayesian Network Algorithms

The Bayesian network representation is shown in Figure 1.
        X_0     X_1     X_2     X_3             X_T-1   X_T
        o ----> o ----> o ----> o       ...     o ----> o 
        |       |       |       |               |       |  
        |       |       |       |               |       |  
        v       v       v       v               v       v  
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T

        Fig. 1: Bayesian network representation for an HMM 
In the description of the Baum-Welch algorithm provided above, the computation of the expected sufficient statistics depends on computing the following term for all i and j in Omega_X.
        xi_t(i,j) = Pr( X_t=q_i, X_t+1=q_j | sigma, lambda)
These computations in turn rely on computing the forward and backward variables (the alpha's and beta's).
                       alpha_t(i) a_ij(o_t+1) beta_t+1(j)
        xi_t(i,j) =  -------------------------------------
                              Pr(sigma | lambda)
Generally, the forward and backward variables are computed using the forward-backward procedure which uses dynamic programming to compute the variables in time polynomial in |Omega_X|, |Omega_O|, and T. In the following paragraphs, we show how the xi's can be computed using standard Bayesian network inference algorithms in the same big-Oh complexity. One advantage of this approach is that it extends easily to the case in which the hidden part of the model is factored into some number of state variables.In the network shown in Figure 1 the O_t's are known. In particular, we have that O_i=o_i. If we assign the O_t's accordingly, use the probabilities indicated by lambda, and apply a standard Bayesian network inference algorithm, we obtain for every X_t the posterior distribution Pr(X_t|sigma,lambda). This isn't exactly what we need since X_t and X_t+1 are clearly not independent. If they were independent, then we could obtain Pr(X_t,X_t+1|sigma,lambda) from the product of Pr(X_t|sigma,lambda) and Pr(X_t+1|sigma,lambda). There are a number of remedies but one approach which is graphically intuitive involves adding a new state variable (X_t,X_t+1) which is the obvious deterministic function X_t and X_t+1. This addition results in the network shown in Figure 2.
             X_0,X_1 X_1,X_2 X_2,X_3    ...          X_T-1,X_T    
                o       o       o               o       o       
              / ^     / ^     / ^               ^     / ^       
            /   |   /   |   /   |               |   /   |       
          /     | /     | /     |               | /     |       
        o ----> o ----> o ----> o       ...     o ----> o       
        |       |       |       |               |       |       
        |       |       |       |               |       |       
        v       v       v       v               v       v       
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T     

        Fig. 2: Bayesian network with joint variables, (X_t,X_t+1)
If we update the network in Figure 2 with O_t=o_t then we obtain Pr((X_t,X_t+1)|sigma,lambda) directly. It should be clear that we can eliminate the X_t (except for X_0) to obtain the singly connected network shown in Figure 3 which can be updated in time polynomial in |Omega_X|, |Omega_O|, and T.
       X_0   X_0,X_1 X_1,X_2 X_2,X_3    ...          X_T-1,X_T    
        o ----> o ----> o ----> o               o ----> o       
        |       |       |       |               |       |       
        |       |       |       |               |       |       
        v       v       v       v               v       v       
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T     

        Fig. 3: Bayesian network with X_t's eliminated
The extension to HMMs with factored state spaces (e.g., see Figure 4) is graphically straightforward. The computational picture is more complicated and depends on the specifics of the update algorithm. It is important to point out, however, that there is a wide range of update algorithms, both approximate and exact, to choose from.
   X_1  o ----> o ----> o ----> o       ...     o ----> o 
          \       \       \                       \     
            \       \       \                       \   
              \       \       \                       \ 
   X_2  o ----> o ----> o ----> o       ...     o ----> o 
        |       |       |       |               |       |  
        |       |       |       |               |       |  
        v       v       v       v               v       v  
    O   o       o       o       o       ...     o       o  

        Fig. 4: Bayesian network representation for an HMM with
        factored state space Omega_X = Omega_X_1 times Omega_X_2.
 The state variable is two dimensional X_t = X_{1,t},X_{2,t}.

References

See [Rabiner & Juang, 1986] and [Rabiner, 1989] for a general introduction and applications in speech. See [Charniak, 1993] for applications in natural language processing including part of speech tagging. Charniak [1993] provides lots of examples that provide useful insight. See [Rabiner, 1989] and [Fraser & Dimitriadis, 1994] for details regarding numerical issues that arise in implementing the above algorithm. Rabiner and Juang [1986] also discuss variant algorithms for continuous observation spaces using multivariate Gaussian models.

댓글 없음:

댓글 쓰기