A Python project

Project Description
Release History

PROJECT_HMM

Sunil

Rahul

Lithin

Hema

Roshan

Karthik

CONTENT

Introduction

Forward Probability

Backward Probability

Viterbi Algorithm

Forward Algorithm

Forward-backward Algorithm

Training a HMM

1 INTRODUCTION

A Hidden Markov model is a Markov chain for which the states are not explicitly observable .We instead make indirect observations about the state by events which result from those hidden states .Since these observables are not sufficient/complete to describe the state, we associate a probability with each of the observable coming from a particular state . In other words, if the probability falls to either 1 or 0,it reduces to a Markov model.

Example 1

To understand how Hidden Markov models work, we take the example of a student attending classes. On any given day, a student may or may not attend a particular class. This is an observable entity.

Let us assume that the student doesn’t attend class for only one of the two reasons, the student is sick or the student simply decides to skip the class. This information i.e the reason, is not known to us and hence is a hidden entity. Hidden Markov models serve as tool for us to identify these hidden entities given the observed entities.

We now introduce the components of an Hidden Markov model and relate them to the above example.

We first define the set of entities that we can observe, the observations. In this case we have 2 observations, “Attends class” and “Missed class”. We next define the set of hidden entities or the states of the HMM. In our case we have 2 states, “Sick” and “Healthy”.

Suppose we are aware of the behaviour of the student and know the probability with which the student attends the class or not given the state that the student is defined by the below matrix.

Attends class

Missed class

Sick

0.1

0.9

Healthy

0.8

0.2

The above matrix states that given that the student is sick, the probability that he attends class is 0.1 and the probability he misses the class is 0.9. Similarly, given that the student is healthy, the probability that he attends class is 0.8 and the probability he misses the class is 0.2 . This matrix is referred to as the emission matrix and defines the probability with which a given observation is emitted at a particular observed state.

Next we look at the probability for student to transition from one state to another.

Sick

Healthy

Sick

0.6

0.4

Healthy

0.2

0.8

The above matrix is referred to as the transition matrix. This tells us that given that the student is Sick today, the probability that he is sick tomorrow is 0.6 while the probability of him being healthy tomorrow is 0.4. Similarly, if the student is healthy today, the probability that the student is healthy tomorrow is 0.8 while the probability of being sick tomorrow is 0.2 .

The last component of a HMM is the start/initial probabilities which define the probability of the first hidden state. This can be seen as a special case of transition probability as the first state, by virtue of being first, doesn't have any state preceding it and hence transition probabilities cannot be applied.

Sick

0.1

Healthy

0.9

This implies that the probability of the first state being Sick is 0.1 while the probability of the first state being Healthy is 0.9

BASIC DEFINITIONS

O : { O1, O2 , …. OT } Sequence of observations

Q : {Q1 , Q2 , …. QT } Sequence of corresponding hidden states

S : {S1 , S2, ……, Sn } Set of unique states in the Hidden Markov model

M : {M1 , M2, ……, Mn } Set of unique observations in the Hidden Markov model

n : Number of hidden states in the model

m : Number of unique observations in the model

T : Transition matrix

E : Emission matrix

: Initial/ Start probability

: Forward probability values

: backward probability values

Note:

t(Si)=t(i)=t(S)= t(sick) for i=1 .Similarly i=2 implies state - ‘Healthy’.

1st order Markov assumption :The probability of occurrence of an event at time 't' depends only on the observation at time 't-1' and not on the events that happened before 't-1'.

P(Ot | O1,O2,…,Ot-1) =P(Ot |Ot-1)

In other words, the observations O1,O2,...Ot-1 do not impact the observation Ot. Hidden Markov models work on this assumption.The initial probability array, transition array and observation/emission array can completely define a HMM.

For the example discussed in the introduction, the states are { S1:Sick , S2 : Healthy}

while the observations are { M1:Attends class , M2 : Missed class}.

Suppose O={A, M, M, A}is a sequence of observations that we observe.

(Denoting the observations Attends class and Missed class as A and M respectively.)

So the student attends class on the first day, misses class for the next two days and attends class on the third day. Given the sequence, we compute forward and backward probabilities that are crucial for training Hidden Markov models.

2. Forward probability

Given an observation sequence O={O1 , O2,.... OT}, the forward probability t(i) for a state Siat time t is the probability for the sequence O0, O1,...,Ot to end in the state Si.

t(i) = P(Qt= Si , O1, O2,...,Ot )

Given the observation sequence O={A, M, M, A}, the forward probability for the student to be Sick on the 2nd day (t=2) is given by :

2(S1) = P( Q2=S1 ,O1=A, O2=M )

Computing forward probabilities

For time t=1,

1(1) = P( Q1=S , O1=A)

This is a special case, as we are not transitioning to the current state from any other state as this is the first state and there aren’t any states before this to transition from.

1(1) = (S)*E( A | S ) = 0.1*0.1 = 0.001

1(2) = (H)*E( A | H ) = 0.9*0.9 = 0.81

For the rest of the time instances, the forward probabilities are calculated using :

t(i) = j t-1(j)*T(Si|Sj)*E(Ot|Si)

On examining the above equation, we can see that the equation basically sums the probabilities of all possible paths to the state Si at time t. If the model is in state Sjin at time t-1 and is to move to the state Si at time t for the observation Ot, then it first has to transition from state Sjto Si, given by the transition probability T(Si|Sj), and then needs to emit the observation Ot, given by the emission probability E(Ot|Si).

For time t=2,

2(1) = 1(1)*T(S|S)*E(M|S) + 1(2)*T(S|H)*E(M|S)

2(2) = 1(1)*T(H|S)*E(M|H) + 1(2)*T(H|H)*E(M|H)

For time t=3,

3(1) = 2(1)*T(S|S)*E(M|S) + 2(2)*T(S|H)*E(M|S)

3(2) = 2(1)*T(H|S)*E(M|H) + 2(2)*T(H|H)*E(M|H)

In a similar fashion the forward probabilities for all the time instances in a given sequence can be calculated. The final forward probability values for the observation O={A, M, M, A} is

Forward Prob

A

M

M

A

Sick (S)

0.001

0.1463

0.1023

0.0067

Healthy (H)

0.81

0.1296

0.0324

0.0538

3. Backward probability

Given an observation sequence O={O0 , O1,.... OT}, then the backward probability t(i) for a state Siat time t is the probability for the sequence Ot+1, Ot+2,...,OT is observed given the state is Si at time 't'.

t(i) = P(Ot+1, Ot+2,...,OT | Qt= Si )

The algorithm starts at time T and then works backwards through the network from observation OT down to Ot+1and thus is simply a “backwards” variant of the forward algorithm.

Given the observation sequence O={A, M}, the backward probability for the student sick on second day to miss class on third day and to attend on the fourth day is given by:1(S) = P( O2=M, O3=A| Q1=S )

Computing Backward Probabilities:

For time t=T,

The initial probabilities of being in state Si at time T and generating nothing else is simply 1 since there is no more output in this context.

T(Si) = P( OT+1=∅ | QT=Si ) = 1

The backward probability at any time ‘t’ before T can be recursively estimated using the iterative formula:

t(Si) = jE(Ot+1|Sj)*T(Sj|Si)*t+1(Sj)

which can be intuitively understood as product of

1.Probability to show up the observation Ot+1 in state Sj- E(Ot+1|Sj)

2.Probability for the state to be Sj at time ‘t+1’ given state is Siat time ‘t’ - T(Sj|Si)

3.Backward probability that the state is Sj at time ‘t+1’ - t+1(Sj)

and then summation over all possible states Sjthat the network can be at time ‘t+1’

So for the example,

At t=T,

T(H)=T(S)=1

At t=T-1,

T-1(H) =E(A|H)*T(H|H)*T(H)+E(A|S)*T(S|H)*T(S)

T-1(S) =E(A|H)*T(H|S)*T(H)+E(A|S)*T(S|S)*T(S)

In a similar fashion the backward probabilities for all the time instances in a given sequence can be calculated. The final backward probability values for the observation O={A, M, M, A} is

Backward Prob

A

M

M

A

Sick (S)

0.1532

0.258

0.38

1

Healthy (H)

0.0742

0.174

0.66

1

VITERBI ALGORITHM

Viterbi algorithm is used to find out the most likely sequence of hidden states that can generate the given set of observations.

So, what we essentially do is in each step of algorithm, we calculate the probabilities of landing up in another state from any present state. We compare transition probabilities between states. We choose whichever is higher and move on.

Let us define:

i(t) as probability of most probable path ending in state i at time t.

So,

i(t) =maxQ1,Q2,....,Qt-1(P(Q1,Q2,....,Qt-1,Qt=Si,O1O2....Ot|)),

where denotes given inital, transition and emission matrices:(,A,B)

Now, since the Observation sequence is given and our task is to find the most probable hidden state sequence,

We can make our task easier and more efficient by computing i(t) recursively as:

i(t)= max1j N(j(t-1)*T(Si|Sj))*E(Ot|Si)

At each step, mind, we store the value of i that we get from the above equation for each time t.

Taking,

i(t)=i*E(O1|Si), 1iN

The best path is found by maximizing i(t) at time t=T, as follows:

Path=max1iN(i(T))

So, we first calculate allfrom the initial time, till the end time T.

We then choose the highest probability endpoint and then backtrack from there to find the highest probability path.

QT= Si; argmax1iN(i(T))

Qt-1= Sj; max1jN(j(t)) >The value of j for this was stored.

Note that the maximum likely path is not the only possible optimality criterion, for example choosing the most likely state at any given time requires a different algorithm and can give a slightly different result. But the overall most likely path provided by the Viterbi algorithm provides an optimal state sequence for many purposes

Example:

Consider the example of a student attending classes.

For the given set of observations say { A , M , M , A} we can find the most probable sequence of states that can generate given observation sequence through Viterbi algorithm.

Most probable sequence for the given observation sequence is { }

$$$$

Finding the most probable sequence:

Given the sequence of observations O : { O1 , O2 , …. Oj} , now we have to find out the most probable sequence of hidden states Q : { Q1 , Q2 , …. Qj } that can generate the given observations.

If Vj,k is the probability for the most probable state sequence responsible for the first j observations that has K as its final state.Then

Vj,k = maxqϵs ( P(O1| k). Tq,k . Vj-1,q ) ,where V1,k = P(O1| k). Πk

Here The Viterbi path can be retrieved by saving back pointers that remember which state q was used in the second equation. Let Ptr( k,j) be the function that returns the value of q used to compute Vj,k if j>1 , or K if j=1. Then:

QT = argmaxqϵs(VT,q)

qj = argmaxqϵs ( VJ, q)

qj-1= ptr( qj,j)

Example:

Consider the example of a student attending classes.

For the given set of observations say { A , M , M , A} we can find the most probable sequence of states that can generate given observation sequence through Viterbi algorithm.

Most probable sequence for the given observation sequence is { }

TRAINING A HMM: BAUM-WELCH ALGORITHM

We have seen how to calculate probability of a certain hidden state given an observation sequence and also finding out the most likely hidden state sequence. But how can we say for sure that our prediction of the hidden states is close to the actual sequence?And if it isn’t, how to make our algorithm become better at prediction?

In other words, how do we train our HMM model?

This section is to answer that particular question.

So, we have our HMM parameters as:

t(i)=P(Qt=Si ,O1,O2,......,Ot)

t(i)=P(Ot+1,Ot+2,.......OT | Qt=Si)

We now observe few results which will be helpful later:

(i)By Bayes theorem:T(i)=0(i)=P(O1, O2,...,OT, Qt= Si )

=P(O1,O2,..,Ot,Qt=Si )*P(Ot+1,Ot+2,.......OT | Qt=Si)

=t(i)* t(i)

This means that the probability of being in state ‘Si ’ at time ‘t’ is t(i)* t(i).

The summation of t(i)* t(i) over all time ‘t’ gives the probability of ever being in state Si at any time.

(ii)P(O1, O2,...,OT)=iP(Qt=Si ,O1,O2,......,Ot)

= iT(i) = i0(i) = it(i)* t(i)

So the summation of T(i)* T(i) over all states ‘Si ’ gives the probability of observing the observation sequence ‘O’.

Therefore, the probability of being in state Sigiven the entire observation sequence ‘O’ is

P(Qt= Si | O1, O2,...,OT )= P(O1, O2,...,OT, Qt= Si )/ P(O1, O2,...,OT)

=t(i)* t(i)/ iT(i)

For ease, we define i(t)=P(Qt=Si| O)

Example,(calculation of gamma)

Re-estimation of Emission Probability:

As seen in (i), the summation of t(i)* t(i) over all time ‘t’ gives the probability of ever being in state Si at any time.Emission Probability for a given state ‘Si’,P(Ok |Qt=Si)can now be re-estimated as a weighted count over the corpus.

P(Ok |Qt=Si)= chances of observing Okfrom state Siover all possible observations

= sum of i(t) over times when Okis observed / sum of i(t) over all time

= t when Ok is observedt(i)* t(i) /tt(i)* t(i)

Example,(calculation of new emission prob.)

Re-estimation of Transition Probability:

For any time ‘t’,the likelihood of a transition from state i to j is product of:

the forward probability of getting to Si = t

the probability of taking the transition = Ti,j * P(Ot+1 | Sj)

the backward probability to complete the observation sequence from state Sj =t+1(j)

i.e., P(Qt=Si, Qt+1=Sj & O1 ... OT) = t(i) * Ti,j * P(Ot+1 | Sj) * t+1(j)

Thus, the probability of taking this transition at any time during the process is the sum of these terms over all time t =tt(i) * Ti,j * P(Ot+1 | Sj) * t+1(j)

The probability of ever being in state Si at any time =tt(i)* t(i)

Therefore,the new transition probability = tt(i) * Ti,j * P(Ot+1 | Sj) * t+1(j) /tt(i)* t(i)

Example,(calculation of new emission prob.)

APPLICATIONS:

A prototypical application would be tracking sequences of sign language gestures.There is information at each frame about which gesture is present,but it may be ambiguous. However, we can impose prior knowledge that certain signs are more likely to follow others using the HMM and get an improved result.

i, j(t)=P(Qt=Si , Qt+1=Sj|O)

Evidently, j=1Ni, j(t)=i(t)

We can say, i(t)=(t(i)*t(i) / j=1Nj(t)*j(t))

Consequently, i, j(t) =(t(i)*T(Sj|Si)*E(Ot+1|Sj)*t+1(i)/ j=1Nj(t)*j(t))

Forward Algorithm:

The forward algorithm gives us the probability of an observed sequence in the HMM. So why do we need to find the probability of an observed sequence? This type of problem occurs in speech recognition where a large number of Markov models are used, and each one modelling a particular word. Each utterance of a word, will now give us a set of observation variables. Hence we will use the Markov model that has the highest probability of this observation sequence. The Forward algorithm is also an important sub-routine of the forward-backward algorithm.

A naive way to find the probability of an observed sequence is to sum over all possible sequences of the hidden states. For example consider the weather, given observation of a man either carrying or not carrying an umbrella.

Observations : No , Yes [ does he carry an umbrella or not]

$$$$ karthik : Attends class->Yes, Misses Class->No

States : Rain , Sun

$$$$ karthik : Healthy->Sun, Sick->Rain

Now given a sequence of observations :

O = No,Yes,Yes

We can compute :

P(O) = P(O/Sun,Sun,Sun) * P(Sun,Sun,Sun)+ P(O/Sun,Rain,Sun)* P(Sun,Rain,Sun) +P(O/Sun,Sun,Rain)P(Rain,Sun,Sun) +.......... +P(O/Rain,Rain,Rain)*P(Rain,Rain,Rain)

or

P(O) = ∑ P(O/s1,s2,s3)*P(s1,s2,s3)

where the above summation is summed over all possible combinations of s1,s2,s3

Hence for an observation sequence of size ‘N’ , and with ‘K’ states, we are required to compute around KN probabilities in total. Calculating the probability in this manner is computationally expensive. Hence, by using the time invariance of the emission and transition probabilities, the Forward algorithm reduces the total number of computations.

The forward algorithm keeps track of the partial probabilities(𝞪t(k)). The partial probability 𝞪t(k) is the probability of reaching state ‘k’ at a stage ‘t’, going through all possible paths over the ‘t’ stages. We see that probability 𝞪t+1(k) can be computed with the knowledge of 𝞪t(x) ∀ x , where x represents a state.

Consider the same example, as used above.Let an observation sequence

O = {o1,o2,o3,o4 …….on } be observed where ot is the observation at any stage ‘t’. In order to compute 𝞪t+1(Rain) we see that:

𝞪t+1(Rain) = emission_prob( Ot+1 when Rain) * 𝞪t(Rain) * transition_prob(Rain->Rain)

+ emission_prob( Ot+1 when Rain)* 𝞪t(Rain) * transition_prob(Sun->Rain)

We know that, ‘Rain’ or ‘Sun’ occurs in stage ‘t’. Also, we have already computed 𝞪t(Rain) and 𝞪t(Sun) which account for all possible paths, upto ‘t’ stages. Hence 𝞪t+1(Rain) is computed using fact that ‘Rain’ or ‘Sun’ in ‘t’ stage can both transition to ‘Rain’ in ‘t+1’ stage. This will now give us all possible paths to ‘Rain’ in stage ‘t+1’. Additionally, Observation Ot+1 is seen in the ‘t+1’ stage, which is also factored into the probability 𝞪t+1(Rain) in terms of the emission probability.

We can similarly compute 𝞪t+1(Sun). Hence in general :

𝞪t+1(s) = em_prob (s,Ot+1 ) * ∑x { 𝞪t(x) * trans_prob( x->s) }

where the summation is applied over all states ‘x’ and ‘s’ is one particular state.

So hence we can loop over the stage ‘t’ to compute partial probabilities at any stage. Hence, the probability of occurrence of an observation sequence of length ‘N’ is

p(O) = ∑x 𝞪N(x) : where summation is over all states ‘x’

$$$$ karthik p(O) = ∑x 𝞪N(x) *em_prob(O,x)

The above and trivially true because 𝞪N(x) gives us the probability of all paths ending in state ‘x’ after ‘N’ stages. Hence any final state can occur, we sum 𝞪N(x) over all possible states ‘x’.

Hence we can iteratively compute all 𝞪K(x), and all that is left is the initialization.

We see that :

𝞪1(x) = probability of state ‘X’ when observation is O1 after first stage.

=> 𝞪1(x) = start_prob(x) * emission_prob( O1 in state X)

Hence the number of computations is greatly reduced, compared to our naive method.

The entire method is summarised in the below pseudo-code:

Let observation sequence be O = {o1,o2,o3,o4 …….on }

$$$$ karthik suggest obs seq starts from 0 instead of 1

Pseudo-Code:

//Initialize : 𝞪K(x)

for each state x:

𝞪1(x) = start_prob(x) * emission_prob( O1 in state X)

//loop for ‘N’ times where ‘N’ is number of stages

for (t =2 ; t<=N ; t ++):

{

for each state s:

{

𝞪t+1(s) = 0

for each state x:

𝞪t+1(s) = 𝞪t+1(s) + em_prob (s,Ot+1 ) * ( 𝞪t(x) * trans_prob( x->s) )

}

}

//Total probability is 𝞪N summed over all the states

total_prob =0

for each state s:

total_prob = total_prob + 𝞪N(s)

//Return the total probability for ($$$$karthik a state S given) observation sequence O

return total_prob

1)Forward - Backward Algorithm:

FB algorithm computes the posterior probability distribution of the states given the observations.

In Example1,suppose Doctor observes Sequence {“normal”, “dizzy”, “dizzy”, “cold”} on four consecutive days, then FB algorithm computes probability of villager being “healthy”(or “fever”) on any of those four days. for example you could compute the probability of person being healthy on third day.

Example2 : Handwriting recognition ,

Hidden states : Actual letters of alphabets

Observed variables : messy letters that one actually writes.

given the observed sequence of messy letters, FB algorithm allows you to compute the posterior probabilities for internal hidden states i.e for actual letters of alphabets

This algorithm rely on the idea of Dynamic Programming, that is a fancy name for saying that we reuse things we have already computed to make things faster.

The forward part computes the posterior probability of state Sk given the first k observations O1:k, and does this ∀k=1…n

The backward part computes the probability of the remaining observations Ok+1:n given state Sk, again for each k.

The last part of the algorithm is to merge the forward and the backward part to give the full posterior probabilities.

Before jumping to calculate Forward and Backward Probabilities , we should know following probability properties:

marginalisation of Probability :

P( Sk,O1:k) = ∑ P( Sk,Sk-1 O1:k-1,Ok) {summation is

over all possible states Sk-1}

Chain Rule of Probability :

P(S1,O1)=P(O1|S1)P(S1)

We want to use dynamic programming, so we want to express P(Sk,O1:k) as a function of k-1, so that we can setup a recursion. we could bring Sk−1 to the equation using marginalisation

$$$$ karthik: I feel we don’t need this part regarding F(k) because it is exactly what we did before in finding out 𝞪t(i) except that it is written differently.

Marginalisation and chain rule of probab. I think would we more helpful if placed at the beginning of Forward probability

Forward Probability ,F(k)= P( Sk,O1:k)

using marginalisation , it could be written as

= ∑ P(O1:k-1, Sk-1 ,Ok, Sk) , here summation is over all possible states Sk-1

using chain rule, above equation could be written as

=∑ P(O1:k-1, Sk-1)P(Ok, Sk |O1:k-1, Sk-1)

In the second term inside Summation , O1:k-1 could be dropped (markov property)

= ∑ P(O1:k-1, Sk-1).P(Ok, Sk | Sk-1)

=∑ P(O1:k-1, Sk-1).P( Sk | Sk-1)P(Ok|Sk )

$$$$ karthik we dropped P(Ok|Sk-1 ) since it is a markov model,i.e. kth observation depends solely on kth state

= F(k-1).P( Sk | Sk-1)P(Ok|Sk )

The first term is the recursion term over k-1, so we compute this value ∀k=2…n. which we have assumed we know. The second term is the transition probability, which also we have assumed we know. The third term is the emission probability and we are done.

calculating forward probability for Example 1:

Suppose a villager A’s response over four consecutive days be {“normal”, “dizzy”,”dizzy”,”cold”} .Given first two days of observations,Probability of villager A being in healthy state on second day could be calculated as follows

F(2)=P( S2,O1:2) = P( “Healthy”,”normal”,”dizzy”)

using marginalisation and summing over all previously possible states

= P( “Healthy”,”Healthy”,”normal”,”dizzy”) + P( “Healthy”,”Fever”,”normal”,”dizzy”)

using chain rule to break the terms :

= P( ”Healthy”,”normal”) . P( “dizzy” , “Healthy” | “Healthy”, “normal”) + P( ”fever”,”normal”)P( “dizzy” , “Healthy” | “fever”, “normal”)

Now we could drop “normal ” in P( “dizzy” , “Healthy” | “Healthy”, “normal”) and P( “dizzy” , “Healthy” | “fever”, “normal”) because of markoniv property

=P( ”Healthy”,”normal”) . P( “dizzy” , “Healthy” | “Healthy”) + P( ”fever”,”normal”)P( “dizzy” , “Healthy” | “fever”)

= P(“Healthy”, “normal”) . P(“Healthy”|”Healthy”) .P(“dizzy” | “Healthy”) + P( “Fever”, “normal”) . P(“Healthy”|”Fever”).P(“dizzy” | “Healthy”)

Backward probability:

For the backward part, we compute the values in a similar way, by using marginalization and the chain rule.

B(k) = Probability of seeing Ok+1 Ok+2 ...........Om given that the state Sk

B(k) = P( Ok+1:m| Sk)

Using marginalisation of probability ,

= ∑P( Ok+2:m, Ok+1, Sk+1| Sk) (summation is over all possible states Sk+1)

Using Chain rule of Probability

= ∑P( Ok+2:m|(Ok+1, Sk+1, Sk)).P(Ok+1, Sk+1| Sk)

in first term we could drop Ok+1, Sk as state Sk+1 alone determines sequence Ok+2:m

= P( Ok+2:m| Sk+1).P( Sk+1 | Sk)P(Ok+1|Sk )

= ∑B(k+1) .P( Sk+1 | Sk)P(Ok+1|Sk )

Again, the first term is the recursion term over k+1, so we compute this value ∀k=1…n-1.

which we have assumed we know. The second term is the transition probability, which also we have assumed we know. The third term is the emission probability and we are done

Taking same example of Villager A again,

B(2)= P( O3:4| S2)

= P(“dizzy,”cold”|”healthy”)

marginalising over all next possible hidden states ,

=P(“dizzy”,”Healthy”,”cold”| “Healthy”) + P(“dizzy”,”fever”,”cold”| “Healthy”)

using chain rule :

= P(dizzy, “Healthy”).P(“healthy”| “Healthy”) P(“cold”| “Healthy”) + P(dizzy, “Fever”).P(“Fever”| “Healthy”) P(“cold”| “Healthy”)

Time Complexity :Forward Backward Algorithm

It computes the posterior marginals of all hidden states given the sequence of observed variables.

P(Sk| O1:m) ∝ P(Ok+1:m |Sk,O1:k) P( Sk,O1:k)

= P( Ok+1:m| Sk) .P( Sk,O1:k)

= B(k). F(k )

Again taking same example of Villager A:

Probability of Villager A being in state “Healthy” on second day = Probability of Villager A being in healthy after seeing observation “normal”, “dizzy” times of Probability of seeing sequence (“dizzy”, “cold”) after being in “healthy” state on day 2.

= P( “Healthy”,”normal”,”dizzy”) * P(“dizzy,”cold”|”Healthy”)

First term is obtained by forward probability while second from Backward Probability

Pseudo Code for Forward Backward algorithm :

$$$$ karthik:

So FB is just choosing a particular hidden state and finding its probability, given an observation sequence

Our hidden state S at time t must give us the observed sequence before t and after t too

So,

First, we calculate probability of a state S at a time t given past observations using Forward Algorithm

Then, given this state S, we calculate the probability of future observations(which we know already) using

Backward Algorithm

Multiplying both the probabilities gives us the probability of having the hidden state as S at time t

Applications of Forward Backward Algorithm :

Inference : for example change detection

Estimating Parameters. This done using Baum Welch Algorithm ,coupling Forward backward Algorithm with expectation maximization

Sampling from Posterior Probability list.

Sunil

Rahul

Lithin

Hema

Roshan

Karthik

CONTENT

Introduction

Forward Probability

Backward Probability

Viterbi Algorithm

Forward Algorithm

Forward-backward Algorithm

Training a HMM

1 INTRODUCTION

A Hidden Markov model is a Markov chain for which the states are not explicitly observable .We instead make indirect observations about the state by events which result from those hidden states .Since these observables are not sufficient/complete to describe the state, we associate a probability with each of the observable coming from a particular state . In other words, if the probability falls to either 1 or 0,it reduces to a Markov model.

Example 1

To understand how Hidden Markov models work, we take the example of a student attending classes. On any given day, a student may or may not attend a particular class. This is an observable entity.

Let us assume that the student doesn’t attend class for only one of the two reasons, the student is sick or the student simply decides to skip the class. This information i.e the reason, is not known to us and hence is a hidden entity. Hidden Markov models serve as tool for us to identify these hidden entities given the observed entities.

We now introduce the components of an Hidden Markov model and relate them to the above example.

We first define the set of entities that we can observe, the observations. In this case we have 2 observations, “Attends class” and “Missed class”. We next define the set of hidden entities or the states of the HMM. In our case we have 2 states, “Sick” and “Healthy”.

Suppose we are aware of the behaviour of the student and know the probability with which the student attends the class or not given the state that the student is defined by the below matrix.

Attends class

Missed class

Sick

0.1

0.9

Healthy

0.8

0.2

The above matrix states that given that the student is sick, the probability that he attends class is 0.1 and the probability he misses the class is 0.9. Similarly, given that the student is healthy, the probability that he attends class is 0.8 and the probability he misses the class is 0.2 . This matrix is referred to as the emission matrix and defines the probability with which a given observation is emitted at a particular observed state.

Next we look at the probability for student to transition from one state to another.

Sick

Healthy

Sick

0.6

0.4

Healthy

0.2

0.8

The above matrix is referred to as the transition matrix. This tells us that given that the student is Sick today, the probability that he is sick tomorrow is 0.6 while the probability of him being healthy tomorrow is 0.4. Similarly, if the student is healthy today, the probability that the student is healthy tomorrow is 0.8 while the probability of being sick tomorrow is 0.2 .

The last component of a HMM is the start/initial probabilities which define the probability of the first hidden state. This can be seen as a special case of transition probability as the first state, by virtue of being first, doesn't have any state preceding it and hence transition probabilities cannot be applied.

Sick

0.1

Healthy

0.9

This implies that the probability of the first state being Sick is 0.1 while the probability of the first state being Healthy is 0.9

BASIC DEFINITIONS

O : { O1, O2 , …. OT } Sequence of observations

Q : {Q1 , Q2 , …. QT } Sequence of corresponding hidden states

S : {S1 , S2, ……, Sn } Set of unique states in the Hidden Markov model

M : {M1 , M2, ……, Mn } Set of unique observations in the Hidden Markov model

n : Number of hidden states in the model

m : Number of unique observations in the model

T : Transition matrix

E : Emission matrix

: Initial/ Start probability

: Forward probability values

: backward probability values

Note:

t(Si)=t(i)=t(S)= t(sick) for i=1 .Similarly i=2 implies state - ‘Healthy’.

1st order Markov assumption :The probability of occurrence of an event at time 't' depends only on the observation at time 't-1' and not on the events that happened before 't-1'.

P(Ot | O1,O2,…,Ot-1) =P(Ot |Ot-1)

In other words, the observations O1,O2,...Ot-1 do not impact the observation Ot. Hidden Markov models work on this assumption.The initial probability array, transition array and observation/emission array can completely define a HMM.

For the example discussed in the introduction, the states are { S1:Sick , S2 : Healthy}

while the observations are { M1:Attends class , M2 : Missed class}.

Suppose O={A, M, M, A}is a sequence of observations that we observe.

(Denoting the observations Attends class and Missed class as A and M respectively.)

So the student attends class on the first day, misses class for the next two days and attends class on the third day. Given the sequence, we compute forward and backward probabilities that are crucial for training Hidden Markov models.

2. Forward probability

Given an observation sequence O={O1 , O2,.... OT}, the forward probability t(i) for a state Siat time t is the probability for the sequence O0, O1,...,Ot to end in the state Si.

t(i) = P(Qt= Si , O1, O2,...,Ot )

Given the observation sequence O={A, M, M, A}, the forward probability for the student to be Sick on the 2nd day (t=2) is given by :

2(S1) = P( Q2=S1 ,O1=A, O2=M )

Computing forward probabilities

For time t=1,

1(1) = P( Q1=S , O1=A)

This is a special case, as we are not transitioning to the current state from any other state as this is the first state and there aren’t any states before this to transition from.

1(1) = (S)*E( A | S ) = 0.1*0.1 = 0.001

1(2) = (H)*E( A | H ) = 0.9*0.9 = 0.81

For the rest of the time instances, the forward probabilities are calculated using :

t(i) = j t-1(j)*T(Si|Sj)*E(Ot|Si)

On examining the above equation, we can see that the equation basically sums the probabilities of all possible paths to the state Si at time t. If the model is in state Sjin at time t-1 and is to move to the state Si at time t for the observation Ot, then it first has to transition from state Sjto Si, given by the transition probability T(Si|Sj), and then needs to emit the observation Ot, given by the emission probability E(Ot|Si).

For time t=2,

2(1) = 1(1)*T(S|S)*E(M|S) + 1(2)*T(S|H)*E(M|S)

2(2) = 1(1)*T(H|S)*E(M|H) + 1(2)*T(H|H)*E(M|H)

For time t=3,

3(1) = 2(1)*T(S|S)*E(M|S) + 2(2)*T(S|H)*E(M|S)

3(2) = 2(1)*T(H|S)*E(M|H) + 2(2)*T(H|H)*E(M|H)

In a similar fashion the forward probabilities for all the time instances in a given sequence can be calculated. The final forward probability values for the observation O={A, M, M, A} is

Forward Prob

A

M

M

A

Sick (S)

0.001

0.1463

0.1023

0.0067

Healthy (H)

0.81

0.1296

0.0324

0.0538

3. Backward probability

Given an observation sequence O={O0 , O1,.... OT}, then the backward probability t(i) for a state Siat time t is the probability for the sequence Ot+1, Ot+2,...,OT is observed given the state is Si at time 't'.

t(i) = P(Ot+1, Ot+2,...,OT | Qt= Si )

The algorithm starts at time T and then works backwards through the network from observation OT down to Ot+1and thus is simply a “backwards” variant of the forward algorithm.

Given the observation sequence O={A, M}, the backward probability for the student sick on second day to miss class on third day and to attend on the fourth day is given by:1(S) = P( O2=M, O3=A| Q1=S )

Computing Backward Probabilities:

For time t=T,

The initial probabilities of being in state Si at time T and generating nothing else is simply 1 since there is no more output in this context.

T(Si) = P( OT+1=∅ | QT=Si ) = 1

The backward probability at any time ‘t’ before T can be recursively estimated using the iterative formula:

t(Si) = jE(Ot+1|Sj)*T(Sj|Si)*t+1(Sj)

which can be intuitively understood as product of

1.Probability to show up the observation Ot+1 in state Sj- E(Ot+1|Sj)

2.Probability for the state to be Sj at time ‘t+1’ given state is Siat time ‘t’ - T(Sj|Si)

3.Backward probability that the state is Sj at time ‘t+1’ - t+1(Sj)

and then summation over all possible states Sjthat the network can be at time ‘t+1’

So for the example,

At t=T,

T(H)=T(S)=1

At t=T-1,

T-1(H) =E(A|H)*T(H|H)*T(H)+E(A|S)*T(S|H)*T(S)

T-1(S) =E(A|H)*T(H|S)*T(H)+E(A|S)*T(S|S)*T(S)

In a similar fashion the backward probabilities for all the time instances in a given sequence can be calculated. The final backward probability values for the observation O={A, M, M, A} is

Backward Prob

A

M

M

A

Sick (S)

0.1532

0.258

0.38

1

Healthy (H)

0.0742

0.174

0.66

1

VITERBI ALGORITHM

Viterbi algorithm is used to find out the most likely sequence of hidden states that can generate the given set of observations.

So, what we essentially do is in each step of algorithm, we calculate the probabilities of landing up in another state from any present state. We compare transition probabilities between states. We choose whichever is higher and move on.

Let us define:

i(t) as probability of most probable path ending in state i at time t.

So,

i(t) =maxQ1,Q2,....,Qt-1(P(Q1,Q2,....,Qt-1,Qt=Si,O1O2....Ot|)),

where denotes given inital, transition and emission matrices:(,A,B)

Now, since the Observation sequence is given and our task is to find the most probable hidden state sequence,

We can make our task easier and more efficient by computing i(t) recursively as:

i(t)= max1j N(j(t-1)*T(Si|Sj))*E(Ot|Si)

At each step, mind, we store the value of i that we get from the above equation for each time t.

Taking,

i(t)=i*E(O1|Si), 1iN

The best path is found by maximizing i(t) at time t=T, as follows:

Path=max1iN(i(T))

So, we first calculate allfrom the initial time, till the end time T.

We then choose the highest probability endpoint and then backtrack from there to find the highest probability path.

QT= Si; argmax1iN(i(T))

Qt-1= Sj; max1jN(j(t)) >The value of j for this was stored.

Note that the maximum likely path is not the only possible optimality criterion, for example choosing the most likely state at any given time requires a different algorithm and can give a slightly different result. But the overall most likely path provided by the Viterbi algorithm provides an optimal state sequence for many purposes

Example:

Consider the example of a student attending classes.

For the given set of observations say { A , M , M , A} we can find the most probable sequence of states that can generate given observation sequence through Viterbi algorithm.

Most probable sequence for the given observation sequence is { }

$$$$

Finding the most probable sequence:

Given the sequence of observations O : { O1 , O2 , …. Oj} , now we have to find out the most probable sequence of hidden states Q : { Q1 , Q2 , …. Qj } that can generate the given observations.

If Vj,k is the probability for the most probable state sequence responsible for the first j observations that has K as its final state.Then

Vj,k = maxqϵs ( P(O1| k). Tq,k . Vj-1,q ) ,where V1,k = P(O1| k). Πk

Here The Viterbi path can be retrieved by saving back pointers that remember which state q was used in the second equation. Let Ptr( k,j) be the function that returns the value of q used to compute Vj,k if j>1 , or K if j=1. Then:

QT = argmaxqϵs(VT,q)

qj = argmaxqϵs ( VJ, q)

qj-1= ptr( qj,j)

Example:

Consider the example of a student attending classes.

For the given set of observations say { A , M , M , A} we can find the most probable sequence of states that can generate given observation sequence through Viterbi algorithm.

Most probable sequence for the given observation sequence is { }

TRAINING A HMM: BAUM-WELCH ALGORITHM

We have seen how to calculate probability of a certain hidden state given an observation sequence and also finding out the most likely hidden state sequence. But how can we say for sure that our prediction of the hidden states is close to the actual sequence?And if it isn’t, how to make our algorithm become better at prediction?

In other words, how do we train our HMM model?

This section is to answer that particular question.

So, we have our HMM parameters as:

t(i)=P(Qt=Si ,O1,O2,......,Ot)

t(i)=P(Ot+1,Ot+2,.......OT | Qt=Si)

We now observe few results which will be helpful later:

(i)By Bayes theorem:T(i)=0(i)=P(O1, O2,...,OT, Qt= Si )

=P(O1,O2,..,Ot,Qt=Si )*P(Ot+1,Ot+2,.......OT | Qt=Si)

=t(i)* t(i)

This means that the probability of being in state ‘Si ’ at time ‘t’ is t(i)* t(i).

The summation of t(i)* t(i) over all time ‘t’ gives the probability of ever being in state Si at any time.

(ii)P(O1, O2,...,OT)=iP(Qt=Si ,O1,O2,......,Ot)

= iT(i) = i0(i) = it(i)* t(i)

So the summation of T(i)* T(i) over all states ‘Si ’ gives the probability of observing the observation sequence ‘O’.

Therefore, the probability of being in state Sigiven the entire observation sequence ‘O’ is

P(Qt= Si | O1, O2,...,OT )= P(O1, O2,...,OT, Qt= Si )/ P(O1, O2,...,OT)

=t(i)* t(i)/ iT(i)

For ease, we define i(t)=P(Qt=Si| O)

Example,(calculation of gamma)

Re-estimation of Emission Probability:

As seen in (i), the summation of t(i)* t(i) over all time ‘t’ gives the probability of ever being in state Si at any time.Emission Probability for a given state ‘Si’,P(Ok |Qt=Si)can now be re-estimated as a weighted count over the corpus.

P(Ok |Qt=Si)= chances of observing Okfrom state Siover all possible observations

= sum of i(t) over times when Okis observed / sum of i(t) over all time

= t when Ok is observedt(i)* t(i) /tt(i)* t(i)

Example,(calculation of new emission prob.)

Re-estimation of Transition Probability:

For any time ‘t’,the likelihood of a transition from state i to j is product of:

the forward probability of getting to Si = t

the probability of taking the transition = Ti,j * P(Ot+1 | Sj)

the backward probability to complete the observation sequence from state Sj =t+1(j)

i.e., P(Qt=Si, Qt+1=Sj & O1 ... OT) = t(i) * Ti,j * P(Ot+1 | Sj) * t+1(j)

Thus, the probability of taking this transition at any time during the process is the sum of these terms over all time t =tt(i) * Ti,j * P(Ot+1 | Sj) * t+1(j)

The probability of ever being in state Si at any time =tt(i)* t(i)

Therefore,the new transition probability = tt(i) * Ti,j * P(Ot+1 | Sj) * t+1(j) /tt(i)* t(i)

Example,(calculation of new emission prob.)

APPLICATIONS:

A prototypical application would be tracking sequences of sign language gestures.There is information at each frame about which gesture is present,but it may be ambiguous. However, we can impose prior knowledge that certain signs are more likely to follow others using the HMM and get an improved result.

i, j(t)=P(Qt=Si , Qt+1=Sj|O)

Evidently, j=1Ni, j(t)=i(t)

We can say, i(t)=(t(i)*t(i) / j=1Nj(t)*j(t))

Consequently, i, j(t) =(t(i)*T(Sj|Si)*E(Ot+1|Sj)*t+1(i)/ j=1Nj(t)*j(t))

Forward Algorithm:

The forward algorithm gives us the probability of an observed sequence in the HMM. So why do we need to find the probability of an observed sequence? This type of problem occurs in speech recognition where a large number of Markov models are used, and each one modelling a particular word. Each utterance of a word, will now give us a set of observation variables. Hence we will use the Markov model that has the highest probability of this observation sequence. The Forward algorithm is also an important sub-routine of the forward-backward algorithm.

A naive way to find the probability of an observed sequence is to sum over all possible sequences of the hidden states. For example consider the weather, given observation of a man either carrying or not carrying an umbrella.

Observations : No , Yes [ does he carry an umbrella or not]

$$$$ karthik : Attends class->Yes, Misses Class->No

States : Rain , Sun

$$$$ karthik : Healthy->Sun, Sick->Rain

Now given a sequence of observations :

O = No,Yes,Yes

We can compute :

P(O) = P(O/Sun,Sun,Sun) * P(Sun,Sun,Sun)+ P(O/Sun,Rain,Sun)* P(Sun,Rain,Sun) +P(O/Sun,Sun,Rain)P(Rain,Sun,Sun) +.......... +P(O/Rain,Rain,Rain)*P(Rain,Rain,Rain)

or

P(O) = ∑ P(O/s1,s2,s3)*P(s1,s2,s3)

where the above summation is summed over all possible combinations of s1,s2,s3

Hence for an observation sequence of size ‘N’ , and with ‘K’ states, we are required to compute around KN probabilities in total. Calculating the probability in this manner is computationally expensive. Hence, by using the time invariance of the emission and transition probabilities, the Forward algorithm reduces the total number of computations.

The forward algorithm keeps track of the partial probabilities(𝞪t(k)). The partial probability 𝞪t(k) is the probability of reaching state ‘k’ at a stage ‘t’, going through all possible paths over the ‘t’ stages. We see that probability 𝞪t+1(k) can be computed with the knowledge of 𝞪t(x) ∀ x , where x represents a state.

Consider the same example, as used above.Let an observation sequence

O = {o1,o2,o3,o4 …….on } be observed where ot is the observation at any stage ‘t’. In order to compute 𝞪t+1(Rain) we see that:

𝞪t+1(Rain) = emission_prob( Ot+1 when Rain) * 𝞪t(Rain) * transition_prob(Rain->Rain)

+ emission_prob( Ot+1 when Rain)* 𝞪t(Rain) * transition_prob(Sun->Rain)

We know that, ‘Rain’ or ‘Sun’ occurs in stage ‘t’. Also, we have already computed 𝞪t(Rain) and 𝞪t(Sun) which account for all possible paths, upto ‘t’ stages. Hence 𝞪t+1(Rain) is computed using fact that ‘Rain’ or ‘Sun’ in ‘t’ stage can both transition to ‘Rain’ in ‘t+1’ stage. This will now give us all possible paths to ‘Rain’ in stage ‘t+1’. Additionally, Observation Ot+1 is seen in the ‘t+1’ stage, which is also factored into the probability 𝞪t+1(Rain) in terms of the emission probability.

We can similarly compute 𝞪t+1(Sun). Hence in general :

𝞪t+1(s) = em_prob (s,Ot+1 ) * ∑x { 𝞪t(x) * trans_prob( x->s) }

where the summation is applied over all states ‘x’ and ‘s’ is one particular state.

So hence we can loop over the stage ‘t’ to compute partial probabilities at any stage. Hence, the probability of occurrence of an observation sequence of length ‘N’ is

p(O) = ∑x 𝞪N(x) : where summation is over all states ‘x’

$$$$ karthik p(O) = ∑x 𝞪N(x) *em_prob(O,x)

The above and trivially true because 𝞪N(x) gives us the probability of all paths ending in state ‘x’ after ‘N’ stages. Hence any final state can occur, we sum 𝞪N(x) over all possible states ‘x’.

Hence we can iteratively compute all 𝞪K(x), and all that is left is the initialization.

We see that :

𝞪1(x) = probability of state ‘X’ when observation is O1 after first stage.

=> 𝞪1(x) = start_prob(x) * emission_prob( O1 in state X)

Hence the number of computations is greatly reduced, compared to our naive method.

The entire method is summarised in the below pseudo-code:

Let observation sequence be O = {o1,o2,o3,o4 …….on }

$$$$ karthik suggest obs seq starts from 0 instead of 1

Pseudo-Code:

//Initialize : 𝞪K(x)

for each state x:

𝞪1(x) = start_prob(x) * emission_prob( O1 in state X)

//loop for ‘N’ times where ‘N’ is number of stages

for (t =2 ; t<=N ; t ++):

{

for each state s:

{

𝞪t+1(s) = 0

for each state x:

𝞪t+1(s) = 𝞪t+1(s) + em_prob (s,Ot+1 ) * ( 𝞪t(x) * trans_prob( x->s) )

}

}

//Total probability is 𝞪N summed over all the states

total_prob =0

for each state s:

total_prob = total_prob + 𝞪N(s)

//Return the total probability for ($$$$karthik a state S given) observation sequence O

return total_prob

1)Forward - Backward Algorithm:

FB algorithm computes the posterior probability distribution of the states given the observations.

In Example1,suppose Doctor observes Sequence {“normal”, “dizzy”, “dizzy”, “cold”} on four consecutive days, then FB algorithm computes probability of villager being “healthy”(or “fever”) on any of those four days. for example you could compute the probability of person being healthy on third day.

Example2 : Handwriting recognition ,

Hidden states : Actual letters of alphabets

Observed variables : messy letters that one actually writes.

given the observed sequence of messy letters, FB algorithm allows you to compute the posterior probabilities for internal hidden states i.e for actual letters of alphabets

This algorithm rely on the idea of Dynamic Programming, that is a fancy name for saying that we reuse things we have already computed to make things faster.

The forward part computes the posterior probability of state Sk given the first k observations O1:k, and does this ∀k=1…n

The backward part computes the probability of the remaining observations Ok+1:n given state Sk, again for each k.

The last part of the algorithm is to merge the forward and the backward part to give the full posterior probabilities.

Before jumping to calculate Forward and Backward Probabilities , we should know following probability properties:

marginalisation of Probability :

P( Sk,O1:k) = ∑ P( Sk,Sk-1 O1:k-1,Ok) {summation is

over all possible states Sk-1}

Chain Rule of Probability :

P(S1,O1)=P(O1|S1)P(S1)

We want to use dynamic programming, so we want to express P(Sk,O1:k) as a function of k-1, so that we can setup a recursion. we could bring Sk−1 to the equation using marginalisation

$$$$ karthik: I feel we don’t need this part regarding F(k) because it is exactly what we did before in finding out 𝞪t(i) except that it is written differently.

Marginalisation and chain rule of probab. I think would we more helpful if placed at the beginning of Forward probability

Forward Probability ,F(k)= P( Sk,O1:k)

using marginalisation , it could be written as

= ∑ P(O1:k-1, Sk-1 ,Ok, Sk) , here summation is over all possible states Sk-1

using chain rule, above equation could be written as

=∑ P(O1:k-1, Sk-1)P(Ok, Sk |O1:k-1, Sk-1)

In the second term inside Summation , O1:k-1 could be dropped (markov property)

= ∑ P(O1:k-1, Sk-1).P(Ok, Sk | Sk-1)

=∑ P(O1:k-1, Sk-1).P( Sk | Sk-1)P(Ok|Sk )

$$$$ karthik we dropped P(Ok|Sk-1 ) since it is a markov model,i.e. kth observation depends solely on kth state

= F(k-1).P( Sk | Sk-1)P(Ok|Sk )

The first term is the recursion term over k-1, so we compute this value ∀k=2…n. which we have assumed we know. The second term is the transition probability, which also we have assumed we know. The third term is the emission probability and we are done.

calculating forward probability for Example 1:

Suppose a villager A’s response over four consecutive days be {“normal”, “dizzy”,”dizzy”,”cold”} .Given first two days of observations,Probability of villager A being in healthy state on second day could be calculated as follows

F(2)=P( S2,O1:2) = P( “Healthy”,”normal”,”dizzy”)

using marginalisation and summing over all previously possible states

= P( “Healthy”,”Healthy”,”normal”,”dizzy”) + P( “Healthy”,”Fever”,”normal”,”dizzy”)

using chain rule to break the terms :

= P( ”Healthy”,”normal”) . P( “dizzy” , “Healthy” | “Healthy”, “normal”) + P( ”fever”,”normal”)P( “dizzy” , “Healthy” | “fever”, “normal”)

Now we could drop “normal ” in P( “dizzy” , “Healthy” | “Healthy”, “normal”) and P( “dizzy” , “Healthy” | “fever”, “normal”) because of markoniv property

=P( ”Healthy”,”normal”) . P( “dizzy” , “Healthy” | “Healthy”) + P( ”fever”,”normal”)P( “dizzy” , “Healthy” | “fever”)

= P(“Healthy”, “normal”) . P(“Healthy”|”Healthy”) .P(“dizzy” | “Healthy”) + P( “Fever”, “normal”) . P(“Healthy”|”Fever”).P(“dizzy” | “Healthy”)

Backward probability:

For the backward part, we compute the values in a similar way, by using marginalization and the chain rule.

B(k) = Probability of seeing Ok+1 Ok+2 ...........Om given that the state Sk

B(k) = P( Ok+1:m| Sk)

Using marginalisation of probability ,

= ∑P( Ok+2:m, Ok+1, Sk+1| Sk) (summation is over all possible states Sk+1)

Using Chain rule of Probability

= ∑P( Ok+2:m|(Ok+1, Sk+1, Sk)).P(Ok+1, Sk+1| Sk)

in first term we could drop Ok+1, Sk as state Sk+1 alone determines sequence Ok+2:m

= P( Ok+2:m| Sk+1).P( Sk+1 | Sk)P(Ok+1|Sk )

= ∑B(k+1) .P( Sk+1 | Sk)P(Ok+1|Sk )

Again, the first term is the recursion term over k+1, so we compute this value ∀k=1…n-1.

which we have assumed we know. The second term is the transition probability, which also we have assumed we know. The third term is the emission probability and we are done

Taking same example of Villager A again,

B(2)= P( O3:4| S2)

= P(“dizzy,”cold”|”healthy”)

marginalising over all next possible hidden states ,

=P(“dizzy”,”Healthy”,”cold”| “Healthy”) + P(“dizzy”,”fever”,”cold”| “Healthy”)

using chain rule :

= P(dizzy, “Healthy”).P(“healthy”| “Healthy”) P(“cold”| “Healthy”) + P(dizzy, “Fever”).P(“Fever”| “Healthy”) P(“cold”| “Healthy”)

Time Complexity :Forward Backward Algorithm

It computes the posterior marginals of all hidden states given the sequence of observed variables.

P(Sk| O1:m) ∝ P(Ok+1:m |Sk,O1:k) P( Sk,O1:k)

= P( Ok+1:m| Sk) .P( Sk,O1:k)

= B(k). F(k )

Again taking same example of Villager A:

Probability of Villager A being in state “Healthy” on second day = Probability of Villager A being in healthy after seeing observation “normal”, “dizzy” times of Probability of seeing sequence (“dizzy”, “cold”) after being in “healthy” state on day 2.

= P( “Healthy”,”normal”,”dizzy”) * P(“dizzy,”cold”|”Healthy”)

First term is obtained by forward probability while second from Backward Probability

Pseudo Code for Forward Backward algorithm :

$$$$ karthik:

So FB is just choosing a particular hidden state and finding its probability, given an observation sequence

Our hidden state S at time t must give us the observed sequence before t and after t too

So,

First, we calculate probability of a state S at a time t given past observations using Forward Algorithm

Then, given this state S, we calculate the probability of future observations(which we know already) using

Backward Algorithm

Multiplying both the probabilities gives us the probability of having the hidden state as S at time t

Applications of Forward Backward Algorithm :

Inference : for example change detection

Estimating Parameters. This done using Baum Welch Algorithm ,coupling Forward backward Algorithm with expectation maximization

Sampling from Posterior Probability list.