CC-BY
Fabian M. Suchanek
Probability Theory
A type‐safe introduction to
Probability theory
2
starring
and
Shrek
Puss
In all of the following, if there is a definition followed by a “special case”, it is fully sufficient to
know and understand the special case for the purpose of most applications in Information
Extraction, including Markov Random Fields, Hidden Markov Models, Markov Chains, and Conditional
Random Fields.
Overview
Complex dependencies
and/or feature functions
only visible
variables
visible
and invisible
variables
Markov Chains
Markov Random Fields
Hidden Markov Models
Conditional Random Fields
3
Chains
Introduction to Probabilities
Def: Universe
A
universe
(also: sample space) is the set of all possible worlds
(also: possible states of the world, outcomes of a random trial).
4
Special case: Universe
Given n named sets
, we use as universe
.
possible world
universe
Example:
X1={Shrek,Puss},
X2={shouts,purrs}
5
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Set of events / Sigma algebra
A
sigma algebra
(also: set of events) over a universe
is a set of subsets of the universe that fulfills certain conditions.
6
It is the set of
“allowable events”
Special case: We use as sigma-algebra the set of all subsets of the universe:
Def: Probability
A
probability
(also: probability measure, probability distribution) is a function
P:F→ [0,1]
such that
P(∅) = 0
P(Ω) = 1
7
for a finite number of disjoint sets
.
For general (not necessarily disjoint) sets
, we have:
“union bound”
...because the
may overlap, in which case
.
Special case: Probability
We use as probability a function that maps every possible world to [0,1].
P: Ω → [0,1]
8
shouts
w2:
Shrek
Puss
Shrek
shouts
purrs
Puss
w3:
purrs
X1
w4:
X2
w1:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
Special case: Probability
For our probability function, we define
for
.
9
P({w1,w2}) = P(w1)+P(w2) = 0.7
shouts
w2:
Shrek
Puss
Shrek
shouts
purrs
Puss
w3:
purrs
X1
w4:
X2
w1:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
Def: Probability space
A
probability space
is a triple of a universe, a sigma algebra, and
a probability measure: (Ω, F, P)
10
We assume a fixed probability space from now on.
shouts
w2:
Shrek
Puss
Shrek
shouts
purrs
Puss
w3:
purrs
X1
w4:
X2
w1:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
Def: Random variable
A
random variable
is a function that takes a possible world and returns a value (also: state).
11
(The random variable “extracts” a feature from the state of the world)
Example: the Random Variable A: Ω → {scary, cosy} describes the atmosphere of a world.
A(w1)=scary
A(w2)=cosy
A(w3)=cosy
A(w4)=cosy
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Are Random Variables random?
Random variables are
• neither random
• nor variables
They take a possible world and return a characteristic of that world.
Wikipedia/Random Variables
12
A(w1)=scary
A(w2)=cosy
A(w3)=cosy
A(w4)=cosy
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Special case: Random variable
In our universe, the named sets can be considered random variables.
We consider only these as random variables:
13
X1(w1)=Shrek
X1(w2)=Puss
X1(w3)=Shrek
X1(w4)=Puss
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Def: Events
An
event
is a subset of the universe.
Event where someone purrs: {w2, w3}
14
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Special case: Events
“X2=purrs” = {w2, w3}
Note that X=x is not a statement,
but a shorthand notation for a set
of possible worlds.
15
We define an event as “X=x ” := {w : w ∈ Ω, X(w)=x} .
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
An event has a probability:
Events
16
P(X1=Shrek) = P({w1,w3}) = 0.5
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Set operations on events
X=x is a set, and can hence
undergo set operations such as ∩, ∪
17
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Syntax: Intersections
For the intersection, we write
• a comma-separated list
• vectors
• single values
18
Def: Conditional probability
The conditional probability of an event A given an event B is
With random variables:
This entails:
P(X=x ∩ Y=y) = P(Y=y) × P(X=x|Y=y)
i.e.: look only at the
worlds where B happens.
In these cases, compute
the ratio of the probabilities
where also A happens.
19
Example: Conditional probability
20
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Example: Conditional probability
Look at only cases where someone purrs
21
P(w2)=0.3
P(w3)=0.1
X1
Puss
Shrek
X2
purrs
purrs
world
w2
w3
Example: Conditional probability
22
P(w2)=0.3
P(w3)=0.1
X1
Puss
Shrek
X2
purrs
purrs
world
w2
w3
Example: Conditional probability
23
P(w2)=0.3
P(w3)=0.1
X1
Puss
Shrek
X2
purrs
purrs
world
w2
w3
If someone purrs,
it’s unlikely that it’s Shrek
Def: Independence
Knowing B does not influence
the probability of A, and vice versa.
24
This entails P(A|B) = P(A), P(B|A)=P(B)
Two events A and B are
independent
if
P(A ∩ B) = P(A) × P(B)
Events are not independent
(shouting means it's Shrek)
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
Conditional Independence
25
Two events A and B are
conditionally independent
given C, if
P(A ∩ B|C) = P(A|C) × P(B|C)
This means:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
X3
loudly
quietly
quietly
loudly
In the example, all events with X1 and X2 are independent given X3:
>NaïveBayes
Chain Rule
26
P(A ∩ B ∩ C) = P(A |B ∩ C) × P(B|C) × P(C )
From the definition of P(A|B) , we can prove the
chain rule
:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
In the example:
0.4
0.4+02
X3
loudly
quietly
quietly
loudly
1
>NaïveBayes
Chain Rule + Independence
27
P(A ∩ B ∩ C) = P(A |B ∩ C) × P(B|C) × P(C )
If A and B are independent given C , then
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=0.2
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
In the example, the events with X1 and X2 are independent given X3:
0.4
0.4+02
X3
loudly
quietly
quietly
loudly
1
>NaïveBayes
Probabilistic Classification
28
Classification
is the task of predicting the most likely value for one
variable (the
class
) given the values for the others (the
features
):
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=
?
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
X3
loudly
quietly
quietly
?
Choose the X3
that maximizes
the probability
Class
Features
Probabilistic Classification
29
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=
?
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
X3
loudly
quietly
quietly
?
How do we compute this?
-> It’s monolithic, we need an independence assumption to break it up.
Can we assume that all variables are independent?
->
No, because this would mean that only a single given variable
is allowed to correlate (= can predict) the unknown variable.
Classification
is the task of predicting the most likely value for one
variable (the
class
) given the values for the others (the
features
):
Probabilistic Classification
30
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=
?
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
X3
loudly
quietly
quietly
?
Let’s assume that the features are independent given the class:
Classification
is the task of predicting the most likely value for one
variable (the
class
) given the values for the others (the
features
):
Naïve Bayes Classification
31
Naïve Bayes Classification
assumes that all features are
independent given the class, and applies the chain rule:
P(w1)=0.4
P(w2)=0.3
P(w3)=0.1
P(w4)=
0.05
X1
Shrek
Puss
Shrek
Puss
X2
shouts
purrs
purrs
shouts
world
w1
w2
w3
w4
X3
loudly
quietly
quietly
loudly
Putting 0.1 instead of 0, see
later
Naïve Bayes Classification
32
Another way to see it (writing P(X) for P(X=x) for brevity):
?
Weird: We predict
from
.
?
Naïve Bayes Classification
assumes that all features are
independent given the class, and applies the chain rule:
No, probable features should get more weight:
No, the weight should be proportional to the size of class:
This is equivalent to the above!
More about Naïve Bayes
33
Naïve Bayes Classification
assumes that all features are
independent given the class, and applies the chain rule:
•
Naïve Bayes is derailed if some features are dependent
(e.g., if a feature is appears twice)
•
If one of the features of the new data point is unknown
(e.g., “shouts” is missing from the new line), its factor is omitted.
•
Instead of choosing a small value for the zero probabilities,
one uses
Laplace Smoothing
, where P(F=f|C=c) is estimated as
•
If a feature F is numeric and distributed as N(μ,σ) ,
the probability P(F=f|C=c) is estimated as
We define for a random variable X : P(X) := λ v: P(X=v)
i.e., P(X)(v)=P(X=v) .
P(X) is a function that takes a value v
and returns the probability P(X=v) .
Syntax: Distribution
34
P(X2)
purrs
shouts
0.4
0.6
We define for random variables
P(X,Y) is a function that takes a value v
for X and a value w for Y and returns P(X=v ∩ Y=w) .
“joint distribution”
Syntax: Joint Distribution
35
P(X1,X2)
Shrek,shouts
0.4
0.3
Puss,purrs
Shrek,purrs
Puss,shouts
0.1
0.2
is a function that takes values
for
and returns a distribution P(X) .
“conditional
distribution”
Syntax: Conditional Distribution
36
P(X1|X2)
Shrek
0.66
0.33
Puss
for X2=shouts
We define for random variables
:
Theorem: Marginals
For any random variables X, Y, we have
The probability of X=x
can be computed if we have
the conditional probability
P(X=x|Y=y) and P(Y=y) .
P(X) is called the marginal distribution
of the joint distribution P(X,Y) .
37
Overview
Complex dependencies
and/or feature functions
only visible
variables
visible
and invisible
variables
Markov Chains
Markov Random Fields
Hidden Markov Models
Conditional Random Fields
38
Chains
Introduction to Probabilities
• An event is a set of possible worlds
• Probabilities are defined on events
• MRFs model limited dependencies
• CRFs are MRFs with hidden variables
39
Summary: Prob’s, MRFs and CRFs
References
Elkan: Log-linear models and conditional random fields
Collins: Log-Linear Models
Lafferty et al: Conditional Random Fields
Sunita Sarawagi: Information Extraction
40