Probability — A Complete Reference

From sample spaces to Bayes and beyond · 표본공간에서 베이즈까지

1. What is Probability?

확률이란 무엇인가?

Probability is the branch of mathematics that quantifies uncertainty. It assigns a number between 0 and 1 to every event, where 0 means impossible, 1 means certain, and values in between measure how strongly we expect the event. This simple idea underlies every model of risk, every statistical inference, and every machine-learning prediction.

Two interpretations dominate. The frequentist view defines P(A) as the long-run relative frequency: if you flip a fair coin forever, the proportion of heads tends to 0.5. The Bayesian view treats probability as a degree of belief that updates when evidence arrives — useful when no repeatable experiment exists, like “what is the probability it will rain tomorrow?” Both interpretations obey the same mathematical rules.

The modern foundation was laid by Andrey Kolmogorov in 1933 with three axioms. First, P(A) ≥ 0 for every event A. Second, P(Ω) = 1, where Ω is the sample space of all possible outcomes. Third, for any sequence of mutually exclusive events A₁, A₂, …, P(A₁ ∪ A₂ ∪ …) = P(A₁) + P(A₂) + …. Every theorem in probability — Bayes, expectation, the law of large numbers, the central limit theorem — flows from these three lines.

Why bother? Because every real-world model that touches uncertainty leans on probability: weather forecasts, insurance premiums, drug-trial results, search rankings, fraud detection, and the calibrated outputs of modern AI systems. Without probability, “uncertain” stays a vague word; with it, uncertainty becomes a quantity you can compute and compare.

확률은 불확실성을 0과 1 사이의 수로 정량화하는 수학입니다. 빈도주의는 P(A)를 무한 반복 실험의 상대빈도로, 베이지언은 신념의 정도로 해석하지만 둘 다 콜모고로프의 세 가지 공리(P ≥ 0, P(Ω) = 1, 서로소 사건 합의 가법성) 위에서 동일하게 작동합니다. 일기예보, 보험, 신약 임상, 검색 랭킹, 자율주행, AI 모델 보정 등 불확실성을 다루는 모든 분야가 이 공리들로부터 출발합니다.

2. Sample Spaces & Events

표본공간과 사건

A sample space, written Ω (omega), is the set of every possible outcome of an experiment. For a single coin flip, Ω = {H, T}. For a six-sided die, Ω = {1, 2, 3, 4, 5, 6}. For two coins flipped together, Ω = {HH, HT, TH, TT} — four outcomes, not three, because HT and TH are distinct. Getting the sample space right is half of every probability problem; most student errors trace back to a wrong Ω.

An event is any subset of Ω. “Rolling an even number” on a die is E = {2, 4, 6}; “at least one head in two flips” is E = {HH, HT, TH}. Because events are sets, the language of set theory transfers directly. The union A ∪ B is “A or B (or both),” the intersection A ∩ B is “both A and B,” and the complement Aᶜ is “not A.” Their probabilities follow: P(Aᶜ) = 1 − P(A), and P(A ∪ B) = P(A) + P(B) − P(A ∩ B), where the last term removes the overlap counted twice.

Two terms are routinely confused. Mutually exclusive (disjoint) events cannot happen at the same time: P(A ∩ B) = 0. Rolling a 2 and rolling a 5 on a single die are disjoint. Independent events can happen together but do not influence each other: P(A ∩ B) = P(A)·P(B). These are two different relationships, and §8 is devoted to disentangling them.

When Ω is finite and every outcome is equally likely, the classical formula P(A) = |A| / |Ω| applies. The probability of drawing an ace from a shuffled standard deck is 4/52 = 1/13 ≈ 7.7%. The formula works as long as the equally-likely assumption holds — a condition that quietly fails more often than students realize.

표본공간 Ω는 모든 가능한 결과의 집합이고, 사건은 Ω의 부분집합입니다. 합집합·교집합·여집합은 그대로 확률로 옮겨와 P(A ∪ B) = P(A) + P(B) − P(A ∩ B) 같은 항등식이 성립합니다. 서로소(P(A ∩ B) = 0)와 독립(P(A ∩ B) = P(A)P(B))은 완전히 다른 개념입니다. 모든 결과가 동등하게 일어나는 경우에는 고전적 정의 P(A) = |A| / |Ω|를 사용합니다.

3. Counting Principles

경우의 수 원리

When outcomes are equally likely, P(A) = |A| / |Ω| reduces to a counting problem — and counting cleanly is harder than it looks. The starting point is the multiplication rule: if a process has k stages with n₁, …, n_k options each, the total number of outcomes is n₁ · n₂ · … · n_k. A 6-character lowercase password has 26⁶ ≈ 309 million possibilities.

A permutation counts ordered arrangements of k items chosen from n: P(n, k) = n! / (n − k)!. The number of ways to arrange 3 books from a shelf of 10 is P(10, 3) = 10·9·8 = 720. A combination counts unordered selections: C(n, k) = n! / (k!·(n − k)!). The number of 5-card poker hands from a 52-card deck is C(52, 5) = 2,598,960. The difference is whether order matters; nearly every counting error comes from mixing these two up.

For overlapping events, inclusion-exclusion restores correctness: |A ∪ B| = |A| + |B| − |A ∩ B|. The same identity holds for probabilities: P(A ∪ B) = P(A) + P(B) − P(A ∩ B). Without subtracting the intersection, every element of A ∩ B would be counted twice.

Worked example — poker flush. A flush is 5 cards of the same suit. The number of flushes equals (suits) × (ways to choose 5 cards from 13 in that suit) = 4 · C(13, 5) = 4 · 1287 = 5148. The total number of 5-card hands is C(52, 5) = 2,598,960. So P(flush) = 5148 / 2,598,960 ≈ 0.00198, or about 1 in 505. Notice how cleanly the problem splits — “choose a suit” then “choose 5 cards within it” — and how every quantity is a combination, never a permutation, because the order of cards in a hand does not matter.

곱의 법칙은 단계가 여러 개인 시행의 총 경우의 수가 각 단계의 곱이라는 원리입니다. 순열 P(n, k) = n!/(n − k)!은 순서가 중요한 배열, 조합 C(n, k) = n!/(k!(n − k)!)은 순서가 무관한 선택을 셉니다. 포커에서 플러시 확률은 4·C(13, 5)/C(52, 5) ≈ 1/505입니다. 중복이 있는 사건의 합은 포함-배제 원리 |A ∪ B| = |A| + |B| − |A ∩ B|로 계산합니다.

4. Conditional Probability

조건부 확률

Conditional probability measures how the probability of A changes once we know B has occurred. The defining formula is P(A | B) = P(A ∩ B) / P(B), valid whenever P(B) > 0. Knowing B has happened restricts the sample space from Ω to B, and the probability of A inside that smaller world is the fraction of B that overlaps A. Drawing a king from a deck is 4/52; given that the card is a face card, the probability jumps to 4/12 = 1/3.

Rearranging produces the multiplication rule: P(A ∩ B) = P(A | B) · P(B) = P(B | A) · P(A). This computes the probability of two events both happening when one depends on the other. Drawing two aces without replacement has probability (4/52) · (3/51) ≈ 0.0045. Probability trees, where each branch is multiplied by the conditional probability along it, formalize this sequential reasoning.

Two events are independent precisely when the conditional collapses to the unconditional: P(A | B) = P(A), or equivalently P(A ∩ B) = P(A) · P(B). Flips of separate coins are independent. Checking independence requires a probability calculation or a domain argument — never an assumption.

P(A | B) and P(B | A) are not the same number. P(spots | measles) is close to 1; P(measles | spots) is usually far smaller because many diseases cause spots. The whole point of Bayes’ theorem next is to convert one into the other without losing your way.

5. Bayes’ Theorem

베이즈 정리

Bayes’ theorem lets you flip a conditional: it converts P(B | A), often known directly, into P(A | B), which is usually what you actually want. The statement is P(A | B) = P(B | A) · P(A) / P(B). The derivation: by the multiplication rule, P(A | B) · P(B) = P(B | A) · P(A); divide by P(B). The denominator is usually expanded via the law of total probability: P(B) = P(B | A)·P(A) + P(B | Aᶜ)·P(Aᶜ).

The terminology matters. P(A) is the prior — what you believed about A before seeing B. P(B | A) is the likelihood — how probable the evidence is under A. P(A | B) is the posterior — your updated belief. P(B) is the normalizing constant. Together they form the engine of statistical learning: prior + evidence → posterior.

Medical test example. A disease has prevalence 1%, and a test is 95% accurate in both directions: P(positive | disease) = 0.95 and P(negative | no disease) = 0.95. You test positive. What is P(disease | positive)? Set A = “has disease,” B = “tests positive.” Then P(A) = 0.01, P(B | A) = 0.95, P(B | Aᶜ) = 0.05. The denominator is P(B) = 0.95 · 0.01 + 0.05 · 0.99 = 0.059. So P(A | B) = (0.95 · 0.01) / 0.059 ≈ 0.161 — only about 16%.

That figure shocks people who expected something close to 95%. The reason is the base rate: 99 healthy people for every 1 sick person, and 5% of the 99 also test positive, drowning out the true positives. When priors are extreme, even a very accurate test cannot overcome them in a single shot. This is why screening programs require confirmatory follow-ups and why doctors interpret results in light of prevalence.

베이즈 정리 P(A | B) = P(B | A)·P(A)/P(B)는 사전확률(prior) P(A), 가능도(likelihood) P(B | A), 그리고 정규화 상수 P(B)를 사용해 사후확률(posterior) P(A | B)를 계산합니다. 유병률 1%, 정확도 95%인 검사를 양성으로 받으면 실제로 병에 걸렸을 확률은 약 16%에 불과합니다. 기저율(prior)이 결과를 크게 좌우하기 때문이며, 검진 프로그램이 항상 확인 검사를 요구하는 이유입니다.

6. Random Variables & Distributions

확률변수와 분포

A random variable is a function that assigns a number to every outcome. If you flip three coins, the number of heads X takes values in {0, 1, 2, 3}. Random variables split into two camps. Discrete variables take values in a finite or countable set — coin counts, die rolls, typos on a page. Continuous variables take values in an interval of real numbers — heights, waiting times, temperatures.

Discrete variables are described by a probability mass function (PMF) p(x) = P(X = x), which lists each value’s probability and sums to 1. Continuous variables use a probability density function (PDF) f(x), with probabilities from integration: P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx. A PDF is not itself a probability — f(x) can exceed 1 — only its integral is. Both kinds share a cumulative distribution function (CDF) F(x) = P(X ≤ x), non-decreasing from 0 at −∞ to 1 at +∞.

A handful of named distributions show up everywhere. Bernoulli(p) models a single yes/no trial. Binomial(n, p) counts successes in n trials with PMF P(X = k) = C(n, k)·pᵏ·(1 − p)ⁿ⁻ᵏ. Geometric(p) counts trials until the first success. Poisson(λ) models counts of rare independent events — arrivals per hour, mutations per genome. The uniform spreads probability evenly. The normal(μ, σ²) or Gaussian is the bell curve that dominates the continuous world via the central limit theorem.

Normal distributions are standardized by Z = (X − μ) / σ, producing a standard normal with mean 0 and variance 1. Z-scores let any normal measurement be compared on a common scale. Roughly 68% of values fall within ±1σ of the mean, 95% within ±2σ, and 99.7% within ±3σ — the famous “68-95-99.7 rule.”

확률변수는 표본공간의 결과에 숫자를 대응시키는 함수입니다. 이산변수는 PMF p(x) = P(X = x)로, 연속변수는 PDF f(x)와 적분으로 확률을 정의하며, 누적분포함수 F(x) = P(X ≤ x)는 두 경우 모두 사용됩니다. 베르누이, 이항(n, p), 기하, 포아송(λ), 균등, 정규(μ, σ²)는 가장 자주 등장하는 명명된 분포이며, Z = (X − μ)/σ로 표준화하면 정규분포 표를 그대로 활용할 수 있습니다.

7. Expected Value & Variance

기댓값과 분산

The expected value E[X] is the probability-weighted average of every value a random variable can take. For a discrete variable, E[X] = Σ xᵢ · p(xᵢ); for a continuous variable, E[X] = ∫ x · f(x) dx. The expected value of a single die roll is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 — a value you can never actually roll, hinting that “expected” means “long-run average,” not “most likely.” Over a million rolls, the sample mean will be extremely close to 3.5.

The most useful property is linearity: E[X + Y] = E[X] + E[Y] and E[aX + b] = a·E[X] + b, with no independence required. Even when X and Y are tightly correlated, their expected values still add. Linearity makes hard problems trivial: the expected number of fixed points in a random permutation of n items is exactly 1, because each position contributes 1/n and there are n positions.

The variance Var(X) = E[(X − μ)²] measures how spread out X is around μ = E[X]. The computational form Var(X) = E[X²] − (E[X])² is what you usually use. The standard deviation σ = √Var(X) is in the units of X. Variance is non-negative and scales quadratically: Var(aX + b) = a²·Var(X). For independent sums, variances add: Var(X + Y) = Var(X) + Var(Y); for dependent variables, include a covariance term.

The law of large numbers (LLN) binds expectation to reality: as samples grow, (X₁ + … + X_n)/n converges to μ. The central limit theorem (CLT) goes further: regardless of the underlying distribution (with finite variance), the standardized sample mean approaches a normal — which is why bell curves appear everywhere in statistics.

기댓값 E[X] = Σ xᵢ·p(xᵢ)는 가능한 모든 값의 확률 가중평균이며, 주사위의 기댓값은 3.5처럼 실제로 나올 수 없는 값일 수도 있습니다. 기댓값의 선형성 E[X + Y] = E[X] + E[Y]는 독립이 아니어도 성립하는 강력한 성질입니다. 분산 Var(X) = E[(X − μ)²]은 퍼짐의 정도를 재고, 독립인 경우에만 Var(X + Y) = Var(X) + Var(Y)가 성립합니다. 큰 수의 법칙은 표본평균이 진짜 기댓값으로 수렴함을 보장하고, 중심극한정리는 그 수렴의 형태가 정규분포임을 알려줍니다.

8. Independence vs Disjoint Events

독립과 서로소

No pair of concepts is confused more often than independence and disjointness. They sound similar — both suggest “separation” — but describe completely different relationships, and two nonempty events cannot be both disjoint and independent.

Disjoint (mutually exclusive) means the events cannot both happen on the same trial: A ∩ B = ∅, so P(A ∩ B) = 0. Rolling a 2 and rolling a 5 on a single die are disjoint, and so are “today is Monday” and “today is Tuesday.” Disjointness is a statement about the sample space itself: the events occupy non-overlapping regions of Ω.

Independent means one event’s occurrence does not change the other’s probability: P(A ∩ B) = P(A) · P(B), equivalently P(A | B) = P(A) when P(B) > 0. Flipping a coin in Korea and one in Brazil are independent. Independence is a statement about probability, not geometry — two events can overlap heavily yet still be independent if the overlap is exactly P(A)·P(B).

Here is the punchline. If A and B are disjoint, then P(A ∩ B) = 0; if they were also independent, then P(A)·P(B) would equal 0, forcing P(A) = 0 or P(B) = 0. So disjoint nonempty events are never independent — they are maximally dependent, because knowing A occurred immediately tells you B did not. The correct rule for disjoint events is P(A ∪ B) = P(A) + P(B), not P(A ∩ B) = P(A)·P(B).

서로소(disjoint)는 두 사건이 동시에 일어날 수 없다는 뜻으로 P(A ∩ B) = 0, 독립은 한 사건이 다른 사건의 확률을 바꾸지 않는다는 뜻으로 P(A ∩ B) = P(A)·P(B)입니다. 서로소이면서 두 사건이 모두 0이 아닌 확률을 가지면 절대 독립일 수 없습니다 — 사실 A가 일어났음을 안 순간 B가 일어나지 않았음이 확실하므로 최대로 종속입니다. 서로소 사건에는 P(A ∪ B) = P(A) + P(B)를 쓰고, 독립 사건에는 P(A ∩ B) = P(A)·P(B)를 씁니다.

9. Common Mistakes & Pitfalls

흔한 실수와 함정

Base-rate fallacy. Ignoring the prior and reading P(disease | positive) as if it were P(positive | disease). The 16% answer in §5 looks shocking only because readers anchor on test accuracy and forget how rare the disease is. The same fallacy makes “matched a partial fingerprint in a database of millions” far weaker evidence than it sounds.
Conjunction fallacy. Believing P(A ∩ B) can exceed P(A). It cannot: A ∩ B is a subset of A. The “Linda the bank teller” experiment showed people judge “Linda is a bank teller and a feminist” more probable than “Linda is a bank teller” — a logical impossibility.
Gambler’s fallacy. Believing past independent outcomes influence the future. After five heads in a row on a fair coin, the next flip is still 50/50. Coins have no memory; thinking heads are “due” after a tail streak has bankrupted serious people.
Monty Hall confusion. Three doors, one prize. You pick door 1; the host opens door 3 to reveal a goat. Should you switch to door 2? Yes — switching wins 2/3 of the time, because the host’s choice carries information about where the prize is.
Confusing P(A | B) with P(B | A). “Most accident victims were wearing seatbelts” does not mean seatbelts cause accidents; it reflects that almost everyone wears seatbelts. Flipping the direction without Bayes is the single most common error in statistics reporting.
Treating “rare” as “impossible.” A 1-in-10,000 event happens many times a day on a planet of 80 million daily trials. Probability zero is far stronger than “very low probability”; reasoning that drops the distinction goes badly wrong at scale.

흔한 실수는 기저율 무시(검사 정확도만 보고 사전확률을 잊는 것), 결합 오류(P(A ∩ B) > P(A)는 불가능), 도박사의 오류(과거 결과가 다음 시행에 영향), 몬티홀 문제 혼동, P(A | B)와 P(B | A) 혼동, 그리고 "드물다"를 "불가능하다"로 취급하는 것입니다. 모두 사전확률·조건의 방향·독립성에 대한 정확한 사고로 피할 수 있습니다.

응용과 관련 개념

Probability is the language any field that takes uncertainty seriously eventually speaks:

Machine learning & AI. Bayesian models update posteriors as data arrives; naive Bayes, logistic regression, Gaussian mixtures, and variational autoencoders are all probabilistic. Dropout randomly zeroes neurons during training to prevent over-fitting.
Finance. Black-Scholes option pricing computes fair price as the expected discounted payoff under a risk-neutral measure. Value-at-risk and credit-default modeling rest on distributions of returns and joint risk.
Insurance. Premiums are set so expected claims plus expenses stay below expected revenue — expected value applied across millions of policies. Reinsurance pools tail risk so no single catastrophe sinks the insurer.
Epidemiology. Sensitivity P(positive | disease) and specificity P(negative | no disease) are conditional probabilities; Bayes converts them into the predictive values clinicians actually use.
Cryptography. The random oracle model treats hash functions as ideal random functions to prove security bounds. Information-theoretic security uses probability over keys and plaintexts.
AI safety & calibration. A well-calibrated model that says “80% confident” should be correct 80% of the time. Probability is how we compare model confidence to reality.

Probability underpins statistics — the inverse problem of going from data back to a probabilistic model — and expected values are weighted arithmetic sums. Random matrices and Markov chains connect to matrix theory through transition matrices and spectral decompositions.

Ready to drill the mechanics? Practice → C:Prob offers interactive problems on counting, conditional probability, Bayes, expected value, and distributions.

확률은 머신러닝(베이지언 모델, 분류기, 드롭아웃), 금융(옵션 가격, VaR), 보험(위험 풀링), 역학(검사 민감도·특이도), 암호학(랜덤 오라클), AI 안전성(모델 보정) 등 불확실성을 다루는 모든 분야에 등장합니다. 통계학은 확률의 역문제이고, 기댓값은 가중 산술 합이며, 마르코프 체인과 무작위 행렬은 행렬 이론과 직결됩니다. 실전 연습은 C:Prob에서 가능합니다.

11. Explore Each Probability Topic in Depth

주제별 심화 가이드

The reference above is the conceptual overview. Each topic below has its own focused page with a step-by-step worked example, an SVG diagram, common-mistake notes, and an FAQ:

Counting — the multiplication rule and sizing a sample space.
Basic probability — P(A) = |A|/|Ω|, the complement, and the addition rule.
Conditional probability — P(A|B) = P(A∩B)/P(B) and independence.
Bayes’ theorem — flipping a conditional and the base-rate effect.
Binomial distribution — P(X=k) = C(n,k)pᵏ(1−p)ⁿ⁻ᵏ, mean np.
Expected value — E[X] = Σ x·P(x) and the linearity of expectation.

위 레퍼런스는 전체 개요이며, 아래 각 주제는 단계별 예제·도해·실수 노트·FAQ를 갖춘 전용 페이지로 이어집니다: 경우의 수, 기본 확률, 조건부 확률, 베이즈 정리, 이항분포, 기댓값.

Practice now → C:Prob