Statistics — A Complete Reference

Descriptive measures, counting, and inference · 기술통계·경우의 수·추론 가이드

1. What is Statistics?

통계학이란?

Statistics is the science of collecting, summarizing, and drawing conclusions from data. The discipline splits into two halves. Descriptive statistics summarizes what is already in front of us — averages, spreads, charts, percentages. Inferential statistics uses a sample to make claims about a larger, often unobservable population, attaching a measure of uncertainty to every conclusion.

The distinction between a population and a sample is the conceptual core of the field. A population is the entire set of entities we care about — every voter in a country, every product rolling off a factory line. A sample is the much smaller subset we actually measure. Population parameters, written with Greek letters like μ (mean) and σ (standard deviation), describe the truth; sample statistics, written with Latin letters like x̄ and s, are our best estimates from the data we managed to collect.

Modern life runs on statistical reasoning, often invisibly. Search engines rank results using statistical models of relevance. Streaming services recommend films from user-rating samples. Public-health agencies decide which vaccines to deploy based on clinical trial evidence. Mastering statistics is less about memorizing formulas and more about thinking clearly under uncertainty — a skill that pays dividends across software, finance, science, and daily judgment.

통계학은 데이터를 수집·요약·분석하여 결론을 이끌어내는 학문으로, 이미 가진 자료를 정리하는 기술통계와 표본으로부터 모집단을 추정하는 추론통계로 나뉩니다. 모집단(전체)을 나타내는 모수 μ, σ와 표본을 나타내는 통계량 x̄, s를 구분하는 것이 핵심이며, 검색 순위·추천 시스템·임상시험·자동완성까지 현대 사회의 거의 모든 의사결정이 통계적 추론 위에서 작동합니다.

2. Measures of Central Tendency

대푯값 — 평균·중앙값·최빈값

When summarizing a dataset with a single number, statisticians use one of three classic measures of center. The arithmetic mean x̄ = Σxᵢ / n is the familiar average: add everything up and divide by the count. The median is the middle value once the data are sorted. The mode is the most frequently occurring value, and is the only measure that makes sense for categorical data like favorite colors or product brands. A less common but important variant is the geometric mean, the n-th root of the product of n values, which is the right average for growth rates and compounded returns where multiplication, not addition, links the observations.

Each measure shines in a different situation. The mean uses every data point and has the cleanest mathematical properties, which is why almost all of inferential statistics builds on it. The median is robust to outliers: a single billionaire entering a room does not move the median income, although it moves the mean dramatically. The mode is invaluable for discrete data, where “the average smartphone color” makes no arithmetic sense but “the most popular color” does.

Worked example with skewed data. Seven households on a street earn (in thousands of dollars per year) 40, 42, 45, 50, 55, 60, and 850 (a retired tech founder). The mean is (40 + 42 + 45 + 50 + 55 + 60 + 850) / 7 ≈ 163, which describes none of the seven households well. The median is the 4th sorted value, 50, which describes six of the seven much better. The mode does not apply because every income is unique. Politicians who report “average income” on skewed distributions often mean the mean, while economists studying the typical experience prefer the median. Always check the shape of the distribution before choosing.

대푯값에는 평균(x̄ = Σxᵢ/n), 중앙값(정렬 후 가운데), 최빈값(가장 자주 등장)이 있으며, 성장률에는 기하평균을 씁니다. 평균은 모든 데이터를 활용하지만 이상치에 취약하고, 중앙값은 이상치에 강해 소득 분포처럼 한쪽으로 치우친 분포를 요약할 때 적합합니다. 일곱 가구 소득이 40·42·45·50·55·60·850(천 달러)일 때 평균은 163이지만 중앙값은 50으로, 어느 쪽이 "전형적"인지에 따라 다른 지표를 선택해야 합니다.

x̄ = Σxᵢ / n

3. Measures of Spread

산포의 측정 — 분산과 표준편차

A center alone never tells the whole story. Two datasets can share an identical mean yet feel completely different — one tightly clustered, the other wildly scattered. Measures of spread quantify that scatter. The simplest is the range, max minus min, but a single extreme observation can blow it up. The interquartile range (IQR), the gap between the 75th and 25th percentiles, captures the middle 50 % of the data and is naturally robust to outliers. Most of statistics, however, is built on the variance and its square root.

Variance σ² = Σ(xᵢ − μ)² / n is the average of squared deviations from the mean. Standard deviation σ = √(variance) brings that quantity back into the original units, which is why we usually report σ rather than σ². Why square the deviations instead of taking absolute values? Squaring always returns a non-negative number, amplifies large deviations to match our intuition that a far-from-mean point is “extra wrong,” and is differentiable everywhere, which makes calculus-based inference like least-squares regression and maximum likelihood tractable.

A subtle but critical detail: sample variance divides by n − 1, not n. This Bessel’s correction compensates for the fact that we used the sample mean x̄ in place of the unknown true mean μ, which slightly under-estimates variability. Dividing by n − 1 instead of n produces an unbiased estimator of the population variance. NumPy follows this convention via ddof=1, and spreadsheets offer both VAR.P (÷n) and VAR.S (÷(n−1)). Using the wrong one is the most common silent bug in beginner data analyses.

평균이 같아도 산포가 다르면 데이터의 성격은 전혀 달라집니다. 범위·IQR(사분위 범위)·분산·표준편차가 산포를 측정하며, 분산 σ² = Σ(xᵢ − μ)²/n에서 편차를 제곱하는 이유는 부호를 없애고 미분 가능한 형태를 만들어 회귀·최대우도 같은 추론 기법을 가능하게 하기 위함입니다. 표본 분산은 모평균 대신 표본평균을 썼다는 점을 보정하기 위해 n이 아닌 n − 1로 나누며(베셀 보정), 모분산 추정량으로서 편향이 사라집니다.

σ (std dev) σ² = Σ(xᵢ−x̄)² / n

4. Quartiles, Percentiles & Box Plots

사분위수·백분위수·상자 그림

Percentiles generalize the median. The p-th percentile is the value below which p % of the data lie: the 50th percentile is the median, the 90th is the value that 90 % of observations fall below. Quartiles are the three percentiles that split the data into four equal parts — Q1 at the 25th percentile, Q2 at the 50th (median), and Q3 at the 75th. The interquartile range is simply Q3 − Q1, the spread of the middle half, an outlier-resistant cousin of the standard deviation.

A box plot (also called a box-and-whisker plot) is the canonical visual summary of these quantities. The “box” runs from Q1 to Q3 with a line at the median, instantly communicating the central 50 % of the data. The whiskers extend to the smallest and largest observations that are not flagged as outliers. Standard practice flags any point further than 1.5 × IQR below Q1 or above Q3 as an outlier and draws it individually. This 1.5·IQR rule is conventional rather than universal — Tukey, who invented the box plot, chose it as a useful default for roughly normal data.

Box plots are powerful precisely because they show shape and spread without assuming a distribution. Comparing several groups side by side — exam scores across class sections, response times across servers, sales across regions — reveals at a glance which groups have similar centers, which have wider spread, and which contain extreme values. They also pair beautifully with raw points overlaid as a jittered strip, giving readers both the summary and the underlying data.

백분위수는 중앙값의 일반화로, p번째 백분위수는 데이터의 p%가 그 아래에 있는 값입니다. 사분위수 Q1·Q2·Q3가 데이터를 4등분하며, 상자 그림은 Q1~Q3을 상자로 그리고 1.5·IQR을 넘어가는 값을 이상치로 표시하는 표준 시각화입니다. 분포를 가정하지 않고 여러 집단의 중심·산포·이상치를 한눈에 비교할 수 있어, 시험 점수·응답 시간·매출처럼 그룹별 비교에 특히 유용합니다.

5. Permutations & Combinations

순열과 조합

Counting feels elementary until you need to count something subtle, at which point it quickly becomes the most error-prone branch of mathematics. Two basic principles do most of the work. The multiplication principle says that if a process has k independent stages with n₁, n₂, …, nₖ outcomes respectively, the total number of outcomes is n₁ · n₂ · … · nₖ. The factorial n! = n · (n − 1) · … · 2 · 1 then counts the number of ways to arrange n distinct objects in a row: 5! = 120, and 10! is already over three million.

From the factorial we derive two named formulas. A permutation P(n, k) = n! / (n − k)! counts the number of ordered arrangements of k items drawn from n distinct items — choosing a president, vice-president, and secretary from a club of ten gives P(10, 3) = 10 · 9 · 8 = 720. A combination C(n, k) = n! / (k! · (n − k)!) counts the number of unordered selections — choosing any three people for a committee gives C(10, 3) = 120, exactly six times smaller because each committee can be arranged in 3! = 6 orders. The mental check “does order matter?” decides the formula every time.

Worked example: five-card poker hands. A standard 52-card deck. We are drawing five cards and order does not matter (a hand is the same set whether dealt in one sequence or another). The count is therefore C(52, 5) = 52! / (5! · 47!) = (52 · 51 · 50 · 49 · 48) / 120 = 2 598 960. Out of those 2.6 million hands, exactly four are royal flushes (one per suit), giving probability 4 / 2 598 960 ≈ 1.54 × 10⁻⁶ — one in roughly 650 000. The structure of every probability calculation in card games and lotteries is the same: count favorable outcomes via combinations, divide by total outcomes via combinations, get the probability.

곱의 원리와 팩토리얼 n!이 모든 경우의 수 계산의 기초입니다. 순열 P(n, k) = n!/(n − k)!는 순서가 있는 선택, 조합 C(n, k) = n!/(k!(n − k)!)는 순서 없는 선택을 세며, "순서가 중요한가?"가 공식을 가릅니다. 포커 5장 패의 총 가짓수는 C(52, 5) = 2 598 960이고, 로열 플러시는 4가지뿐이라 확률은 약 1/650 000입니다.

P(n,r) order matters n!/(n−r)! C(n,r) order doesn't n!/[r!(n−r)!] P(5,2)=20 vs C(5,2)=10

6. Correlation & Causation

상관관계와 인과관계

Correlation measures how two variables move together. The most common quantity is the Pearson correlation coefficient r = Σ(xᵢ − x̄)(yᵢ − ȳ) / √(Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²), a value between −1 and +1. A coefficient of +1 means the points lie on a perfectly increasing line, −1 perfectly decreasing, and 0 no linear relationship at all. Values around ±0.7 are strong in most empirical contexts, ±0.3 modest, and below ±0.1 essentially noise. Always pair the coefficient with a scatter plot: Anscombe’s quartet shows four datasets with identical r ≈ 0.82 that look radically different on inspection.

Pearson r captures only linear relationships and is sensitive to outliers. Spearman’s rank correlation replaces each value with its rank and computes Pearson r on the ranks, which detects any monotonic relationship and shrugs off outliers. Kendall’s tau counts concordant versus discordant pairs and behaves similarly. When variables are non-linear or contain extremes, the rank-based versions are usually a safer first look.

The single most repeated warning in all of statistics is that correlation does not imply causation. Ice cream sales and drowning deaths are strongly correlated, but neither causes the other — both rise in summer because of the confounding variable, hot weather. The number of films Nicolas Cage releases in a year correlates with swimming-pool drownings in the United States; the correlation is real but completely meaningless. Always hunt for confounders, reverse causation, and selection effects before reading a correlation as evidence of cause. Only randomized experiments, or carefully designed causal-inference techniques, can promote a correlation into a credible causal claim.

피어슨 상관계수 r은 두 변수의 선형 관계를 −1에서 +1 사이로 나타냅니다. r=±1은 완전한 선형 관계, r=0은 선형 관계 없음을 뜻하며, 비선형이나 이상치가 있을 때는 순위 기반인 스피어만·켄달 상관계수가 더 안전합니다. 가장 중요한 경고는 상관관계는 인과관계가 아니다라는 것입니다. 아이스크림 판매와 익사 사고가 함께 늘어나는 것은 둘 다 더위라는 교란변수의 영향일 뿐이며, 인과를 주장하려면 무작위 실험이나 인과추론 기법이 필요합니다.

r ≈ +0.98 (strong positive)

7. Probability Distributions (Bridge to /docs/prob/)

확률분포 — 정규·이항·푸아송

Inferential statistics needs a probabilistic model of how the data could have been generated. Three distributions cover most practical applications. The normal distribution N(μ, σ²) is the famous bell curve — symmetric, unbounded, parameterized by its mean μ and standard deviation σ. It models heights, measurement errors, exam scores, and the sums of many small independent effects.

The normal distribution comes with one of the most cited rules of thumb in statistics: the 68-95-99.7 rule, or empirical rule. About 68 % of values lie within ±1σ of the mean, 95 % within ±2σ, and 99.7 % within ±3σ. A measurement four standard deviations from the mean is so unlikely — roughly one in 16 000 — that it usually points to a measurement error or a genuinely interesting outlier. “Six Sigma” manufacturing aims for fewer than 3.4 defects per million opportunities, corresponding to a process where the specification limits sit six standard deviations from the mean.

The binomial distribution counts successes in a fixed number of independent yes/no trials with constant success probability p — vote counts, click-through rates, defect counts. The Poisson distribution counts rare independent events in a fixed interval — phone calls per hour, typos per page, ER arrivals per minute. Both can be approximated by the normal distribution when sample sizes get large, which leads to the central limit theorem: the distribution of the sample mean of many independent observations approaches a normal distribution as n grows, regardless of the shape of the source distribution. The CLT is why normality keeps appearing even when underlying data are skewed or discrete, and it justifies most of the statistical machinery in the next sections. For probability axioms and named distributions in full, see probability.

정규분포 N(μ, σ²)는 평균과 표준편차로 정의되는 종 모양 분포로, 데이터의 약 68%/95%/99.7%가 평균 ±1σ/±2σ/±3σ 안에 들어간다는 경험 법칙이 있습니다. 이항분포는 고정 횟수 시행에서 성공 횟수, 푸아송 분포는 단위 시간당 드문 사건의 발생 횟수를 모형화합니다. 중심극한정리는 원래 분포가 어떤 모양이든 표본평균의 분포는 표본 크기가 커질수록 정규분포에 가까워진다는 결과로, 추론통계 전반의 토대가 됩니다. 자세한 분포론은 확률론 문서를 참고하세요.

8. Sampling & Bias

표본 추출과 편향

Every inferential claim depends on the quality of the sample behind it. A simple random sample gives each member of the population an equal chance of selection and is the gold standard, but in practice true randomness requires a complete list of the population, which few populations come with. Stratified sampling splits the population into homogeneous subgroups and samples independently from each, guaranteeing representation of small but important groups. Cluster sampling randomly selects whole groups (schools, neighborhoods) and surveys everyone in them, trading some efficiency for vastly lower cost. Convenience sampling — surveying whoever is easiest to reach — is fast and almost always biased.

Sampling bias occurs whenever certain members of the population are systematically more likely to be sampled than others. The 1936 Literary Digest poll predicted that Alf Landon would defeat Franklin Roosevelt in a landslide; it was built on 2.4 million responses, but drawn from telephone directories and automobile registrations during the Great Depression — so the sample skewed strongly toward wealthier Republican-leaning households. Roosevelt won 46 of 48 states. Sample size cannot fix a biased frame: bigger samples just produce more confidently wrong answers.

Survivorship bias is another frequent trap. Statistician Abraham Wald famously recommended armoring the aircraft areas with the fewest bullet holes among returning planes — because the planes hit in the heavily-armored zones had already crashed and were absent from the sample. Sample size and margin of error also matter: for a survey of n people, the margin of error for a proportion is roughly 1/√n, so quadrupling the sample only halves the error. Past a few thousand respondents, reducing bias usually beats scaling up.

단순 무작위 추출이 이상적이지만 실무에서는 층화·집락 추출이 더 현실적이고, 편의 추출은 거의 항상 편향을 일으킵니다. 1936년 리터러리 다이제스트 여론조사는 240만 명을 조사하고도 표본 자체가 부유층 위주여서 루즈벨트 당선을 놓쳤고, 이는 표본 크기가 편향을 보정해주지 않음을 보여주는 대표 사례입니다. 전쟁기 비행기 탄흔 분석(생존자 편향)도 같은 함정이며, 표본 크기 n에 대한 오차는 대략 1/√n로 감소하므로 무작정 크기를 키우기보다 편향 제거가 우선입니다.

9. Common Mistakes & Pitfalls

흔한 실수와 함정

대표적 함정으로는 상관과 인과의 혼동, 큰 편향 표본에서 비롯되는 잘못된 자신감, 기저율 무시(저빈도 질환에서 양성 예측도가 낮아지는 현상), 심슨의 역설(부분과 전체의 추세가 뒤집힘), 모수가 다른 비율 비교, 신뢰구간 누락 등이 있습니다. 점추정만 보고 확정적으로 결론짓는 대신, 표본이 어떻게 추출되었는지·기저율은 얼마인지·신뢰구간 폭은 어느 정도인지를 함께 확인해야 통계로 속거나 속이는 일을 줄일 수 있습니다.

실세계 응용과 관련 개념

Statistics underpins decision-making across nearly every quantitative field:

Statistics relies on probability; covariance matrices are central to multivariate stats (see matrices); count-based calculations sit on arithmetic. Vectors and geometry appear once you generalize correlation to higher dimensions and projection — see vectors and geometry — and the central limit theorem ties all of these strands together by guaranteeing that, with enough data, the normal distribution emerges from almost anywhere.

Practice → C:Stat.

통계학은 A/B 테스트(소프트웨어 실험), 여론조사, 품질 관리(식스 시그마), 스포츠 분석(세이버메트릭스·xG), 머신러닝 평가지표(정확도·정밀도·재현율·AUC·교차검증)까지 거의 모든 정량 분야의 의사결정을 떠받칩니다. 통계학은 확률론에 기반하며, 공분산 행렬은 행렬의 응용이고, 셈은 산술 위에서 동작합니다. 고차원으로 일반화하면 벡터기하가 자연스레 연결되며, 중심극한정리가 이 모든 갈래를 정규분포로 묶어줍니다. 실전 연습은 C:Stat에서 가능합니다.

11. Explore Each Statistics Topic in Depth

주제별 심화 가이드

The reference above is the conceptual overview. Each topic below has its own focused page with a step-by-step worked example, an SVG diagram, common-mistake notes, and an FAQ:

Prefer a quick computation? Try the Standard Deviation Calculator or browse all math calculators.

위 레퍼런스는 전체 개요이며, 아래 각 주제는 단계별 예제·도해·실수 노트·FAQ를 갖춘 전용 페이지로 이어집니다: 평균, 중앙값, 분산, 표준편차, 순열, 조합, 확률, 상관계수. 빠른 계산은 표준편차 계산기·계산기 모음에서.

Practice now → C:Stat