Statistics for Machine Learning (Required statistics for Machine learning)

Tejas Kamble
January 15, 2025
20 min read
Machine learning,Statistics

The basic and most important part of the Machine learning and Data analysis is to understand the Data, Analyze the pattern in technical way we will say distribution of the data, we will discuss as follow:

To understand the machine learning and data science perfectly you must know the statistics. we are going to discuss most important techniques in the Statistics.

Types of Statistics:

Descriptive
Inferential

Descriptive Statistics:

1) Measure of Central Tendencies (Mean, Median, Mode)

Measures of central tendency are statistical metrics that summarize a dataset by identifying its central point. They provide insights into the “typical” or “average” value in a dataset, making them essential for analyzing and comparing data distributions.

There are three main measures of central tendency: Mean, Median, and Mode.

1. Mean (Arithmetic Average)

The mean is the sum of all values in a dataset divided by the number of values. It is the most commonly used measure of central tendency.

Formula

For a dataset with nnn values: Mean(xˉ)=∑xin\text{Mean} (\bar{x}) = \frac{\sum x_i}{n}Mean(xˉ)=n∑xi

where:

xix_ixi represents individual values,
nnn is the total number of values.

Example

Consider the dataset: {4, 8, 10, 12, 14} xˉ=4+8+10+12+145=485=9.6\bar{x} = \frac{4 + 8 + 10 + 12 + 14}{5} = \frac{48}{5} = 9.6xˉ=54+8+10+12+14=548=9.6

Thus, the mean is 9.6.

Pros and Cons

✅ Easy to calculate and widely used.
✅ Considers all values in the dataset.
❌ Sensitive to outliers (e.g., extreme values can skew the mean).

2. Median (Middle Value of a Sorted Dataset)

The median is the middle value when the dataset is arranged in ascending or descending order. If there are an odd number of values, the median is the middle one. If there are an even number of values, the median is the average of the two middle values.

Steps to Find the Median

Sort the dataset in ascending order.
If nnn is odd, the median is the middle value: Median=x(n+12)\text{Median} = x_{\left(\frac{n+1}{2}\right)}Median=x(2n+1)
If nnn is even, the median is the average of the two middle values: Median=x(n2)+x(n2+1)2\text{Median} = \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}}{2}Median=2x(2n)+x(2n+1)

Example 1 (Odd Number of Values)

Dataset: {3, 7, 9, 11, 15}

Middle value: 9 (third value in sorted order).
Median = 9

Example 2 (Even Number of Values)

Dataset: {2, 5, 8, 12, 14, 18}

Middle two values: 8 and 12
Median: 8+122=10\frac{8 + 12}{2} = 1028+12=10
Median = 10

Pros and Cons

✅ Not affected by outliers, making it a good measure for skewed distributions.
✅ Works well for ordinal data.
❌ Ignores extreme values in the dataset.

3. Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset. A dataset can have:

No mode (if all values appear once).
One mode (Unimodal dataset).
Two modes (Bimodal dataset).
More than two modes (Multimodal dataset).

Example 1 (Unimodal Dataset)

Dataset: {2, 4, 4, 6, 8, 10}

The number 4 appears twice, making it the mode.
Mode = 4

Example 2 (Bimodal Dataset)

Dataset: {1, 3, 3, 5, 7, 7, 9}

Two numbers appear most frequently: 3 and 7.
Modes = 3 and 7 (Bimodal)

Example 3 (No Mode)

Dataset: {2, 5, 8, 11, 14}

No repeated values, so no mode exists.

Pros and Cons

✅ Useful for categorical data (e.g., finding the most common brand, product, or category).
✅ Works well for non-numeric data.
❌ May not exist or may be multiple, making interpretation difficult.

Choosing the Best Measure of Central Tendency

Scenario	Best Measure
Symmetric data with no outliers	Mean
Skewed data with outliers	Median
Categorical or qualitative data	Mode
Bimodal/multimodal distribution	Mode

Example: House Prices

Consider the house prices (in $1000s):
{120, 130, 135, 140, 500}

Mean = 120+130+135+140+5005=205\frac{120+130+135+140+500}{5} = 2055120+130+135+140+500=205 (highly affected by outlier).
Median = 135 (middle value, better representation).
Mode = No mode (no repetition).

In this case, the median is the best measure because it is unaffected by the extreme value (500).

2) Measure of Dispersion (Standard Deviation, Variance)

1. Measures of Dispersion

While measures of central tendency (mean, median, mode) tell us about the center of the data, measures of dispersion describe how spread out the data points are. The most common measures of dispersion include Variance and Standard Deviation.

1.1 Variance (𝜎² or s²)

Variance measures how much each data point deviates from the mean, squared.

Formulas for Variance

Population Variance (𝜎²) σ2=∑(xi−μ)2N\sigma^2 = \frac{\sum (x_i – \mu)^2}{N}σ2=N∑(xi−μ)2 where:
- σ2\sigma^2σ2 = population variance
- xix_ixi = individual data points
- μ\muμ = population mean
- NNN = total number of data points in the population
Sample Variance (s²) s2=∑(xi−xˉ)2n−1s^2 = \frac{\sum (x_i – \bar{x})^2}{n – 1}s2=n−1∑(xi−xˉ)2 where:
- s2s^2s2 = sample variance
- xˉ\bar{x}xˉ = sample mean
- nnn = sample size

🔹 Why use (n-1) in the sample variance formula?
The denominator (n−1)(n-1)(n−1) is known as Bessel’s correction, which corrects for bias in estimating the population variance from a sample.

1.2 Standard Deviation (𝜎 or s)

Standard deviation is simply the square root of variance. It provides a measure of spread in the same units as the data.

Formulas for Standard Deviation

Population Standard Deviation σ=∑(xi−μ)2N\sigma = \sqrt{\frac{\sum (x_i – \mu)^2}{N}}σ=N∑(xi−μ)2
Sample Standard Deviation s=∑(xi−xˉ)2n−1s = \sqrt{\frac{\sum (x_i – \bar{x})^2}{n – 1}}s=n−1∑(xi−xˉ)2

Example Calculation: Consider the dataset {10, 12, 14, 18, 20}

Find the mean: xˉ=10+12+14+18+205=14.8\bar{x} = \frac{10+12+14+18+20}{5} = 14.8xˉ=510+12+14+18+20=14.8
Calculate squared deviations from the mean: (10−14.8)2=23.04,(12−14.8)2=7.84,(14−14.8)2=0.64(10 – 14.8)^2 = 23.04, \quad (12 – 14.8)^2 = 7.84, \quad (14 – 14.8)^2 = 0.64(10−14.8)2=23.04,(12−14.8)2=7.84,(14−14.8)2=0.64 (18−14.8)2=10.24,(20−14.8)2=27.04(18 – 14.8)^2 = 10.24, \quad (20 – 14.8)^2 = 27.04(18−14.8)2=10.24,(20−14.8)2=27.04
Find Variance: s2=23.04+7.84+0.64+10.24+27.045−1=68.84=17.2s^2 = \frac{23.04 + 7.84 + 0.64 + 10.24 + 27.04}{5 – 1} = \frac{68.8}{4} = 17.2s2=5−123.04+7.84+0.64+10.24+27.04=468.8=17.2
Find Standard Deviation: s=17.2≈4.15s = \sqrt{17.2} \approx 4.15s=17.2≈4.15

Thus, the sample standard deviation is 4.15.

2. Population vs. Sample

2.1 Population

A population includes all members of a defined group. When we calculate statistics for an entire population, we use N in the denominator.

Example: The average height of all people in a country.

2.2 Sample

A sample is a subset of a population. Since we usually cannot collect data from the entire population, we estimate statistics using a sample and use n-1 in the denominator.

Example: Measuring the height of 1,000 randomly selected people to estimate the national average.

3. Types of Variables

Variables are classified into categorical and numerical types.

3.1 Categorical Variables (Qualitative)

Categorical variables represent distinct groups or categories.

Nominal: No order or ranking. (e.g., Colors: {Red, Blue, Green})
Ordinal: Categories have a meaningful order. (e.g., Education level: {High School, Bachelor’s, Master’s})

3.2 Numerical Variables (Quantitative)

Numerical variables represent measurable quantities.

Discrete: Can take only specific values, usually integers. (e.g., Number of children: {0, 1, 2, 3})
Continuous: Can take any value within a range. (e.g., Height: {5.4 ft, 5.5 ft, 6.2 ft})

4. Data Visualization: Histograms and KDE

4.1 Histograms

A histogram is a graphical representation of the frequency distribution of numerical data. It consists of bins, where each bin represents a range of values, and the height of the bar represents the frequency.

🔹 Example of a Histogram:

If we have test scores {50, 55, 60, 60, 65, 70, 75, 80, 85, 90}, a histogram would show how often scores fall within certain ranges (bins like 50-60, 60-70, etc.).

🔹 Advantages of Histograms: ✅ Shows the distribution of data
✅ Helps detect skewness and outliers

4.2 Kernel Density Estimation (KDE)

A Kernel Density Estimate (KDE) is a smoothed version of a histogram. Instead of using bars, KDE uses a smooth curve to estimate the probability density function of a dataset.

🔹 Why Use KDE?

Unlike histograms, KDE does not depend on bin width, providing a clearer view of data distribution.

🔹 Example: If we have a dataset of student heights, a KDE plot would give a smooth curve that helps visualize the probability density of different height ranges.

🔹 Difference Between Histogram & KDE

Feature	Histogram	KDE
Representation	Binned Bars	Smooth Curve
Sensitivity	Depends on bin width	Depends on kernel bandwidth
Use Case	Discrete counts	Probability density estimation

Understanding Percentiles, Quartiles & the 5-Number Summary

The Foundation of Exploratory Data Analysis (EDA)

When you’re trying to understand a dataset or detect outliers, few tools are more powerful and intuitive than percentiles, quartiles, and the 5-number summary. These help you explore how your data is distributed and identify extreme values with precision.

🔢 1. Percentiles

Percentiles divide the dataset into 100 equal parts. Each percentile tells you the value below which a certain percentage of data falls.

Example:

25th percentile (P25) → 25% of data lies below this value.
90th percentile (P90) → 90% of data lies below this value.

Percentiles help describe the relative standing of a value in the dataset.

🧮 2. Quartiles

Quartiles are specific percentiles that divide your data into four equal parts.

Quartile	Percentile Equivalent	Meaning
Q1 (1st Quartile)	25th Percentile	25% of data below
Q2 (Median)	50th Percentile	Middle value
Q3 (3rd Quartile)	75th Percentile	75% of data below

🧰 3. The 5-Number Summary

The 5-number summary is one of the most important techniques in descriptive statistics. It gives a quick overview of the distribution of your data and is crucial for visualizations like box plots.

✅ It includes:

Minimum – Smallest value in the dataset
Q1 (1st Quartile) – 25% of data lies below this
Median (Q2) – 50% of data lies below this
Q3 (3rd Quartile) – 75% of data lies below this
Maximum – Largest value in the dataset

📐 4. Interquartile Range (IQR)

The Interquartile Range (IQR) measures the spread of the middle 50% of the data. IQR=Q3−Q1

This is a robust measure of variability that is not affected by outliers.

🚨 5. Detecting Outliers using IQR

The IQR can be used to create fences beyond which data points are considered outliers.

🔻 Lower Fence:

Lower Fence=Q1−1.5×IQR

🔺 Upper Fence:

Upper Fence=Q3+1.5×IQR

Any value below the lower fence or above the upper fence is typically classified as an outlier.

📊 6. Example: 5-Number Summary + IQR Calculation

📘 Dataset:

data = [7, 8, 8, 10, 12, 13, 14, 15, 16, 22, 40]

🔍 Step-by-Step:

Minimum = 7
Maximum = 40
Median (Q2) = 13
Q1 (25th percentile) = 9
Q3 (75th percentile) = 16

So the 5-number summary is:

[Minimum = 7, Q1 = 9, Median = 13, Q3 = 16, Maximum = 40]

📏 Calculate IQR:

IQR=Q3−Q1=16−9=7

🔎 Determine fences:

Lower Fence = 9−(1.5×7)=9−10.5=−1.59 – (1.5 \times 7) = 9 – 10.5 = -1.59−(1.5×7)=9−10.5=−1.5
Upper Fence = 16+(1.5×7)=16+10.5=26.516 + (1.5 \times 7) = 16 + 10.5 = 26.516+(1.5×7)=16+10.5=26.5

🧨 Outlier Detection:

Any value > 26.5 or < -1.5 is an outlier.
✅ So, in this dataset, 40 is an outlier.

📦 7. Visual Representation: Box Plot

The 5-number summary is used in box plots to visually summarize data.

The box spans from Q1 to Q3
A line marks the median (Q2)
“Whiskers” extend to the smallest/largest non-outlier values
Outliers are shown as dots or stars beyond the whiskers

🌟 Why Is This Important?

Feature	Reason
Exploratory Data Analysis (EDA)	Quickly understand the spread and central tendency
Outlier Detection	IQR-based fences help detect anomalies
Feature Scaling & Normalization	Useful for feature engineering
Robust Statistics	Median and IQR are unaffected by extreme values, unlike mean & standard deviation

📌 Summary Table

Term	Description	Formula
Q1	25th percentile	–
Q2 (Median)	50th percentile	–
Q3	75th percentile	–
IQR	Interquartile range	Q3−Q1Q3 – Q1Q3−Q1
Lower Fence	Threshold for low outliers	Q1−1.5×IQR
Upper Fence	Threshold for high outliers	Q3+1.5×IQR
Outliers	Values outside fences	x<LF or x> UF x < LF

Correlation and Covariance: A Comprehensive Guide

Covariance and correlation are fundamental statistical concepts used to measure the relationship between variables. While they serve similar purposes, they differ in important ways. Let’s explore both concepts in depth.

Covariance

Covariance measures how two variables change together. It indicates the direction of the linear relationship between variables.

Mathematical Definition

The sample covariance formula is:

Cov(X,Y) = Σ[(x_i – x̄)(y_i – ȳ)] / (n-1)

Where:

x_i and y_i are individual data points
x̄ and ȳ are the means of X and Y
n is the number of data pairs

Interpretation

Positive covariance: Variables tend to move in the same direction
Negative covariance: Variables tend to move in opposite directions
Zero covariance: No linear relationship between variables

Example

Let’s consider stock prices for two companies, A and B, over 5 days:

Day | Company A | Company B —-|———–|———- 1 | $10 | $5 2 | $12 | $6 3 | $11 | $4 4 | $13 | $7 5 | $14 | $8

Step 1: Calculate means

Mean of A = (10+12+11+13+14)/5 = $12
Mean of B = (5+6+4+7+8)/5 = $6

Step 2: Calculate deviations and their products

Day | A-Mean(A) | B-Mean(B) | Product —-|———–|———–|——– 1 | -2 | -1 | 2 2 | 0 | 0 | 0 3 | -1 | -2 | 2 4 | 1 | 1 | 1 5 | 2 | 2 | 4

Step 3: Sum products and divide by (n-1) Cov(A,B) = (2+0+2+1+4)/(5-1) = 9/4 = 2.25

The positive covariance indicates that when stock A increases, stock B tends to increase as well.

Types of Covariance

1. Positive Covariance

Indicates that variables tend to move in the same direction.

Example: Height and weight in humans typically have positive covariance because taller people generally weigh more than shorter people.

2. Negative Covariance

Indicates that variables tend to move in opposite directions.

Example: Hours spent studying and number of errors on a test typically have negative covariance because more study time usually results in fewer errors.

3. Zero Covariance

Indicates no linear relationship between variables.

Example: Shoe size and intelligence would likely have zero covariance because there’s no reason one would affect the other.

4. Autocovariance

Measures the covariance of a variable with itself at different time points.

Example: In time series analysis, the price of gold today might have a high autocovariance with its price yesterday.

5. Cross-covariance

Measures the similarity between two different time series at different time lags.

Example: Rainfall amounts and reservoir levels might show cross-covariance with a lag, as it takes time for rainfall to affect reservoir levels.

Correlation

Correlation is a standardized version of covariance that measures both the strength and direction of a linear relationship between variables. It always falls between -1 and 1.

Mathematical Definition

The Pearson correlation coefficient is:

ρ(X,Y) = Cov(X,Y) / (σ_X × σ_Y)

Where:

Cov(X,Y) is the covariance
σ_X is the standard deviation of X
σ_Y is the standard deviation of Y

Interpretation

Correlation of 1: Perfect positive correlation
Correlation of -1: Perfect negative correlation
Correlation of 0: No linear correlation
Correlation between 0 and 1: Positive correlation
Correlation between -1 and 0: Negative correlation

Example

Continuing with our stock price example:

Step 1: Calculate standard deviations

For Company A: σ_A = sqrt([((-2)² + 0² + (-1)² + 1² + 2²)/5]) = sqrt(10/5) = sqrt(2) ≈ 1.41
For Company B: σ_B = sqrt([((-1)² + 0² + (-2)² + 1² + 2²)/5]) = sqrt(10/5) = sqrt(2) ≈ 1.41

Step 2: Calculate correlation

ρ(A,B) = Cov(A,B) / (σ_A × σ_B) = 2.25 / (1.41 × 1.41) = 2.25/2 = 1.125

(Note: In practice, correlation should always be between -1 and 1. The slight discrepancy here is due to rounding. The actual correlation would be 1, indicating perfect positive correlation.)

Types of Correlation

1. Pearson Correlation

Measures linear relationships between variables with continuous, normally distributed data.

Example: The relationship between height and weight typically follows a linear pattern suitable for Pearson correlation.

2. Spearman Rank Correlation

Measures monotonic relationships, where variables tend to change together but not necessarily at a constant rate.

Example: The relationship between age and reading ability in children might be monotonic (generally increasing) but not strictly linear.

Pearson Correlation Coefficient and Spearman Rank Correlation Coefficient

Let me explain both correlation methods in detail, including their formulas, interpretations, and when to use each one.

Pearson Correlation Coefficient

The Pearson correlation coefficient (r) measures the linear relationship between two continuous variables. It’s the most commonly used correlation measure in statistics.

Formula

For two variables X and Y with n observations, the Pearson correlation coefficient is calculated as:

r = Σ[(xᵢ – x̄)(yᵢ – ȳ)] / √[Σ(xᵢ – x̄)² × Σ(yᵢ – ȳ)²]

Alternatively, it can be expressed as:

r = Cov(X,Y) / (σₓ × σᵧ)

Where:

xᵢ and yᵢ are individual data points
x̄ and ȳ are the means of X and Y
Cov(X,Y) is the covariance between X and Y
σₓ and σᵧ are the standard deviations of X and Y

Properties

Range: Always between -1 and +1
Interpretation:
- r = 1: Perfect positive linear relationship
- r = -1: Perfect negative linear relationship
- r = 0: No linear relationship
- 0 < |r| < 0.3: Weak correlation
- 0.3 ≤ |r| < 0.7: Moderate correlation
- 0.7 ≤ |r| < 1: Strong correlation
Symmetry: r(X,Y) = r(Y,X)
Invariance to linear transformations: If Y = aX + b (where a > 0), then r = 1

Example Calculation

Consider these data points:

CopyX: 1, 2, 3, 4, 5
Y: 2, 3, 5, 7, 11

Step 1: Calculate means

x̄ = (1+2+3+4+5)/5 = 3
ȳ = (2+3+5+7+11)/5 = 5.6

Step 2: Calculate deviations, squares, and products

X | Y | X-x̄ | Y-ȳ | (X-x̄)² | (Y-ȳ)² | (X-x̄)(Y-ȳ) —–|—–|—–|—–|———|———|———— 1 | 2 | -2 | -3.6| 4 | 12.96 | 7.2 2 | 3 | -1 | -2.6| 1 | 6.76 | 2.6 3 | 5 | 0 | -0.6| 0 | 0.36 | 0 4 | 7 | 1 | 1.4| 1 | 1.96 | 1.4 5 | 11 | 2 | 5.4| 4 | 29.16 | 10.8 —–|—–|—–|—–|———|———|———— Sum: | | | | 10 | 51.2 | 22.0

Step 3: Calculate r r = 22.0 / √(10 × 51.2) = 22.0 / √512 = 22.0 / 22.63 ≈ 0.972

This indicates a very strong positive linear relationship between X and Y.

Assumptions and Limitations

Assumes variables have a linear relationship
Sensitive to outliers
Both variables should be normally distributed for hypothesis testing
Measures only linear relationships

Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient (ρ or rₛ) measures the monotonic relationship between two variables by using their ranks rather than actual values.

Formula

For two variables X and Y with n observations, the Spearman correlation is calculated as:

ρ = 1 – (6 × Σd²) / (n(n² – 1))

Where:

d is the difference between the ranks of corresponding values of X and Y
n is the number of observations

If there are no tied ranks, this simplifies to:

ρ = Pearson correlation coefficient between the ranks of X and Y

Properties

Range: Always between -1 and +1
Interpretation:
- ρ = 1: Perfect monotonic increasing relationship
- ρ = -1: Perfect monotonic decreasing relationship
- ρ = 0: No monotonic relationship
- Similar magnitude interpretation as Pearson (weak, moderate, strong)
Invariant to any monotonic transformation of the variables
Less sensitive to outliers than Pearson

Example Calculation

Consider these data points:

X: 5, 10, 15, 20, 25
Y: 2, 4, 5, 9, 12

Step 1: Rank the values in each variable (1 = lowest, n = highest)

X | Y | Rank X | Rank Y | d (difference) | d² —–|—–|——–|——–|—————-|—– 5 | 2 | 1 | 1 | 0 | 0 10 | 4 | 2 | 2 | 0 | 0 15 | 5 | 3 | 3 | 0 | 0 20 | 9 | 4 | 4 | 0 | 0 25 | 12 | 5 | 5 | 0 | 0 —–|—–|——–|——–|—————-|—– Σd² = 0

Step 2: Calculate ρ ρ = 1 – (6 × 0) / (5(5² – 1)) = 1 – 0/120 = 1

This indicates a perfect monotonic relationship between X and Y.

Handling Tied Ranks

When ties occur, each tied value is assigned the average of the ranks they would have received if they were distinct. For example, if the 2nd and 3rd positions are tied, both receive rank 2.5.

For tied ranks, the formula can be adjusted to:

ρ = (Σ(xᵢ – x̄)(yᵢ – ȳ)) / √(Σ(xᵢ – x̄)² × Σ(yᵢ – ȳ)²)

Where xᵢ and yᵢ are now the ranks.

Assumptions and Limitations

Only assumes variables have a monotonic relationship (not necessarily linear)
Less powerful than Pearson when data is truly linear and normally distributed
Computationally more intensive for large datasets

When to Use Each Coefficient

Use Pearson When:

Data is continuous
The relationship appears linear
Both variables are approximately normally distributed
Outliers are minimal or have been addressed
You need to measure the strength of strictly linear relationships

Use Spearman When:

Data is ordinal or ranked
The relationship appears monotonic but not necessarily linear
Variables are not normally distributed
Outliers are present
You want to capture any monotonic relationship, not just linear ones
You’re analyzing variables where exact values are less important than relative ordering

Practical Examples

Pearson Correlation: Relationship between height and weight in a population (typically linear)
Spearman Correlation: Relationship between customer satisfaction ratings (1-5 scale) and likelihood to repurchase (where the relationship might be monotonic but not strictly linear)

Key Differences Between Covariance and Correlation

Scale: Covariance is affected by the scale of the variables; correlation is standardized between -1 and 1.
Units: Covariance has units (product of the units of the two variables); correlation is unitless.
Comparability: Correlations can be compared across different datasets; covariances generally cannot.
Interpretation: Correlation provides both direction and strength; covariance only reliably indicates direction.

Applications in Data Analysis

Portfolio Management

Positive covariance between assets increases portfolio risk
Negative covariance helps diversify risk

Machine Learning

Correlation analysis helps identify relevant features
Principal Component Analysis uses covariance matrices to reduce dimensionality

Quality Control

Correlation between process variables helps identify root causes of defects

Economic Analysis

Correlation between GDP and unemployment rate helps understand economic cycles

Probability Distribution

The Relationship Between PDF, PMF, and CDF

Probability Mass Function (PMF)

The PMF applies to discrete random variables and gives the probability that a random variable X equals a specific value x.

Mathematical Definition: P(X = x) = PMF(x)

Properties:

Non-negative: PMF(x) ≥ 0 for all x
Sum to 1: Σ PMF(x) = 1 (over all possible values)
Range: 0 ≤ PMF(x) ≤ 1

Example: For a fair six-sided die, the PMF is:

P(X = 1) = P(X = 2) = … = P(X = 6) = 1/6

Probability Density Function (PDF)

The PDF applies to continuous random variables and represents the relative likelihood of the random variable taking on a specific value.

Mathematical Definition: f(x) = dF(x)/dx, where F(x) is the CDF

Properties:

Non-negative: f(x) ≥ 0 for all x
Area equals 1: ∫f(x)dx = 1 (integrated over all possible values)
P(a ≤ X ≤ b) = ∫(from a to b) f(x)dx
Unlike PMF, PDF can exceed 1 at specific points

Example: The PDF of a standard normal distribution is: f(x) = (1/√(2π)) * e^(-x²/2)

Cumulative Distribution Function (CDF)

The CDF applies to both discrete and continuous random variables and gives the probability that X is less than or equal to x.

Mathematical Definition: F(x) = P(X ≤ x)

For discrete variables: F(x) = Σ PMF(t) for all t ≤ x For continuous variables: F(x) = ∫(from -∞ to x) f(t)dt

Properties:

Non-decreasing: F(x₁) ≤ F(x₂) if x₁ < x₂
Limits: lim(x→-∞) F(x) = 0 and lim(x→∞) F(x) = 1
Range: 0 ≤ F(x) ≤ 1
P(a < X ≤ b) = F(b) – F(a)

Example: For a standard normal distribution, the CDF doesn’t have a simple closed form but is often denoted as Φ(x).

Types of Probability Distributions

Probability distributions come in two main categories:

Discrete Probability Distributions

Bernoulli Distribution: Models binary outcomes (success/failure)
Binomial Distribution: Sum of independent Bernoulli trials
Poisson Distribution: Models rare events in fixed intervals
Geometric Distribution: Number of trials until first success
Negative Binomial: Number of trials until k successes
Hypergeometric: Sampling without replacement
Discrete Uniform: Equal probability for all outcomes

Continuous Probability Distributions

Normal/Gaussian Distribution: Bell-shaped curve
Standard Normal Distribution: Normal with μ=0, σ=1
Uniform Distribution: Equal probability density over interval
Exponential Distribution: Time between Poisson events
Log-Normal Distribution: When logarithm follows normal distribution
Chi-Square Distribution: Sum of squared standard normal variables
Student’s t-Distribution: Used for small sample statistics
F-Distribution: Ratio of chi-squared distributions
Beta Distribution: Models probabilities or proportions
Gamma Distribution: Generalizes exponential and chi-squared
Weibull Distribution: Models failure rates and reliability
Pareto Distribution: Power-law probability distribution
Cauchy Distribution: Heavy-tailed distribution

Bernoulli Distribution

The Bernoulli distribution is the simplest discrete probability distribution, modeling a single binary outcome.

Properties

PMF: P(X = x) = p^x × (1-p)^(1-x) for x ∈ {0, 1}
Mean: E[X] = p
Variance: Var(X) = p(1-p)
Parameter: p = probability of success (0 ≤ p ≤ 1)

Example

Flipping a coin once with p = 0.5 probability of heads:

P(X = 1) = 0.5 (heads)
P(X = 0) = 0.5 (tails)

Bernoulli Distribution

Image

Applications

Quality control (defective/non-defective)
Medical tests (positive/negative)
Elections (win/lose)
Any yes/no, success/failure scenario

Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials.

Properties

PMF: P(X = k) = (n choose k) × p^k × (1-p)^(n-k)
Mean: E[X] = np
Variance: Var(X) = np(1-p)
Parameters:
- n = number of trials
- p = probability of success on each trial

Example

Flipping a fair coin 10 times and counting heads:

n = 10, p = 0.5
P(X = 5) = (10 choose 5) × 0.5^5 × 0.5^5 = 252 × 0.001953 = 0.246

Binomial Distribution

Image

Applications

Quality control (number of defects in a batch)
Medical studies (number of patients recovering)
Sports statistics (number of successful shots)
Election polling (number of voters supporting a candidate)

Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space, when these events happen at a constant average rate.

Properties

PMF: P(X = k) = (e^(-λ) × λ^k) / k!
Mean: E[X] = λ
Variance: Var(X) = λ
Parameter: λ = average number of events per interval

Example

If emails arrive at an average rate of 5 per hour:

λ = 5
P(X = 3) = (e^(-5) × 5^3) / 3! = (0.0067 × 125) / 6 = 0.140

Poisson Distribution

Image

Applications

Customer arrivals at a service counter
Number of calls to a call center
Number of defects in manufacturing
Number of accidents at an intersection
Radioactive decay events

Normal/Gaussian Distribution

The normal distribution is the most important continuous probability distribution, characterized by its bell-shaped curve.

Properties

PDF: f(x) = (1/(σ√(2π))) × e^(-(x-μ)²/(2σ²))
Mean: E[X] = μ
Variance: Var(X) = σ²
Parameters:
- μ = mean (location parameter)
- σ = standard deviation (scale parameter)

Example

Human heights often follow a normal distribution. If adult male heights have μ = 175 cm and σ = 7 cm:

68% of men have heights between 168-182 cm (μ ± σ)
95% of men have heights between 161-189 cm (μ ± 2σ)
99.7% of men have heights between 154-196 cm (μ ± 3σ)