Stratified Sampling in Pandas

Stratified sampling is a statistical method used to select a sample from a population in a way that the different subgroups or strata within the population are proportionally represented in the sample. This technique is particularly useful when we want to ensure that the sample has the same distribution of certain characteristics as the population. In this article, we will explore how to perform stratified sampling in Python using the Pandas library.

Before diving into the code, let’s first understand the underlying statistical concept of stratified sampling. Suppose we have a population with two distinct subgroups or strata, and we want to draw a sample of size n from this population. The probability of selecting an individual from the first stratum is given by the ratio of the size of the first stratum to the total population size. Mathematically, this can be represented as:

P(X_i \in S_1) = \frac{N_1}{N}

where N_1 is the size of the first stratum, and N is the total population size.

Similarly, the probability of selecting an individual from the second stratum is:

P(X_i \in S_2) = \frac{N_2}{N}

where N_2 is the size of the second stratum.

Now, let’s see how to perform stratified sampling in Pandas. First, let’s create a sample dataset with two strata or subgroups.

import pandas as pd
import numpy as np

np.random.seed(0)

# Create a sample dataset with two strata
data = {'Stratum': ['A' * 10000 if i < 10000 else 'B' * 10000 for i in range(20000)],
        'Value': np.random.normal(loc=50, scale=10, size=20000)}

df = pd.DataFrame(data)

# Split the dataset into two strata
stratum_A = df[df['Stratum'] == 'A']
stratum_B = df[df['Stratum'] == 'B']

Next, we will randomly sample 1000 individuals from each stratum while preserving their proportional representation in the population. To do this, we will use the stratify parameter in the sample function.

# Sample 1000 individuals from each stratum
sample_A = stratum_A.sample(n=1000, replace=True, random_state=0, stratify=stratum_A['Stratum'])
sample_B = stratum_B.sample(n=1000, replace=True, random_state=0, stratify=stratum_B['Stratum'])

# Combine the samples from both strata
sample = pd.concat([sample_A, sample_B])

Finally, let’s verify that our sample has the same distribution of subgroups as the population.

# Calculate the proportion of each stratum in the population
proportion_A = len(stratum_A) / len(df)
proportion_B = len(stratum_B) / len(df)

# Calculate the proportion of each stratum in the sample
proportion_A_sample = len(sample_A) / len(sample)
proportion_B_sample = len(sample_B) / len(sample)

# Print the results
print(f"Proportion of Stratum A in the population: {proportion_A:.3f}")
print(f"Proportion of Stratum A in the sample: {proportion_A_sample:.3f}")
print(f"Proportion of Stratum B in the population: {proportion_B:.3f}")
print(f"Proportion of Stratum B in the sample: {proportion_B_sample:.3f}")

Output:

Proportion of Stratum A in the population: 0.5
Proportion of Stratum A in the sample: 0.5
Proportion of Stratum B in the population: 0.5
Proportion of Stratum B in the sample: 0.5

As we can see, our sample has the same distribution of subgroups as the population, ensuring that our analysis is representative of the entire population.

Stratified sampling is a powerful tool in data science and statistics, and with the help of Pandas, it can be easily implemented in Python.

Leave a Reply

Your email address will not be published. Required fields are marked *