Cluster Sampling in Pandas

Cluster sampling is a type of probability sampling where the population is first divided into clusters or groups, and then a random sample is selected from each of these clusters. This method is often used when it is not feasible or cost-effective to survey the entire population. In this tutorial, we will learn how to perform cluster sampling using Python and the Pandas library.

Before we dive into the code, it’s important to understand the underlying statistical concept of cluster sampling. Let’s consider an example to illustrate this.

Example: Suppose we want to estimate the average income of all adults in a large city. However, surveying every adult in the city would be too time-consuming and expensive. Instead, we can divide the city into clusters based on geographical areas, such as neighborhoods. We can then randomly select a few neighborhoods and survey all the adults in those neighborhoods to estimate the average income of all adults in the city.

The mathematical formula for cluster sampling is:

 \bar{X}_{cs} = \frac{1}{N} \sum_{i=1}^{k} \frac{n_i}{N_i} \sum_{j=1}^{n_i} x_{ij}

Where:

  •  \bar{X}_{cs} is the estimated population mean for the cluster sampling method.
  •  N is the total population size.
  •  k is the number of clusters.
  •  n_i is the number of individuals in the  i^{th} cluster.
  •  N_i is the population size of the  i^{th} cluster.
  •  x_{ij} is the  j^{th} observation for the  i^{th} cluster.

Now, let’s see how to perform cluster sampling using Python and Pandas.

Setup

First, we need to install the necessary libraries and create a sample dataset.

import pandas as pd
import numpy as np
import random

Create a sample dataset with 1000 observations and 2 variables: cluster and income.

np.random.seed(0)
data = {'cluster': np.random.randint(1, 11, 1000),
       'income': np.random.normal(50000, 10000, 1000)}
df = pd.DataFrame(data)

Cluster Sampling

To perform cluster sampling, we will first group the data by the cluster variable and then randomly sample from each group.

# Define the number of clusters and the number of observations to sample from each cluster
k = 5
n = 5

# Group the data by cluster
grouped = df.groupby('cluster')

# Initialize an empty list to store the results
results = []

# Perform cluster sampling
for name, group in grouped:
    cluster_size = len(group)
    sample = group.sample(n, replace=False)
    mean = np.mean(sample['income'])
    results.append({'cluster': name, 'mean_income': mean})

# Combine the results into a single dataframe
output = pd.DataFrame(results)

The output dataframe contains the cluster number and the estimated mean income for each cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *