-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add bin_prop computed variable to stat_bin#6477
Add bin_prop computed variable to stat_bin#6477kieran-mace wants to merge 1 commit intotidyverse:mainfrom
Conversation
Summary
Adds after_stat(bin_prop) functionality to stat_bin, bringing feature parity with stat_count. The new bin_prop computed variable shows the proportion of each group within each bin.
Closes #6478
Motivation
stat_count provides after_stat(prop) for proportion-based visualizations, but stat_bin lacked equivalent functionality. This made it difficult to create proportion-based histograms for continuous data.
Implementation
- Added
compute_panelmethod toStatBinthat calculatesbin_prop = count_in_group / total_count_in_bin - Handles multiple groups, weights, and empty bins correctly
- Maintains backwards compatibility (single groups have
bin_prop = 1) - Updated documentation to include the new computed variable
Usage Example
ggplot(data, aes(x = weight, y = after_stat(bin_prop), fill = sex)) +
stat_bin(geom = "col", bins = 8, position = "dodge") +
scale_y_continuous(labels = scales::percent)
This addresses the feature gap where users could use after_stat(prop) with stat_count for discrete data but had no equivalent for continuous data with stat_bin.
Test plan
- All existing
stat_bintests pass (no regressions) - Added comprehensive tests for
bin_propfunctionality - Tested with single groups, multiple groups, and weighted data
- Verified
after_stat(bin_prop)works correctly in plots - Confirmed proportions sum to 1 within each bin
Some example output:
# Create sample data with two groups
df <- data.frame(
mass_kg = c(rnorm(1000, mean = 70, sd = 10), rnorm(500, mean = 85, sd = 12)),
sex = c(rep("Female", 1000), rep("Male", 500))
)
ggplot(df, aes(x = mass_kg, fill = sex, color = sex)) +
stat_bin(binwidth = 5,
mapping = aes(y = after_stat(bin_prop))) +
labs(
title = "Proportion in Each Mass bin",
x = "Mass",
y = "Proportion",
color = "Sex"
)
stat_bin(binwidth = 5,
mapping = aes(y = after_stat(bin_prop)),
geom = 'line',
position = 'dodge') +
labs(
title = "Proportion in Each Mass bin",
x = "Mass",
y = "Proportion",
color = "Sex"
)
Created on 2025-05-22 with reprex v2.1.1
Generated with Claude Code
functionality to stat_bin. The bin_prop variable shows the proportion
of each group within each bin, enabling proportion-based visualizations
for binned continuous data.
Key features:
- bin_prop = count_in_group / total_count_in_bin
- Works with multiple groups and respects weights
- Backwards compatible (bin_prop = 1 for single groups)
- Properly handles empty bins
Usage:
ggplot(data, aes(x = continuous_var, y = after_stat(bin_prop), fill = group)) +
stat_bin(geom = "col", position = "dodge")
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude
teunbrand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, thanks for the PR! There are a few concerns that I hope can be alleviated, see related comments.
| if (!is.null(data) && nrow(data) > 0 && | ||
| all(c("count", "xmin", "xmax") %in% names(data))) { | ||
|
|
||
| # Calculate bin_prop: proportion of each group within each bin | ||
| # Create a unique bin identifier using rounded values to handle floating point precision | ||
| data$bin_id <- paste(round(data$xmin, 10), round(data$xmax, 10), sep = "_") | ||
|
|
||
| # Calculate total count per bin across all groups | ||
| bin_totals <- stats::aggregate(data$count, by = list(bin_id = data$bin_id), FUN = sum) | ||
| names(bin_totals)[2] <- "bin_total" | ||
|
|
||
| # Merge back to get bin totals for each row | ||
| data <- merge(data, bin_totals, by = "bin_id", sort = FALSE) | ||
|
|
||
| # Calculate bin_prop: count within group / total count in bin | ||
| # When bin_total = 0 (empty bin), set bin_prop based on whether there are multiple groups | ||
| n_groups <- length(unique(data$group)) | ||
| if (n_groups == 1) { | ||
| # With only one group, bin_prop is always 1 (100% of the bin belongs to this group) | ||
| data$bin_prop <- 1 | ||
| } else { | ||
| # With multiple groups, bin_prop = count / total_count_in_bin, or 0 for empty bins | ||
| data$bin_prop <- ifelse(data$bin_total > 0, data$count / data$bin_total, 0) | ||
| } | ||
|
|
||
| # Remove the temporary columns | ||
| data$bin_id <- NULL | ||
| data$bin_total <- NULL | ||
| } else { | ||
| # If we don't have the necessary data, just add a default bin_prop column | ||
| data$bin_prop <- if (nrow(data) > 0) rep(1, nrow(data)) else numeric(0) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems more complicated than it needs to be. Can't this be computed more directly?
| #' width = "widths of bins.", | ||
| #' bin_prop = "proportion of points in bin that belong to each group." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you regenerate the .Rd files as well?
| # Test with 5 bins to get predictable overlap | ||
| p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaks can be set directly if predictability is an issue
| bins_with_both_groups <- aggregate(data$count > 0, by = list(paste(data$xmin, data$xmax)), sum) | ||
| overlapping_bins <- bins_with_both_groups[bins_with_both_groups$x == 2, ]$Group.1 | ||
|
|
||
| for (bin in overlapping_bins) { | ||
| bin_data <- data[paste(data$xmin, data$xmax) == bin, ] | ||
| total_prop <- sum(bin_data$bin_prop) | ||
| expect_equal(total_prop, 1, tolerance = 1e-6) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it more simple to test that the sum over bins is 1, regardless of how many groups?
| bin1_data <- data[data$x == min(data$x), ] | ||
| bin2_data <- data[data$x == max(data$x), ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bin1_data <- data[data$x == min(data$x), ] | |
| bin2_data <- data[data$x == max(data$x), ] | |
| bin1_data <- data[data$x == 1, ] | |
| bin2_data <- data[data$x == 2, ] |
We know from the test data what these values should be