6. Selecting the optimal bin width

library(quollr)
library(dplyr)
library(ggplot2)

set.seed(20240110)

We demonstrate how to identify the optimal bin width for hexagonal binning in the 2-D embedding space. Selecting an appropriate bin width is crucial for balancing model complexity and prediction accuracy when comparing structures between high-dimensional data and their 2-D layout.

We begin by computing model errors across a range of bin width values using the gen_diffbin1_errors() function. This function fits models for multiple bin widths and returns root mean squared error (RMSE) values for each configuration.


error_df_all <- gen_diffbin1_errors(highd_data = scurve, 
                                    nldr_data = scurve_umap)

error_df_all <- error_df_all |>
  mutate(a1 = round(a1, 2)) |>
  filter(b1 >= 5) |>
  group_by(a1) |>
  filter(RMSE == min(RMSE)) |>
  ungroup()

We round the bin width values (a1), filter for sufficient bin resolution (b1 >= 5), and select the configuration with the lowest RMSE for each unique bin width.

error_df_all |>
  arrange(-a1) |>
  head(5)
#> # A tibble: 5 × 9
#>   Error  RMSE    b1    b2     b     m    a1    a2  d_bar
#>   <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>  <dbl>
#> 1  629. 0.410     5     7    35    22  0.26  0.23 0.0409
#> 2  563. 0.367     6     8    48    27  0.23  0.2  0.0369
#> 3  520. 0.336     7     9    63    33  0.2   0.17 0.0323
#> 4  430. 0.272     8    11    88    46  0.16  0.14 0.0260
#> 5  407. 0.254     9    12   108    51  0.14  0.12 0.0255

The plot below shows the relationship between bin width (a1) and RMSE. The goal is to identify a bin width that minimizes RMSE while avoiding overly coarse or fine binning.

ggplot(error_df_all,
         aes(x = a1,
             y = RMSE)) +
    geom_point(size = 0.8) +
    geom_line(linewidth = 0.3) +
    ylab("RMSE") +
    xlab(expression(paste("binwidth (", a[1], ")"))) +
    theme_minimal() +
    theme(panel.border = element_rect(fill = 'transparent'),
          plot.title = element_text(size = 12, hjust = 0.5, vjust = -0.5),
          axis.ticks.x = element_line(),
          axis.ticks.y = element_line(),
          legend.position = "none",
          axis.text.x = element_text(size = 7),
          axis.text.y = element_text(size = 7),
          axis.title.x = element_text(size = 7),
          axis.title.y = element_text(size = 7))

RMSE Vs binw idths.