6 Validation Helper Functions

6.1 Panel Map

6.1.1 Generate Equal Weights

In the most simple case, to make a panel map, we need only a correspondence between a source nomenclature (std_A) and target nomenclature (std_B), which doesn’t have any duplicate rows.

6.1.1.1 Functions

This is a helper function for making a valid Panel Map with equal Mapping Weights from a concordance table.

#' Helper to build equal split panel map
#'
#' Generate panel map using all *distinct* correspondences between two classifications.
#'
#' @param code_dict Data frame containing correspondence between source and destination codes
#' @inheritParams check_weights
#' @param .weights_to (optional) new column name for storing weights that will be applied to. The default name is `split_<<code_in>>`.
#' input values.
#'
#' @return Returns panel map as tibble
#' @export
#'
#' @examples
make_pm_equal <- function(code_dict, code_in, code_out, .weights_to = NULL){
  ## check and remove for duplicates
  n_dups <- sum(duplicated(code_dict))
  no_dup_links <- n_dups == 0
  if (!no_dup_links) {
    message("Removing duplicate code_in/code_out rows")
    code_dict <- code_dict |>
    dplyr::distinct({{code_in}}, {{code_out}})
  }

  ## make column name for weights
  .weights_to <- .weights_to %||% paste("split", deparse(substitute(code_in)), sep = "_")

  ## make panel map
  panel_map <- code_dict |>
    dplyr::group_by({{code_in}}) |>
    dplyr::mutate(n_dest = dplyr::n(),
                  !!.weights_to := 1 / n_dest) |>
    dplyr::ungroup() |>
    dplyr::select(-n_dest)

  return(panel_map)
}

#' @rdname make_pm_equal
#' @export
make_panel_map_equal <- make_pm_equal

Use this helper on the concordance table defined above:

make_pm_equal(codes_BA, std_A, std_B, "weights")

## # A tibble: 10 × 3
##    std_B std_A weights
##    <chr> <chr>   <dbl>
##  1 A1    x1111    1   
##  2 B2    x2222    1   
##  3 B2    x3333    1   
##  4 C3    x4444    0.25
##  5 C4    x4444    0.25
##  6 C4    x6666    1   
##  7 C5    x4444    0.25
##  8 C6    x4444    0.25
##  9 C7    x5555    0.5 
## 10 C8    x5555    0.5

This function uses the no_dup_links flag to removes any duplicate instructions/links, to avoid assigning unequal shares to each target code/category (shown as naive_share):

library(dplyr)
codes <- tribble(~code_in, ~code_out,
                 "cake", "piece_01",
                 "cake", "piece_02",
                 "cake", "piece_03",
                 "cake", "piece_03" ## duplicated row
                 )

codes |>
  ## equal share by code_out
  mutate(equal_share = 1 / n_distinct(code_out)) |>
  ## without duplicates removed
  group_by(code_in) |>
  mutate("n_dest" = n(),
         weight := 1 / n_dest) |>
  ungroup() |>
  select(-n_dest) |>
  group_by(code_out) |>
  summarise(
    weights = paste(weight, collapse = "+"),
    naive_share = sum(weight),
    equal_share = unique(equal_share)
)

## # A tibble: 3 × 4
##   code_out weights   naive_share equal_share
##   <chr>    <chr>           <dbl>       <dbl>
## 1 piece_01 0.25             0.25       0.333
## 2 piece_02 0.25             0.25       0.333
## 3 piece_03 0.25+0.25        0.5        0.333

6.1.1.2 Tests

testthat::test_that(
  "make_pm_equal() works",
  {
    testthat::expect_identical(
      make_pm_equal(equal_pm$codes_BA, std_A, std_B, .weights_to = "weight"), equal_pm$pm_BA)
    testthat::expect_no_message(
      make_pm_equal(equal_pm$codes_BA, std_A, std_B, .weights_to = "weight"))
  }
)
testthat::test_that(
  "make_pm_equal() handles duplicate link correctly",
  {
    dup_codes_BA <- rbind(equal_pm$codes_BA, equal_pm$codes_BA[1, ])
    testthat::expect_message(
      make_pm_equal(dup_codes_BA, std_A, std_B)
    )
    testthat::expect_identical(
      make_pm_equal(dup_codes_BA, std_A, std_B, .weights_to = "weight"), equal_pm$pm_BA
    )
  }
)

## Test passed 🌈

## Removing duplicate code_in/code_out rows

## Test passed 🥳

6.1.2 Get Bad Weights (TBC)

Helper for errors in panel map weights

6.2 Data Preparation for Concordance Transformation

Things that are checked for:

missing values
code_in and value_in columns

Probably good practice things, but too much hassle to check, so maybe put in vignette?

data_in should only have one row/obs per code_in,
ideally data_in has only the code_in and values_in columns… the rest get dropped (like with dplyr::summarise())