Title: List Balancing for Reweighting and Population Synthesis
Version: 1.0.2
Description: Performs iterative proportional updating given a seed table and an arbitrary number of marginal distributions. This is commonly used in population synthesis, survey raking, matrix rebalancing, and other applications. For example, a household survey may be weighted to match the known distribution of households by size from the census. An origin/ destination trip matrix might be balanced to match traffic counts. The approach used by this package is based on a paper from Arizona State University (Ye, Xin, et. al. (2009) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf). Some enhancements have been made to their work including primary and secondary target balance/importance, general marginal agreement, and weight restriction.
License: Apache License (== 2.0)
URL: https://github.com/dkyleward/ipfr
BugReports: https://github.com/dkyleward/ipfr/issues
Depends: R (≥ 3.2.0)
Imports: dplyr (≥ 0.7.3), ggplot2 (≥ 2.2.1), magrittr (≥ 1.5), tidyr (≥ 0.5.1), mlr (≥ 2.11)
LazyData: true
Suggests: knitr, rmarkdown, testthat (≥ 2.1.0), covr
VignetteBuilder: knitr
RoxygenNote: 7.0.2
NeedsCompilation: no
Packaged: 2020-04-01 19:42:58 UTC; kyle
Author: Kyle Ward [aut, cre, cph], Greg Macfarlane [ctb]
Maintainer: Kyle Ward <kyleward084@gmail.com>
Repository: CRAN
Date/Publication: 2020-04-01 20:20:02 UTC

ipfr: A package to perform iterative proportional fitting

Description

The main function is ipu. For a 2D/matrix problem, the ipu_matrix function is easier to use. The resulting weight_tbl from ipu() can be fed into synthesize to generate a synthetic population

Author(s)

Maintainer: Kyle Ward kyleward084@gmail.com [copyright holder]

Other contributors:

See Also

Useful links:


Applies an importance weight to an ipfr factor

Description

At lower values of importance, the factor is moved closer to 1.

Usage

adjust_factor(factor, importance)

Arguments

factor

A correction factor that is calculated using target/current.

importance

A real between 0 and 1 signifying the importance of the factor. An importance of 1 does not modify the factor. An importance of 0.5 would shrink the factor closer to 1.0 by 50 percent.

Value

The adjusted factor.


Balances secondary targets to primary

Description

The average weight per record needed to satisfy targets is computed for both primary and secondary targets. Often, these can be very different, which leads to poor performance. The algorithm must use extremely large or small weights to match the competing goals. The secondary targets are scaled so that they are consistent with the primary targets on this measurement.

Usage

balance_secondary_targets(
  primary_targets,
  primary_seed,
  secondary_targets,
  secondary_seed,
  secondary_importance,
  primary_id
)

Arguments

primary_targets

A named list of data frames. Each name in the list defines a marginal dimension and must match a column from the primary_seed table. The data frame associated with each named list element can contain a geography field (starting with "geo_"). If so, each row in the target table defines a new geography (these could be TAZs, tracts, clusters, etc.). The other column names define the marginal categories that targets are provided for. The vignette provides more detail.

primary_seed

In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair.

secondary_targets

Same format as primary_targets, but they constrain the secondary_seed table.

secondary_seed

Most commonly, if the primary_seed describes households, the secondary seed table would describe the persons in each household. Must contain the same primary_id column that links each person to their respective household in primary_seed.

secondary_importance

A real between 0 and 1 signifying the importance of the secondary targets. At an importance of 1, the function will try to match the secondary targets exactly. At 0, only the percentage distributions are used (see the vignette section "Target Agreement".)

primary_id

The field used to join the primary and secondary seed tables. Only necessary if secondary_seed is provided.

Details

If multiple geographies are present in the secondary_target table, then balancing is done for each geography separately.

Value

named list of the secondary targets


Check geo fields

Description

Helper function for check_tables. Makes sure that geographies in a seed and target table line up properly.

Usage

check_geo_fields(seed, target, target_name)

Arguments

seed

seed table to check

target

data.frame of a single target table

target_name

the name of the target (e.g. size)

Value

The seed and target table (which may be modified)


Check for missing categories in seed

Description

Helper function for check_tables.

Usage

check_missing_categories(seed, target, target_name, geo_colname)

Arguments

seed

seed table to check

target

data.frame of a single target table

target_name

the name of the target (e.g. size)

geo_colname

the name of the geo column in both the seed and target (e.g. geo_taz)

Value

Nothing. Throws an error if one is found.


Check seed and target tables for completeness

Description

Given seed and targets, checks to make sure that at least one observation of each marginal category exists in the seed table. Otherwise, ipf/ipu would produce wrong answers without throwing errors.

Usage

check_tables(
  primary_seed,
  primary_targets,
  secondary_seed = NULL,
  secondary_targets = NULL,
  primary_id
)

Arguments

primary_seed

In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair.

primary_targets

A named list of data frames. Each name in the list defines a marginal dimension and must match a column from the primary_seed table. The data frame associated with each named list element can contain a geography field (starting with "geo_"). If so, each row in the target table defines a new geography (these could be TAZs, tracts, clusters, etc.). The other column names define the marginal categories that targets are provided for. The vignette provides more detail.

secondary_seed

Most commonly, if the primary_seed describes households, the secondary seed table would describe the persons in each household. Must contain the same primary_id column that links each person to their respective household in primary_seed.

secondary_targets

Same format as primary_targets, but they constrain the secondary_seed table.

primary_id

The field used to join the primary and secondary seed tables. Only necessary if secondary_seed is provided.

Value

both seed tables and target lists


Compare results to targets

Description

Compare results to targets

Usage

compare_results(seed, targets)

Arguments

seed

data.frame Seed table with a weight column in the same format required by ipu().

targets

named list of data.frames in the same format required by ipu().

Value

data frame comparing balanced results to targets


Create a named list of target priority levels.

Description

Create a named list of target priority levels.

Usage

create_target_priority(target_priority, targets)

Arguments

target_priority

This argument controls how quickly each set of targets is relaxed. In other words: how important it is to match the target exactly. Defaults to 10,000,000, which means that all targets should be matched exactly.

real

This priority value will be used for each target table.

named list

Each named entry must match an entry in either primary_targets or secondary_targets and have a real. This priority will be applied to that target table. Any targets not in the list will default to 10,000,000.

data.frame

Column target must have values that match an entry in either primary_targets or secondary_targets. Column priority contains the values to use for priority. Any targets not in the table will default to 10,000,000.

targets

The complete list of targets (both primary and secondary)


Re-weight a Seed Table to Marginal Controls

Description

Re-weight a Seed Table to Marginal Controls

Usage

ipf(
  seed,
  targets,
  relative_gap = 0.01,
  absolute_gap = 1,
  max_iterations = 50,
  min_weight = 1e-04,
  verbose = FALSE
)

Arguments

seed

A data frame including a weight field and necessary columns for matching to marginal targets.

targets

A named list of data frames. Each name in the list defines a marginal dimension and must match a column from the seed table. The data frame associated with each name must start with an identical column named cluster. Each row in the target table defines a new cluster (these could be TAZs, tracts, districts, etc.), and every target table must have the same number of rows/clusters. The other column names define the marginal categories that targets are provided for.

relative_gap

target for convergence. Maximum percent change to allow any seed weight to move by while considering the process converged. By default, if no weights change by more than 1 The process is said to be converged if either relative_gap or absolute_gap parameters have been met.

absolute_gap

target for convergence. Maximum absolute change to allow any seed weight to move by while considering the process converged. By default, if no weights change by more than 10, the process has converged. The process is said to be converged if either relative_gap or absolute_gap parameters have been met.

max_iterations

maximum number of iterations to perform, even if convergence is not reached.

min_weight

Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights.

verbose

Print details on the maximum expansion factor with each iteration? Default FALSE.

Value

the seed data frame with a column of weights appended for each row in the target data.frames


Iterative Proportional Updating

Description

A general case of iterative proportional fitting. It can satisfy two, disparate sets of marginals that do not agree on a single total. A common example is balancing population data using household- and person-level marginal controls. This could be for survey expansion or synthetic population creation. The second set of marginal/seed data is optional, meaning it can also be used for more basic IPF tasks.

Usage

ipu(
  primary_seed,
  primary_targets,
  secondary_seed = NULL,
  secondary_targets = NULL,
  primary_id = "id",
  secondary_importance = 1,
  relative_gap = 0.01,
  max_iterations = 100,
  absolute_diff = 10,
  weight_floor = 1e-05,
  verbose = FALSE,
  max_ratio = 10000,
  min_ratio = 1e-04
)

Arguments

primary_seed

In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair.

primary_targets

A named list of data frames. Each name in the list defines a marginal dimension and must match a column from the primary_seed table. The data frame associated with each named list element can contain a geography field (starting with "geo_"). If so, each row in the target table defines a new geography (these could be TAZs, tracts, clusters, etc.). The other column names define the marginal categories that targets are provided for. The vignette provides more detail.

secondary_seed

Most commonly, if the primary_seed describes households, the secondary seed table would describe the persons in each household. Must contain the same primary_id column that links each person to their respective household in primary_seed.

secondary_targets

Same format as primary_targets, but they constrain the secondary_seed table.

primary_id

The field used to join the primary and secondary seed tables. Only necessary if secondary_seed is provided.

secondary_importance

A real between 0 and 1 signifying the importance of the secondary targets. At an importance of 1, the function will try to match the secondary targets exactly. At 0, only the percentage distributions are used (see the vignette section "Target Agreement".)

relative_gap

After each iteration, the weights are compared to the previous weights and the the relative_gap threshold, then the process terminates.

max_iterations

maximum number of iterations to perform, even if relative_gap is not reached.

absolute_diff

Upon completion, the ipu() function will report the worst-performing marginal category and geography based on the percent difference from the target. absolute_diff is a threshold below which percent differences don't matter.

For example, if if a target value was 2, and the expanded weights equaled 1, that's a 100 is only 1.

Defaults to 10.

weight_floor

Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights.

verbose

Print iteration details and worst marginal stats upon completion? Default FALSE.

max_ratio

real number. The average weight per seed record is calculated by dividing the total of the targets by the number of records. The max_scale caps the maximum weight at a multiple of that average. Defaults to 10000 (basically turned off).

min_ratio

real number. The average weight per seed record is calculated by dividing the total of the targets by the number of records. The min_scale caps the minimum weight at a multiple of that average. Defaults to 0.0001 (basically turned off).

Value

a named list with the primary_seed with weight, a histogram of the weight distribution, and two comparison tables to aid in reporting.

References

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf

Examples

hh_seed <- dplyr::tibble(
  id = c(1, 2, 3, 4),
  siz = c(1, 2, 2, 1),
  weight = c(1, 1, 1, 1),
  geo_cluster = c(1, 1, 2, 2)
)

hh_targets <- list()
hh_targets$siz <- dplyr::tibble(
  geo_cluster = c(1, 2),
  `1` = c(75, 100),
  `2` = c(25, 150)
)

result <- ipu(hh_seed, hh_targets, max_iterations = 5)

Balance a matrix given row and column targets

Description

This function simplifies the call to 'ipu()' for the simple case of a matrix and row/column targets.

Usage

ipu_matrix(mtx, row_targets, column_targets, ...)

Arguments

mtx

a matrix

row_targets

a vector of targets that the row sums must match

column_targets

a vector of targets that the column sums must match

...

additional arguments that are passed to 'ipu()'. See ipu for details.

Value

A matrix that matches row and column targets

Examples

mtx <- matrix(data = runif(9), nrow = 3, ncol = 3)
row_targets <- c(3, 4, 5)
column_targets <- c(5, 4, 3)
ipu_matrix(mtx, row_targets, column_targets)

Iterative Proportional Updating (Newton-Raphson)

Description

List balancing similar to ipu, but using the Newton-Raphson approach to optimization. Created primarily as a point of comparison for ipu.

Usage

ipu_nr(
  primary_seed,
  primary_targets,
  secondary_seed = NULL,
  secondary_targets = NULL,
  target_priority = 1e+07,
  relative_gap = 0.01,
  max_iterations = 100,
  absolute_diff = 10,
  weight_floor = 1e-05,
  verbose = FALSE,
  max_ratio = 10000,
  min_ratio = 1e-04
)

Arguments

primary_seed

In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair. Must contain a pid ("primary ID") field that is unique for each row. Must also contain a geography field that starts with "geo_".

primary_targets

A named list of data frames. Each name in the list defines a marginal dimension and must match a column from the primary_seed table. The data frame associated with each named list element must contain a geography field (starts with "geo_"). Each row in the target table defines a new geography (these could be TAZs, tracts, clusters, etc.). The other column names define the marginal categories that targets are provided for. The vignette provides more detail.

secondary_seed

Most commonly, if the primary_seed describes households, the secondary seed table would describe a unique person with each row. Must also contain the pid column that links each person to their respective household in primary_seed. Must not contain any geography fields (starting with "geo_").

secondary_targets

Same format as primary_targets, but they constrain the secondary_seed table.

target_priority

This argument controls how quickly each set of targets is relaxed. In other words: how important it is to match the target exactly. Defaults to 10,000,000, which means that all targets should be matched exactly.

real

This priority value will be used for each target table.

named list

Each named entry must match an entry in either primary_targets or secondary_targets and have a real. This priority will be applied to that target table. Any targets not in the list will default to 10,000,000.

data.frame

Column target must have values that match an entry in either primary_targets or secondary_targets. Column priority contains the values to use for priority. Any targets not in the table will default to 10,000,000.

relative_gap

After each iteration, the weights are compared to the previous weights and the the relative_gap threshold, then the process terminates.

max_iterations

maximum number of iterations to perform, even if relative_gap is not reached.

absolute_diff

Upon completion, the ipu() function will report the worst-performing marginal category and geography based on the percent difference from the target. absolute_diff is a threshold below which percent differences don't matter.

For example, if if a target value was 2, and the expanded weights equaled 1, that's a 100 is only 1.

Defaults to 10.

weight_floor

Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights.

verbose

Print iteration details and worst marginal stats upon completion? Default FALSE.

max_ratio

real number. The average weight per seed record is calculated by dividing the total of the targets by the number of records. The max_scale caps the maximum weight at a multiple of that average. Defaults to 10000 (basically turned off).

min_ratio

real number. The average weight per seed record is calculated by dividing the total of the targets by the number of records. The min_scale caps the minimum weight at a multiple of that average. Defaults to 0.0001 (basically turned off).

Value

a named list with the primary_seed with weight, a histogram of the weight distribution, and two comparison tables to aid in reporting.


Helper function to process a seed table

Description

Helper for ipu(). Strips columns from seed table except for the primary id and marginal column (as reflected in the targets tables). Also identifies factor columns with one level and processes them before mlr::createDummyFeatures() is called.

Usage

process_seed_table(df, primary_id, marginal_columns)

Arguments

df

the data.frame as processed by ipu() before this function is called.

primary_id

the name of the primary ID column.

marginal_columns

The vector of column names in the seed table that have matching targets.


Scale targets to ensure consistency

Description

Often, different marginals may disagree on the total number of units. In the context of household survey expansion, for example, one marginal might say there are 100k households while another says there are 101k. This function solves the problem by scaling all target tables to match the first target table provided.

Usage

scale_targets(targets, verbose = FALSE)

Arguments

targets

named list of data.frames in the same format required by ipu.

verbose

logical Show a warning for each target scaled? Defaults to FALSE.

Value

A named list with the scaled targets


Create the ASU example

Description

Sets up the Arizona example IPU problem and is used in multiple places throughout the package (vignettes/tests).

Usage

setup_arizona()

Value

A list of four variables: hh_seed, hh_targets, per_seed, and per_targets. These can be used directly by ipu.

Examples

setup_arizona()

Creates a synthetic population based on ipu results

Description

A simple function that takes the weight_tbl output from ipu and randomly samples based on the weight.

Usage

synthesize(weight_tbl, group_by = NULL, primary_id = "id")

Arguments

weight_tbl

the data.frame of the same name output by ipu.

group_by

if provided, the data.frame will be grouped by this variable before sampling. If not provided, tidyverse/dplyr groupings will be respected. If no grouping info is present, samples are drawn from the entire table.

primary_id

The field used to join the primary and secondary seed tables. Only necessary if secondary_seed is provided.

Value

A data.frame with one record for each synthesized member of the population (e.g. household). A new_id column is created, but the previous primary_id column is maintained to facilitate joining back to other data sources (e.g. a person attribute table).

Examples

hh_seed <- dplyr::tibble(
id = c(1, 2, 3, 4),
siz = c(1, 2, 2, 1),
weight = c(1, 1, 1, 1),
geo_cluster = c(1, 1, 2, 2)
)
hh_targets <- list()
hh_targets$siz <- dplyr::tibble(
  geo_cluster = c(1, 2),
  `1` = c(75, 100),
  `2` = c(25, 150)
)
result <- ipu(hh_seed, hh_targets, max_iterations = 5)
synthesize(result$weight_tbl, "geo_cluster")