Type: Package
Title: Bayesian Inference from the Conditional Genetic Stock Identification Model
Version: 0.3.4
Maintainer: Eric C. Anderson <eric.anderson@noaa.gov>
Description: Implements Bayesian inference for the conditional genetic stock identification model. It allows inference of mixed fisheries and also simulation of mixtures to predict accuracy. A full description of the underlying methods is available in a recently published article in the Canadian Journal of Fisheries and Aquatic Sciences: <doi:10.1139/cjfas-2018-0016>.
License: CC0
LazyData: TRUE
Depends: R (≥ 3.3.0)
Imports: dplyr, gtools, magrittr, Rcpp (≥ 0.12.5), readr, rlang, stringr, tibble, tidyr, RcppParallel
LinkingTo: Rcpp, RcppParallel
SystemRequirements: GNU make
RoxygenNote: 7.2.3
Suggests: knitr, rmarkdown, ggplot2
VignetteBuilder: knitr
Encoding: UTF-8
NeedsCompilation: yes
Packaged: 2024-01-23 21:49:33 UTC; eriq
Author: Eric C. Anderson [aut, cre], Ben Moran [aut]
Repository: CRAN
Date/Publication: 2024-01-24 14:40:05 UTC

Pipe operator

Description

Pipe operator

Usage

lhs %>% rhs

Generate a random population structure and mixture sample, as in Hasselman et al. 2015

Description

Creates random reporting unit (rho) and collection (omega) proportions, and a sim_colls vector for simulation of individual genotypes, based on the methods used in Hasselman et al. (2015)

Usage

Hasselman_sim_colls(RU_starts, RU_vec, size = 100)

Arguments

RU_starts

a vector delineating the reporting units in RU_vec; generated by tcf2param_list

RU_vec

a vector of collection indices, grouped by reporting unit; generated by tcf2param_list

Details

This function is designed specifically to recreate the simulations in Hasselman et al. (2015), to check for the bias that was observed therein. Rho (reporting unit proportions) is chosen with alphas of 1.5, and omega (collection proportions) chosen with the same alpha, then scaled by the corresponding rho.

Value

Hasselman_sim_colls returns a list with three elements. The first two are a rho vector and an omega vector, respectively, both with alpha parameters = 1.5. The third is a vector of origins for simulated individuals, sampled from the collections with probabilities = omega


Convert data frame of allele frequencies to nested lists

Description

List-izes the output of reference_allele_counts into a usable format for allelic_list

Usage

a_freq_list(D, pop_level = "collection")

Arguments

D

the long-format dataframe of counts by collection, locus, and allele, output by reference_allele_counts, to be made into a nested list

Value

a_freq_list returns a list named by loci, each element of which is a matrix containing that locus's allele count data. Rows in the matrix mark alleles, and columns collections

Examples


 # Generate a list of individual genotypes by allele from
 # the alewife data's reference allele count tables
 example(reference_allele_counts)
 ale_ac <- a_freq_list(ale_rac)

Microsat data from alewife herring reference populations

Description

Standard two-column genetic data with lots of other columns preceding it. Can be fed directly into rubias because it has at least the columns sample_type, collection, repunit and indiv.

Format

A tibble.

Source

https://datadryad.org/stash/dataset/doi:10.5061/dryad.80f4f


Create genotype lists for each locus

Description

Uses the allele counts from a_freq_list and the cleaned short-format output of tcf2long to generate a nested list of individual genotypes for each locus

Usage

allelic_list(cs, ac, samp_type = "both")

Arguments

cs

a clean short genetic data matrix; the second element of the output from tcf2long. Must have a column of individual identifiers, named "indiv"

ac

allele counts from a_freq_list

samp_type

choose which sample types of individuals to include in output: "mixture", "both", or "reference"

Value

allelic_list returns a two-component nested list, with data stored as character names of alleles ($chr) or as integer indices for the alleles ($int). Both forms contain lists representing to loci, with two component vectors corresponding to gene copies a and b.

Examples

example(a_freq_list)
ale_cs <- ale_long$clean_short
# Get the vectors of gene copies a and b for all loci in integer index form
ale_alle_list <- allelic_list(ale_cs, ale_ac)$int


Test the effects of the parametric bootstrap bias correction on a reference dataset through cross-validation

Description

This is a rewrite of bias_comparison(). Eric didn't want the plotting to be wrapped up in a function, and wanted to return a more informative data frame.

Usage

assess_pb_bias_correction(
  reference,
  gen_start_col,
  seed = 5,
  nreps = 50,
  mixsize = 100,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

reference

a two-column format genetic dataset, with a "repunit" column specifying each individual's reporting unit of origin, a "collection" column specifying the collection (population or time of sampling) and "indiv" providing a unique name

gen_start_col

the first column containing genetic data in reference. All columns should be genetic format following this column, and gene copies from the same locus should be adjacent

seed

the random seed for simulations

nreps

The number of reps to do.

mixsize

The size of each simulated mixture sample.

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

Details

Takes a reference two-column genetic dataset, pulls a series of random "mixture" datasets with varying reporting unit proportions from this reference, and compares the results of GSI through standard MCMC vs. parametric-bootstrap MCMC bias correction

The amount of bias in reporting unit proportion calculations increases with the rate of misassignment between reporting units (decreases with genetic differentiation), and increases as the number of collections within reporting units becomes more uneven.

Output from the standard Bayesian MCMC method demonstrates the level of bias to be expected for the input data set, and parametric bootstrapping is an empirical method for the removal of any existing bias.

Value

bias_comparison returns a list; the first element is a list of the relevant rho values generated on each iteration of the random "mixture" creation. This includes the true rho value, the standard result rho_mcmc, and the parametric bootstrapped rho_pb.

The second element is a dataframe listing summary statistics for each reporting unit and estimation method. mse, the mean squared error, summarizes the deviation of the rho estimates from their true value, including both bias and other variance. mean_prop_bias is the average ratio of residual to true value, which gives greater weight to deviations at smaller values. mean_bias is simply the average residual; unlike mse, this demonstrates the direction of the bias.

Examples

## Not run: 
## This takes too long to run in R CMD CHECK
ale_bias <- assess_pb_bias_correction(alewife, 17)

## End(Not run)


Simulate mixtures and estimate reporting group and collection proportions.

Description

From a reference dataset, this creates a genotype-logL matrix based on simulation-by-individual with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.

Usage

assess_reference_loo(
  reference,
  gen_start_col,
  reps = 50,
  mixsize = 100,
  seed = 5,
  alpha_repunit = 1.5,
  alpha_collection = 1.5,
  resampling_unit = "individual",
  alle_freq_prior = list(const_scaled = 1),
  printSummary = FALSE,
  return_indiv_posteriors = FALSE
)

Arguments

reference

a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries

gen_start_col

the first column of genetic data in reference

reps

number of reps of mixture simulation and MCMC to do

mixsize

the number of individuals in each simulated mixture

seed

a random seed for simulations

alpha_repunit

If a vector, this is the dirichlet parameter for simulating the proportions of reporting units. Gets recycled to the number of reporting units. Default is 1.5. Otherwise, this could be a two-column data frame. The first column must be named "repunit" and the second one must be one of "dirichlet", "ppn", or "cnt", according to whether you wish to specify dirichlet parameters, or proportions, or exact counts, respectively, for each population. If you want to make multiple simulations, pass in a list of data frames or of individual dirichlet parameters. For examples, see sim_spec_examples.

alpha_collection

The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5. If this is a data frame then the first column must be "collection" and the second must be one of "dirichlet", "ppn", "cnt", "sub_dirichlet", "sub_ppn". If you want to provide multiple different scenarios. You can pass them in as a list. If alpha_repunit or alpha_collection is a list with length greater than 1, the shorter will be recycled. For examples, see sim_spec_examples.

resampling_unit

what unit should be resampled. Currently the choices are "individuals" (the default) and "gene_copies". Using "individuals" preserves missing data patterns available in the reference data set. We also have "gene_copies_with_missing" capability, but it is not yet linked into this function.

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

printSummary

if TRUE a summary of the reference samples will be printed to stdout.

return_indiv_posteriors

if TRUE, output is a list of 2. The first entry, mixing_proportions, contains the true (simulated) and estimated mixture proportions for each scenario, iteration, and collection. The second, indiv_posteriors, contains the posterior probability of assignment to each collection for each scenario, iteration, and individual. If FALSE, output is a single data frame, mixing_proportions

Examples

# very small number of reps so it is quick enough for example
ale_dev <- assess_reference_loo(alewife, 17, reps = 5)


Partition a reference dataset and estimate reporting group and collection proportions

Description

From a reference dataset, this draws (without replacement) a simulated mixture dataset with randomly drawn population proportions, then uses this in two different estimates of population mixture proportions: maximum likelihood via EM-algorithm and posterior mean from MCMC.

Usage

assess_reference_mc(
  reference,
  gen_start_col,
  reps = 50,
  mixsize = 100,
  seed = 5,
  alpha_repunit = 1.5,
  alpha_collection = 1.5,
  min_remaining = 5,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

reference

a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries.

gen_start_col

the first column of genetic data in reference

reps

number of reps to do

mixsize

the number of individuals in each simulated mixture.

seed

a random seed for simulations

alpha_repunit

The dirichlet parameter for simulating the proportions of reporting units. Default = 1.5

alpha_collection

The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5

min_remaining

the minimum number of individuals which should be conserved in each reference collection during sampling without replacement to form the simulated mixture

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

Details

This method is referred to as "Monte Carlo cross-validation". The input parameters for assess_reference_mc are more restrictive than those of assess_reference_loo. Rather than allowing a data.frame to specify Dirichlet parameters, proportions, or counts for specific reporting units and collections, assess_reference_mc only allows vector input (default = 1.5) for alpha_repunit and alpha_collection. These inputs specify the uniform Dirichlet parameters for all reporting units and collections, respectively.

For mixture proportion generation, the rho values are first drawn using a stick-breaking model of the Dirichlet distribution, but with proportions capped by min_remaining. Stick-breaking is then used to subdivide each reporting unit into collections. In addition to the constraint that mixture sampling without replacement cannot deplete the number of individuals in each collection below min_remaining, a similar constraint is placed upon the number of individuals left in reporting units, determined as min_remaining * (# collections in reporting unit).

Note that this implies that the data are only truly Dirichlet distributed when no rejections based on min_remaining occur. This is a reasonable certainty with sufficient reference collection sizes relative to the desired mixture size.

Examples

# only 5 reps, so it doesn't take too long.  Typically you would
# do many more
ale_dev <- assess_reference_mc(alewife, 17, 5)

Get the average within-RU assignment rate for each collection

Description

This function takes a matrix of scaled genotype likelihoods for a group of individuals of known origin, and calculates the average rate at which individuals in a particular collection are assigned to the correct reporting unit.

Usage

avg_coll2correctRU(SL, coll, RU_starts, RU_vec)

Arguments

SL

a scaled likelihood matrix; each column should sum to one, and represent the probability of assignments to each collection (row) for a particular individual

coll

a vector of the collection of origin indices of the individuals (length = ncol(SL))

RU_starts

a vector delineating starting indices of different reporting units in RU_vec

RU_vec

a vector of collection indices, organized by reporting unit

Details

The average rate of correct within-reporting unit assignment is proportional to reporting-unit-level bias in the posterior probability for this collection; if the correct assignment rate is high relative to other collections, it will be upwardly biased, and vice versa. The inverse of this vector is used to scale Dirichlet draws of omega during misassignment-scaled MCMC.

Value

avg_coll2correctRU returns a vector of length nrow(SL), where each element represents the average proportion of fish from the corresponding collection which are correctly assigned to the proper collection, or misassigned to another collection within the same reporting unit. This is distinct from the rate of correct assignment at the collection level, which is too low and variable to serve as a stable metric for omega scaling.

Examples

locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames

params <- tcf2param_list(alewife, 17, ploidies = ploidies)
SL <- geno_logL(params) %>% exp() %>% apply(2, function(x) x/sum(x))
correct <- avg_coll2correctRU(SL, params$coll, params$RU_starts, params$RU_vec)

Microsat data from blueback herring reference populations

Description

Standard two-column genetic data with lots of other columns preceding it. Can be fed directly into rubias because it has at least the columns sample_type, collection, repunit and indiv.

Format

A tibble.

Source

https://datadryad.org/stash/dataset/doi:10.5061/dryad.80f4f


Perform a parametric bootstrapping correction on an estimated rho vector

Description

Takes an estimate of rho, and a two-column format genetic data frame containing both reference and mixture data. Returns a new rho corrected by parametric bootstrapping

Usage

bootstrap_rho(
  rho_est,
  pi_est,
  D,
  gen_start_col,
  niter = 100,
  reps = 2000,
  burn_in = 100,
  pi_prior = NA,
  pi_prior_sum = 1
)

Arguments

rho_est

the rho value previously estimated from MCMC GSI from the provided reference and mixture data

pi_est

the pi value previously estimated from MCMC GSI from the provided reference and mixture data

D

a two-column genetic dataframe containing the reference and mixture data from which rho_est was computed; with "repunit", "collection", and "indiv" columns

gen_start_col

the first column of genetic data in D. All columns after gen_start_col must be genetic data

pi_prior

The prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector. Non-default values should be a vector of length equal to the number of populations in the reference dataset. Default value of NA leads to the calculation of a symmetrical prior based on pi_prior_sum.

pi_prior_sum

total weight on default symmetrical prior for pi.

In parametric bootstrapping, niter new mixture datasets are simulated by individual from the reference with reporting unit proportions rho_est, and the mean of their MCMC GSI outputs is used to calculate an average bias. This bias is subtracted from rho_est to give the output. The number of individuals in each simulated bootstrap dataset is equal to the number of "mixture" individuals in D.

Value

bootstrap_rho returns a new rho value, corrected by parametric bootstrapping.


check a baseline and mixture file together to ensure the known_collections are valid if they exist

Description

Simple function that checks for known_collections columns in a reference and mixture and makes sure that they are compliant. If there is a non-NA entry in the Mixture frame's known_collection column this function returns TRUE. Otherwise it returns FALSE.

Usage

check_known_collections(R, M)

Arguments

R

reference data frame

M

mixture data frame


A helper function to check that the input data frame is OK

Description

Just checks to make sure that column types are correct.

Usage

check_refmix(D, gen_start_col, type = "reference")

Arguments

D

the data frame

gen_start_col

the column in which the genetic data starts

type

For writing errors, supply "mixture" or "reference" as appropriate.

Details

It also checks the patterns of missing data, and from that infers whether markers are haploid or diploid.


SNP data from chinook reference populations

Description

Chinook salmon baseline data similar to that which can be downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv. This data set includes 91 SNPs and 7301 fish and is what the Dryad data became after we converted from TaqMan to SNPtype assays (being forced to toss some loci) and tossed out a bunch of lousy historical samples from Trinity River.

Format

A tbl_df-ed (from dplyr) data frame with 7,301 rows and 185 variables. The first three columns are

repunit (chr)

the reporting unit that the individual is in

pop (chr)

the population from which the individual was sampled

ID (chr)

Unique identifier of the individual fish

The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.

Source

https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv


a vector that gives a desired sort order of the chinook collections

Description

This is just an example of what one would use as levels in order to get the chinook collections in a desired sort order after analysis. The issue here is collection in the input data frame to most functions must be a character vector, not a factor. But, after analysis you can always make them a factor again and use a vector like this one to specify the levels.

Source

Made it up!


SNP data from Chinook salmon taken in May/August 2015 from California fisheries

Description

This has data from 91 SNP markers (a subset of the 95 markers in the chinook baseline data set).

Format

A tbl_df-ed (from dplyr) data frame with 2256 rows and 193 variables. The first four columns are meta data. The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.

Source

Southwest Fisheries Science Center, Santa Cruz, CA


a vector that gives a desired sort order of the chinook repunits

Description

This is just an example of what one would use as levels in order to get the chinook repunits in a desired sort order after analysis. The issue here is that repunit in the input data frame to most functions must be a character vector, not a factor. But, after analysis you can always make them a factor again and use a vector like this one to specify the levels.

Source

Made it up!


check for matching (or close to matching) genotypes in a data frame

Description

Super simple function that looks at all pairs of fish from the data frame and returns a tibble that includes those which shared a fraction >= than min_frac_non_miss of the genotypes not missing in either fish, and which were matching at a fraction >= min_frac_matching of those non-missing pairs of genotypes.

Usage

close_matching_samples(
  D,
  gen_start_col,
  min_frac_non_miss = 0.7,
  min_frac_matching = 0.9
)

Arguments

D

a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has entried either of "reference" or "mixture" or both.

gen_start_col

the first column of genetic data in reference

min_frac_non_miss

the fraction of loci that the pair must share non missing in order to be reported

min_frac_matching

the fraction of shared non-missing loci that must be shared between the indivdiuals to be reported as a matching pair.

Value

a tibble ...

Examples

# one pair found in the interal alewife data set:
close_matching_samples(alewife, 17)

for each individual count the number or loci missing and non_missing

Description

Takes a rubias genetic data frame that must have column "indiv". Haploids have the second column at each locus totally missing. Diploids with missing data will have both gene copies missing.

Usage

count_missing_data(D, gen_start_col)

Arguments

D

the data frame

gen_start_col

the column in which the genetic data starts

Value

returns a data frame with indiv (as characters), n_non_miss_loci, n_miss_loci (as numeric) and missing_loci (as a list-column of named integer vectors)


Create a vector of pi Dirichlet priors with specified values for one or more collections

Description

This handles a case in which the user provides a data frame for pi_prior. The data frame lists desired Dirichlet parameter priors for at least one reference collection, and/or a default value for all unspecified collections.

Usage

custom_pi_prior(P, C)

Arguments

P

A data frame of one or more desired pi prior parameters. One column, "collection", is a character vector, with valid values including the names of any reference collections, or the special value "DEFAULT_PI". The second column, "pi_param" is the prior value to be used for each collection.

C

a tibble with a column "collection" collection names

Details

Input checking is currently done in the early stages of infer_mixture in order to throw errors before long processing times, and avoid re-checking during bootstrap_rho.


Given a vector of different categories in 1...n and a prior, simulate a Dirichlet random vector

Description

Takes a vector of collection indices to which individuals (vector elements) were assigned, and returns a Dirichlet random variable generated by adding the prior to the sum of each collection's occurrences, and simulating an alpha from a gamma distribution with this shape parameter.

Usage

dirch_from_allocations(C, lambda)

Arguments

C

a vector giving different categories of individual (not counts of categories - untabulated)

lambda

priors for the categories

Details

The categories are labeled in C from 1 up to n. n is the length of lambda, which is a vector of priors. Note that all elements of lambda must be strictly greater than 0.


Given a vector of counts for different categories in 1...n and a prior, simulate a Dirichlet random vector

Description

Takes a vector of counts for 1:n collections, and returns a Dirichlet random variable generated by adding the prior to each collection value, and simulating an alpha from a gamma distribution with this shape parameter.

Usage

dirch_from_counts(C, lambda)

Arguments

C

a vector giving counts of categories

lambda

priors for the categories

Details

The categories are labeled in C from 1 up to n. n is the length of lambda, which is a vector of priors. Note that all elements of lambda must be strictly greater than 0.


Calculate a matrix of genotype log-likelihoods for a genetic dataset

Description

Takes a list of key parameters from a genetic dataset, and calculates the log-likelihood of each individual's genotype, given the allele counts in each collection

Usage

geno_logL(par_list)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

Details

Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin

Value

geno_logL returns a matrix with C rows and I columns. Each column represents an individual, and each row the log-likelihood of that individual's genotype, given the allele counts in that collection

Examples

example(tcf2param_list)
ale_glL <- geno_logL(ale_par_list)

Calculate a matrix of sum-of-squares of genotype log-likelihoods for a genetic dataset

Description

Takes a list of key parameters from a genetic dataset, and calculates the sum of squared log-likelihood of each individual's genotype, given the allele counts in each collection. This is used for the quick-and-dirty Z-score calculations.

Usage

geno_logL_ssq(par_list)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

Details

Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin

Value

geno_logL returns a matrix with C rows and I columns. Each column represents an individual, and each row the log-likelihood of that individual's genotype, given the allele counts in that collection

Examples

example(tcf2param_list)
ale_glL <- geno_logL(ale_par_list)

infer the ploidy from the pattern of NAs in the columns of data

Description

This is strictly internal

Usage

get_ploidy_from_frame(tmp, type)

Arguments

tmp

a data frame with 2 * L columns (two for each locus)


Simulate genotype log-likelihoods from a population by gene copy

Description

Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by gene copy from the specified collections

Usage

gprob_sim_gc(par_list, sim_colls)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

sim_colls

a vector of indices for the collections desired for simulation; each element of the list corresponds to an individual

Details

In simulation by gene copy, the genotype at a locus for any individual is the result of two random draws from the allele count matrix of that locus. Draws within an individual are performed without replacement, but allele counts are replaced between individuals.

Value

gprob_sim returns a matrix of the summed log-likelihoods for all loci of a simulated population mixture; columns represent individuals, with each row containing their log-likelihood of belonging to the collection of the same index, given the selection of two independent gene copies from the desired collection of origin's reference allele frequencies

Examples

example(tcf2param_list)
sim_colls <- sample(ale_par_list$C, 1070, replace = TRUE)
ale_sim_gprobs_gc <- gprob_sim_gc(ale_par_list, sim_colls)

Simulate genotypes by gene copy, with missing data from chosen individuals

Description

Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by gene copy from the specified collections, with genotypes masked by missing data patterns from reference individuals

Usage

gprob_sim_gc_missing(par_list, sim_colls, sim_missing)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

sim_colls

a vector; element i specifies the collection from which to sample the genotypes for individual i

sim_missing

a vector; element i specifies the index for the individual in params$I whose missing data should be copied for individual i

Details

In simulation by gene copy, the genotype at a locus for any individual is the result of two random draws from the allele count matrix of that locus. Draws within an individual are performed without replacement, but allele counts are replaced between individuals. If the data at a particular locus is missing for individual i in sim_missing, this data will also be missing in simulated individual i for the log-likelihood calculation.

Examples


# If one wanted to simulate the missing data patterns
# of a troublesome mixture dataset, one would run tcf2param_list,
# selecting samp_type = "mixture", then draw sim_miss from
# the mixture individual genotype list

# make a fake mixture data set to demonstrate...
drawn <- mixture_draw(alewife, rhos = c(1/3, 1/3, 1/3),N = 100)
ref <- drawn$reference
mix <- drawn$mix

# then run it...
# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(rbind(ref,mix), 17, samp_type = "mixture", ploidies = ploidies)
sim_colls <- sample(params$C, 1070, replace = TRUE)
sim_miss <- sample(length(params$coll), 1070, replace = TRUE)
ale_sim_gprobs_miss <- gprob_sim_gc_missing(params, sim_colls, sim_miss)

Simulate genotype log-likelihoods from a population by individual

Description

Takes a list of parameters from a genetic dataset, and returns a genotype log-likelihood matrix for individuals simulated by individual from the specified collections

Usage

gprob_sim_ind(par_list, sim_colls)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

sim_colls

a vector of indices for the collections desired for simulation; each element of the list corresponds to an individual

Details

In simulation by individual, the genotype for any simulated individual is the result of a single random draw from the genotypes of all individuals in the collection. Gene copies and loci are therefore not independent.

Value

gprob_sim returns a matrix of the summed log-likelihoods for all loci of a simulated population mixture; columns represent individuals, with each row containing their log-likelihood of belonging to the collection of the same index, given the selection of an individual's genotype from the reference collection of interest. Selection at the locus and gene copy level are not independent, and missing data is included in selection.

Examples

example(tcf2param_list)
sim_colls <- sample(ale_par_list$C, 1070, replace = TRUE)
ale_sim_gprobs_ind <- gprob_sim_ind(ale_par_list, sim_colls)

EM algorithm from the simplest GSI model for pi and the individual posterior probabilities

Description

Using a matrix of scaled likelihoods, this function does an EM algorithm to climb the likelihood surface for pi, and computes the plug-in estimate of the posteriors for all the individuals. It returns the output in a list.

Usage

gsi_em_1(SL, Pi_init, max_iterations, tolerance, return_progression)

Arguments

SL

a matrix of the scaled likelihoods. This is should have values for each individual in a column (going down in the rows are values for different collections).

Pi_init

Starting value for the pi (collection mixture proportion) vector.

max_iterations

the maximum total number of reps iterations to do.

tolerance

the EM-algorithm will be considered converged when the sum over the elements of pi of the absolute value of the difference between the previous and the current estimate is less than tolerance.

return_progression

If true, then the pi_trace component of the output shows the value of pi visited en route to the end.

Value

gsi_em_1 returns a final Maximum-Likelihood estimate for pi and PofZ, as well as the number of iterations needed to reach convergence ("iterations_performed"), and traces of the pi values and change in pi in each iteration

Examples

# this is shown with a scaled likelihood matrix from self-assignment
# of the reference individuals

# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
params <- tcf2param_list(alewife, 17, ploidies = ploidies)
logl <- geno_logL(params)
SL <- apply(exp(logl), 2, function(x) x/sum(x))
test_em <- gsi_em_1(SL,
                    rep(1/params$C, params$C),
                    max_iterations = 10^6,
                    tolerance = 10^-7,
                    return_progression = TRUE)

MCMC from the simplest GSI model for pi and the individual posterior probabilities

Description

Using a matrix of scaled likelihoods, this function samples values of pi and the posteriors for all the individuals. It returns the output in a list.

Usage

gsi_mcmc_1(SL, Pi_init, lambda, reps, burn_in, sample_int_Pi, sample_int_PofZ)

Arguments

SL

matrix of the scaled likelihoods. This is should have values for each individual in a column (going down in the rows are values for different populations).

Pi_init

Starting value for the pi (collection mixture proportion) vector.

lambda

the prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector

reps

total number of reps (sweeps) to do.

burn_in

how many reps to discard in the beginning when doing the mean calculation. They will still be returned in the traces if desired

sample_int_Pi

the number of reps between samples being taken for Pi traces. If 0 no trace samples are taken

sample_int_PofZ

the number of reps between samples being taken for the traces of posterior of each individual's origin. If 0 no trace samples are taken.

Value

gsi_mcmc_1 returns a list of three. $mean lists the posterior means for collection proportions pi and for the individual posterior probabilities of assignment PofZ. $sd returns the posterior standard deviations for the same values.

If the corresponding sample_int variables are not 0, $trace contains samples taken from the Markov chain at intervals of sample_int_(variable) steps.

Examples

# this demonstrates it with scaled likelihoods computed from
# assignment of the reference samples

# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames

params <- tcf2param_list(alewife, 17, ploidies = ploidies)
logl <- geno_logL(params)
SL <- apply(exp(logl), 2, function(x) x/sum(x))
lambda <- rep(1/params$C, params$C)
mcmc <- gsi_mcmc_1(SL, lambda, lambda, 200, 50, 5, 5)

MCMC from the fully Bayesian GSI model for pi and the individual posterior probabilities

Description

Given a list of key parameters from a genetic dataset, this function samples values of pi and the posteriors for all the individuals. Each MCMC iteration includes a recalculation of the scaled genotype likelihood matrix, with baseline allele frequencies updated based on the previous iteration's allocations. It returns the output in a list.

Usage

gsi_mcmc_fb(
  par_list,
  Pi_init,
  lambda,
  reps,
  burn_in,
  sample_int_Pi,
  sample_int_PofZ
)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

Pi_init

Starting value for the pi (collection mixture proportion) vector.

lambda

the prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of a new pi vector

reps

total number of reps (sweeps) to do.

burn_in

how many reps to discard in the beginning when doing the mean calculation. They will still be returned in the traces if desired

sample_int_Pi

the number of reps between samples being taken for Pi traces. If 0 no trace samples are taken

sample_int_PofZ

the number of reps between samples being taken for the traces of posterior of each individual's origin. If 0 no trace samples are taken.

Value

gsi_mcmc_fb returns a list of three. $mean lists the posterior means for collection proportions pi, for the individual posterior probabilities of assignment PofZ, and for the allele frequencies theta. $sd returns the posterior standard deviations for the same values.

If the corresponding sample_int variables are not 0, $trace contains samples taken from the Markov chain at intervals of sample_int_(variable) steps.

Examples

# this demonstrates it with scaled likelihoods computed from
# assignment of the reference samples

# we have to get the ploidies to pass to tcf2param_list
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames

params <- tcf2param_list(alewife, 17, ploidies = ploidies)
lambda <- rep(1/params$C, params$C)
# use very short run and burn in so it doesn't take too long
# when checking on CRAN
mcmc <- gsi_mcmc_fb(params, lambda, lambda, 20, 5, 4, 4)

Estimate mixing proportions and origin probabilities from one or several mixtures

Description

Takes a mixture and reference dataframe of two-column genetic data, and a desired method of estimation for the population mixture proportions (MCMC, PB, BR). Returns the output of the chosen estimation method

Usage

infer_mixture(
  reference,
  mixture,
  gen_start_col,
  method = "MCMC",
  alle_freq_prior = list(const_scaled = 1),
  pi_prior = NA,
  pi_init = NULL,
  reps = 2000,
  burn_in = 100,
  pb_iter = 100,
  prelim_reps = NULL,
  prelim_burn_in = NULL,
  sample_int_Pi = 1,
  sample_theta = TRUE,
  pi_prior_sum = 1
)

Arguments

reference

a dataframe of two-column genetic format data, proceeded by "repunit", "collection", and "indiv" columns. Does not need "sample_type" column, and will be overwritten if provided

mixture

a dataframe of two-column genetic format data. Must have the same structure as reference dataframe, but "collection" and "repunit" columns are ignored. Does not need "sample_type" column, and will be overwritten if provided

gen_start_col

the first column of genetic data in both data frames

method

a choice between "MCMC", "PB", "BR" methods for estimating mixture proportions

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

pi_prior

The prior to be added to the collection allocations, in order to generate pseudo-count Dirichlet parameters for the simulation of new pi vectors in MCMC. Default value of NA leads to the calculation of a symmetrical prior based on pi_prior_sum. To provide other values to certain collections, you can pass in a data frame with two columns, "collection" listing the relevant collection, and "pi_param" listing the desired prior for that collection. Specific priors may be listed for as few as one collection. The special collection name "DEFAULT_PI" is used to set the prior for all collections not explicitly listed; if no "DEFAULT_PI" is given, it is taken to be 1/(# collections).

pi_init

The initial value to use for the mixing proportion of collections. This lets the user start the chain from a specific value of the mixing proportion vector. If pi_init is NULL (the default) then the mixing proportions are all initialized to be equal. Otherwise, you pass in a data frame with one column named "collection" and the other named "pi_init". Every value in the pi_init column must be strictly positive (> 0), and a value must be given for every collection. If they sum to more than one the values will be normalized to sum to one.

reps

the number of iterations to be performed in MCMC

burn_in

how many reps to discard in the beginning of MCMC when doing the mean calculation. They will still be returned in the traces if desired.

pb_iter

how many bootstrapped data sets to do for bootstrap correction using method PB. Default is 100.

prelim_reps

for method "BR", the number of reps of conditional MCMC (as in method "MCMC") to perform prior to MCMC with baseline resampling. The posterior mean of mixing proportions from this conditional MCMC is then used as pi_init in the baseline resampling MCMC.

prelim_burn_in

for method "BR", this sets the number of sweeps out of prelim_reps that should be discarded as burn in when preparing the posterior means of the mixing proportions to be set as pi_init in the baseline resampling MCMC.

sample_int_Pi

how many iterations between storing the mixing proportions trace. Default is 1. Can't be 0. Can't be so large that fewer than 10 samples are taken from the burn in and the sweeps.

sample_theta

for method "BR", whether or not the function should store the posterior mean of the updated allele frequences. Default is TRUE

pi_prior_sum

For pi_prior = NA, the prior on the mixing proportions is set as a Dirichlet vector of length C, with each element being W/C, where W is the pi_prior_sum and C is the number of collections. By default this is 1. If it is made much smaller than 1, things could start to mix more poorly.

Details

"MCMC" estimates mixing proportions and individual posterior probabilities of assignment through Markov-chain Monte Carlo conditional on the reference allele frequencies, while "PB" does the same with a parametric bootstrapping correction, and "BR" runs MCMC sweeps while simulating reference allele frequencies using the genotypes of mixture individuals and allocations from the previous sweep. All methods default to a uniform 1/(# collections or RUs) prior for the mixing proportions.

Value

Tidy data frames in a list with the following components: mixing_proportions: the estimated mixing proportions of the different collections. indiv_posteriors: the posterior probs of fish being from each of the collections. mix_prop_traces: the traces of the mixing proportions. Useful for computing credible intervals. bootstrapped_proportions: If using method "PB" this returns the bootstrap corrected reporting unit proportions.

Examples

mcmc <- infer_mixture(reference = small_chinook_ref,
                      mixture = small_chinook_mix,
                      gen_start_col = 5,
                      method = "MCMC",
                      reps  = 200)

Collect essential data values before mixture proportion estimation

Description

Takes all relevant information created in previous steps of data conversion pipeline, and combines into a single list which serves as input for further calculations

Usage

list_diploid_params(
  AC_list,
  I_list,
  PO,
  coll_N,
  RU_vec,
  RU_starts,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

AC_list

a list of allele count matrices; output from a_freq_list

I_list

a list of genotype vectors; output from allelic_list

PO

a vector of collection (population of origin) indices for every individual in the sample, in order identical to I_list

coll_N

a vector of the total number of individuals in each collection, in order of appearance in the dataset

RU_vec

a vector of collection indices, sorted by reporting unit

RU_starts

a vector of indices, designating the first collection for each reporting unit in RU_vec

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. The name of the list item determines the type of prior used, with options "const", "scaled_const", and "empirical". If "const", the listed number will be taken as a constant added to the count for each allele, locus, and collection. If "scaled_const", the listed number will be divided by the number of alleles at a locus, then added to the allele counts. If "empirical", the listed number will be multiplied by the relative frequency of each allele across all populations, then added to the allele counts.

Details

Genotypes represented in I_list are converted into a single long vector, ordered by locus, individual, and gene copy, with NA values represented as 0s. Similarly, AC_list is unlisted to AC, ordered by locus, collection, and allele. DP is a list of Dirichlet priors for likelihood calculations, created by adding the values calculated from alle_freq_prior to each allele sum_AC and sum_DP are the summed allele values for each locus of their parent vectors, ordered by locus and collection.

Value

list_diploid_params returns a list of the information necessary for the calculation of genotype likelihoods in MCMC:

L, N, and C represent the number of loci, individual genotypes, and collections, respectively. A is a vector of the number of alleles at each locus, and CA is the cumulative sum of A. coll, coll_N, RU_vec, and RU_starts are copied directly from input.

I, AC, sum_AC, DP, and sum_DP are vectorized versions of data previously represented as lists and matrices; indexing macros use L, N, C, A, and CA to access these vectors in later Rcpp-based calculations.

Examples

example(allelic_list)
PO <- as.integer(factor(ale_long$clean_short$collection))
coll_N <- as.vector(table(PO))

Colls_by_RU <- dplyr::count(ale_long$clean_short, repunit, collection) %>%
   dplyr::filter(n > 0) %>%
   dplyr::select(-n)
 PC <- rep(0, length(unique((Colls_by_RU$repunit))))
 for(i in 1:nrow(Colls_by_RU)) {
   PC[Colls_by_RU$repunit[i]] <- PC[Colls_by_RU$repunit[i]] + 1
 }
RU_starts <- c(0, cumsum(PC))
RU_vec <- as.integer(Colls_by_RU$collection)
param_list <- list_diploid_params(ale_ac, ale_alle_list, PO, coll_N, RU_vec, RU_starts)


Separate a chosen proportion of a reference dataset into a mixture with known population proportions

Description

Takes a reference dataset and a set of population proportions, either at the collection or reporting unit level. Randomly samples individuals to satisfy these desired proportions, and splits them into a new "mixture" dataframe.

Usage

mixture_draw(D, rhos = NULL, omegas = NULL, N, min_remaining = 0)

Arguments

D

a two-column genetic dataframe with "indiv", "repunit", and "collection" columns

rhos

a vector of the desired reporting unit proportions in the mixture set; if not named, will be assumed to be ordered by order of appearance in the dataset

omegas

the desired collection proportions in the mixture set

N

the total size of the mixture set

min_remaining

the fraction of any collection in the reference dataset which must remain at the end of the draw

Value

mixture_draw returns a list of two data frames, "mixture" being the random sample taken, and "reference" being the remaining samples

Examples

rhos <- as.vector(gtools::rdirichlet(1, table(alewife$repunit)))
cross_val <- mixture_draw(D = alewife, rhos = rhos, N = 100, min_remaining = .005)


for individuals of known origin in the mixture, put all their weight on their known collection

Description

This is used internally.

Usage

modify_scaled_likelihoods_for_known_mixture_fish(SL, KC, CFL)

Arguments

SL

the matrix of scaled likelihoods.

KC

a character vector of collections that the individuals belong to (or NA if you don't know where they come from). If this is NULL, then SL just gets returned untouched.

CFL

the levels of the collections factor (which is within clean$short)


Compute the mean and variance of the single-locus genotype likelihoods for each collection

Description

This assumes that you have compiled params for a reference data set and then it just calls rcpp_per_locus and then summarizes the results.

Usage

per_locus_means_and_vars(par_list)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list. This should be include genotypes only for the reference individuals.

Value

Returns a list with two components, mean and var, each one a matrix that has C (number of collections) rows and L (number of loci) columns, giving the mean (or variance) of the genotype likelihoods in the individuals in that collection at that locus.


perfect-assignment genetic data for chinook.

Description

This is just like the chinook data, but only has 7 loci and all loci are fixed in fortuitous patterns so that every single collection is easily resolved. This is primarily useful for testing purposes.

Source

Made it up!


perfect-assignment mixture genetic data for chinook.

Description

This is similar to the chinook_mix data, but only has 7 loci and all loci are fixed in fortuitous patterns so that every single collection is easily resolved. This is primarily useful for testing purposes. The name of the individual has its collection inside the colons.

Source

Made it up!


Find all pairs that have close matches

Description

Find all pairs that have close matches

Usage

rcpp_close_matchers(par_list, non_miss_fract, match_fract)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

Value

Gotta say more

Examples

# gotta do something here too

From the pattern of missing data at each individual, compute the expected mean and variance of the logl

Description

This takes a param_list so that it has access to individual genotypes (and hence can cycle through them and know which are missing and which are not.) It also takes a matrix of per-locus logl means and variances like what is computed by per_locus_means_and_vars.

Usage

rcpp_indiv_specific_logl_means_and_vars(par_list, MV)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

MV

a list of two matrices, one of means and the other of variances, which are C x L matrices. This is basically the list that is returned by per_locus_means_and_vars.

Details

This function doesn't do any checking to assure that the par_list and the per-locus logl means matrix are made for one another. (i.e. use the same collections in the same order.)

Value

a matrix with C rows and I columns. Each row represents a collection, and each column an individual.


Return a matrix of locus-specific self-assignment logls

Description

Takes a list of key parameters from a genetic dataset, and calculates the log-likelihood of each individual's single-locus genotype, given the allele counts in the individual's collection.

Usage

rcpp_per_locus_logls(par_list)

Arguments

par_list

genetic data converted to the param_list format by tcf2param_list

Details

This uses Leave-One-Out cross-validation is used to avoid bias in log-likelihood for an individual's known collection of origin

Value

rcpp_per_locus_logls returns a matrix with I rows and L columns. Each row represents an individual, and each column a locus. Note that missing data at a locus returns a zero. That should be changed to NA later.


read a gsi_sim formatted input file into a tibble that rubias can use

Description

Note that this relies on a system call to awk. It probably won't work on Windows.

Usage

read_gsi_sim(path, sample_type, repunits = NULL)

Arguments

path

path to the gsi_sim file

sample_type

should be "reference" or "mixture" depending on what kind of file it is

repunits

the gsi_sim reporting units file. Has no effect if sample_type is "mixture". If sample_type is "reference" and this is left as NULL, then all collections will be put in the the "default_repu".


Estimate mixing proportions from reference and mixture datasets

Description

Takes a mixture and reference dataframe of two-column genetic data, and a desired method of estimation for the population mixture proportions (MCMC, PB, or BH MCMC) Returns the output of the chosen estimation method

Usage

ref_and_mix_pipeline(
  reference,
  mixture,
  gen_start_col,
  method = "MCMC",
  reps = 2000,
  burn_in = 100,
  sample_int_Pi = 0,
  sample_int_PofZ = 0,
  sample_int_omega = 0,
  sample_int_rho = 0,
  sample_int_PofR = 0
)

Arguments

reference

a dataframe of two-column genetic format data, proceeded by "repunit", "collection", and "indiv" columns. Does not need "sample_type" column, and will be overwritten if provided

mixture

a dataframe of two-column genetic format data. Must have the same structure as reference dataframe, but "collection" and "repunit" columns are ignored. Does not need "sample_type" column, and will be overwritten if provided

gen_start_col

the first column of genetic data in both data frames

method

this must be "MCMC". "PB" and "BH" are no longer supported in this function.

reps

the number of iterations to be performed in MCMC

burn_in

how many reps to discard in the beginning of MCMC when doing the mean calculation. They will still be returned in the traces if desired.

sample_int_Pi

the number of reps between samples being taken for pi traces. If 0 no traces are taken. Only used in methods "MCMC" and "PB".

sample_int_PofZ

the number of reps between samples being taken for the posterior traces of each individual's collection of origin. If 0 no trace samples are taken. Used in all methods

sample_int_omega

the number of reps between samples being taken for collection proportion traces. If 0 no traces are taken. Only used in method "BH"

sample_int_rho

the number of reps between samples being taken for reporting unit proportion traces. If 0 no traces are taken. Only used in method "BH"

sample_int_PofR

the number of reps between samples being taken for the posterior traces of each individual's reporting unit of origin. If 0 no trace samples are taken. Only used in method "BH".

Details

"MCMC" estimates mixing proportions and individual posterior probabilities of assignment through Markov-chain Monte Carlo, while "PB" does the same with a parametric bootstrapping correction, and "BH" uses the misassignment-scaled, hierarchical MCMC. All methods use a uniform 1/(# collections or RUs) prior for pi/omega and rho.

Value

mix_proportion_pipeline returns the standard output of the chosen mixing proportion estimation method (always a list). For method "PB", returns the standard MCMC results, as well as the bootstrap-corrected collection proportions under $mean$bootstrap

Examples

reference <- small_chinook_ref
mixture <- small_chinook_mix
gen_start_col <- 5

# this function expects things as factors.  This function is old and needs
# to be replaced and deprecated.

reference$repunit <- factor(reference$repunit, levels = unique(reference$repunit))
reference$collection <- factor(reference$collection, levels = unique(reference$collection))
mixture$repunit <- factor(mixture$repunit, levels = unique(mixture$repunit))
mixture$collection <- factor(mixture$collection, levels = unique(mixture$collection))

mcmc <- ref_and_mix_pipeline(reference, mixture, gen_start_col, method = "MCMC")


Tabulate occurrences of all observed alleles in reference genetic data

Description

Takes the first output of tcf2long, along with two columns named "collection" and "sample_type", and returns a data frame of allele counts for each locus within each reference population. Alleles to be counted are identified from both reference and mixture populations.

Usage

reference_allele_counts(D, pop_level = "collection")

Arguments

D

A data frame containing, at minimum, a column of sample group identifiers named "collection", a column designating each row as "reference" or "mixture", named "sample_type", and (from tcf2long output) locus, gene copy, and observed alleles. If higher-level reporting unit counts are desired, must have a column of reporting unit identifiers named "repunit"

pop_level

a character vector expressing the population level for which allele counts should be tabulated. Set to "collection" for collection/underlying sample group (default), or "repunit" for reporting unit/overlying sample groups

Details

The "collection" column should be a key assigning samples to the desired groups, e.g. collection site, run time, year. The "sample_type" column must contain either "reference" or "mixture" for each sample.

Value

reference_allele_counts returns a long-format dataframe, with count data for each collection, locus, and allele. Counts are only drawn from "reference" samples; alleles unique to the "mixture" samples will still appear in the list, but will have 0s for all groups.

Examples

## count alleles in alewife reference populations
example(tcf2long)  # gets variable ale_long
ale_rac <- reference_allele_counts(ale_long$long)


Round a given number, with 5 always rounded up

Description

Given a number and a digit to round to, returns the rounded number, with 5 always rounded upwards.

Usage

round2(x, n)

Arguments

x

the data to be rounded

n

the number of digits to round to

Details

This function differs from round, which rounds 5 "towards the even number". Rounding 5s up leads to bias when positive and negative numbers are expected, but can be desired in some cases.


rubias: Bayesian inference from the conditional genetic stock identification model

Description

Read the "rubias-overview" vignette for information on data input formats and how to use the package

the rubias main high-level functions

The following functions are wrappers, designed for user-friendly input and useful output:

infer_mixture is used to perform genetic stock identification. Options include standard MCMC and the parametric bootstrap bias correction.

self_assign does simple self-assignment of individuals in a reference data set to the various collections in the reference data set.

assess_reference_loo does leave-one-out based simulations to predict how accurately GSI can be done.

assess_reference_mc uses Monte-Carlo cross-validation based simulations to predict how accurately GSI can be done.

assess_pb_bias_correction attempts to demonstrate how much (or little) improvement can be expected from the parametric bootstrap correction given a particular reference data set.

genetic data format

See the vignette.

example data

alewife, blueback, and chinook are genetic data sets that are useful for playing around with rubias and testing it out.


Sample 1 observation from cell probabilities that are columns of a matrix

Description

Takes a matrix in which columns sum to one. For each column, performs a single multinomial draw from the rows, weighted by their values in that column

Usage

samp_from_mat(M)

Arguments

M

a matrix whose columns are reals summing to one

Value

a vector length = ncol(M) of indices, with each element being the row that was chosen in that column's sampling


Do leave-one-out self-assignment of individuals in a reference baseline

Description

Returns a tidy data frame

Usage

self_assign(
  reference,
  gen_start_col,
  preCompiledParams = NULL,
  alle_freq_prior = list(const_scaled = 1)
)

Arguments

reference

a two-column format genetic dataset, with "repunit", "collection", and "indiv" columns, as well as a "sample_type" column that has some "reference" entries

gen_start_col

the first column of genetic data in reference

preCompiledParams

Users should never use this option. It is here only so that this function can be called on a precompiled set of parameters with infer_mixture. Don't use this, unless you are one of the package developers...

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

Value

a tibble ...

Examples

ale_sa <- self_assign(alewife, 17)

List of example ways of specifying repunit and collection quantities in simulations

Description

This is just a list of tibbles that can be passed to the alpha_repunit or the alpha_collection parameters in, for example, assess_reference_loo.

Source

Made it up!


Generate a samples for a mixture.

Description

Creates random reporting unit (rho) and collection (omega) proportions, and a sim_colls vector for simulation of individual genotypes, based on the methods used in Hasselman et al. (2015)

Usage

simulate_random_samples(
  RU_starts,
  RU_vec,
  size = 100,
  alpha_repunit = 1.5,
  alpha_collection = 1.5,
  coll_sub_dirichlet_default = 1.5
)

Arguments

RU_starts

a vector delineating the reporting units in RU_vec; generated by tcf2param_list

RU_vec

a vector of collection indices, grouped by reporting unit; generated by tcf2param_list

size

The number of individuals desired in the mixture sample. Default = 100. This is ignored without a warning if alpha_repunit is specified with counts (a cnt column) or if alpha_collection is specified with counts (a cnt column).

alpha_repunit

If a vector, this is the dirichlet parameter for simulating the proportions of reporting units. Gets recycled to the number of reporting units. Default is 1.5. Otherwise, this could be a two-column data frame. The first column must be named "repunit" and the second one must be one of "dirichlet", "ppn", or "cnt", according to whether you wish to specify dirichlet parameters, or proportions, or exact counts, respectively, for each population.

alpha_collection

The dirichlet parameter for simulating proportions of collections within reporting units. Default = 1.5

coll_sub_dirichlet_default

If you are providing a data frame with requested sub_dirichlet pars for collections, and you don't specifically list one, this is the value it gets.

Value

a list with three elements. The first two are a rho vector and an omega vector, respectively. The third is a vector of origins for simulated individuals, sampled from the collections with probabilities = omega


Small sample of SNP data from Chinook salmon taken in May/August 2015 from California fisheries

Description

This is simply a sample of 100 fish from chinook.

Source

Southwest Fisheries Science Center, Santa Cruz, CA


SNP data from selected chinook reference populations

Description

A small number of poulations from the Chinook salmon baseline data similar to that which can be downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv. This data set includes 91 SNPs and 909 fish.

Format

A tbl_df-ed (from dplyr) data frame with 909 rows and 185 variables. The first three columns are

repunit (chr)

the reporting unit that the individual is in

pop (chr)

the population from which the individual was sampled

ID (chr)

Unique identifier of the individual fish

The remaining columns are two columns for each locus. These columns are named like, "Locus.1" and "Locus.2" for the first and second gene copies at that locus. For example, "Ots_104569-86.1" and "Ots_104569-86.2". The locus columns are ints and missing data is denoted by NA.

Source

https://datadryad.org/stash/dataset/doi:10.5061/dryad.574sv


Convert Two-Column Genetic Data to Long Format

Description

Takes a data frame consisting of metadata followed by paired columns of genetic data, with each column in the pair representing a gene copy at a locus. Returns a list of two data frames: one with genetic data condensed into one column, and the other with two-column structure intact, but with cleaned allele names.

Usage

tcf2long(D, gen_start_col)

Arguments

D

A data frame containing two-column genetic data, optionally preceded by metadata. The header of the first genetic data column in each pair lists the locus name, the second is ignored. Locus names must not have spaces in them!

gen_start_col

The index (number) of the column in which genetic data starts. Columns must be only genetic data after genetic data starts.

Value

tcf2long returns a list of two data frames: in the first, "long", the rightmost column is the genetic data. Two new columns, "locus" and "gene copy", duplicate the original column name provided in the first of each pair, and designate copies "a" and "b", respectively. Metadata is duplicated as necessary for each locus. The second, "clean_short", replicates the input dataset, but with column names replaced by "(locus name) a" and "(locus name) b" in each pair. In other words the locus name has an "a" or a "b" added to it after a space.

Examples

## Convert the alewife dataset for further processing
# the data frame passed into this function must have had
# character collections and repunits converted to factors
reference <- alewife
reference$repunit <- factor(reference$repunit, levels = unique(reference$repunit))
reference$collection <- factor(reference$collection, levels = unique(reference$collection))
ale_long <- tcf2long(reference, 17)

Generate MCMC parameter list from two-column genetic data & print summary

Description

This function is a wrapper for all steps to create the parameter list necessary for genotype log-likelihood calculation from the starting two-column genetic data

Usage

tcf2param_list(
  D,
  gen_start_col,
  samp_type = "both",
  alle_freq_prior = list(const_scaled = 1),
  summ = T,
  ploidies
)

Arguments

D

A data frame containing two-column genetic data, preceded by metadata. The header of the first genetic data column in each pair lists the locus name, the second is ignored. Locus names must not have spaces in them! Required metadata includes a column of unique individual identifiers named "indiv", a column named "collection" designating the sample groups, a column "repunit" designating the reporting unit of origin of each fish, and a "sample_type" column denoting each individual as a "reference" or "mixture" sample. No NAs should be present in metadata

gen_start_col

The index (number) of the column in which genetic data starts. Columns must be only genetic data after genetic data starts.

samp_type

the sample groups to be include in the individual genotype list, whose likelihoods will be used in MCMC. Options "reference", "mixture", and "both"

alle_freq_prior

a one-element named list specifying the prior to be used when generating Dirichlet parameters for genotype likelihood calculations. Valid methods include "const", "scaled_const", and "empirical". See ?list_diploid_params for method details.

summ

logical indicating whether summary descriptions of the formatted data be provided

ploidies

a named vector of ploidies (1 or 2) for each locus. The names must the the locus names.

Details

In order for all steps in conversion to be carried out successfully, the dataset must have "repunit", "collection", "indiv", and "sample_type" columns preceding two-column genetic data. If summ == TRUE, the function prints summary statistics describing the structure of the dataset, as well as the presence of missing data, enabling verification of proper data conversion.

Value

tcf2param_list returns the output of list_diploid_params, after the original dataset is converted to a usable format and all relevant values are extracted. See ?list_diploid_params for details

Examples

# after adding support for haploid markers we need to pass
# in the ploidies vector.  These markers are all diploid...
locnames <- names(alewife)[-(1:16)][c(TRUE, FALSE)]
ploidies <- rep(2, length(locnames))
names(ploidies) <- locnames
ale_par_list <- tcf2param_list(alewife, 17, ploidies = ploidies)


A helper function to tidy up the output from the gsi_mcmc functions

Description

This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.

Usage

tidy_mcmc_coll_rep_stuff(field, p, pname, car_tib)

Arguments

field

The output to tidy (i.e.. out$mean)

p

the name of the parameter whose values you want to extract (like "pi")

pname

the name that you want the parameter to be called in the output

car_tib

a tibble with repunit and collection in the order they appear in the output


A helper function to tidy up the PofZ-like output from the gsi_mcmc functions

Description

This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.

Usage

tidy_mcmc_pofz(input, pname, car_tib, mix_indiv_tib)

Arguments

input

The output to tidy (i.e.. out$mean$PofZ)

pname

the name that you want the parameter to be called in the output

car_tib

a tibble with repunit and collection in the order they appear in the output

mix_indiv_tib

a tibble with the individuals in the order they appear in the output


a helper function to tidy up the pi-traces that come out of the mcmc functions

Description

This makes a tidy data frame of stuff, and also changes things back to factors, if the levels are provided.

Usage

tidy_pi_traces(input, pname, car_tib, interval)

Arguments

input

The output to tidy (i.e.. out$trace$pi)

pname

the name that you want the parameter to be called in the output

car_tib

a tibble with repunit and collection in the order they appear in the output

interval

the thinning interval that was used


Write a mixture data frame to gsi_sim format baseline and repunits file

Description

Note, this is only intended to work with integer-valued alleles, at the moment. It was just written for testing and verifying that things are working correctly.

Usage

write_gsi_sim_mixture(mix, gen_start_col, mixprefix)

Arguments

mix

mixture data frame

gen_start_col

column in which the genetic data start

mixprefix

path to write the mixture file to. The mixture collection name + .txt will be appended to this. This path can include directories if they exist. An example would be "./my_gsi_data/mixture". This is a required argument.

Examples

# this writes to file prefix "mixfile" in a temporary directory
dd <- tempdir()
prefix <- file.path(dd, "mixfile")

# print that
prefix

# note that in practice you will probably want to specify
# your own directory...

# run the function
write_gsi_sim_mixture(chinook_mix, 5, prefix)

# see where those files live:
dir(dd, pattern = "mixfile*", full.names = TRUE)

Write a reference data frame to gsi_sim format baseline and repunits file

Description

Note, this is only intended to work with integer-valued alleles, at the moment. It was just written for testing and verifying that things are working correctly.

Usage

write_gsi_sim_reference(
  ref,
  gen_start_col,
  baseout = "baseline.txt",
  repout = "repunits.txt"
)

Arguments

ref

reference data frame

gen_start_col

column in which the genetic data start

baseout

path to write the baseline file to. Required.

repout

path to write the repunits file to. Required.

Examples

# create a temp directory to put example outputs
dd <- tempdir()
basefile <- file.path(dd, "baseline.txt")
repunitsfile <- file.path(dd, "repunits.txt")

# print those
basefile
repunitsfile

# note that in practice you will probably want to specify
# your own filepaths...

# run the function
write_gsi_sim_reference(alewife, 17, basefile, repunitsfile)