| Title: | Dealing with Binary Replicates |
|---|---|
| Description: | Statistical methods for analyzing binary replicates, which are noisy binary measurements of latent binary states. Provides scoring functions (average, median, likelihood-based, and Bayesian) to estimate the probability that an individual is in the positive state. Includes maximum a posteriori estimation via the EM algorithm and full Bayesian inference via Stan. Supports classification with inconclusive decisions and prevalence estimation. |
| Authors: | Pierre Pudlo [aut, cre] (ORCID: <https://orcid.org/0000-0003-0995-716X>), Manuela Royez-Carenzi [aut] (ORCID: <https://orcid.org/0000-0003-2392-1120>), Hadrien Lorenzo [aut] (ORCID: <https://orcid.org/0000-0002-0253-8539>) |
| Maintainer: | Pierre Pudlo <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-05-03 16:01:48 UTC |
| Source: | https://github.com/pierrepudlo/binaryreplicates |
Get credible intervals, Bayesian scores, and prevalence estimate
from the stanfit object returned by BayesianFit.
credint(fit, level = 0.9) bayesian_scoring(ni, si, fit) bayesian_prevalence_estimate(fit)credint(fit, level = 0.9) bayesian_scoring(ni, si, fit) bayesian_prevalence_estimate(fit)
fit |
The |
level |
Posterior probability of the credible intervals |
ni |
Numeric vector of |
si |
Numeric vector of |
See BayesianFit for details on the Bayesian model.
The credint function returns the credible interval bounds for the fixed
parameters of the Bayesian model. The default posterior probability is 90%.
The bayesian_scoring function returns the Bayesian scores.
These scores are the posterior probabilities
that the true latent 's are equal to 1.
The bayesian_prevalence_estimate function returns the posterior mean of the
posterior distribution on the prevalence of .
classify_with_scores, BayesianFit
data("periodontal") theta <- mean(periodontal$ti) fit <- BayesianFit(periodontal$ni, periodontal$si, chains = 2, iter = 5000) credint(fit) Y_B <- bayesian_scoring(periodontal$ni, periodontal$si, fit) T_B <- classify_with_scores(Y_B, .4, .6) theta_B <- bayesian_prevalence_estimate(fit) cat("The Bayesian prevalence estimate is ", theta_B, "\n") cat("The prevalence in the data is ", theta, "\n")data("periodontal") theta <- mean(periodontal$ti) fit <- BayesianFit(periodontal$ni, periodontal$si, chains = 2, iter = 5000) credint(fit) Y_B <- bayesian_scoring(periodontal$ni, periodontal$si, fit) T_B <- classify_with_scores(Y_B, .4, .6) theta_B <- bayesian_prevalence_estimate(fit) cat("The Bayesian prevalence estimate is ", theta_B, "\n") cat("The prevalence in the data is ", theta, "\n")
Fit the Bayesian model for Binary Replicates
BayesianFit( ni, si, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2, a_T = 0.5, b_T = 0.5), ... )BayesianFit( ni, si, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2, a_T = 0.5, b_T = 0.5), ... )
ni |
Numeric vector of |
si |
Numeric vector of |
prior |
A list of prior parameters for the model, see Details. |
... |
Arguments passed to |
The model is a Bayesian model for binary replicates. The prior distribution is as follows:
The false positive rate:
The false negative rate:
The prevalence:
The statistical model considers that the true statuses are the latent
And, given the true status , the number of positive replicates is
An object of class stanfit returned by rstan::sampling.
The Stan model samples the posterior distribution of the fixed parameters ,
and . It also generates the latent variables according
to their predictive distribution.
credint, bayesian_scoring, classify_with_scores, bayesian_prevalence_estimate
Classification based on a thresholding of the scores
classify_with_scores(scores, vL, vU)classify_with_scores(scores, vL, vU)
scores |
Numeric vector of the scores, computed with average_scoring, median_scoring, MAP_scoring or bayesian_scoring |
vL |
The lower threshold |
vU |
The upper threshold |
Each decision is taken according to the following rule:
where is the score for individual .
A numeric vector of the classification (where 0.5 = inconclusive)
Note that if is provided, the empirical risk is estimated with
.
classify_with_scores_cvEM(object, ti = NULL, vL = 0.5, vU = 0.5)classify_with_scores_cvEM(object, ti = NULL, vL = 0.5, vU = 0.5)
object |
An object of class cvEM |
ti |
Numeric vector of |
vL |
The lower threshold for classification. Defaults to 0.5. |
vU |
The upper threshold for classification. Defaults to 0.5. |
A cvEM object with the following components:
A list of the predictions for each fold
The empirical risk if is provided
data(periodontal) modelCV <- cvEM(periodontal$ni, periodontal$si) modelCV2 <- classify_with_scores_cvEM(modelCV, vL = 0.4)data(periodontal) modelCV <- cvEM(periodontal$ni, periodontal$si) modelCV2 <- classify_with_scores_cvEM(modelCV, vL = 0.4)
Cross-validation for the EM algorithm
cvEM( ni, si, ti = NULL, N_cv = NULL, N_init = 20, maxIter = 1000, errorMin = 1e-07, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2) )cvEM( ni, si, ti = NULL, N_cv = NULL, N_init = 20, maxIter = 1000, errorMin = 1e-07, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2) )
ni |
Numeric vector of |
si |
Numeric vector of |
ti |
Numeric vector of |
N_cv |
The number of folds. Defaults to 20. |
N_init |
The number of initializations if |
maxIter |
The maximum number of iterations if the EM algorithm is used. Defaults to 1e3. |
errorMin |
The minimum error for convergence if the EM algorithm is used. Defaults to 1e-7. |
prior |
A list of prior parameters for the model. |
This function chooses its algorithm according to what is provided in the ti argument:
ti is fully providedThe function computes the Maximum-A-Posteriori estimate, with an explicit formula.
ti is not providedThe function uses the EM algorithm to estimate the parameters.
ti is partially providedThe function uses the EM algorithm to estimate the parameters.
A list with the following components:
A list of the models for each fold
A list of the predictions for each fold
data("periodontal") modelCV <- cvEM(periodontal$ni, periodontal$si)data("periodontal") modelCV <- cvEM(periodontal$ni, periodontal$si)
Compute the Maximum-A-Posteriori estimate with the EM algorithm
EMFit( ni, si, ti = NULL, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2), N_init = 20, maxIter = 1000, errorMin = 1e-07 )EMFit( ni, si, ti = NULL, prior = list(a_FP = 2, b_FP = 2, a_FN = 2, b_FN = 2), N_init = 20, maxIter = 1000, errorMin = 1e-07 )
ni |
Numeric vector of |
si |
Numeric vector of |
ti |
Numeric vector of |
prior |
A list of prior parameters for the model. The prior distribution is as follows:
|
N_init |
The number of initializations if |
maxIter |
The maximum number of iterations if the EM algorithm is used. Defaults to 1e3. |
errorMin |
The minimum error for convergence if the EM algorithm is used. Defaults to 1e-7. |
This function chooses its algorithm according to what is provided in the ti argument:
ti is fully providedThe function computes the Maximum-A-Posteriori estimate, with an explicit formula.
ti is not providedThe function uses the EM algorithm to estimate the parameters.
ti is partially providedThe function uses the EM algorithm to estimate the parameters.
A list with the following components:
The estimated values of the scores
The estimated values of the parameters , and
data("periodontal") # Get ML estimate knowing the true values of the latent ti's periodontal_ml <- EMFit(periodontal$ni, periodontal$si, periodontal$ti) # Get MAP estimate without knowing the true values of the latent ti's periodontal_EM <- EMFit(periodontal$ni, periodontal$si, ti = NULL)data("periodontal") # Get ML estimate knowing the true values of the latent ti's periodontal_ml <- EMFit(periodontal$ni, periodontal$si, periodontal$ti) # Get MAP estimate without knowing the true values of the latent ti's periodontal_EM <- EMFit(periodontal$ni, periodontal$si, ti = NULL)
Data from a mammography screening program. The dataset is not the original dataset, but an imputed dataset based on the summary statistics available publicly.
data(mammography) data(observed)data(mammography) data(observed)
The data frame mammography has 148 rows and 3 variables
True latent state of the individual (1=positive, 0=negative)
Number of mammography tests performed for each individual
Number of positive mammography tests for each individual
The matrix observed is a 110x148 array with the observed replicates, one column
for each individual. The rows correspond to the replicates, and the values are either 0 or 1.
Beam C, Conant E, Sickles E. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. JNCI. 2003;95:282-290.
Compute the average-, median- and likelihood-based scores
average_scoring(ni, si) median_scoring(ni, si) likelihood_scoring(ni, si, param) MAP_scoring(ni, si, fit)average_scoring(ni, si) median_scoring(ni, si) likelihood_scoring(ni, si, param) MAP_scoring(ni, si, fit)
ni |
Numeric vector of |
si |
Numeric vector of |
param |
A list with 3 entries:
|
fit |
The object returned by EMFit containing the results of the EM algorithm |
A numeric vector of the scores
For likelihood-based scores, the values of , and
are required. Consequently, likelihood scoring is not directly applicable
in practice without parameter estimates.
data("periodontal") Y_A <- average_scoring(periodontal$ni, periodontal$si) Y_M <- median_scoring(periodontal$ni, periodontal$si) # In order to compute the likelihood-based scores, we need to know theta, # p and q which can be estimated in this example as follows: theta_hat <- mean(periodontal$ti) cat("The prevalence in the data is ", theta_hat, "\n") p_hat <- with(periodontal, sum(si[ti == 0]) / sum(ni[ti == 0])) q_hat <- with(periodontal, 1 - sum(si[ti == 1]) / sum(ni[ti == 1])) Y_L <- likelihood_scoring(periodontal$ni, periodontal$si, list(theta = theta_hat, p = p_hat, q = q_hat)) data("periodontal") Y_M <- median_scoring(periodontal$ni, periodontal$si) data("periodontal") fit <- EMFit(periodontal$ni, periodontal$si) Y_MAP <- MAP_scoring(periodontal$ni, periodontal$si, fit)data("periodontal") Y_A <- average_scoring(periodontal$ni, periodontal$si) Y_M <- median_scoring(periodontal$ni, periodontal$si) # In order to compute the likelihood-based scores, we need to know theta, # p and q which can be estimated in this example as follows: theta_hat <- mean(periodontal$ti) cat("The prevalence in the data is ", theta_hat, "\n") p_hat <- with(periodontal, sum(si[ti == 0]) / sum(ni[ti == 0])) q_hat <- with(periodontal, 1 - sum(si[ti == 1]) / sum(ni[ti == 1])) Y_L <- likelihood_scoring(periodontal$ni, periodontal$si, list(theta = theta_hat, p = p_hat, q = q_hat)) data("periodontal") Y_M <- median_scoring(periodontal$ni, periodontal$si) data("periodontal") fit <- EMFit(periodontal$ni, periodontal$si) Y_MAP <- MAP_scoring(periodontal$ni, periodontal$si, fit)
Data from enzymatic diagnostic tests to detect two organisms Treponema denticola and Bacteroides gingivalis.
data(periodontal)data(periodontal)
A data frame with 50 rows and 3 variables:
Total numbers of sites tested for each individual
Number of positive tests for each individual
Status of the individual (1=infected, 0=non-infected)
Hujoel PP, Moulton LH, Loesche WJ. Estimation of sensitivity and specificity of site-specific diagnostic tests. J Periodontal Res. 1990 Jul;25(4):193-6. doi: 10.1111/j.1600-0765.1990.tb00903.x. Erratum in: J Periodontal Res 1990 Nov;25(6):377. PMID: 2197400. Moore et al. (2013) Genetics 195:1077-1086 (PubMed)
data(periodontal) hat_prevalence <- mean(periodontal$si/periodontal$ni) hat_prevalence # should be compared to: mean(periodontal$ti)data(periodontal) hat_prevalence <- mean(periodontal$si/periodontal$ni) hat_prevalence # should be compared to: mean(periodontal$ti)
Compute predictive Bayesian scores
predict_scores(newdata_ni, newdata_si, fit)predict_scores(newdata_ni, newdata_si, fit)
newdata_ni |
Numeric vector of the total numbers of replicates per individuals |
newdata_si |
Numeric vector of the numbers of positive replicates per individuals |
fit |
The |
The predict_scores function computes the predictive Bayesian scores.
It computes the empirical estimator, for a new individual , of the following integral:
where is the posterior distribution
of the parameters , and given the data
and is given
by the function likelihood_scoring, such as
Thus the estimator is given by
where each parameter is sampled from the
posterior distribution, output of the function BayesianFit. is the total number of sampled parameters.
The predict_scores function returns the predictive Bayesian scores in a numeric vector.
The predictive Bayesian scores are the posterior probabilities that the true
latent 's are equal to 1 on new data, averaged over the posterior distribution.
data("periodontal") theta <- mean(periodontal$ti) fitBay <- BayesianFit(periodontal$ni, periodontal$si, chains = 2, iter = 500) fitMAP <- EMFit(periodontal$ni, periodontal$si) ## Comparison Bayesian <--> MAP ni <- 200 Ni <- rep(ni,ni+1) Si <- 0:ni scores <- cbind(predict_scores(Ni,Si,fitBay), likelihood_scoring(Ni,Si,fitMAP$parameters_hat)) matplot(Si,scores,type = "l",lty = 1,col = 1:2, ylab = "Scores",xlab = "Number of Successes",main = "")data("periodontal") theta <- mean(periodontal$ti) fitBay <- BayesianFit(periodontal$ni, periodontal$si, chains = 2, iter = 500) fitMAP <- EMFit(periodontal$ni, periodontal$si) ## Comparison Bayesian <--> MAP ni <- 200 Ni <- rep(ni,ni+1) Si <- 0:ni scores <- cbind(predict_scores(Ni,Si,fitBay), likelihood_scoring(Ni,Si,fitMAP$parameters_hat)) matplot(Si,scores,type = "l",lty = 1,col = 1:2, ylab = "Scores",xlab = "Number of Successes",main = "")
Compute the average-/median- or MAP-based prevalence estimates based on the scores
prevalence_estimate(scores)prevalence_estimate(scores)
scores |
Numeric vector of the scores, computed with average_scoring, median_scoring or MAP_scoring |
A numeric value of the prevalence estimate
We have shown that the median-based prevalence estimator is better
than the average-based prevalence estimator in terms of bias, except when
the prevalence is in an interval . The length of is small when the
number of replicates is always large.
data("periodontal") theta <- mean(periodontal$ti) Y_A <- average_scoring(periodontal$ni, periodontal$si) Y_M <- median_scoring(periodontal$ni, periodontal$si) fit <- EMFit(periodontal$ni, periodontal$si) Y_MAP <- MAP_scoring(periodontal$ni, periodontal$si, fit) hat_theta_A <- prevalence_estimate(Y_A) hat_theta_M <- prevalence_estimate(Y_M) hat_theta_MAP <- prevalence_estimate(Y_MAP) cat("The average-based prevalence estimate is ", hat_theta_A, "\n") cat("The median-based prevalence estimate is ", hat_theta_M, "\n") cat("The MAP-based prevalence estimate is ", hat_theta_MAP, "\n") cat("The prevalence in the dataset is ", theta, "\n")data("periodontal") theta <- mean(periodontal$ti) Y_A <- average_scoring(periodontal$ni, periodontal$si) Y_M <- median_scoring(periodontal$ni, periodontal$si) fit <- EMFit(periodontal$ni, periodontal$si) Y_MAP <- MAP_scoring(periodontal$ni, periodontal$si, fit) hat_theta_A <- prevalence_estimate(Y_A) hat_theta_M <- prevalence_estimate(Y_M) hat_theta_MAP <- prevalence_estimate(Y_MAP) cat("The average-based prevalence estimate is ", hat_theta_A, "\n") cat("The median-based prevalence estimate is ", hat_theta_M, "\n") cat("The MAP-based prevalence estimate is ", hat_theta_MAP, "\n") cat("The prevalence in the dataset is ", theta, "\n")