Skip to contents

1. Overall Workflow

The overall workflow of the package is intentionally straight-forward (we assume you have pre-processed your data but have not calibrated it yet):
1. Build an xgboost model (classification based on your outcome label)
2. Estimate feature importance (based on the xgboost model and SHAP analysis)
3. Extract cutoffs (used by the xgboost model)
4. Calibrate data (with the cutoffs generated in Step 2)
5. Necessity analysis for individual conditions (using the calibrated data)
6. Form condition-ranking table (by tallying feature importance and necessity metrics)
7. Iterative QCA for sufficiency analysis (by choosing k explanatory conditions from a given n condition vector)

## 1. Build an xgboost model
xgb <- buildXGboost(voteData, "vote")

## 2. estimate feature importance
fMetricModel <- getFeatureImp(xgb, voteData, "vote")

## 3. Extract cutoffs
cutDf <- getXGboostCut(xgb, full = F)

## 4. Calibrate data
voteDataCalibrated <- reCalibrateXGCut(data = voteData, cutoffs = cutDf, outcome = "vote", na.rm = F)

## 5. Necessity analysis for individual conditions
fMetricQCA <- getQCAMetric(voteDataCalibrated, "vote")

## 6. Form condition-ranking table
fRankTable <- joinMetrics(fMetricModel, fMetricQCA)

## 7. Iterative QCA for sufficiency analysis
selectFeatures <- head(fMetricModel$feature, 10)
Res <- iterQCA(features = selectFeatures, k=4, dataCali = voteDataCalibrated, outcome = "vote", incl.cut = 0.8)

2. Detailed Steps

2.1 Build an xgboost model

We assume you have pre-processed your data (including data cleaning, renaming, filtering etc.), and we start by building an xgboost model. Specifically, we build a tree ensemble classification model to classify each observation into one of your outcome category.

In our example data voteData, our outcome condition is "vote", indicating whether the interviewee voted or not. Each row is a interviewee, and the features are various aspects about the interviewee (e.g. education level, parent education level, etc.). Refer to the codebook for the feature names.

dim(voteData) # 427 interviewees; 77 features, including "vote"
#> [1] 427  77
table(voteData$vote) # 319 voted; 108 did not vote;
#> 
#>   0   1 
#> 108 319

We then build an xgboost model to predict the "vote" outcome. Under the hood, we built the model using the caret and xgboost packages. To see details of the default parameters used, call ?buildXGboost.

## this can take a few minutes
set.seed(201) ## ensure reproducibility
xgb <- buildXGboost(voteData, "vote")

We can estimate the accuracy of the model to get an idea of how well it differentiates the voters and the non-voters.Note that our goal here is not to predict new data, but rather to understand the data at hand. The accuracy, even modest as a general classification model, may be good enough to show some success in using the given features to tell apart the voters and non-voters.

getAccXGboost(xgb, voteData, "vote") 
#> [1] 1

2.2 Estimate feature importance

We next use the built xgboost model to estimate the relative importance of each feature in differentiating the voters from the non-voters. More specifically, we look at four metrics: three extracted direclty from the xgboost tree ensemble (gain, cover, frequency) and one additional calculated measure (shapley value). Briefly, the meaning of the four metrics are:

  • Gain: improvement in accuracy brought by the feature to the branches it is on
  • Cover: relative quantity of observations concerned by a feature
  • Frequency: the number of times a feature is used in all generated trees
  • SHAP value (average): the average contribution of the feature to the prediction of the model

Detailed explanation of these measures, particularly Shapley values, is out of the scope here, but there are many sources online that provide great explanations and are easily searchable.

fMetricModel <- getFeatureImp(xgb, voteData, "vote")
head(fMetricModel)
#>   feature       Gain      Cover  Frequency      shap
#> 1    V648 0.03230976 0.03431853 0.02008457 0.3786536
#> 2  P_V195 0.04122095 0.04229647 0.05073996 0.3667945
#> 3  P_V530 0.04155585 0.04892724 0.01479915 0.3489758
#> 4    V191 0.01959714 0.03495970 0.01374207 0.3418239
#> 5    V207 0.04996828 0.04374676 0.04228330 0.3047783
#> 6   P_V44 0.01707990 0.01998959 0.01374207 0.2811981

2.3 Extract Cutoffs

Since xgboost treen ensemble are a collection of individual tree structures, we can extract and summarize the cutoffs used across them and leverage on these cutoffs to calibrate our data. The default behavior is to look at each feature’s cutoff across all trees and choose the most frequently used one. Admittedly, these might not be the most appropriate cutoffs depending on the contexts, and we showed that these cutoffs, among all possible cutoffs, would produce QCA solutions that are among top 10%. Still, we enourage the users to take these for guidance and leverage on prior/domain knowledge to evaluate the their relavance and appropriateness, and modify them accordingly.

Another potential catch of this approach is “missing features”. When constructing model, there is certain built-in randomness (in selecting features for each tree), and consequently some features (usually the less informative ones) may not be used. This is normally okay, as the “missing feautres” are very likely to be uninformative and thus least useful for downstream QCA analysis. However, if certain features are still desired, we can manually add them into the cutoff table and set a cutoff for them. Alternatively, we can also set a different seed and retrain the xgboost model to see if the features will be selected.

cutDf <- getXGboostCut(xgb, full = F)
head(cutDf)
#> # A tibble: 6 × 2
#>   feature cutoff
#>   <chr>    <dbl>
#> 1 P_V136     3.5
#> 2 P_V168     2.5
#> 3 P_V185     0.5
#> 4 P_V195    52.5
#> 5 P_V242     3.5
#> 6 P_V243     3.5
nrow(cutDf) # 72 of the 76 features selected; 4 "missing" features
#> [1] 73

2.4 Calibrate data

When building the xgboost model, we do not require nor enourage the input data to be calibrated. This ensures maximal details/information that can be instrumental for classification. However, to proceed with downstream QCA tasks, we have to calibrate the data. Since we extracted the cutoffs above, we can use them directly to calibrate. Note that the current version only support crisp calibration, and fuzzy set calibration will be supported in future versions.

## before calibration
unique(voteData$V188)
#>  [1] 16 11 10 14  5 18  8  7  6  4 25  3  0  2 15 30 13 12 27 20  9 22 24  1 35
#> [26] 28 17 23
## calibrate
voteDataCalibrated <- reCalibrateXGCut(data = voteData, cutoffs = cutDf, outcome = "vote", na.rm = F)
## after calibration
unique(voteDataCalibrated$V188)
#> [1] 1 0

2.5 Necessity analysis for individual conditions

Now that we have the (re)calibrated data, we can perform a QCA necessity test to get an idea of the necessity relationship between each explanatory condition and the outcome.

fMetricQCA <- getQCAMetric(voteDataCalibrated, "vote")
head(fMetricQCA)
#>        Cons.Nec Cov.Nec   RoN feature
#> P_V136    0.574   0.769 0.775  P_V136
#> P_V168    0.473   0.737 0.804  P_V168
#> P_V185    0.429   0.721 0.817  P_V185
#> P_V195    0.411   0.780 0.875  P_V195
#> P_V242    0.545   0.753 0.775  P_V242
#> P_V243    0.473   0.702 0.768  P_V243

2.6 Form condition-ranking table

We have constructed per-feature metrics from both the xgboost model and QCA necessity tests, and we can combine them to get a single feature importance table and save for later reference. Note that the final ranking fRank is still based on the “shap” value, which best captures the learning from the xgboost model.

fRankTable <- joinMetrics(fMetricModel, fMetricQCA)
head(fRankTable)
#>   feature       Gain      Cover  Frequency      shap Cons.Nec Cov.Nec   RoN
#> 1    V648 0.03230976 0.03431853 0.02008457 0.3786536   0.7490  0.8100 0.702
#> 2  P_V195 0.04122095 0.04229647 0.05073996 0.3667945   0.4110  0.7800 0.875
#> 3  P_V530 0.04155585 0.04892724 0.01479915 0.3489758   0.0846  0.4737 0.925
#> 4    V191 0.01959714 0.03495970 0.01374207 0.3418239   0.2100  0.8590 0.969
#> 5    V207 0.04996828 0.04374676 0.04228330 0.3047783   0.3790  0.8580 0.935
#> 6   P_V44 0.01707990 0.01998959 0.01374207 0.2811981   0.3860  0.8260 0.914
#>   fRank
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5
#> 6     6

2.7 Iterative QCA

The condition ranking is great for reference, and we have shown that the top features on average provide much better QCA solutions than the bottom ones. However, even among the top 10, we normally want to choose 3 to 4 features to analyze using QCA. This presents a problem as we do not know which 3 or 4 out of 10 features might give us the best solution. To address this, we perform iterative QCA sufficiency tests and summarize the results.

Specifically, as an example, let’s choose the top 10 features from the feature ranking table as our “pool” of features. We then iteratively choose 4 out of the 10 features to do QCA analysis, exhaustively examining all combinations of “10-choose-4”. The resulting solutions include the particular combination of 4 features, the corresponding solution and the metric scores for that solution. In addition, the results are also ranked based on the coverage as a guidance. Users are then encouraged to examine this table to determine the best features to use. Note that the features to choose from are manually determined by the users, and they need not to be the top features form the ranking table.

selectFeatures <- head(fMetricModel$feature, 10)
## this can take a few minutes
Res <- iterQCA(features = selectFeatures, k=4, dataCali = voteDataCalibrated, outcome = "vote", incl.cut = 0.8)
head(Res)
#>                    features                                   path     inclS
#> 1   V648,P_V195,P_V530,V191      V648*~P_V530+~P_V195*~P_V530*V191 0.8380282
#> 2 P_V195,P_V530,P_V44,P_V59           ~P_V530*P_V44+~P_V530*~P_V59 0.8292683
#> 3  P_V195,P_V530,V189,P_V59     ~P_V530*~P_V59+P_V195*~P_V530*V189 0.8233216
#> 4   P_V195,P_V530,V141,V204 ~P_V195*~P_V530*V141+~P_V530*V141*V204 0.8376384
#> 5    V648,P_V530,P_V44,V141   V648*~P_V530*V141+~P_V530*P_V44*V141 0.8401487
#> 6   V648,P_V195,P_V530,V189                           V648*~P_V530 0.8383459
#>         PRI     convS rank
#> 1 0.8380282 0.7460815    1
#> 2 0.8292683 0.7460815    2
#> 3 0.8233216 0.7304075    3
#> 4 0.8376384 0.7115987    4
#> 5 0.8401487 0.7084639    5
#> 6 0.8383459 0.6990596    6