Enhance PCGR MSI Classifier With VAF/AD Filtering

Oct 25, 2025 by SLV Team 50 views

Introduction

Hey guys! Let's dive into how we can make the MSI (Microsatellite Instability) classifier in PCGR (Personalized Cancer Genome Reporter) even better. Right now, when we're trying to figure out the Tumour Mutational Burden (TMB) and MSI status from tumour samples, filtering by Variant Allele Fraction (VAF) or Allelic Depth (AD) is super important. Think of VAF as the percentage of reads that support a particular variant, and AD as the actual number of reads that support it. Getting this right can seriously affect the accuracy of our results. So, what's the issue, and how can we fix it? Let's break it down.

Problem Statement: The Need for Better Filtering

The main problem is that PCGR currently has parameters to filter the input variant callset by VAF or AD, but these filters don't seem to directly influence the MSI classifier. We have parameters like tumor_af_min, tumor_dp_min, control_af_min, control_dp_min, and tmb_af_min, tmb_dp_min. However, these don't change the variant callset that the MSI classifier uses. This can be a big deal because if the TMB is artificially high due to variants with low VAF (maybe because of subclonality or just sequencing errors), it throws off the MSI calculations. Specifically, it messes with:

MSI-based TMB calculations: This is different from PCGR's regular calculate_tmb() function.
Fraction of INDELs: Calculated as #INDELs/(#SNVs+#INDELs), which is a significant factor in the MSI classifier.

Both of these features are heavily weighted in the MSI classifier, so any inaccuracies here can lead to wrong calls. Imagine you're trying to diagnose a patient, and the results are skewed because of some low-quality variants. Not ideal, right?

Proposed Solution: Implementing VAF/AD Filters for the MSI Classifier

So, what's the solution? Simple! We need to add parameters for minimum VAF and AD to filter the variant callset that goes into the MSI classifier. It should work similarly to how calculate_tmb() in pcgr/variant.py handles filtering. These new filters should also play nice with the global minimums set by the existing parameters like tumor_af_min, tumor_dp_min, control_af_min, and control_dp_min. Here’s a more detailed breakdown:

New Parameters: Introduce new parameters specifically for the MSI classifier, such as msi_af_min and msi_dp_min. These parameters will allow users to set the minimum VAF and AD thresholds for variants considered by the MSI classifier.
Integration with Existing Filters: Ensure that the new MSI-specific filters are resolved against the existing global minimums (tumor_af_min, tumor_dp_min, control_af_min, control_dp_min). This means that the MSI classifier will only consider variants that meet both the global and MSI-specific VAF/AD criteria.
Code Modification: Modify the MSI classifier code to incorporate these filters. This involves updating the variant parsing logic to apply the VAF/AD thresholds before any MSI-related calculations are performed.
Testing and Validation: Thoroughly test the updated MSI classifier with various datasets to ensure that the new filters improve the accuracy of MSI calls. Compare the results with and without the filters to quantify the impact.

By implementing these changes, we can ensure that the MSI classifier is only using high-quality variants, leading to more accurate and reliable results. This is crucial for making informed clinical decisions based on PCGR reports.

Why This Matters: Benefits of Improved Filtering

Adding these filters is a game-changer for several reasons:

More Accurate MSI Calls: By filtering out low-quality variants, we reduce the noise in the data and get a clearer signal for MSI status. This means fewer false positives and more confidence in the results. The increased accuracy is especially important in clinical settings, where treatment decisions are based on these results.
Better TMB Estimates: When the MSI classifier uses a cleaner variant callset, the TMB estimates become more reliable. This is particularly important for MSI-based TMB calculations, which are distinct from PCGR's standard TMB calculation method. Accurate TMB estimates are crucial for identifying patients who may benefit from immunotherapy.
Improved INDEL Fraction Calculation: The fraction of INDELs among all calls (#INDELs/(#SNVs+#INDELs)) is a key feature in the MSI classifier. By filtering out low-VAF variants, we can get a more accurate representation of the true INDEL fraction, leading to better MSI classification. The precise INDEL fraction calculation ensures that the MSI classifier is making informed decisions based on the actual genomic landscape.
Enhanced Clinical Utility: Ultimately, better MSI calls and TMB estimates translate to improved clinical utility. Clinicians can make more informed decisions about treatment strategies, leading to better patient outcomes. The overall goal is to provide the best possible information to guide clinical decision-making.

Alternative Considered: Pre-filtering Variant Callsets

Now, you might be thinking, "Why not just pre-filter the variant callsets before running PCGR?" That's a valid point! Pre-filtering is definitely an option, but it has its downsides. If you pre-filter, you're essentially removing variants that might be interesting or relevant in other parts of the report. For example, a variant with low VAF might still be important for understanding the overall genomic landscape of the tumour, even if it's not directly relevant to MSI status. The advantage of filtering within the MSI classifier is that it allows us to selectively filter variants for MSI analysis while still retaining them for other analyses.

Flexibility: Filtering within the MSI classifier provides more flexibility, as it allows you to apply different filtering criteria for different analyses.
Comprehensive Reporting: By retaining all variants in the initial callset, you can generate a more comprehensive report that includes all potentially relevant information.
Reduced Data Loss: Pre-filtering can lead to the loss of valuable data that might be relevant for other analyses.

Real-World Context: Comparing PCGR with MSIsensor-pro

To really drive home the importance of this, let's talk about a real-world comparison. Someone actually compared PCGR-based MSI calls against MSIsensor-pro, and they found something interesting. Some samples that PCGR called as MSI-high were actually microsatellite stable (MSS) in MSIsensor-pro. This suggests that PCGR might be a bit too sensitive to low-quality variants, leading to false positives. By implementing VAF/AD filtering in the MSI classifier, we can bring PCGR's results more in line with other MSI detection methods, improving the overall reliability of the tool. Specifically, MSIsensor-pro employs a more stringent approach to MSI detection, focusing on high-confidence microsatellite instability events.

Benchmarking: Comparing PCGR with other MSI detection methods like MSIsensor-pro helps identify areas for improvement.
Calibration: Implementing VAF/AD filtering can help calibrate PCGR's MSI calls to be more consistent with other methods.
Confidence: More consistent results across different MSI detection methods increase confidence in the overall findings.

Step-by-Step Implementation Plan

To successfully implement this enhancement, we can follow a structured approach. This ensures that the changes are well-integrated, thoroughly tested, and aligned with the existing PCGR framework. Here's a step-by-step plan:

1. Define New Parameters

Start by defining the new parameters in the PCGR configuration. This includes:

msi_af_min: Minimum allele fraction for variants used in the MSI classifier.
msi_dp_min: Minimum read depth for variants used in the MSI classifier.

These parameters should be clearly documented and explained in the PCGR documentation.

2. Modify Variant Parsing Logic

Update the variant parsing logic in the PCGR code to incorporate the new VAF/AD filters. This involves:

Adding conditional statements to filter variants based on msi_af_min and msi_dp_min.
Ensuring that these filters are applied after the global minimums (tumor_af_min, tumor_dp_min, control_af_min, control_dp_min) are applied.
Optimizing the code for performance to minimize the impact on processing time.

3. Update MSI Classifier Code

Modify the MSI classifier code to use the filtered variant callset. This includes:

Updating the input to the MSI classifier to be the filtered variant list.
Adjusting any calculations or thresholds that may be affected by the filtered data.
Ensuring that the MSI classifier still produces accurate and reliable results with the filtered data.

4. Implement Comprehensive Testing

Conduct thorough testing to validate the new filters and ensure that they improve the accuracy of MSI calls. This includes:

Testing with diverse datasets that include both MSI-high and MSS samples.
Comparing the results with and without the filters to quantify the impact on MSI calls.
Benchmarking against other MSI detection methods like MSIsensor-pro to ensure consistency.

5. Document Changes

Update the PCGR documentation to reflect the changes. This includes:

Documenting the new parameters and their usage.
Explaining the impact of the filters on MSI calls.
Providing examples of how to use the filters in different scenarios.

6. Release and Monitor

Release the updated PCGR version with the new filters and monitor its performance in real-world scenarios. This includes:

Tracking the accuracy of MSI calls over time.
Gathering feedback from users to identify any issues or areas for improvement.
Continuously optimizing the filters and the MSI classifier to improve its performance.

Conclusion

So, there you have it! By adding VAF/AD filtering to the MSI classifier in PCGR, we can significantly improve the accuracy of MSI calls and TMB estimates. This not only enhances the clinical utility of PCGR but also brings it more in line with other MSI detection methods. It's a win-win for everyone involved! Let's get to work and make PCGR even better!