PhD Scientific Days 2023

Budapest, 22-23 June 2023

Pathology - Posters D

Computational Strategies to Increase Confidence in Variant Peptide Identifications in Proteomic Datasets

Beáta Szeitz1, Nicole Woldmar2,3, Zoltán G. Páhi4, Zsolt Horvath2, Fábio C.S. Nogueira3, Lazaro H. Betancourt2, Tibor Pankotai4, David Fenyö5, György Marko-Varga2, A. Marcell Szász1, Melinda Rezeli2, Peter L. Horvatovich6
1 Semmelweis University, Budapest, Hungary
2 Lund University, Lund, Sweden
3 Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
4 University of Szeged, Hungarian Centre of Excellence for Molecular Medicine (HCEMM), Szeged, Hungary
5 New York University, New York, USA
6 University of Groningen, Groningen, The Netherlands

Text of the abstract

Introduction: Mutations are important drivers of cancer, thus, the identification of proteotypic peptides carrying sequence variants via mass spectrometry (MS)-based proteomics has become a relevant part of cancer research. However, identifying such peptides, e.g. single amino acid variants (SAAVs), is challenging due to high sequence similarity with canonical peptides, leading to an increased risk of false positive identifications.
Aims: Our goal was to develop and test a computational workflow that reduces false identifications of SAAV peptides in proteomic datasets.
Methods: We built a bioinformatic workflow consisting of DIA-Umpire for pseudo spectra extraction from MS data collected in data independent acquisition mode, MSFragger for database search, and Percolator to control the false discovery rate. This was followed by the use of SpectrumAI, PepQuery and MS2PIP tools, connected via in-house R scripts to remove low-quality peptide-spectrum matches (PSMs) of SAAV peptides. We tested the workflow by searching SAAV peptides in a proteomic dataset of 26 human small cell lung cancer (SCLC) cell lines. Prior to the database search, we built a sequence database containing canonical protein sequences as well as SAAVs previously described in the genome of SCLC cell lines by the Cancer Cell Line Encyclopedia. We annotated SAAV PSMs based on whether they passed the validation by SpectrumAI, PepQuery and MS2PIP tools, and whether the PSM was supported by the genomic data of the given cell line.
Results: A total of 2828 PSMs of 581 unique SAAV peptides were identified, from which 574 PSMs (20.3%) were supported by genomics. Only 213 PSMs of 114 unique SAAV peptides passed both SpectrumAI and PepQuery validation, and showed a Pearson correlation coefficient > 0.75 between MS2PIP-predicted and experimental spectrum. Among these validated PSMs, however, 189 PSMs (88.7%) were supported by genomic evidence.
Conclusion: Our results demonstrate that thorough quality control of SAAV PSMs in proteomic datasets can dramatically reduce the rate of false positives.
Funding: This study was supported by the ÚNKP-22-3-II New National Excellence Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund, and the Semmelweis 250+ Excellence PhD Scholarship.