The question of whether it is appropriate to attribute authorship to deceased individuals of original studies in the biomedical literature is contentious. Authorship guidelines utilized by journals do not provide a clear consensus framework that is binding on those in the field. To guide and inform the implementation of authorship frameworks it would be useful to understand the extent of the practice in the scientific literature, but studies that have systematically quantified the prevalence of this phenomenon in the biomedical literature have not been performed to date. To address this issue, we quantified the prevalence of publications by deceased authors in the biomedical literature from the period 1990–2020. We screened 2,601,457 peer-reviewed papers from the full text Europe PubMed Central database. We applied natural language processing, stringent filtering and manual curation to identify a final set of 1,439 deceased authors. We then determined these authors published a total of 38,907 papers over their careers with 5,477 published after death. The number of deceased publications has been growing rapidly, a 146-fold increase since the year 2000. This rate of increase was still significant when accounting for the growing total number of publications and pool of authors. We found that more than 50% of deceased author papers were first submitted after the death of the author and that over 60% of these papers failed to acknowledge the deceased authors status. Most deceased authors published less than 10 papers after death but a small pool of 30 authors published significantly more. A pool of 266 authors published more than 90% of their total publications after death. Our analysis indicates that the attribution of deceased authorship in the literature is not an occasional occurrence but a burgeoning trend. A consensus framework to address authorship by deceased scientists is warranted.
Carriers of germline biallelic pathogenic variants in the MUTYH gene have a high risk of colorectal cancer. We test 5649 colorectal cancers to evaluate the discriminatory potential of a tumor mutational signature specific to MUTYH for identifying biallelic carriers and classifying variants of uncertain clinical significance (VUS). Using a tumor and matched germline targeted multi-gene panel approach, our classifier identifies all biallelic MUTYH carriers and all known non-carriers in an independent test set of 3019 colorectal cancers (accuracy = 100% (95% confidence interval 99.87–100%)). All monoallelic MUTYH carriers are classified with the non-MUTYH carriers. The classifier provides evidence for a pathogenic classification for two VUS and a benign classification for five VUS. Somatic hotspot mutations KRAS p.G12C and PIK3CA p.Q546K are associated with colorectal cancers from biallelic MUTYH carriers compared with non-carriers (p = 2 × 10−23 and p = 6 × 10−11, respectively). Here, we demonstrate the potential application of mutational signatures to tumor sequencing workflows to improve the identification of biallelic MUTYH carriers.
Germline variants explain more than a third of prostate cancer (PrCa) risk, but very few associations have been identified between heritable factors and clinical progression.
To find rare germline variants that predict time to biochemical recurrence (BCR) after radical treatment in men with PrCa and understand the genetic factors associated with such progression.
esign, setting, and participant
Whole-genome sequencing data from blood DNA were analysed for 850 PrCa patients with radical treatment from the Pan Prostate Cancer Group (PPCG) consortium from the UK, Canada, Germany, Australia, and France. Findings were validated using 383 patients from The Cancer Genome Atlas (TCGA) dataset.
Outcome measurements and statistical analysis
A total of 15, 822 rare (MAF <1%) predicted-deleterious coding germline mutations were identified. Optimal multifactor and univariate Cox regression models were built to predict time to BCR after radical treatment, using germline variants grouped by functionally annotated gene sets. Models were tested for robustness using bootstrap resampling.
Results and limitations
Optimal Cox regression multifactor models showed that rare predicted-deleterious germline variants in “Hallmark” gene sets were consistently associated with altered time to BCR. Three gene sets had a statistically significant association with risk-elevated outcome when modelling all samples: PI3K/AKT/mTOR, Inflammatory response, and KRAS signalling (up). PI3K/AKT/mTOR and KRAS signalling (up) were also associated among patients with higher-grade cancer, as were Pancreas-beta cells, TNFA signalling via NKFB, and Hypoxia, the latter of which was validated in the independent TCGA dataset.
We demonstrate for the first time that rare deleterious coding germline variants robustly associate with time to BCR after radical treatment, including cohort-independent validation. Our findings suggest that germline testing at diagnosis could aid clinical decisions by stratifying patients for differential clinical management.
Disease recurrence is common following prostatectomy in patients with localised prostate cancer with high-risk features. Although androgen deprivation therapy increases the rates of organ-confined disease and negative surgical margins, there is no significant benefit on disease recurrence. Multiple lines of evidence suggest that FGF/FGFR signalling is important in supporting prostate epithelial cell survival in hostile conditions, including acute androgen deprivation. Given the recent availability of oral FGFR inhibitors, we investigated whether combination therapy could improve tumour response in the neo-adjuvant setting.
We conducted an open label phase II study of the combination of erdafitinib (3 months) and androgen deprivation therapy (4 months) in men with localised prostate cancer with high-risk features prior to prostatectomy using a Simon's two stage design. The co-primary endpoints were safety and tolerability and pathological response in the prostatectomy specimen. The effect of treatment on residual tumours was explored by global transcriptional profiling with RNA-sequencing.
Nine patients were enrolled in the first stage of the trial. The treatment combination was poorly tolerated. Erdafitinib treatment was discontinued early in 6 patients, three of whom also required dose interruptions/reductions. Androgen deprivation therapy for 4 months was completed in all patients. The most common adverse events were hyperphosphataemia, taste disturbance, dry mouth and nail changes. No patients achieved a complete pathological response, although patients who tolerated erdafitinib for longer had smaller residual tumours, associated with reduced transcriptional signatures of epithelial cell proliferation.
Although there was a possible enhanced anti-tumour effect of androgen deprivation therapy in combination with erdafitnib in treatment naïve prostate cancer, the poor tolerability in this patient population prohibits the use of this combination in this setting.
Clinical Practice Points
Disease recurrence is common following prostatectomy in patients with localised prostate cancer with high-risk features. Although androgen deprivation therapy increases the rates of organ-confined disease and negative surgical margins, there is no significant benefit on disease recurrence. Multiple lines of evidence suggest that FGF/FGFR signalling is important in supporting prostate epithelial cell survival in hostile conditions, including acute androgen deprivation. We conducted an open label phase II study of the combination of erdafitinib (3 months) and androgen deprivation therapy (4 months) in men with localised prostate cancer with high-risk features prior to prostatectomy using a Simon's two stage design. The co-primary endpoints were safety and tolerability and pathological response in the prostatectomy specimen. The treatment combination was poorly tolerated. The most common adverse events were hyperphosphataemia, taste disturbance, dry mouth and nail changes. No patients achieved a complete pathological response, although patients who tolerated erdafitinib for longer had smaller residual tumours, associated with reduced transcriptional signatures of epithelial cell proliferation. Although there was a possible enhanced anti-tumour effect of androgen deprivation therapy in combination with erdafitnib in treatment naïve prostate cancer, the poor tolerability in this patient population prohibits the use of this combination in this setting.
Recent publications have shown patients with defects in the DNA mismatch repair (MMR) pathway driven by either MSH2 or MSH6 loss experience a significant increase in the incidence of prostate cancer. Moreover, this increased incidence of prostate cancer is accompanied by rapid disease progression and poor clinical outcomes.
Methods and results
We show that androgen-receptor activation, a key driver of prostate carcinogenesis, can disrupt the MSH2 gene in prostate cancer. We screened tumours from two cohorts (recurrent/non-recurrent) of prostate cancer patients to confirm the loss of MSH2 protein expression and identified decreased MSH2 expression in recurrent cases. Stratifying the independent TCGA prostate cancer cohort for MSH2/6 expression revealed that patients with lower levels of MSH2/6 had significant worse outcomes, in contrast, endometrial and colorectal cancer patients with lower MSH2/6 levels. MMRd endometrial and colorectal tumours showed the expected increase in mutational burden, microsatellite instability and enhanced immune cell mobilisation but this was not evident in prostate tumours.
We have shown that loss or reduced levels of MSH2/MSH6 protein in prostate cancer is associated with poor outcome. However, our data indicate that this is not associated with a statistically significant increase in mutational burden, microsatellite instability or immune cell mobilisation in a cohort of primary prostate cancers.
Germline pathogenic variants (PVs) in the DNA mismatch repair (MMR) genes and in the base excision repair gene MUTYH underlie hereditary colorectal cancer (CRC) and polyposis syndromes. We evaluated the robustness and discriminatory potential of tumour mutational signatures in CRCs for identifying germline PV carriers.
Whole-exome sequencing of formalin-fixed paraffin-embedded (FFPE) CRC tissue was performed on 33 MMR germline PV carriers, 12 biallelic MUTYH germline PV carriers, 25 sporadic MLH1 methylated MMR-deficient CRCs (MMRd controls) and 160 sporadic MMR-proficient CRCs (MMRp controls) and included 498 TCGA CRC tumours. COSMIC V3 single base substitution (SBS) and indel (ID) mutational signatures were assessed for their ability to differentiate CRCs that developed in carriers from non-carriers.
The combination of mutational signatures SBS18 and SBS36 contributing >30% of a CRC’s signature profile was able to discriminate biallelic MUTYH carriers from all other non-carrier control CRCs with 100% accuracy (area under the curve (AUC) 1.0). SBS18 and SBS36 were associated with specific MUTYH variants p.Gly396Asp (p=0.025) and p.Tyr179Cys (p=5×10-5), respectively. The combination of ID2 and ID7 could discriminate the 33 MMR PV carrier CRCs from the MMRp control CRCs (AUC 0.99); however, SBS and ID signatures, alone or in combination, could not provide complete discrimination (AUC 0.79) between CRCs from MMR PV carriers and sporadic MMRd controls.
Assessment of SBS and ID signatures can discriminate CRCs from biallelic MUTYH carriers and MMR PV carriers from non-carriers with high accuracy, demonstrating utility as a potential diagnostic and variant classification tool.
People who develop mismatch repair (MMR) deficient cancer in the absence of a germline MMR gene pathogenic variant or somatic hypermethylation of the MLH1 gene promoter are classified as having suspected Lynch syndrome (SLS). Germline whole genome sequencing (WGS) and targeted and genome-wide tumor sequencing was applied to identify the underlying cause of tumor MMR-deficiency in SLS. Germline WGS was performed on 14 cancer-affected people with SLS, including two sets of first-degree relatives. Germline pathogenic variants, including complex structural rearrangements and non-coding variants, were assessed for the MMR genes. Tumor tissue was sequenced for somatic MMR gene mutations by targeted, whole exome sequencing (WES) or WGS. Germline WGS identified pathogenic MMR variants in 3 of the 14 (21.4%) SLS cases including a 9.5Mb inversion disrupting MSH2 in a mother and daughter. Excluding these 3 MMR carriers, tumor sequencing identified at least two somatic MMR gene mutations in 8/11 (72.7%) tumors tested. In a second mother-daughter pair, a somatic cause of their tumor MMR-deficiency was supported by the presence of double somatic MSH2 mutations in their respective tumors. More than 70% of SLS were resolved as having double somatic MMR mutations in the absence of germline pathogenic variants in the MMR or other DNA repair-related genes as determine by WGS and, therefore, confidently assigned a non-inherited cause for their tumor MMR-deficiency.
DNA originating from degenerate tumour cells can be detected in the circulation in many tumour types, where it can be used as a marker of disease burden as well as to monitor treatment response. Although circulating tumour DNA (ctDNA) measurement has prognostic/predictive value in metastatic prostate cancer, its utility in localised disease is unknown.
We performed whole-genome sequencing of tumour-normal pairs in eight patients with clinically localised disease undergoing prostatectomy, identifying high confidence genomic aberrations. A bespoke DNA capture and amplification panel against the highest prevalence, highest confidence aberrations for each individual was designed and used to interrogate ctDNA isolated from plasma prospectively obtained pre- and post- (24 h and 6 weeks) surgery. In a separate cohort (n = 189), we identified the presence of ctDNA TP53 mutations in preoperative plasma in a retrospective cohort and determined its association with biochemical- and metastasis-free survival.
ResultsTumour variants in ctDNA were positively identified pre-treatment in two of eight patients, which in both cases remained detectable postoperatively. Patients with tumour variants in ctDNA had extremely rapid disease recurrence and progression compared to those where variants could not be detected. In terms of aberrations targeted, single nucleotide and structural variants outperformed indels and copy number aberrations. Detection of ctDNA TP53 mutations was associated with a significantly shorter metastasis-free survival (6.2 vs. 9.5 years (HR 2.4; 95% CIs 1.2–4.8, p = 0.014).
CtDNA is uncommonly detected in localised prostate cancer, but its presence portends more rapidly progressive disease.
To characterize the spectrum of BRCA1 and BRCA2 pathogenic germline variants in women from south-west Poland and west Ukraine affected with breast or ovarian cancer. Testing in women at high risk of breast and ovarian cancer in these regions is currently mainly limited to founder mutations.
Unrelated women affected with breast and/or ovarian cancer from Poland (n = 337) and Ukraine (n = 123) were screened by targeted sequencing. Excluded from targeted sequencing were 34 Polish women who had previously been identified as carrying a founder mutation in BRCA1. No prior testing had been conducted among the Ukrainian women. Thus, this study screened BRCA1 and BRCA2 in the germline DNA of 426 women in total.
We identified 31 and 18 women as carriers of pathogenic/likely pathogenic (P/LP) genetic variants in BRCA1 and BRCA2, respectively. We observed five BRCA1 and eight BRCA2 P/LP variants (13/337, 3.9%) in the Polish women. Combined with the 34/337 (10.1%) founder variants identified prior to this study, the overall P/LP variant frequency in the Polish women was thus 14% (47/337). Among the Ukrainian women, 16/123 (13%) women were identified as carrying a founder mutation and 20/123 (16.3%) were found to carry non-founder P/LP variants (10 in BRCA1 and 10 in BRCA2).
These results indicate that genetic testing in women at high risk of breast and ovarian cancer in Poland and Ukraine should not be limited to founder mutations. Extended testing will enhance risk stratification and management for these women and their families.
The identification of metabolites plays an important role in understanding drug efficacy and safety however these compounds are often difficult to identify in complex mixtures. One approach to identify drug metabolites involves utilising differentially isotopically labelled drug compounds to create unique isotopic signals that can be detected by liquid chromatography-mass spectrometry (LC-MS). User-friendly, efficient, computational tools that allow selective detection of these signals are lacking. We have developed an efficient open-source software tool called HiTIME (High-Resolution Twin-Ion Metabolite Extraction) which filters twin-ion signals in LC-MS data. The intensity of each data point in the input is replaced by a Z-score describing how well the point matches an idealised twin-ion signal versus alternative ion signatures. Here we provide a detailed description of the algorithm and demonstrate its performance on simulated and experimental data.
Few genetic risk factors have been demonstrated to be specifically associated with aggressive prostate cancer (PrCa). Here, we report a case-case study of PrCa comparing the prevalence of germline pathogenic/likely pathogenic (P/LP) genetic variants in 787 men with aggressive disease and 769 with non-aggressive disease. Overall, we observed P/LP variants in 11.4% of men with aggressive PrCa and 9.8% of men with non-aggressive PrCa (two-tailed Fisher's exact tests, P = 0.28). The proportion of BRCA2 and ATM P/LP variant carriers in men with aggressive PrCa exceeded that observed in men with non-aggressive PrCa; 18/787 carriers (2.3%) and 4/769 carriers (0.5%), P = 0.004, and 14/787 carriers (0.02%) and 5/769 carriers (0.01%), P = 0.06, respectively. Our findings contribute to the extensive international effort to interpret the genetic variation identified in genes included on gene-panel tests, for which there is currently an insufficient evidence-base for clinical translation in the context of PrCa risk.
The advent of gene panel testing is challenging the previous practice of using clinically defined cancer family syndromes to inform single-gene genetic screening. Individual and family cancer histories that would have previously indicated testing of a single gene or a small number of related genes are now, increasingly, leading to screening across gene panels that contain larger numbers of genes. We have applied a gene panel test that included four DNA mismatch repair (MMR) genes (MLH1, MSH2, MSH6 and PMS2) to an Australian population-based case–control-family study of breast cancer. Altogether, eight pathogenic variants in MMR genes were identified: six in 1421 case-families (0.4%, 4 MSH6 and 2 PMS2) and two in 833 control-families (0.2%, one each of MLH1 and MSH2). This testing highlights the current and future challenges for clinical genetics in the context of anticipated gene panel-based population-based screening that includes the MMR genes. This test-ing is likely to provide additional opportunities for cancer prevention via cascade testing for Lynch syndrome and precision medicine for breast cancer treatment.
Background. Bioinformatics software tools are often created ad hoc, frequently by people without extensive training in software development. In particular, for beginners, the barrier to entry in bioinformatics software development is high, especially if they want to adopt good programming practices. Even experienced developers do not always follow best practices. This results in the proliferation of poorer-quality bioinformatics software, leading to limited scalability and inefficient use of resources; lack of reproducibility, usability, adaptability, and interoperability; and erroneous or inaccurate results.
Findings. We have developed Bionitio, a tool that automates the process of starting new bioinformatics software projects following recommended best practices. With a single command, the user can create a new well-structured project in 1 of 12 programming languages. The resulting software is functional, carrying out a prototypical bioinformatics task, and thus serves as both a working example and a template for building new tools. Key features include command-line argument parsing, error handling, progress logging, defined exit status values, a test suite, a version number, standardized building and packaging, user documentation, code documentation, a standard open source software license, software revision control, and containerization.
Conclusions. Bionitio serves as a learning aid for beginner-to-intermediate bioinformatics programmers and provides an excellent starting point for new projects. This helps developers adopt good programming practices from the beginning of a project and encourages high-quality tools to be developed more rapidly. This also benefits users because tools are more easily installed and consistent in their usage. Bionitio is released as open source software under the MIT License and is available at https://github.com/bionitio-team/bionitio
BACKGROUND: Muir-Torre syndrome is defined by the development of sebaceous skin lesions in individuals who carry a germline mismatch repair (MMR) gene mutation. Loss of expression of MMR proteins is frequently observed in sebaceous skin lesions, but MMR-deficiency alone is not diagnostic for carrying a germline MMR gene mutation.
METHODS: Whole exome sequencing was performed on three MMR-deficient sebaceous lesions from individuals with MSH2 gene mutations (Lynch syndrome) and three MMR-proficient sebaceous lesions from individuals without Lynch syndrome with the aim of characterizing the tumor mutational signatures, somatic mutation burden, and microsatellite instability status. Thirty predefined somatic mutational signatures were calculated for each lesion.
RESULTS: Signature 1 was ubiquitous across the six lesions tested. Signatures 6 and 15, associated with defective DNA MMR, were significantly more prevalent in the MMR-deficient lesions from the MSH2 carriers compared with the MMR-proficient non-Lynch sebaceous lesions (mean ± SD=41.0 ± 8.2% vs. 2.3 ± 4.0%, p = 0.0018). Tumor mutation burden was, on average, significantly higher in the MMR-deficient lesions compared with the MMR-proficient lesions (23.3 ± 11.4 vs. 1.8 ± 0.8 mutations/Mb, p = 0.03). All four sebaceous lesions observed in sun exposed areas of the body demonstrated signature 7 related to ultraviolet light exposure.
CONCLUSION: Tumor mutational signatures 6 and 15 and somatic mutation burden were effective in differentiating Lynch-related from non-Lynch sebaceous lesions.
Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence, and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research, however informatic methods are now required to assign confidence in large volumes of predicted structures.
To predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination.
We used the I-TASSER suite to predict structural models for ∼5000 proteins encoded in Giardia duodenalis and identify their closest empirically determined structural homologues in the Protein Data Bank. Models were assigned to high or lower-confidence categories depending on the presence of matching PFAM domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high confidence category individually, and in combination through development of a random forest classifier.
We identified 1095 high confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high confidence status, however the random forest classifier out-performed any metric in isolation (AUC = 0.977), and identified a subset of 305 high confidence-like models, corresponding to false positive predictions. High confidence models exhibited higher transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high confidence-like proteins yielded substantial new insight into mechanisms of redox balance in Giardia duodenalis—a system central to the efficacy of limited anti-giardial drugs.
Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.
Breast cancer risk for BRCA1 and BRCA2 pathogenic mutation carriers is modified by risk factors that cluster in families, including genetic modifiers of risk. We considered genetic modifiers of risk for carriers of high-risk mutations in other breast cancer susceptibility genes.
In a family known to carry the high-risk mutation PALB2:c.3113G>A (p.Trp1038*), whole-exome sequencing was performed on germline DNA from four affected women, three of whom were mutation carriers.
RNASEL:p.Glu265* was identified in one of the PALB2 carriers who had two primary invasive breast cancer diagnoses before 50 years. Gene-panel testing of BRCA1, BRCA2, PALB2 and RNASEL in the Australian Breast Cancer Family Registry identified five carriers of RNASEL:p.Glu265* in 591 early onset breast cancer cases. Three of the five women (60%) carrying RNASEL:p.Glu265* also carried a pathogenic mutation in a breast cancer susceptibility gene compared with 30 carriers of pathogenic mutations in the 586 non-carriers of RNASEL:p.Glu265* (5%) (p < 0.002). Taqman genotyping demonstrated that the allele frequency of RNASEL:p.Glu265* was similar in affected and unaffected Australian women, consistent with other populations.
Our study suggests that RNASEL:p.Glu265* may be a genetic modifier of risk for early-onset breast cancer predisposition in carriers of high-risk mutations. Much larger case-case and case-control studies are warranted to test the association observed in this report.
Neural injury triggers swift responses from glia, including glial migration and phagocytic clearance of damaged neurons. The transcriptional programs governing these complex innate glial immune responses are still unclear. Here, we describe a novel injury assay in adult Drosophila that elicits widespread glial responses in the ventral nerve cord (VNC). We profiled injury-induced changes in VNC gene expression by RNA sequencing (RNA-seq) and found that responsive genes fall into diverse signaling classes. One factor, matrix metalloproteinase-1 (MMP-1), is induced in Drosophila ensheathing glia responding to severed axons. Interestingly, glial induction of MMP-1 requires the highly conserved engulfment receptor Draper, as well as AP-1 and STAT92E. In MMP-1 depleted flies, glia do not properly infiltrate neuropil regions after axotomy and, as a consequence, fail to clear degenerating axonal debris. This work identifies Draper-dependent activation of MMP-1 as a novel cascade required for proper glial clearance of severed axons.
Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a ‘life cycle’ view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain.
Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on ‘omics’ datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices.
Background:Previously, we described ROVER, a DNA variant caller which identifies genetic variants from PCR-targeted massively parallel sequencing (MPS) datasets generated by the Hi-Plex protocol. ROVER permits stringent filtering of sequencing chemistry-induced errors by requiring reported variants to appear in both reads of overlapping pairs above certain thresholds of occurrence. ROVER was developed in tandem with Hi-Plex and has been used successfully to screen for genetic mutations in the breast cancer predisposition gene PALB2.
ROVER is applied to MPS data in BAM format and, therefore, relies on sequence reads being mapped to a reference genome. In this paper, we describe an improvement to ROVER, called UNDR ROVER (Unmapped primer-Directed ROVER), which accepts MPS data in FASTQ format, avoiding the need for a computationally expensive mapping stage. It does so by taking advantage of the location-specific nature of PCR-targeted MPS data.
Results: The UNDR ROVER algorithm achieves the same stringent variant calling as its predecessor with a significant runtime performance improvement. In one indicative sequencing experiment, UNDR ROVER (in its fastest mode) required 8-fold less sequential computation time than the ROVER pipeline and 13-fold less sequential computation time than a variant calling pipeline based on the popular GATK tool.
UNDR ROVER is implemented in Python and runs on all popular POSIX-like operating systems (Linux, OS X). It requires as input a tab-delimited format file containing primer sequence information, a FASTA format file containing the reference genome sequence, and paired FASTQ files containing sequence reads. Primer sequences at the 5′ end of reads associate read-pairs with their targeted amplicon and, thus, their expected corresponding coordinates in the reference genome. The primer-intervening sequence of each read is compared against the reference sequence from the same location and variants are identified using the same algorithm as ROVER. Specifically, for a variant to be ‘called’ it must appear at the same location in both of the overlapping reads above user-defined thresholds of minimum number of reads and proportion of reads.
Conclusions: UNDR ROVER provides the same rapid and accurate genetic variant calling as its predecessor with greatly reduced computational costs.
Background: The NCBI Entrez Gene and PubMed databases contain a wealth of high-quality information about genes for many different organisms. The NCBI Entrez online web-search interface is convenient for simple manual search for a small number of genes but impractical for the kinds of outputs seen in typical genomics projects.
Results: We have developed an efficient open source tool implemented in Python called Annokey, which annotates gene lists with the results of a keyword search of the NCBI Entrez Gene database and linked Pubmed article information. The user steers the search by specifying a ranked list of keywords (including multi-word phrases and regular expressions) that are correlated with their topic of interest. Rank information of matched terms allows the user to guide further investigation.
We applied Annokey to the entire human Entrez Gene database using the key-term “DNA repair” and assessed its performance in identifying the 176 members of a published “gold standard” list of genes established to be involved in this pathway. For this test case we observed a sensitivity and specificity of 97% and 96%, respectively.
Conclusions: Annokey facilitates the identification of genes related to an area of interest, a task which can be onerous if performed manually on a large number of genes. Annokey provides a way to capitalize on the high quality information provided by the Entrez Gene database allowing both scalability and compatibility with automated analysis pipelines, thus offering the potential to significantly enhance research productivity.
This thesis is about the design and implementation of a debugging tool which helps Haskell programmers understand why their programs do not work as intended. The traditional debugging technique of examining the program execution step-by-step, popular with imperative languages, is less suitable for Haskell because its unorthodox evaluation strategy is difficult to relate to the structure of the original program source code. We build a debugger which focuses on the high-level logical meaning of a program rather than its evaluation order. This style of debugging is called declarative debugging, and it originated in logic programming languages. At the heart of the debugger is a tree which records information about the evaluation of the program in a manner which is easy to relate to the structure of the program. Links between nodes in the tree reflect logical relationships between entities in the source code. An error diagnosis algorithm is applied to the tree in a top-down fashion, searching for causes of bugs. The search is guided by an oracle, who knows how each part of the program should behave. The oracle is normally a human — typically the person who wrote the program — however, much of its behaviour can be encoded in software.
An interesting aspect of this work is that the debugger is implemented by means of a program transformation. That is, the program which is to be debugged is trans- formed into a new one, which when evaluated, behaves like the original program but also produces the evaluation tree as a side-effect. The transformed program is augmented with code to perform the error diagnosis on the tree. Running the trans- formed program constitutes the evaluation of the original program plus a debugging session. The use of program transformation allows the debugger to take advantage of existing compiler technology — a whole new compiler and runtime environment does not need to be written — which saves much work and enhances portability.
The technology described in this thesis is well-tested by an implementation in software. The result is a useful tool, called buddha, which is publicly available and supports all of the Haskell 98 standard.
Haskell is a very safe language, particularly because of its type system. However there will always be programs that do the wrong thing. Programmer fallibility, partial or incorrect specifications and typographic errors are but a few of the reasons that make bugs a fact of life. This paper is about the use and implementation of a debugger, called Buddha, which helps Haskell programmers understand why their programs misbehave. Traditional debugging tools that examine the program execution step-by-step are not suitable for Haskell because of its unorthodox evaluation strategy. Instead, a different approach is taken which abstracts away the evaluation order of the program and focuses on its high-level logical meaning.
This style of debugging is called Declarative Debugging, and it has its roots in the Logic Programming community. At the heart of the debugger is a tree which records information about the evaluation of the program in a manner which is easy to relate to the structure of the source code. It resembles a call graph annotated with the arguments and results of function applications, shown in their most evaluated form. Logical relationships between entities in the source are reflected in the links between nodes in the tree. An error diagnosis algorithm is applied to the tree in a top-down fashion in the search for causes of bugs.