flowcraft.templates.assembly_report module

Purpose

This module is intended to provide a summary report for a given assembly in Fasta format.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample Identification string.
    • e.g.: 'SampleA'
  • assembly : Path to assembly file in Fasta format.
    • e.g.: 'assembly.fasta'

Generated output

  • ${sample_id}_assembly_report.csv : CSV with summary information of the assembly.
    • e.g.: 'SampleA_assembly_report.csv'

Code documentation

class flowcraft.templates.assembly_report.Assembly(assembly_file, sample_id)[source]

Class that parses and filters an assembly file in Fasta format.

This class parses an assembly file, collects a number of summary statistics and metadata from the contigs and reports.

Parameters:
assembly_file : str

Path to assembly file.

sample_id : str

Name of the sample for the current assembly.

Methods

get_coverage_sliding(self, coverage_file[, …])
Parameters:
get_gc_sliding(self[, window]) Calculates a sliding window of the GC content for the assembly
get_summary_stats(self[, output_csv]) Generates a CSV report with summary statistics about the assembly
summary_info = None

OrderedDict: Initialize summary information dictionary. Contains keys:

  • ncontigs: Number of contigs
  • avg_contig_size: Average size of contigs
  • n50: N50 metric
  • total_len: Total assembly length
  • avg_gc: Average GC proportion
  • missing_data: Count of missing data characters
contigs = None

OrderedDict: Object that maps the contig headers to the corresponding sequence

contig_coverage = None

OrderedDict: Object that maps the contig headers to the corresponding list of per-base coverage

sample = None

str: Sample id

contig_boundaries = None

dict: Maps the boundaries of each contig in the genome

get_summary_stats(self, output_csv=None)[source]

Generates a CSV report with summary statistics about the assembly

The calculated statistics are:

  • Number of contigs
  • Average contig size
  • N50
  • Total assembly length
  • Average GC content
  • Amount of missing data
Parameters:
output_csv: str

Name of the output CSV file.

get_gc_sliding(self, window=2000)[source]

Calculates a sliding window of the GC content for the assembly

Returns:
gc_res : list

List of GC proportion floats for each data point in the sliding window

get_coverage_sliding(self, coverage_file, window=2000)[source]
Parameters:
coverage_file : str

Path to file containing the coverage info at the per-base level (as generated by samtools depth)

window : int

Size of sliding window