flowcraft.templates.fastqc_report module

Purpose

This module is intended parse the results of FastQC for paired end FastQ samples. It parses two reports:

  • Categorical report
  • Nucleotide level report.

Expected input

The following variables are expected whether using NextFlow or the main() executor.

  • sample_id : Sample identification string
    • e.g.: 'SampleA'
  • result_p1 : Path to both FastQC result files for pair 1
    • e.g.: 'SampleA_1_data SampleA_1_summary'
  • result_p2 : Path to both FastQC result files for pair 2
    • e.g.: 'SampleA_2_data SampleA_2_summary'
  • opts : Specify additional arguments for executing fastqc_report. The arguments should be a string of command line arguments, The accepted arguments are:
    • '--ignore-tests' : Ignores test results from FastQC categorical summary. This is used in the first run of FastQC.

Generated output

The generated output are output files that contain an object, usually a string.

  • fastqc_health : Stores the health check for the current sample. If it
    passes all checks, it contains only the string ‘pass’. Otherwise, contains the summary categories and their respective results - e.g.: 'pass'
  • optimal_trim : Stores a tuple with the optimal trimming positions for 5’
    and 3’ ends of the reads. - e.g.: '15 151'

Code documentation

flowcraft.templates.fastqc_report.write_json_report(sample_id, data1, data2)[source]

Writes the report

Parameters:
data1
data2
flowcraft.templates.fastqc_report.get_trim_index(biased_list)[source]

Returns the trim index from a bool list

Provided with a list of bool elements ([False, False, True, True]), this function will assess the index of the list that minimizes the number of True elements (biased positions) at the extremities. To do so, it will iterate over the boolean list and find an index position where there are two consecutive False elements after a True element. This will be considered as an optimal trim position. For example, in the following list:

[True, True, False, True, True, False, False, False, False, ...]

The optimal trim index will be the 4th position, since it is the first occurrence of a True element with two False elements after it.

If the provided bool list has no True elements, then the 0 index is returned.

Parameters:
biased_list: list

List of bool elements, where True means a biased site.

Returns:
x : index position of the biased list for the optimal trim.
flowcraft.templates.fastqc_report.trim_range(data_file)[source]

Assess the optimal trim range for a given FastQC data file.

This function will parse a single FastQC data file, namely the ‘Per base sequence content’ category. It will retrieve the A/T and G/C content for each nucleotide position in the reads, and check whether the G/C and A/T proportions are between 80% and 120%. If they are, that nucleotide position is marked as biased for future removal.

Parameters:
data_file: str

Path to FastQC data file.

Returns:
trim_nt: list

List containing the range with the best trimming positions for the corresponding FastQ file. The first element is the 5’ end trim index and the second element is the 3’ end trim index.

flowcraft.templates.fastqc_report.get_sample_trim(p1_data, p2_data)[source]

Get the optimal read trim range from data files of paired FastQ reads.

Given the FastQC data report files for paired-end FastQ reads, this function will assess the optimal trim range for the 3’ and 5’ ends of the paired-end reads. This assessment will be based on the ‘Per sequence GC content’.

Parameters:
p1_data: str

Path to FastQC data report file from pair 1

p2_data: str

Path to FastQC data report file from pair 2

Returns:
optimal_5trim: int

Optimal trim index for the 5’ end of the reads

optima_3trim: int

Optimal trim index for the 3’ end of the reads

See also

trim_range
flowcraft.templates.fastqc_report.get_summary(summary_file)[source]

Parses a FastQC summary report file and returns it as a dictionary.

This function parses a typical FastQC summary report file, retrieving only the information on the first two columns. For instance, a line could be:

'PASS   Basic Statistics        SH10762A_1.fastq.gz'

This parser will build a dictionary with the string in the second column as a key and the QC result as the value. In this case, the returned dict would be something like:

{"Basic Statistics": "PASS"}
Parameters:
summary_file: str

Path to FastQC summary report.

Returns:
summary_info: OrderedDict

Returns the information of the FastQC summary report as an ordered dictionary, with the categories as strings and the QC result as values.

flowcraft.templates.fastqc_report.check_summary_health(summary_file, **kwargs)[source]

Checks the health of a sample from the FastQC summary file.

Parses the FastQC summary file and tests whether the sample is good or not. There are four categories that cannot fail, and two that must pass in order for the sample pass this check. If the sample fails the quality checks, a list with the failing categories is also returned.

Categories that cannot fail:

fail_sensitive = [
    "Per base sequence quality",
    "Overrepresented sequences",
    "Sequence Length Distribution",
    "Per sequence GC content"
]

Categories that must pass:

must_pass = [
    "Per base N content",
    "Adapter Content"
]
Parameters:
summary_file: str

Path to FastQC summary file.

Returns:
x : bool

Returns True if the sample passes all tests. False if not.

summary_info : list

A list with the FastQC categories that failed the tests. Is empty if the sample passes all tests.