Prepare and configure the pipeline
Creating the configuration file: inputs and other parameters
The configuration file config.yaml will contain all the input files required for the pipeline and some other parameters.
Prepare the query genome data
Parameter in |
format |
description |
Example |
---|---|---|---|
annotationQuery |
GFF3 |
The gff file of the annotation we want to transfert onto the new Target genome |
annotationQuery: “/path/to/IWGSC_RefSeqv1_annotation.gff3” |
featureType |
[STRING] |
The feature we want to use to anchore the annotation. Here we use the gene feature of the GFF. |
featureType: ‘gene |
queryFasta |
FASTA |
Fasta file of the genome reference sequence. Must match the GFF file used in |
queryFasta: ‘/path/to/IWGSC_RefSeqv1_annotation.fasta’ |
blastdb |
[blast database] |
blast db of all mRNAs (all isoforms) of the query annotation. This will be used to rescue genes which have failed in the transfert |
blastdb: ‘data/IWGSCv1.1_all_mrna’ |
chromosomes |
python list |
list of all the chromosomes in the query reference genome. This will be used to split all the data per chromosome and speed up the analysis |
chromosomes: [‘1A’, ‘2A’, ‘3A’, ‘4A’, ‘5A’, ‘6A’, ‘7A’, ‘1B’, ‘2B’, ‘3B’, ‘4B’, ‘5B’, ‘6B’, ‘7B’, ‘1D’, ‘2D’, ‘3D’, ‘4D’, ‘5D’, ‘6D’, ‘7D’, ‘U’] |
Prepare the markers/ISBPs input data
The pipeline uses markers/ISBPs as anchores to target a restricted part of the target genome on which our genes are suspected to be located.
Parameter in |
format |
description |
Example |
---|---|---|---|
isbpBed |
BED |
Initial coordinates of the markers/ISBPs on the query genome |
isbpBed: ‘data/ISBP_refseqv1.bed’ |
mapq |
[INT] |
Minimum mapping quality of the marker/ISBP to be kept in the anchoring process |
mapq: 30 |
mismatches |
[INT] |
Max missmatches allowed for a marker to be kept in the analysis |
mismatches: 0 |
Prepare the target genome data
Here, we only need the fasta of the new genome assembly
Parameter in |
format |
description |
Example |
---|---|---|---|
targetFasta |
FASTA |
fasta file of the target genome assembly on which we transfert the annotation |
targetFasta: ‘data/CS_pesudo_v2.1.fa’ |
targetBwaIdx |
BWA index |
Prefix for the BWA index files |
targetBwaIdx: ‘/data/db/triticum_aestivum/julius/current/bwa/all’ |
targetGmapIndex |
PATH |
Name of the GMAP index directory. This will be used with the |
targetGmapIndex: ‘ensembl_Triticum_aestivum_julius_2022-9-16’ |
targetGmapIndexPath |
PATH |
Full path to the directory in which the GMAPindex is found. This will be used with the |
targetGmapIndexPath: ‘/data/db/triticum_aestivum/julius/gmapdb/all/’ |
Output parameters/settings
Parameter in |
format |
description |
Example |
---|---|---|---|
results |
[STRING] |
directory in which all the Snakemake rules will be executed |
results: ‘results’ |
finalPrefix |
[STRING] |
Prefix for all the final output files (annotaion, mrna/pep fasta sequences ect) |
finalPrefix: ‘IWGSC_refseqv2.0_annotv2.0’ |
chromMapID |
CSV |
Mapping file which sets the correspondance between the chromosome names in the GFF and the chromosome ID in the newlygenerated gene IDs |
chromMapID: ‘data/chromosomeMappingID.csv’ |
transferType |
[STRING] |
transfert all isoforms or only the representative transcript (.1) for each gene in the reference genome |
transferType: ‘all’ ; transferType: ‘first’ |
Example of chromosomeMappingID.csv
file :
$ cat data/chromosomeMappingID.csv
Chr1A 1A
Chr1B 1B
Chr1D 1D
Chr2A 2A
Chr2B 2B
Chr2D 2D
Chr3A 3A
Chr3B 3B
Chr3D 3D
Chr4A 4A
Chr4B 4B
Chr4D 4D
Chr5A 5A
Chr5B 5B
Chr5D 5D
Chr6A 6A
Chr6B 6B
Chr6D 6D
Chr7A 7A
Chr7B 7B
Chr7D 7D
Once all those parameters has been set up, the final configuration file may look like this:
##### QUERY related files/parameters (refseqv1.0)
# GFF annotatin to transfert
annotationQuery: 'data/IWGSC_v1.1_20170706.gff3'
# feature type used for anchoring on target genome
featureType: 'gene'
# FASTA of the query (used to check the sequences after the coordinates are calculated on the target genome)
queryFasta: 'data/161010_Chinese_Spring_v1.0_pseudomolecules.fasta'
# blastdb of all mrnas. used to rescue genes which have failed in the transfert using the targeted approache
blastdb: 'data/IWGSCv1.1_all_mrna'
# map of all chromosome ids --> NEED TO BE UPDATED in another version WITH ONE ARRAY FOR THE QUERY AND ONE ARRAY FOR THE TARGET GENOME ASSEMBLY
chromosomes: ['1A', '2A', '3A', '4A', '5A', '6A', '7A', '1B', '2B', '3B', '4B', '5B', '6B', '7B', '1D', '2D', '3D', '4D', '5D', '6D', '7D', 'U']
refChrom: ['chr1A', 'chr1B', 'chr1D', 'chr2A', 'chr2B', 'chr2D', 'chr3A', 'chr3B', 'chr3D', 'chr4A', 'chr4B', 'chr4D', 'chr5A', 'chr5B', 'chr5D', 'chr6A', 'chr6B', 'chr6D', 'chr7A', 'chr7B', 'chr7D', 'chrUn']
##### TARGET related files/parameters (refseqv2.1)
targetFasta: 'data/CS_pesudo_v2.1.fa'
#GMAP index of the genome for -d option
targetGmapIndex: 'ensembl_Triticum_aestivum_julius_2022-9-16'
#GMAP index: path to the gmapindex directory, for -D option
targetGmapIndexPath: '/data/db/triticum_aestivum/julius/current/gmapdb/all/'
##### ISBP/markers related config and parameters
# BAM file of markers/ISBPs mapped on the target genome
# isbpBam: '/home/masirvent/wheat10plus-pangenome/data/mappingISBP/session2/arina/arina_CS_ISBP.bam'
# BED file of coordinates on the query genome (REFSEQ v2.1)
isbpBed: 'data/Tae.Chinese_Spring.refSeqv2.1.ISBPs.bed'
# BWA threads for mapping
bwaThreads: 16
# FLAG : F flag for samtools
flag_F: 3844
# minimum mapping quality of markers on the target genome
mapq: 30
# max mismatches per ISBP/marker
mismatches: 2
##### OUTPUT directory
results: 'resultsDEV'
finalPrefix: 'IWGSC_refseqv2.0_annotv2.0'
# this file contains two columns: the first is the chromosome name as it appears in the genome.fasta of the new reference,
# and the second the chromosome name as it will appear in the new gene Names
chromMapID: 'data/chromosomeMappingID.csv'