本文总结利用RNA-seq数据进行snp calling的流程。
1. 构建参考基因组index
1 2 3
| bwa index Dmel.dna.fa samtools faidx Dmel.dna.fa picard CreateSequenceDictionary -R ~/00.reference/01.Dmel/Dmel.dna.fa -O Dmel.dna.dict
|
2. bam文件预处理
对bam文件进行排序。
1 2
| samtools merge -@ 10 -l 9 merged.bam sample1.bam sample2.bam sample3.bam samtools sort -@ 10 -l 9 merged.bam -o merged.sort.bam
|
3. 去重复mark duplicates
去除掉建库、测序过程中偏好性导致的一些”假阳性”地增高的reads。
1 2 3 4
| picard MarkDuplicates -I merged.sort.bam \ -M merged_mark_metrics.txt \ --REMOVE_DUPLICATES 1 \ -O merged.sort.removed.bam
|
4-1. GATK snp calling
1. SplitNCigarReads
这一步是针对RNA-seq数据call snp特有的步骤。mRNA转录本主要是由DNA的外显子exon的可变剪切,不包括intron部分。因此,在RNA-seq数据call snp时,如果直接比对到基因组上,若比对到intron的N区域,就会有假阳性的snp,为了减少假阳性,需要将reads进行切割,跨越intron区域。若为n个intron的read片段,则切割为n+1个子read片段。
1 2 3
| gatk SplitNCigarReads -R ~/00.reference/01.Dmel/Dmel.dna.fa \ -I merged.sort.removed.bam \ -O merged.sort.removed.splited.bam
|
2. GATK snp calling
HaplotypeCaller进行变异检测,之后,只保留SNP位点,并对SNP进行过滤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| gatk HaplotypeCaller -R ~/00.reference/01.Dmel/Dmel.dna.fa \ --ploidy 2 -I 03.bam/merged.sort.removed.bam \ -O 04.call_snp/all.merged.vcf.gz
gatk SelectVariants -R ~/00.reference/01.Dmel/Dmel.dna.fa \ -O 04.call_snp/all.merged.snp.vcf \ --variant 04.call_snp/all.merged.vcf.gz \ --select-type-to-include SNP
gatk VariantFiltration -R ~/00.reference/01.Dmel/Dmel.dna.fa \ -V 04.call_snp/all.merged.snp.vcf.gz \ --cluster-size 4 --cluster-window-size 10 \ --mask-name aroundIndel --mask 04.call_snp/all.merged.indel.vcf.gz -mask-extension 3 \ --filter-name "lowMQRankSum" --filter-expression "MQRankSum < 12.5" \ --filter-name "highFS" --filter-expression "FS > 60.0" \ --filter-name "lowReadPosRankSum" --filter-expression "ReadPosRankSum < -8.0" \ --filter-name "lowMQ" --filter-expression "MQ < 40.0" \ --filter-name "lowQD" --filter-expression "QD < 2.0" \ --genotype-filter-name "lowDP" --genotype-filter-expression "DP < 8.0" \ -O 05.filter_snp/all.merged.filered.snp.vcf.gz
|
去重复之后,也可利用bcftools进行snp calling,并进行过滤。
1 2 3
| bcftools mpileup -Ou -f ~/00.reference/01.Dmel/Dmel.dna.fa 03.bam/merged.sort.removed.bam | bcftools call -mv -o all.merged.vcf.gz bcftools filter all.merged.vcf.gz -e 'QUAL<30' --SnpGap 5 -Oz -o all.filtered.merged.vcf.gz; bcftools filter all.filtered.merged.vcf.gz -e 'DP<10' -Oz -o all.filtered2.merged.vcf.gz;
|