生物信息学有许多文件格式,不同的文件格式的染色体坐标方法并不一样,在文件处理中很容易混淆,因此记录一下。
1. 0-based与1-based
0-based即文件第一个碱基坐标以0开始,1-based即文件中第一个碱基坐标以1开始。
特征 | 0-based | 1-based |
---|---|---|
第一个碱基坐标 | 0 | 1 |
区间 | 前闭后开[start,end) | 闭区间[start,end] |
区间长度 | end-start | end-start+1 |
简而言之,0-based与1-based的区别如下图:
- 数字直接代表一个碱基
- 两个数字之间才代表一个碱基
2. 0-based与1-based坐标转换
- 0-based转1-based
- Start(1-based) = Start(0-based) + 1
- End(1-based) = End(0-based)
- 1-based转1-based
- Start(0-based) = Start(1-based) - 1
- End(0-based) = End(1-based)
3. 文件格式所用的坐标系统总结
0-based有:
Format | Type |
---|---|
BED | 0-based |
BAM | 0-based |
BCF | 0-based |
narrowPeak(MACS2) | 0-based |
SAF(FeatureCount) | 0-based |
bedGraph | |
UCSC Genome Browser tables | 0-based |
1-based有:
Format | Type |
---|---|
GTF | 1-based |
GFF | 1-based |
SAM | 1-based |
VCF | 1-based |
Wiggle | 1-based |
GenomicRanges | 1-based |
IGV | 1-based |
BLAST | 1-based |
GenBank/EMBL Feature Table | 1-based |