生物信息学有许多文件格式,不同的文件格式的染色体坐标方法并不一样,在文件处理中很容易混淆,因此记录一下。
1. 0-based与1-based
0-based即文件第一个碱基坐标以0开始,1-based即文件中第一个碱基坐标以1开始。
| 特征 | 0-based | 1-based |
|---|---|---|
| 第一个碱基坐标 | 0 | 1 |
| 区间 | 前闭后开[start,end) | 闭区间[start,end] |
| 区间长度 | end-start | end-start+1 |
简而言之,0-based与1-based的区别如下图:
- 数字直接代表一个碱基
- 两个数字之间才代表一个碱基

2. 0-based与1-based坐标转换
- 0-based转1-based
- Start(1-based) = Start(0-based) + 1
- End(1-based) = End(0-based)
- 1-based转1-based
- Start(0-based) = Start(1-based) - 1
- End(0-based) = End(1-based)
3. 文件格式所用的坐标系统总结
0-based有:
| Format | Type |
|---|---|
| BED | 0-based |
| BAM | 0-based |
| BCF | 0-based |
| narrowPeak(MACS2) | 0-based |
| SAF(FeatureCount) | 0-based |
| bedGraph | |
| UCSC Genome Browser tables | 0-based |
1-based有:
| Format | Type |
|---|---|
| GTF | 1-based |
| GFF | 1-based |
| SAM | 1-based |
| VCF | 1-based |
| Wiggle | 1-based |
| GenomicRanges | 1-based |
| IGV | 1-based |
| BLAST | 1-based |
| GenBank/EMBL Feature Table | 1-based |