How to Sort BED Files Effectively
BED (Browser Extensible Data) files are text-formatted files widely used in genomic interval analysis and visualization.
BED format represents genomic features such as genes, exons, cds, and other custom regions.
Various bioinformatics tools such as bedtools and UCSC genomic browser uses BED format for various genomic analysis tasks such as finding the overlapping genomic intervals, merging genomic intervals, and visualization of the genomic features.
BED files must be sorted by chromosome and start position for various analyses such as merging the intervals and visualization.
The BED files can be sorted using various methods such as bedtools sort
and UNIX sort
command.
The following examples explain how to use bedtools sort
and sort
command to sort the BED files.
bedtools sort
bedtools has a sort
function to sort the BED files by chromosome and start position.
The example BED file with genomic intervals:
# BED file
cat file.bed
chr1 1 10
chr1 30 35
chr1 20 25
chr1 38 50
This example BED file is not sorted and we will sort it using the bedtools sort
command.
bedtools sort -i file.bed
# output
chr1 1 10
chr1 20 25
chr1 30 35
chr1 38 50
You can see that the BED file is sorted by the chromosome and start position.
bedtools sort
command could be slow for large BED files. If you have a large BED file, it is recommended to use the
UNIX sort
command which is a faster and consumes less memory.UNIX sort
In addition to bedtools sort
, you can also use the UNIX sort
command to sort the BED file by the chromosome and start position.
The example BED file with genomic intervals:
# BED file
cat file.bed
chr1 1 10
chr1 30 35
chr1 20 25
chr1 38 50
This example BED file is not sorted and we will sort it using the UNIX sort
command.
sort -k1,1 -k2,2n file.bed
# output
chr1 1 10
chr1 20 25
chr1 30 35
chr1 38 50
You can see that the UNIX sort
command sorted the BED file by the chromosome and start position.
The following example shows how to sort a BED file alphanumerically considering chr, start, and end coordinates.
The example BED file with genomic intervals:
# BED file
cat file.bed
chr1 1 10
chr10 30 99
chr10 30 60
chr2 40 50
chrX 60 80
chrX 60 70
Sort the BED file alphanumerically considering chr, start, and end coordinates,
sort -k1,1V -k2,2n -k3,3n file.bed
# output
chr1 1 10
chr2 40 50
chr10 30 60
chr10 30 99
chrX 60 70
chrX 60 80
In the above command -k1,1V
sort the chr column alphanumerically, -k2,2n
sort the start position numerically, and -k3,3n
sort the end position numerically.