How to Merge BED Files And Retain Other Columns
bedtools merge
is a useful tool in bioinformatics for merging the overlapping or book-ended genomic intervals from the BED file.
Most of the time BED files contains the first three required columns (chrom, start, end). But, often there is a fourth name column for the feature annotation.
When you merge the BED file with four or more columns, the information is not retained from the fourth column in a merged file.
However, the bedtools merge
function has additional parameters (-c
and -o
) for retaining the information for additional columns.
The following example explains how to use bedtools merge
for merging the overlapping intervals from BED files and retaining the other additional column information.
The example BED files with an additional fourth column for feature annotation:
# BED file
cat file1.bed
chr1 10 100 exon1_f1
chr1 400 500 exon3_f1
cat file2.bed
chr1 50 200 exon1_f2
chr1 600 700 exon4_f2
These example BED files contain overlapping genomic intervals.
Now, merge these overlapping genomic intervals into a single interval from both BED files using bedtools merge
.
bedtools merge
command requires a sorted BED file by chromosome and start position. Please read this article on how to sort BED file effectively.cat file1.bed file2.bed | bedtools sort | bedtools merge
# output
chr1 10 200
chr1 400 500
chr1 600 700
You can see that the overlapping genomic intervals are merged into a single interval. However, the name column (fourth column) information is not retained in the merged output.
To retain name column (fourth column) information, you can use -c
parameter with bedtools merge
. The -c
parameter specifies which columns from the input BED Files
to analyze with the -o
parameter.
To retain name column (fourth column) information in output, we will use the collapse
operator as a value for the -o
parameter.
cat file1.bed file2.bed | bedtools sort | bedtools merge -c 4,4 -o collapse,collapse
# output
chr1 10 200 exon1_f1,exon1_f2 exon1_f1,exon1_f2
chr1 400 500 exon3_f1 exon3_f1
chr1 600 700 exon4_f2 exon4_f2
You can see that the output contains the merged intervals and feature annotation information from both BED files.
Similarly, you can also use bedtools merge
to merge the overlapping intervals from a single BED file.
# BED file
cat file3.bed
chr1 100 200 exon1
chr1 150 300 exon1
chr1 500 600 exon2
bedtools merge -i file3.bed -c 4 -o collapse
# output
chr1 100 300 exon1,exon1
chr1 500 600 exon2
You can also bedtools merge
command to filter out the overlapping regions and keep non-overlapping regions from the BED file.