Merge Regions in BED File by Feature Name
In bioinformatics, the analysis of BED files involves merging genomic intervals (regions) into contiguous regions that share the common feature names (fourth column of BED file).
In this case, you can use the bedtools groupby
function to merge genomic intervals into contiguous regions based on the name column.
For example, you have the following BED file with their feature names in the fourth column
cat file1.bed
chr1 10 100 exon1
chr1 60 200 exon1
chr2 200 500 exon2
chr2 350 450 exon2
chr2 600 700 exon2
Now, merge regions in BED file using bedtools groupby
function based on feature names,
bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count
# output
chr1 exon1 10 200 2
chr2 exon2 200 700 3
In the above example, we merged the genomic intervals from the BED file based on the feature name.
The -g
parameter specifies to use the column 1 and 4 to group the intervals. The -c
parameter specifies to use the column 4,2, and 3 for summarising.
The -o
paramter specifies which operation should be applied on columns specified by -c
parameter. For example, apply count
operation on column 4.
The obtained merged region output is not in BED format. If you want output in BED format, you can pipe the output to awk
to rearrange the columns.
bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count | awk '{OFS="\t"}{print $1,$3,$4,$2}'
# output
chr1 10 200 exon1
chr2 200 700 exon2
If you want to save the merged output to a file, you can redirect the output to a file as below,
bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count | awk '{OFS="\t"}{print $1,$3,$4,$2}' > merged.bed
cat merged.bed
chr1 10 200 exon1
chr2 200 700 exon2
In addition to merging the regions based on name feature, you can also use the region overlaps in the BED file to merge the genomic intervals.