Merge Regions in BED File by Feature Name

2024-08-07 316 words 2 minutes

In bioinformatics, the analysis of BED files involves merging genomic intervals (regions) into contiguous regions that share the common feature names (fourth column of BED file).

In this case, you can use the bedtools groupby function to merge genomic intervals into contiguous regions based on the name column.

For example, you have the following BED file with their feature names in the fourth column

cat file1.bed

chr1    10      100     exon1
chr1    60      200     exon1
chr2    200     500     exon2
chr2    350     450     exon2
chr2    600     700     exon2

Now, merge regions in BED file using bedtools groupby function based on feature names,

bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count

# output

chr1    exon1   10      200     2
chr2    exon2   200     700     3

In the above example, we merged the genomic intervals from the BED file based on the feature name.

The -g parameter specifies to use the column 1 and 4 to group the intervals. The -c parameter specifies to use the column 4,2, and 3 for summarising.

The -o paramter specifies which operation should be applied on columns specified by -c parameter. For example, apply count operation on column 4.

The obtained merged region output is not in BED format. If you want output in BED format, you can pipe the output to awk to rearrange the columns.

bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count | awk '{OFS="\t"}{print $1,$3,$4,$2}'

# output

chr1    10      200     exon1
chr2    200     700     exon2

If you want to save the merged output to a file, you can redirect the output to a file as below,

bedtools groupby -i file1.bed -g 1,4 -c 2,3,4 -o min,max,count | awk '{OFS="\t"}{print $1,$3,$4,$2}' > merged.bed

cat merged.bed

chr1    10      200     exon1
chr2    200     700     exon2

In addition to merging the regions based on name feature, you can also use the region overlaps in the BED file to merge the genomic intervals.