What are overlapping entries?

The overlapping entries section represents the relationship between homologous superfamilies and other InterPro entries. This is calculated by analysing the overlap between matched sequence sets.

An InterPro entry (IPR) is considered related to a homologous superfamily if:

  • their sequence matches overlap (i.e. the match positions fall within the homologous superfamily boundaries)
  • the Jaccard index (equivalent) or containment index (parent/child) of the matching sequence sets is greater than 0.75

The Jaccard index formula is  , the containment index formula is .

What are the union and the intersection of two datasets?

  • The union (IPR1 ∪ IPR2) is the number of unique proteins found in the two datasets
  • The intersection (IPR1 ∩ IPR2 or IPR2 ∩ IPR1) is the number of domains overlapping for the protein common to both datasets 

How do we know if the protein domains are intersecting?

Looking at the common proteins between two entries of interest, the determination of whether or not their domains are intersecting is done by verifying if they overlap. This is possible by searching if the midpoint of the match from one entry is in between the boundaries of the match from the other entry.

There are three different scenarios possible:  

Figure 12 Three possible overlapping scenarios.

Determining overlapping entries

Once the intersection and union of the two datasets have been determined, the Jaccard and containment indexes can be calculated.

  • Two entries are equivalent if the Jaccard index score is equal or higher than 0.75
  • Two entries have a parent/child relation if the Jaccard containment index score is equal or higher than 0.75

Both equivalent and parent/child entries are shown in the Overlapping entries section.

The next page works through a guided example to help explain the calculations involved in understanding overlapping entries.