Network Clustering

The clustering method used in BioModule is named CAM (Clique Aggregation Method). CAM is based on two properties of biological modules: connectivity and core-attachment structure. The subgraph of a biologcial network induced by a module should be connected because the members of the module work in tight cooperation; a module is connected if there exists a reliable path, whose existence probabilities is greater than a "confidence score threshold", for any two genes (or proteins) in the module. On the otherhand according to Gavin et al. (2006), a complex consists of a core and an attachment. Therefore a biological module should be a core-attachment structure.

CAM find biological modules by mergeing maximal reliable cliques. A reliable clique is a clique in which every weight of an edge is greater than the confidence score threshold, and a maximal reliable clique is a reliable clique that cannot be contained in any other reliable cliques. The pseudo code for CAM is shown as follows, and There are two major parts: Lines 4-47 depicts that how CAM generates candiated modules, and Lines 49-60 shows how CAM computes finial modules by removing overlapping modules. The measures that CAM used in lines 13 and 50 are clique score and module score, respectively. Given a clique C, the score of the clique is shown as following equation.

Given a module M, the score of the module is shown as following equation.


Algorithm CAM (Clique Aggregation Method)
Input: An input network G=(V, E, w), where V is the vertex set, E is the edge set, and w is the edge weight function; confidence_score_threshold is the confidence score threshold; maximum_module_size is the maximum module size; module_overlapping_threshold is the module overlapping threshold. Output: A set of modules; Procedure 1: Generate the set of maximal reliable cliques from G 2: s(G) = the size of a maximumn reliable clique 3: 4: Let I be an empty set 5: for k=3 to s(G) do 6: S maximal reliable cliques with sizes greater than k-1 8: unmarked all clique in S 12: 14: for each clique C in S in descending order do 15: if C is unmarked then 16: R {C}. Mark C. 17: Create a priority queue Q. 18: Enqueue D in S, which shares k-1 vertices with C, onto Q, and mark D. 19: while Q is not empty do 20: X <- Q.dequeue() 21: if there exists a reliable path between any two nodes x and y, 22: where x in X, y in Y, and Y in R then 23: R = R {X} 24: for each clique Z which shares k-1 vertices with X do 25: if Z is not marked then 26: Mark Z and enqueue Z onto Q 27: end if 28: end for 29: else 30: for each clique Z in Q do. 31: Let Z be unmarked. 32: end for 33: break 34: end if 35: end while 36: S { C | C in R} 42: if |S| < maximum_module_size then 43: I = I {S} 44: end if 45: end if 46: end for 47: end for 48: 49: Let J be an empty set. 51: for all S in I in descending order do 52: overlap_cnt 0 53: for all T in J do 54: overlap_cnt overlap_cnt + |ST| 55: end for 56: if overlap_cnt/|S| < module_overlapping_threshold then 57: J = J {S} 58: end if 59: end for 60: Output J.

Parameter setting

  • Confidence score threshold: When an input network is weighted and this threshold is increased, the number of reliable paths is decreased and the modules shrink. We recommend that use higher value for this threshold if the network contains many false positives or false negatives. CAM sets this threshold to 0.5 when user submits a job without setting it.

  • Maximum module size: This parameter sets the maximum size of the modules generated. The suggested value for this parameter is 150, because the sizes of most of protein complexes are less than 150.

  • Module overlapping threshold: The threshold used to remove highly overlapped modules. When the threshold is increased, the overlapping modules is also increased. We think the safe choice for this parameter is 0.5.

References


© 2012, Laboratory of Systems Biology and Network Biology ,

Institute of Information Science, Academic Sinica. All rights reserved. 

Lasted updated 2012/10/19