An efficient algorithm for sampling directed graphs

About the code

The code is an implementation of the algorithm described in [1]. It provides an efficient way to perform sampling of the realizations of any given bi-degree sequence. Previously existing graph sampling methods were either link-swap based (Markov-Chain Monte Carlo algorithms) or stub-matching based (the Configuration Model). Both types are ill-controlled, with typically unknown mixing times for link-swap methods and uncontrolled rejections for the Configuration Model. Conversely, this is an efficient, polynomial time algorithm that generates statistically independent directed graph samples with a given, arbitrary, bi-degree sequence. Unlike other algorithms, this degree-based method [2] always produces a sample, without backtracking or rejections.

If you use this code for your research, I kindly ask you to cite Ref. 1 in your publications.

Download the code

How to use

The code consists of the files DiGSampler.c, DiGSampler.h and DiGSamp.h. To use it, just include DiGSamp.h in your code, and compile DiGSampler.c with your other source files, remembering to use the option -lm as it needs to link to the math library.

The code defines two new data structures, called bds and digraph. The members of the structure bds, used internally by the sampling algorithm, are

void digsaminit(int **seq, const int n)

where n is the number of nodes in the sequence, and seq is a pointer to an nX2 matrix containing the bi-degree sequence. The sequence has to be stored so that, given an index i, seq[i][0] and seq[i][1] are the in-degree and the out-degree of the ith node, respectively.

To create and store a sample realization of a given bi-degree sequence, the user should invoke the function digsam. The prototype of the function is

digraph digsam(double (*rng)(void), const int stfl)

The function returns a realization of the bi-degree sequence for which it has been initialized, using the user-specified random number generator rng.

G = digsam(rng,0);

After the assignment, G.list is a densely allocated matroid containing the out-adjacency list, and G.weight is the logarithm of the weight associated with that particular sample. New samples of the sequence can be obtained invoking gsam again.

Please note that the out-adjacency list of a previous sample is destroyed with further calls to the sampler,

G = digsam(rng,1);

H = digsam(rng,1);

will result in the same out-adjacency list stored in G.list and in H.list. Also note that while the user can switch at will between uniform and degree-proportional choice of the target nodes, samples and weights of only one kind should be used for statistical averages.

After finishing sampling a given sequence, the memory used should be cleaned by invoking digsamclean(). Afterwards, the sampler is ready to be initialized again with another bi-degree sequence.

A minimal proof of concept program is included, in the file poc.c. The code can be compiled on a standard GNU/Linux distribution with the command

gcc -std=c99 -lm -o poc DiGSampler.c poc.c

The program invokes the directed graph sampling algorithm to produce 10 realizations of the sequence {(3, 0), (3, 0), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2)}, using a simple random number generator. The first 5 samples are created using uniform choice of nodes, and the remaining using degree-proportional selection. After the generation of each sample, the out-adjacency list and the logarithm of the weight are displayed on screen. Please note that (pseudo) random number generation for scientific or cryptographic applications is a complex subject, and the actual generator to use in publication-level sampling should be an established, tested, one. In the proof of concept, a simple one is used just for sake of simplicity. It should probably not be used otherwise, and definitely not be used for cryptographic applications, as there exist more appropriate and far better generators.

Suggestions

One of the return values provided by the code is the logarithm of the weight associated with each sample, to be used in an expression for the weighted mean of some observable. To avoid dealing with numbers of substantially different order of magnitude, a useful trick is to employ a formula for the logarithm of the sum. This way, one can find directly log(a+b) knowing log(a) and log(b). To see how this works, call x=log(a), y=log(b), and result=log(a+b). Then the following chain of identities holds:

log(a+b) =

= log(a*(1+b/a))

= log(a) + log(1+b/a)

= x + log(1+b/a)

= x + log(1+exp(y)/exp(x))

= x + log(1+exp(y-x))

Now notice that, if y>x, then y-x>0, and therefore the exponential in the expression above can grow without control. However, if y≤x, then the argument of that same exponential is negative or 0. Then, in this case, that exponential will be a real number between 0 and 1. If it is so, then the second term in the sum is log(1+ε), with ε between 0 and 1. But this is a very easily computed quantity, as it can be comfortably and precisely expanded in series, so much that the C programming language even has a function for it (log1p). Then, knowing x and y, all one needs to do is to make sure that y≤x. Since the sum is a symmetric operation, all this can be easily written in C as

result = fmax(x,y) + log1p(exp(-fabs(x-y)));

fmax returns whichever is greater between x and y, fabs returns the absolute value of the difference between x and y, and the minus sign before it makes sure that the exponential is negative. Since the exponential is quite small, the whole formula is particularly stable.

Aside from this, there are still some caveats. The first is that, in the weighted mean formula for a series of observable measurements Q_i with weights w_i, it's probably better not to compute the sum of w_iQ_i and the sum of w_i independently and then subtract the logarithms. Instead, one can use a stable algorithm to directly compute the ensemble average of Q on the fly. A particularly well-suited algorithm is West's algorithm [3], which is very straightforward. An easy explanation of the algorithm can be found here, under the section "Weighted incremental algorithm". As a good side-effect, the algorithm will also provide the uncertainty associated with the ensemble average of Q. Notice that, as it's discussed in West's original paper, this algorithm should be used only when one cannot save all the data and analyse them later, in which case the best choice would be a two-pass algorithm.

A second point of caution is that when computing mean and standard deviation, one often ends up not just summing, but also subtracting. In fact, subtractions are carried out about 50% of the times. The above formula for the logarithmic sum can be adapted for subtraction too, becoming

log(a-b) = x + log(1-exp(y-x)),

or, in C,

result = x + log1p(-exp(y-x));

The formula is always valid in the general case of a>b, but it's not as stable as that of the sum. The reason is that log(1+ε) changes relatively slowly for ε>0, but it changes quite quickly for ε<0. However, this is not too big a concern in the case of a mean and standard deviation calculation. In fact, West's algorithm converges relatively quickly to the correct value. This means that the amplitude of the oscillations around the actual ensemble average will quickly decrease. Thus, potentially the only problematic situations can happen for the first few terms in the calculation of the mean (or better, for half of them), but typically this is not a problem.

Finally, the last thing to be aware of is that of course the logarithmic formulae above will work only if one is dealing with positive numbers. However, some observables could very well be negative. For instance, one might be measuring one of the assortativity coefficients of a graph. In this case, the range of possible values for the observable would be from -1 to 1. Anyway, problems such as this are easily solved if one knows the theoretical range of the measurements. Then, one can artificially sum a certain same number to all the measurements, and average over these "shifted" results. In the assortativity example, one could sum 2 to all the measurements, thus making sure that the averages would be over the range 1 to 3, and then subtract 2 from the final result. This would guarantee that one never tries to take the logarithm of a negative number, and, importantly, that one can always say, a priori, which is the greater of the two numbers involved at every step.

References

[1] Kim, Del Genio, Bassler and Toroczkai, New J. Phys. 14, 023012

[2] Del Genio, Kim, Toroczkai and Bassler, PLoSOne 5 (4), e10012

[3] West, Commun. ACM 22, 532-535

Release information

Current version

Old versions

About the code

The code is an implementation of the algorithm described in [1]. It provides an efficient way to perform sampling of the realizations of any given bi-degree sequence. Previously existing graph sampling methods were either link-swap based (Markov-Chain Monte Carlo algorithms) or stub-matching based (the Configuration Model). Both types are ill-controlled, with typically unknown mixing times for link-swap methods and uncontrolled rejections for the Configuration Model. Conversely, this is an efficient, polynomial time algorithm that generates statistically independent directed graph samples with a given, arbitrary, bi-degree sequence. Unlike other algorithms, this degree-based method [2] always produces a sample, without backtracking or rejections.

If you use this code for your research, I kindly ask you to cite Ref. 1 in your publications.

Download the code

How to use

The code consists of the files DiGSampler.c, DiGSampler.h and DiGSamp.h. To use it, just include DiGSamp.h in your code, and compile DiGSampler.c with your other source files, remembering to use the option -lm as it needs to link to the math library.

The code defines two new data structures, called bds and digraph. The members of the structure bds, used internally by the sampling algorithm, are

- int *indegree
- int *outdegree
- int *label
- int *forbidden

- int **list
- double weight

void digsaminit(int **seq, const int n)

where n is the number of nodes in the sequence, and seq is a pointer to an nX2 matrix containing the bi-degree sequence. The sequence has to be stored so that, given an index i, seq[i][0] and seq[i][1] are the in-degree and the out-degree of the ith node, respectively.

To create and store a sample realization of a given bi-degree sequence, the user should invoke the function digsam. The prototype of the function is

digraph digsam(double (*rng)(void), const int stfl)

The function returns a realization of the bi-degree sequence for which it has been initialized, using the user-specified random number generator rng.

**The sequence contained in seq must be lexicographically ordered**. The random number generator must be a function taking no input parameters and returning a double precision floating point number between 0 and 1. The generator must be already seeded. This leaves the user the choice of the generator to use. The variable stfl is a flag governing the way target nodes are chosen for connection: if set to 0, the nodes are chosen randomly amongst those allowed; if set to anything but 0, the nodes are chosen with a probability proportional to their residual in-degree. Given a sequence and random number generator, the user creates a sample by declaring a variable G of type digraph, and then assigning it the return value of digsam:G = digsam(rng,0);

After the assignment, G.list is a densely allocated matroid containing the out-adjacency list, and G.weight is the logarithm of the weight associated with that particular sample. New samples of the sequence can be obtained invoking gsam again.

Please note that the out-adjacency list of a previous sample is destroyed with further calls to the sampler,

**even if the sample is assigned to a different variable**. Thus, for instance, the linesG = digsam(rng,1);

H = digsam(rng,1);

will result in the same out-adjacency list stored in G.list and in H.list. Also note that while the user can switch at will between uniform and degree-proportional choice of the target nodes, samples and weights of only one kind should be used for statistical averages.

After finishing sampling a given sequence, the memory used should be cleaned by invoking digsamclean(). Afterwards, the sampler is ready to be initialized again with another bi-degree sequence.

A minimal proof of concept program is included, in the file poc.c. The code can be compiled on a standard GNU/Linux distribution with the command

gcc -std=c99 -lm -o poc DiGSampler.c poc.c

The program invokes the directed graph sampling algorithm to produce 10 realizations of the sequence {(3, 0), (3, 0), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2)}, using a simple random number generator. The first 5 samples are created using uniform choice of nodes, and the remaining using degree-proportional selection. After the generation of each sample, the out-adjacency list and the logarithm of the weight are displayed on screen. Please note that (pseudo) random number generation for scientific or cryptographic applications is a complex subject, and the actual generator to use in publication-level sampling should be an established, tested, one. In the proof of concept, a simple one is used just for sake of simplicity. It should probably not be used otherwise, and definitely not be used for cryptographic applications, as there exist more appropriate and far better generators.

Suggestions

One of the return values provided by the code is the logarithm of the weight associated with each sample, to be used in an expression for the weighted mean of some observable. To avoid dealing with numbers of substantially different order of magnitude, a useful trick is to employ a formula for the logarithm of the sum. This way, one can find directly log(a+b) knowing log(a) and log(b). To see how this works, call x=log(a), y=log(b), and result=log(a+b). Then the following chain of identities holds:

log(a+b) =

= log(a*(1+b/a))

= log(a) + log(1+b/a)

= x + log(1+b/a)

= x + log(1+exp(y)/exp(x))

= x + log(1+exp(y-x))

Now notice that, if y>x, then y-x>0, and therefore the exponential in the expression above can grow without control. However, if y≤x, then the argument of that same exponential is negative or 0. Then, in this case, that exponential will be a real number between 0 and 1. If it is so, then the second term in the sum is log(1+ε), with ε between 0 and 1. But this is a very easily computed quantity, as it can be comfortably and precisely expanded in series, so much that the C programming language even has a function for it (log1p). Then, knowing x and y, all one needs to do is to make sure that y≤x. Since the sum is a symmetric operation, all this can be easily written in C as

result = fmax(x,y) + log1p(exp(-fabs(x-y)));

fmax returns whichever is greater between x and y, fabs returns the absolute value of the difference between x and y, and the minus sign before it makes sure that the exponential is negative. Since the exponential is quite small, the whole formula is particularly stable.

Aside from this, there are still some caveats. The first is that, in the weighted mean formula for a series of observable measurements Q_i with weights w_i, it's probably better not to compute the sum of w_iQ_i and the sum of w_i independently and then subtract the logarithms. Instead, one can use a stable algorithm to directly compute the ensemble average of Q on the fly. A particularly well-suited algorithm is West's algorithm [3], which is very straightforward. An easy explanation of the algorithm can be found here, under the section "Weighted incremental algorithm". As a good side-effect, the algorithm will also provide the uncertainty associated with the ensemble average of Q. Notice that, as it's discussed in West's original paper, this algorithm should be used only when one cannot save all the data and analyse them later, in which case the best choice would be a two-pass algorithm.

A second point of caution is that when computing mean and standard deviation, one often ends up not just summing, but also subtracting. In fact, subtractions are carried out about 50% of the times. The above formula for the logarithmic sum can be adapted for subtraction too, becoming

log(a-b) = x + log(1-exp(y-x)),

or, in C,

result = x + log1p(-exp(y-x));

The formula is always valid in the general case of a>b, but it's not as stable as that of the sum. The reason is that log(1+ε) changes relatively slowly for ε>0, but it changes quite quickly for ε<0. However, this is not too big a concern in the case of a mean and standard deviation calculation. In fact, West's algorithm converges relatively quickly to the correct value. This means that the amplitude of the oscillations around the actual ensemble average will quickly decrease. Thus, potentially the only problematic situations can happen for the first few terms in the calculation of the mean (or better, for half of them), but typically this is not a problem.

Finally, the last thing to be aware of is that of course the logarithmic formulae above will work only if one is dealing with positive numbers. However, some observables could very well be negative. For instance, one might be measuring one of the assortativity coefficients of a graph. In this case, the range of possible values for the observable would be from -1 to 1. Anyway, problems such as this are easily solved if one knows the theoretical range of the measurements. Then, one can artificially sum a certain same number to all the measurements, and average over these "shifted" results. In the assortativity example, one could sum 2 to all the measurements, thus making sure that the averages would be over the range 1 to 3, and then subtract 2 from the final result. This would guarantee that one never tries to take the logarithm of a negative number, and, importantly, that one can always say, a priori, which is the greater of the two numbers involved at every step.

References

[1] Kim, Del Genio, Bassler and Toroczkai, New J. Phys. 14, 023012

[2] Del Genio, Kim, Toroczkai and Bassler, PLoSOne 5 (4), e10012

[3] West, Commun. ACM 22, 532-535

Release information

Current version

**2.4**: fixed function prototyping.Old versions

**2.3**: fixed compliance with C99 standard.**2.2**: fixed unchecked buffer overrun, which could have been triggered if the first node in the (reordered) derivative sequence built to find the fail in-degree had an out-degree equal to the number of nodes in the sequence minus 1.**2.1**: fixed potential memory leak condition, triggered if the code was used to generate samples from many different sequences. Thanks to Lyle Muller for signaling the bug.