I saw many times on this list the question about the performance of wait() vs. next_trigger, and in general, I think that when speaking about system performance, it is important to know what execution speed can be expected.I decided to measure the performance, the result see in the attachment.
The starting point was the 'next_trigger' example. I slightly modified it to make altogether 10 calls to the respective function, and added an external repetition cycle, i.e. altogether 1000 "actions" are executed by a module. After that I prepared the equivalent functionality using 'wait()'. In the main test program I create a vector of modules with length 0,1,2,5,10, etc, and measure the execution time using Linux facilities, at the beginning of elaboration, immediately before starting and after stopping. From those data I determined the MAPS value (in analogy with MIPS: how many "actions" are executed per seconds). The modules (as elements of the vector) work on the same time scale, i.e. SystemC receives requests from all modules at the same time. The computer has limited computing capacity and memory bandwidth, so I expected to see those limitations in the execution times.
I think I can see two "roofline" effects (http://doi.acm.org/10.1145/1498765.1498785).
Considering the elaboration phase only, the memory subsystem limits the performance. For very low number of elements, the memory bandwidth is not saturated, so the performance initially increases proportionally, and after a while it becomes constant. (interestingly, with a slight difference for the two actions).
Considering the simulation phase, the major limitation is the available computing capacity; when it is reached, the apparent execution times gets longer, and the performance of the simulator starts to decrease. Correspondingly, from the total execution time, the effect of both 'rooflines' can be seen, although the memory limitation is less expressed. The data confirm what was told earlier on the list that the implementation is more or less the same, and so is their performance.
I also see some difference in the behavior of the two methods; even it looks like that the next_trigger may have !two! rooflines; it may be the sign of some SW and/or HW cache. This is something where I would need the help of a core developer (for co-authorship): probably the effect of some configuration parameters can be seen from outside.
This simple benchmark has a minimum amount of non-kernel functionality, no i/o ports, binding, etc; so in my eyes it can be for SystemC what Linpack is for supercomputers. I want to measure also the effect of tracing, logging, etc. Any meas ideas welcome.
I think that in TLM simulation (and other, non-circuit simulations) it is an important factor to find out the proper computer size for the design. As shown in the figure, the efficiency drops for the too large (compared to ??? HW parameters) designs, so it might be better to invest into a larger configuration. (BTW: notice how the number of context switches correlate with the degradation of the performance)