Jump to content

Systemc performance


Vegh, Janos

Recommended Posts

Hello,

I saw many times on this list the question about the performance of wait() vs. next_trigger, and in general, I think that when speaking about system performance, it is important to know what execution speed can be expected.I decided to measure the performance, the result see in the attachment.

The starting point was the 'next_trigger' example. I slightly modified it to make altogether 10 calls to the respective function, and added an external repetition cycle, i.e. altogether 1000 "actions" are executed by a module. After that I prepared the equivalent functionality using 'wait()'. In the main test program I create a vector of modules with length 0,1,2,5,10, etc, and measure the execution time using Linux facilities, at the beginning of elaboration, immediately before starting and after stopping. From those data I determined the MAPS value (in analogy with MIPS: how many "actions" are executed per seconds). The modules (as elements of the vector) work on the same time scale, i.e. SystemC receives requests from all modules at the same time.  The computer has limited computing capacity and memory bandwidth, so I expected to see those limitations in the execution times.

I think I can see two "roofline" effects (http://doi.acm.org/10.1145/1498765.1498785).

Considering the elaboration phase only, the memory subsystem limits the performance. For very low number of elements, the memory bandwidth is not saturated, so the performance initially increases proportionally, and after a while it becomes constant. (interestingly, with a slight difference for the two actions).

Considering the simulation phase, the major limitation is the available computing capacity; when it is reached, the apparent execution times gets longer, and the performance of the simulator starts to decrease. Correspondingly, from the total execution time, the effect of both 'rooflines' can be seen, although the memory limitation is less expressed. The data confirm what was told earlier on the list that the implementation is more or less the same, and so is their performance.

I also see some difference in the behavior of the two methods; even it looks like that the next_trigger may have !two! rooflines; it may be the sign of some SW and/or HW cache. This is something where I would need the help of a core developer (for co-authorship): probably the effect of some configuration parameters can be seen from outside.

This simple benchmark has a minimum amount of non-kernel functionality, no i/o ports, binding, etc; so in my eyes it can be for SystemC what Linpack is for supercomputers. I want to measure also the effect of tracing, logging, etc. Any meas ideas welcome.

I think that in TLM simulation (and other, non-circuit simulations) it is an important factor to find out the proper computer size for the design. As shown in the figure, the efficiency drops for the too large (compared to ??? HW parameters) designs, so it might be better to invest into a larger configuration. (BTW: notice how the number of context switches correlate with the degradation of the performance)

Best regards

Janos

RooflineSystemC.pdf

Link to comment
Share on other sites

Perhaps you would like to share your code for measurements via GitHub?

Measuring performance can be tricky to say the least. How you compile (compiler, version, SystemC version) and what you measure can really change results. Probably helps to specify your computer's specifications (Processor, RAM, cache, OS version) too.

  • Processor (vendor, version)
  • L1 cache size
  • L2 cache size
  • L3 cache size
  • RAM
  • OS (name, version)
  • Compiler (name, version)
  • Compiler switches (--std, -O)
  • SystemC version
  • SystemC installation switches
  • How time is measured and from what point (e.g. start_of_simulation to end_of_simulation)
  • Memory consumption information if possible

This will help to make meaningful statements about the measurements and allow others to reproduce/verify your results. It is also important to understand how these results should be interpreted (taken advantage of) and compared.

As with respect to TLM, it will get a lot more challenging. For example, what style of coding: Loosely Timed, Approximately Timed. Are sc_clock's involved?

Link to comment
Share on other sites

Real-life simulation performance usually depends a lot on modeling style. For high-level TLM-2.0 models share of simulation time consumed by SystemC primitives is usually much lower, comparing to time consumed by "business logic" of models.  Efficiency of simulation kernel (like context switches and channels) is much more important for low-level RTL simulations.

Link to comment
Share on other sites

22 hours ago, Roman Popov said:

Real-life simulation performance usually depends a lot on modeling style. For high-level TLM-2.0 models share of simulation time consumed by SystemC primitives is usually much lower, comparing to time consumed by "business logic" of models.  Efficiency of simulation kernel (like context switches and channels) is much more important for low-level RTL simulations.

Yes, this is exactly what I originally wanted to do: just to measure how the _relative_ efficiency changes if wait/next_trigger event handling is utilized. I do not say I want to measure something absolute, like in the case of supercomputers the important parameter is what time is needed for which benchmark. BTW: how the efficiency of the simulation depends on the size of the design, is also important. The preliminary tests say that the large designs eat up the computing resource, and the simulation time strongly increases.

I have uploaded the 'benchmark system' to

https://github.com/jvegh/SystemCperformance

The primary point was to make as little offset as possible, and I hope I made it applicable to other 'benchmarking', too. The easy-to change CMake enables to play with compilers, versions, etc., and the provided .tex files enable to make quickly publication-quality diagrams. My experiences say that at low "module" numbers the measurement not necessarily provides reliable results: the used resource measurement is not designed for such utilization. With larger scatter, but it works.

For the first look, however, it looks like that the tool is sensitive to the internals of the computer. My first idea was that I can see the "memory-bound" and "computation-bound" feature of the SystemC kernel, of course on the top of the cache behavior. From the data measured on my laptop, I see some strange effects near the (I guess) cache capacity bounds. The OS is the same (Ubuntu  18.04), but the processor belongs to a different generation.

 

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...