Jump to content
Roman Popov

Benchmarking RTL simulation with SystemC

Recommended Posts

A common knowledge is that RTL simulations in SystemC are slow, comparing to HDLs. Because SystemC is just a library, and HDL simulators are optimizing compilers.  I've decided to experiment a little bit to measure the difference in simulation performance quantitatively.

The simulated design is a pipeline of registers, without any logic between them. Each register is modeled as a separate process (SC_METHOD, SC_CTHREAD or always_ff). This design is artificial, since there is no designer-written logic, but it should allow to measure simulator kernel efficiency. 

I've used 4 simulators:

  1. Accellera SystemC with registers modeled as SC_METHODs
  2. Accellera SystemC with registers modeled as SC_CTHREADs
  3. Commercial optimizing Verilog simulator that compiles HDL to C/C++. Afaik, this simulator is discrete-event, the same as SystemC.
  4. Verilator. A cycle-based simulator, that compiles HDL to optimized C/C++. 

Here are results : Simulating 20 ms, clock is 20Mhz.   All simulation runtimes are in seconds.  

Number of registers in pipeline   |    SystemC/SC_METHODs   |   SystemC/SC_CTHREADs  |  Commercial Verilog simulator  |  Verilator
100                                           1.3                        2.4                         0.85                  0.2 
200                                           2.7                        5.5                         1.75                 0.28 
500                                           8.9                         17                          6.5                 0.49 
1000                                           18                         46                         15.6                 0.96 
2000                                           65                        159                           37                  1.8 
4000                                          180                        428                           73                  3.7 
8000                                          920                         -                           133                  7.4 

*(I didn't run a SC_CTHREAD simulation for 8000 regs)

Here is source code for DUTs:

SystemC:

SC_MODULE(dff_reg) {

    sc_in<bool>     clk{"clk"};
    sc_in<bool>     rstn{"rstn"};

    sc_in<uint32_t>      data_in{"data_in"};
    sc_out<uint32_t>     data_out{"data_out"};

    SC_CTOR(dff_reg) {
#ifndef CTHREAD_DUT
        SC_METHOD(proc);
        sensitive << clk.pos() << rstn.neg();
#else
        SC_CTHREAD(proc_thread, clk.pos());
        async_reset_signal_is(rstn, false);
#endif
    }

    void proc() {
        if (!rstn) {
            data_out = 0;
        } else {
            data_out = data_in;
        }
    }

    void proc_thread() {
        data_out = 0;
        wait();
        while (1) {
            data_out = data_in;
            wait();
        }
    }

};

SC_MODULE(dut) {
    sc_in<bool> clk{"clk"};
    sc_in<bool> rstn{"rstn"};
    sc_in<uint32_t>  data_in{"data_in"};
    sc_out<uint32_t> data_out{"data_out"};

    SC_CTOR(dut) {
        for (size_t i = 0; i < N_REGS; ++i) {
            dff_insts[i].clk(clk);
            dff_insts[i].rstn(rstn);
            dff_insts[i].data_in(data_io[i]);
            dff_insts[i].data_out(data_io[i+1]);
        }

        SC_METHOD(in_method); sensitive << data_in;
        SC_METHOD(out_method); sensitive << data_io[N_REGS];
    }

private:

    void in_method() { data_io[0] = data_in; }
    void out_method() { data_out = data_io[N_REGS]; }

    sc_vector<dff_reg>        dff_insts{"dff_insts", N_REGS};
    sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1};
};

 

Verilog:

module dff_reg (
    input  bit  clk,
    input  bit  rstn,
    input  int  data_in,
    output int  data_out
);

always_ff @(posedge clk or negedge rstn) begin
    if (~rstn) begin
        data_out <= 0;
    end else begin
        data_out <= data_in;
    end
end

endmodule

module dut
(
    input  bit  clk,
    input  bit  rstn,
    input  int  data_in,
    output int  data_out
);

int  data_io[N_REGS + 1];

assign data_io[0] = data_in;
assign data_out = data_io[N_REGS];

genvar i;

generate
    for(i = 0; i < N_REGS; i = i + 1) begin: DFF_INST
        dff_reg D (
            .clk(clk),
            .rstn(rstn),
            .data_in(data_io[i]),
            .data_out(data_io[i+1])
        );
    end
endgenerate

endmodule

 

Share this post


Link to post
Share on other sites

What is interesting here, is that SystemC performance is close to Commercial HDL Compiler for small to medium size designs. But as design grows larger then 1000 processes, SystemC starts to lose significantly. 

I've profiled 4000 regs SystemC design in VTune to see what happens inside. 

Hotspot analysis is not surprising: 90% of cpu time is consumed by sc_event::trigger, ports reads/write through virtual sc_interface methods and sc_prim_channel_registry::perform_update. 

What is interesting is micro-architectural analysis: 90% of time performance is bound by data cache misses.  So as design grows large, the only important thing is cache efficiency. And this is the weakness of SystemC, since SystemC design primitives are very expensive.  For example sizeof(sc_signal<bool>) is 216 bytes!

Share this post


Link to post
Share on other sites

Is it really the size of the primitive, or the size of the data in play that's the issue? Also, if you could relocate the  data to in close proximity to one another, then the caching would be potentially more efficient since it is the cache line size and utilization that has a large impact on performance.

Perhaps a redesign of primitives would change this for RTL; although, it might take some serious rethink to make it work. If construction of sc_signal<bool> invoked some pooling and data aggregation. I would bet a saavy C++ programmer could improve this.

For what it's worth, I have had some ideas that might make gate-level sims possible in SystemC too.

Share this post


Link to post
Share on other sites

One more experiment with adding a combinational method to each register module.  In SystemC case it will create additional N events inside signals (vs only 1 clock event used in original experiment)

SystemC source:

SC_MODULE(dff_reg) {

    sc_in<bool>     clk{"clk"};
    sc_in<bool>     rstn{"rstn"};

    sc_in<uint32_t>      data_in{"data_in"};
    sc_out<uint32_t>     data_out{"data_out"};

    SC_CTOR(dff_reg) {
        SC_METHOD(proc);
        sensitive << clk.pos() << rstn.neg();
        SC_METHOD(preproc);
        sensitive << data_in;

    }
private:
    sc_signal<int> data_imp{"data_imp"};

    void preproc() {
        data_imp = data_in * 3 + 1;
    }

    void proc() {
        if (!rstn) {
            data_out = 0;
        } else {
            data_out = data_imp;
        }
    }

    void proc_thread() {
        data_out = 0;
        wait();
        while (1) {
            data_out = data_imp;
            wait();
        }
    }

};

 

Verilog code:

module dff_reg (
    input  bit  clk,
    input  bit  rstn,
    input  int  data_in,
    output int  data_out
);

int data_imp;

always_comb begin
    data_imp = data_in * 3 + 1;
end

always_ff @(posedge clk or negedge rstn) begin
    if (~rstn) begin
        data_out <= 0;
    end else begin
        data_out <= data_imp;
    end
end

endmodule

 

Simulation results  (runtime in seconds):

Number of registers   |  SystemC (2 SC_METHODs) |  Commercial Verilog simulator
100                              2.8                          1.1 
200                                7                          2.3
300                               10                          4.5
500                               20                         10.5
1000                              40                           31
2000                             164                           52

 

In this case Verilog compiler leads considerably even on small designs.  Probably like a Verilator it does some static scheduling to reduce a number of events in optimized build. 

Share this post


Link to post
Share on other sites
On 11/2/2018 at 7:39 PM, David Black said:

Is it really the size of the primitive, or the size of the data in play that's the issue? Also, if you could relocate the  data to in close proximity to one another, then the caching would be potentially more efficient since it is the cache line size and utilization that has a large impact on performance.

Yes, intuitively both locality and cache line utilization should have a significant impact on performance. 

Quote

Perhaps a redesign of primitives would change this for RTL; although, it might take some serious rethink to make it work. If construction of sc_signal<bool> invoked some pooling and data aggregation. I would bet a saavy C++ programmer could improve this.

I think it is possible to make a faster C++ library for RTL simulation if we were to design it from scratch today. However, in SystemC case it would be hard to improve performance without breaking backwards compatibility. 

But the question is, do we really want to simulate large scale RTL designs using a library like SystemC? I don't think there is much demand for this.  If you are using SystemC for design, then you have a tool that converts it into Verilog. And after you did this conversion you can use Verilog compiler or Verilator for large simulations.  This flow is already that utilized for example by Chisel and RocketChip https://github.com/freechipsproject/rocket-chip#emulator,  they generate Verilog from Scala and use Verilator for simulations.

Using Verilator with SystemC is even easier, since Verilator can generate a SystemC wrapper that can be easily integrated into SystemC-based verification environment.

Share this post


Link to post
Share on other sites

One experiment with locality, I've replaced sc_vectors in original example with contiguous storage:

SC_MODULE(dut) {
    sc_in<bool> clk{"clk"};
    sc_in<bool> rstn{"rstn"};
    sc_in<uint32_t>  data_in{"data_in"};
    sc_out<uint32_t> data_out{"data_out"};

    SC_CTOR(dut) {

        dff_insts = static_cast<dff_reg *>(::operator new(sizeof(dff_reg) * N_REGS));
        for (size_t i = 0; i < N_REGS; ++i) {
            new (dff_insts + i) dff_reg(sc_gen_unique_name("dff_reg"));
        }

        data_io = static_cast<sc_signal<uint32_t> *>(::operator new(sizeof(sc_signal<uint32_t>) * (N_REGS + 1)));
        for (size_t i = 0; i < N_REGS + 1; ++i) {
            new (data_io + i) sc_signal<uint32_t>(sc_gen_unique_name("data_io"));
        }

        for (size_t i = 0; i < N_REGS; ++i) {
            dff_insts[i].clk(clk);
            dff_insts[i].rstn(rstn);
            dff_insts[i].data_in(data_io[i]);
            dff_insts[i].data_out(data_io[i+1]);
        }

        SC_METHOD(in_method); sensitive << data_in;
        SC_METHOD(out_method); sensitive << data_io[N_REGS];
    }

private:

    void in_method() { data_io[0] = data_in; }
    void out_method() { data_out = data_io[N_REGS]; }

    dff_reg *dff_insts;
    sc_signal<uint32_t> *data_io;
//    sc_vector<dff_reg>        dff_insts{"dff_insts", N_REGS};
//    sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1};
};

 

Here are results (runtime in seconds):

Number of regs   |   Original (sc_vector)   |  Contiguous storage  | Improvement over original
100                         1.3 sec                  1.0 sec                   23 %
200                         2.7 sec                  2.3 sec                   14 %
500                         8.9 sec                  7.1 sec                   20 %                   
1000                         18 sec                 15.4 sec                   14 %
2000                         65 sec                   45 sec                   30 %
4000                        180 sec                  117 sec                   35 %

So even without improving a primitives it is possible to gain some improvement in performance. 

I think this technique can be applied internally to same places inside SystemC kernel. There are places inside SystemC kenel that utilize arena allocators, but this technique is not applied systematically ( This is my observation, probably SystemC authors did benchmarking and implemented arenas only where it makes sense )

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×