Roman Popov Posted November 2, 2018 Report Share Posted November 2, 2018 A common knowledge is that RTL simulations in SystemC are slow, comparing to HDLs. Because SystemC is just a library, and HDL simulators are optimizing compilers. I've decided to experiment a little bit to measure the difference in simulation performance quantitatively. The simulated design is a pipeline of registers, without any logic between them. Each register is modeled as a separate process (SC_METHOD, SC_CTHREAD or always_ff). This design is artificial, since there is no designer-written logic, but it should allow to measure simulator kernel efficiency. I've used 4 simulators: Accellera SystemC with registers modeled as SC_METHODs Accellera SystemC with registers modeled as SC_CTHREADs Commercial optimizing Verilog simulator that compiles HDL to C/C++. Afaik, this simulator is discrete-event, the same as SystemC. Verilator. A cycle-based simulator, that compiles HDL to optimized C/C++. Here are results : Simulating 20 ms, clock is 20Mhz. All simulation runtimes are in seconds. Number of registers in pipeline | SystemC/SC_METHODs | SystemC/SC_CTHREADs | Commercial Verilog simulator | Verilator 100 1.3 2.4 0.85 0.2 200 2.7 5.5 1.75 0.28 500 8.9 17 6.5 0.49 1000 18 46 15.6 0.96 2000 65 159 37 1.8 4000 180 428 73 3.7 8000 920 - 133 7.4 *(I didn't run a SC_CTHREAD simulation for 8000 regs) Here is source code for DUTs: SystemC: SC_MODULE(dff_reg) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dff_reg) { #ifndef CTHREAD_DUT SC_METHOD(proc); sensitive << clk.pos() << rstn.neg(); #else SC_CTHREAD(proc_thread, clk.pos()); async_reset_signal_is(rstn, false); #endif } void proc() { if (!rstn) { data_out = 0; } else { data_out = data_in; } } void proc_thread() { data_out = 0; wait(); while (1) { data_out = data_in; wait(); } } }; SC_MODULE(dut) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dut) { for (size_t i = 0; i < N_REGS; ++i) { dff_insts[i].clk(clk); dff_insts[i].rstn(rstn); dff_insts[i].data_in(data_io[i]); dff_insts[i].data_out(data_io[i+1]); } SC_METHOD(in_method); sensitive << data_in; SC_METHOD(out_method); sensitive << data_io[N_REGS]; } private: void in_method() { data_io[0] = data_in; } void out_method() { data_out = data_io[N_REGS]; } sc_vector<dff_reg> dff_insts{"dff_insts", N_REGS}; sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1}; }; Verilog: module dff_reg ( input bit clk, input bit rstn, input int data_in, output int data_out ); always_ff @(posedge clk or negedge rstn) begin if (~rstn) begin data_out <= 0; end else begin data_out <= data_in; end end endmodule module dut ( input bit clk, input bit rstn, input int data_in, output int data_out ); int data_io[N_REGS + 1]; assign data_io[0] = data_in; assign data_out = data_io[N_REGS]; genvar i; generate for(i = 0; i < N_REGS; i = i + 1) begin: DFF_INST dff_reg D ( .clk(clk), .rstn(rstn), .data_in(data_io[i]), .data_out(data_io[i+1]) ); end endgenerate endmodule Quote Link to comment Share on other sites More sharing options...
Roman Popov Posted November 2, 2018 Author Report Share Posted November 2, 2018 What is interesting here, is that SystemC performance is close to Commercial HDL Compiler for small to medium size designs. But as design grows larger then 1000 processes, SystemC starts to lose significantly. I've profiled 4000 regs SystemC design in VTune to see what happens inside. Hotspot analysis is not surprising: 90% of cpu time is consumed by sc_event::trigger, ports reads/write through virtual sc_interface methods and sc_prim_channel_registry::perform_update. What is interesting is micro-architectural analysis: 90% of time performance is bound by data cache misses. So as design grows large, the only important thing is cache efficiency. And this is the weakness of SystemC, since SystemC design primitives are very expensive. For example sizeof(sc_signal<bool>) is 216 bytes! Quote Link to comment Share on other sites More sharing options...
David Black Posted November 3, 2018 Report Share Posted November 3, 2018 Is it really the size of the primitive, or the size of the data in play that's the issue? Also, if you could relocate the data to in close proximity to one another, then the caching would be potentially more efficient since it is the cache line size and utilization that has a large impact on performance. Perhaps a redesign of primitives would change this for RTL; although, it might take some serious rethink to make it work. If construction of sc_signal<bool> invoked some pooling and data aggregation. I would bet a saavy C++ programmer could improve this. For what it's worth, I have had some ideas that might make gate-level sims possible in SystemC too. Quote Link to comment Share on other sites More sharing options...
Roman Popov Posted November 4, 2018 Author Report Share Posted November 4, 2018 One more experiment with adding a combinational method to each register module. In SystemC case it will create additional N events inside signals (vs only 1 clock event used in original experiment) SystemC source: SC_MODULE(dff_reg) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dff_reg) { SC_METHOD(proc); sensitive << clk.pos() << rstn.neg(); SC_METHOD(preproc); sensitive << data_in; } private: sc_signal<int> data_imp{"data_imp"}; void preproc() { data_imp = data_in * 3 + 1; } void proc() { if (!rstn) { data_out = 0; } else { data_out = data_imp; } } void proc_thread() { data_out = 0; wait(); while (1) { data_out = data_imp; wait(); } } }; Verilog code: module dff_reg ( input bit clk, input bit rstn, input int data_in, output int data_out ); int data_imp; always_comb begin data_imp = data_in * 3 + 1; end always_ff @(posedge clk or negedge rstn) begin if (~rstn) begin data_out <= 0; end else begin data_out <= data_imp; end end endmodule Simulation results (runtime in seconds): Number of registers | SystemC (2 SC_METHODs) | Commercial Verilog simulator 100 2.8 1.1 200 7 2.3 300 10 4.5 500 20 10.5 1000 40 31 2000 164 52 In this case Verilog compiler leads considerably even on small designs. Probably like a Verilator it does some static scheduling to reduce a number of events in optimized build. Quote Link to comment Share on other sites More sharing options...
Roman Popov Posted November 4, 2018 Author Report Share Posted November 4, 2018 On 11/2/2018 at 7:39 PM, David Black said: Is it really the size of the primitive, or the size of the data in play that's the issue? Also, if you could relocate the data to in close proximity to one another, then the caching would be potentially more efficient since it is the cache line size and utilization that has a large impact on performance. Yes, intuitively both locality and cache line utilization should have a significant impact on performance. Quote Perhaps a redesign of primitives would change this for RTL; although, it might take some serious rethink to make it work. If construction of sc_signal<bool> invoked some pooling and data aggregation. I would bet a saavy C++ programmer could improve this. I think it is possible to make a faster C++ library for RTL simulation if we were to design it from scratch today. However, in SystemC case it would be hard to improve performance without breaking backwards compatibility. But the question is, do we really want to simulate large scale RTL designs using a library like SystemC? I don't think there is much demand for this. If you are using SystemC for design, then you have a tool that converts it into Verilog. And after you did this conversion you can use Verilog compiler or Verilator for large simulations. This flow is already that utilized for example by Chisel and RocketChip https://github.com/freechipsproject/rocket-chip#emulator, they generate Verilog from Scala and use Verilator for simulations. Using Verilator with SystemC is even easier, since Verilator can generate a SystemC wrapper that can be easily integrated into SystemC-based verification environment. Quote Link to comment Share on other sites More sharing options...
Roman Popov Posted November 4, 2018 Author Report Share Posted November 4, 2018 One experiment with locality, I've replaced sc_vectors in original example with contiguous storage: SC_MODULE(dut) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dut) { dff_insts = static_cast<dff_reg *>(::operator new(sizeof(dff_reg) * N_REGS)); for (size_t i = 0; i < N_REGS; ++i) { new (dff_insts + i) dff_reg(sc_gen_unique_name("dff_reg")); } data_io = static_cast<sc_signal<uint32_t> *>(::operator new(sizeof(sc_signal<uint32_t>) * (N_REGS + 1))); for (size_t i = 0; i < N_REGS + 1; ++i) { new (data_io + i) sc_signal<uint32_t>(sc_gen_unique_name("data_io")); } for (size_t i = 0; i < N_REGS; ++i) { dff_insts[i].clk(clk); dff_insts[i].rstn(rstn); dff_insts[i].data_in(data_io[i]); dff_insts[i].data_out(data_io[i+1]); } SC_METHOD(in_method); sensitive << data_in; SC_METHOD(out_method); sensitive << data_io[N_REGS]; } private: void in_method() { data_io[0] = data_in; } void out_method() { data_out = data_io[N_REGS]; } dff_reg *dff_insts; sc_signal<uint32_t> *data_io; // sc_vector<dff_reg> dff_insts{"dff_insts", N_REGS}; // sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1}; }; Here are results (runtime in seconds): Number of regs | Original (sc_vector) | Contiguous storage | Improvement over original 100 1.3 sec 1.0 sec 23 % 200 2.7 sec 2.3 sec 14 % 500 8.9 sec 7.1 sec 20 % 1000 18 sec 15.4 sec 14 % 2000 65 sec 45 sec 30 % 4000 180 sec 117 sec 35 % So even without improving a primitives it is possible to gain some improvement in performance. I think this technique can be applied internally to same places inside SystemC kernel. There are places inside SystemC kenel that utilize arena allocators, but this technique is not applied systematically ( This is my observation, probably SystemC authors did benchmarking and implemented arenas only where it makes sense ) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.