Jump to content

Roman Popov

Members
  • Content Count

    251
  • Joined

  • Last visited

  • Days Won

    23

Everything posted by Roman Popov

  1. Roman Popov

    What is SystemC library special in?

    In that case you will have to specify library paths manually.
  2. Roman Popov

    SC_THREADS not starting?

    This is definition: void StimGen(void) {} Declaration should look like this: void StimGen(void);
  3. Roman Popov

    SC_THREADS not starting?

    Your code is wrong, you should get a compiler error, because you have two definitions for stimuli::StimGen. My guess is that you have not added stimuli.cpp to your project file. So it is just ignored. As a result you have empty process void StimGen(void) { /*Do nothing*/}
  4. Roman Popov

    SC_THREADS not starting?

    Can you provide a complete source code?
  5. Link on a main page is wrong: Leads to UVM downloads...
  6. A common knowledge is that RTL simulations in SystemC are slow, comparing to HDLs. Because SystemC is just a library, and HDL simulators are optimizing compilers. I've decided to experiment a little bit to measure the difference in simulation performance quantitatively. The simulated design is a pipeline of registers, without any logic between them. Each register is modeled as a separate process (SC_METHOD, SC_CTHREAD or always_ff). This design is artificial, since there is no designer-written logic, but it should allow to measure simulator kernel efficiency. I've used 4 simulators: Accellera SystemC with registers modeled as SC_METHODs Accellera SystemC with registers modeled as SC_CTHREADs Commercial optimizing Verilog simulator that compiles HDL to C/C++. Afaik, this simulator is discrete-event, the same as SystemC. Verilator. A cycle-based simulator, that compiles HDL to optimized C/C++. Here are results : Simulating 20 ms, clock is 20Mhz. All simulation runtimes are in seconds. Number of registers in pipeline | SystemC/SC_METHODs | SystemC/SC_CTHREADs | Commercial Verilog simulator | Verilator 100 1.3 2.4 0.85 0.2 200 2.7 5.5 1.75 0.28 500 8.9 17 6.5 0.49 1000 18 46 15.6 0.96 2000 65 159 37 1.8 4000 180 428 73 3.7 8000 920 - 133 7.4 *(I didn't run a SC_CTHREAD simulation for 8000 regs) Here is source code for DUTs: SystemC: SC_MODULE(dff_reg) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dff_reg) { #ifndef CTHREAD_DUT SC_METHOD(proc); sensitive << clk.pos() << rstn.neg(); #else SC_CTHREAD(proc_thread, clk.pos()); async_reset_signal_is(rstn, false); #endif } void proc() { if (!rstn) { data_out = 0; } else { data_out = data_in; } } void proc_thread() { data_out = 0; wait(); while (1) { data_out = data_in; wait(); } } }; SC_MODULE(dut) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dut) { for (size_t i = 0; i < N_REGS; ++i) { dff_insts[i].clk(clk); dff_insts[i].rstn(rstn); dff_insts[i].data_in(data_io[i]); dff_insts[i].data_out(data_io[i+1]); } SC_METHOD(in_method); sensitive << data_in; SC_METHOD(out_method); sensitive << data_io[N_REGS]; } private: void in_method() { data_io[0] = data_in; } void out_method() { data_out = data_io[N_REGS]; } sc_vector<dff_reg> dff_insts{"dff_insts", N_REGS}; sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1}; }; Verilog: module dff_reg ( input bit clk, input bit rstn, input int data_in, output int data_out ); always_ff @(posedge clk or negedge rstn) begin if (~rstn) begin data_out <= 0; end else begin data_out <= data_in; end end endmodule module dut ( input bit clk, input bit rstn, input int data_in, output int data_out ); int data_io[N_REGS + 1]; assign data_io[0] = data_in; assign data_out = data_io[N_REGS]; genvar i; generate for(i = 0; i < N_REGS; i = i + 1) begin: DFF_INST dff_reg D ( .clk(clk), .rstn(rstn), .data_in(data_io[i]), .data_out(data_io[i+1]) ); end endgenerate endmodule
  7. Roman Popov

    Benchmarking RTL simulation with SystemC

    One experiment with locality, I've replaced sc_vectors in original example with contiguous storage: SC_MODULE(dut) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dut) { dff_insts = static_cast<dff_reg *>(::operator new(sizeof(dff_reg) * N_REGS)); for (size_t i = 0; i < N_REGS; ++i) { new (dff_insts + i) dff_reg(sc_gen_unique_name("dff_reg")); } data_io = static_cast<sc_signal<uint32_t> *>(::operator new(sizeof(sc_signal<uint32_t>) * (N_REGS + 1))); for (size_t i = 0; i < N_REGS + 1; ++i) { new (data_io + i) sc_signal<uint32_t>(sc_gen_unique_name("data_io")); } for (size_t i = 0; i < N_REGS; ++i) { dff_insts[i].clk(clk); dff_insts[i].rstn(rstn); dff_insts[i].data_in(data_io[i]); dff_insts[i].data_out(data_io[i+1]); } SC_METHOD(in_method); sensitive << data_in; SC_METHOD(out_method); sensitive << data_io[N_REGS]; } private: void in_method() { data_io[0] = data_in; } void out_method() { data_out = data_io[N_REGS]; } dff_reg *dff_insts; sc_signal<uint32_t> *data_io; // sc_vector<dff_reg> dff_insts{"dff_insts", N_REGS}; // sc_vector<sc_signal<uint32_t>> data_io{"data_io", N_REGS + 1}; }; Here are results (runtime in seconds): Number of regs | Original (sc_vector) | Contiguous storage | Improvement over original 100 1.3 sec 1.0 sec 23 % 200 2.7 sec 2.3 sec 14 % 500 8.9 sec 7.1 sec 20 % 1000 18 sec 15.4 sec 14 % 2000 65 sec 45 sec 30 % 4000 180 sec 117 sec 35 % So even without improving a primitives it is possible to gain some improvement in performance. I think this technique can be applied internally to same places inside SystemC kernel. There are places inside SystemC kenel that utilize arena allocators, but this technique is not applied systematically ( This is my observation, probably SystemC authors did benchmarking and implemented arenas only where it makes sense )
  8. Roman Popov

    Benchmarking RTL simulation with SystemC

    Yes, intuitively both locality and cache line utilization should have a significant impact on performance. I think it is possible to make a faster C++ library for RTL simulation if we were to design it from scratch today. However, in SystemC case it would be hard to improve performance without breaking backwards compatibility. But the question is, do we really want to simulate large scale RTL designs using a library like SystemC? I don't think there is much demand for this. If you are using SystemC for design, then you have a tool that converts it into Verilog. And after you did this conversion you can use Verilog compiler or Verilator for large simulations. This flow is already that utilized for example by Chisel and RocketChip https://github.com/freechipsproject/rocket-chip#emulator, they generate Verilog from Scala and use Verilator for simulations. Using Verilator with SystemC is even easier, since Verilator can generate a SystemC wrapper that can be easily integrated into SystemC-based verification environment.
  9. Roman Popov

    Benchmarking RTL simulation with SystemC

    One more experiment with adding a combinational method to each register module. In SystemC case it will create additional N events inside signals (vs only 1 clock event used in original experiment) SystemC source: SC_MODULE(dff_reg) { sc_in<bool> clk{"clk"}; sc_in<bool> rstn{"rstn"}; sc_in<uint32_t> data_in{"data_in"}; sc_out<uint32_t> data_out{"data_out"}; SC_CTOR(dff_reg) { SC_METHOD(proc); sensitive << clk.pos() << rstn.neg(); SC_METHOD(preproc); sensitive << data_in; } private: sc_signal<int> data_imp{"data_imp"}; void preproc() { data_imp = data_in * 3 + 1; } void proc() { if (!rstn) { data_out = 0; } else { data_out = data_imp; } } void proc_thread() { data_out = 0; wait(); while (1) { data_out = data_imp; wait(); } } }; Verilog code: module dff_reg ( input bit clk, input bit rstn, input int data_in, output int data_out ); int data_imp; always_comb begin data_imp = data_in * 3 + 1; end always_ff @(posedge clk or negedge rstn) begin if (~rstn) begin data_out <= 0; end else begin data_out <= data_imp; end end endmodule Simulation results (runtime in seconds): Number of registers | SystemC (2 SC_METHODs) | Commercial Verilog simulator 100 2.8 1.1 200 7 2.3 300 10 4.5 500 20 10.5 1000 40 31 2000 164 52 In this case Verilog compiler leads considerably even on small designs. Probably like a Verilator it does some static scheduling to reduce a number of events in optimized build.
  10. Roman Popov

    Benchmarking RTL simulation with SystemC

    What is interesting here, is that SystemC performance is close to Commercial HDL Compiler for small to medium size designs. But as design grows larger then 1000 processes, SystemC starts to lose significantly. I've profiled 4000 regs SystemC design in VTune to see what happens inside. Hotspot analysis is not surprising: 90% of cpu time is consumed by sc_event::trigger, ports reads/write through virtual sc_interface methods and sc_prim_channel_registry::perform_update. What is interesting is micro-architectural analysis: 90% of time performance is bound by data cache misses. So as design grows large, the only important thing is cache efficiency. And this is the weakness of SystemC, since SystemC design primitives are very expensive. For example sizeof(sc_signal<bool>) is 216 bytes!
  11. Roman Popov

    Sensitive List

    I usually have a project-wide header that includes systemc.h together with some utilities like this one. And then I include it everywhere instead of systemc.h. You should not modify systemc.h since it's a 3rd party code.
  12. Roman Popov

    Sensitive List

    Compiler tells you exactly what is wrong. You can't make a process sensitive to a plain (non-channel) variable. Because plain variable has no value changed event. Here is the list of overloads: sc_sensitive& operator << ( const sc_event& ); sc_sensitive& operator << ( const sc_interface& ); sc_sensitive& operator << ( const sc_port_base& ); sc_sensitive& operator << ( sc_event_finder& ); So you can make process sensitive to event, sc_interface and it's derived classes (sc_signal for example), sc_port, and sc_event_finder (this is what sc_in<bool>::pos() returns). Probably what you really want is sc_vector of signals? : sc_vector<sc_signal<sc_fixed<din_size, din_int>>> in_val{"in_val", din_num_samples}; This will work. Also making process sensitive to all elements in vector is such a common thing, so I recommend you to create a special overload for this case. For example: template <typename T> sc_sensitive& operator << ( sc_sensitive& sensitive, const sc_vector<T>& vec ) { for (auto & el : vec) sensitive << el; return sensitive; } struct dut : sc_module { sc_vector<sc_in<int>> in_vector{"in_vector", N_PORTS}; SC_CTOR(dut) { SC_METHOD(test_method); sensitive << in_vector; } void test_method(); };
  13. Roman Popov

    How to connect array of ports?

    Usually you should use sc_vector instead of array. The problem with array is that names of signals and ports in array will be initialized to meaningless values (port_0 ... port_1 ). If you still want to use arrays, then bind with a loop. Here is example, both for array and sc_vector. #include <systemc.h> static const int N_PORTS = 4; struct dut : sc_module { sc_in<int> in_array[N_PORTS]; sc_vector<sc_in<int>> in_vector{"in_vector", N_PORTS}; SC_CTOR(dut) {} }; struct test : sc_module { dut d0{"d0"}; sc_signal<int> sig_array[N_PORTS]; sc_vector<sc_signal<int>> sig_vector{"sig_vector", N_PORTS}; SC_CTOR(test) { for (size_t i = 0; i < N_PORTS; ++i) { d0.in_array[i](sig_array[i]); } d0.in_vector(sig_vector); } };
  14. There are couple of commercial implementations, but I've not seen any open-source.
  15. Roman Popov

    Method sensitive with sc_inout port

    In your code sample both Initiator and Target SC_METHODs are sensitive to value change event of signal. So when signal value changes, both of them are triggered. Why did you expect different outcome?
  16. Neither do I. From my point of view making stack for SC_THREAD executable makes no sense. Probably this was a requirement for some outdated systems and needs to be revisited.
  17. This is strange. Does it give any hint where execution jumps to the heap? forkjoin test is very simple and should not generate any code on heap.
  18. Roman Popov

    Seeking Feedback on Datatypes

    Same experience on our side. We are using AC types for CNN inference engine modeling. The problem with sc_dt:: types is not only performance, but semantics also. For example sc_int and sc_uint types are limited to 64 bits in width. Probably it makes sense for software modeling, but for HLS this is not useful at all. And then you got this weird things happening with wrap-around on 64 bits: sc_int<64> a, b; sc_bigint<128> m; m = a * b; // result is computed with 64 bits wraparound and then casted to 128 bits It's Apache license, https://github.com/hlslibs/ac_types/blob/master/LICENSE. The same as SystemC. And it was proposed for synthesis on SystemC evolution day in 2017 http://www.accellera.org/images/activities/working-groups/S3._Datatypes.1.pdf . So probably the only required thing is tool support from vendors other then Mentor.
  19. Roman Popov

    Method sensitive with sc_inout port

    I don't quite understood what is your problem. I suggest you to run simulation and check if works as you expected. If it is not, try to reformulate a problem as "I expected (something), but I got (something else) in simulation".
  20. SystemC is single threaded, you don't need std::mutex. However SystemC does not guarantee any order for processes executed in the same delta cycle. SystemC is a discrete-event simulator, and I suggest you to learn the concepts of simulation semantics from Paragraph 4 "Elaboration and simulation semantics" of IEEE-1666 SystemC standard. The purpose of primitive channels, like sc_signal, is to remove non-determinism by separating channel update request from actual update of value into two different simulation phases. But it only works if there is a single writing process during a delta cycle. In your case if initiator and responder threads are executed on same cycle, they both will read the same "current value" of signal. Initiator will request to set signal value to (valid = 1, ready = 0) and responder will request to set it to (valid = 0, ready = 1). Since there is no guarantee on order between processes, you will either get (1,0) or (0,1) on the next cycle.
  21. Roman Popov

    tlm_fifo nb_peek

    2.3.3 will be released next week w/o a fix. So most likely next one after 2.3.3
  22. Those questions are covered in detail in paragraphs 14.1 and 14.2 of SystemC standard. Can't answer in a better way. TLM2.0 simulations are not cycle-accurate, so you don't have clock edge events. In AT modeling style you should call wait(delay) after each transport call. In LT modeling style all initiators are temporaly decoupled and can run ahead of simulation time, usually for a globally specified time quantum. For debugging you can use the same techinques as in cycle-accurate modeling: Source-level debugging with breakpoints and stepping Transaction and signal tracing Logging Comparing with RTL, debugging using waveform won't be that effective, because in AT/LT modeling state of model can change significantly in a single simulator/waveform step. Usually preffered way is combination of logging and source-level debug. Debugging TLM models is harder comparing to RTL. Also C++ is much more complex and error-prone comparing to VHDL/Verilog.
  23. TLM payload is used for untyped raw data transfers. Data format is usually a property of device. Let's consider an example: Initiator is CPU model, and target is Convolution filter accelerator. Accelerator accepts a 2d matrix (2d array) of coefficients as an input. Documentation of accelerator must specify a binary format of data, for example: coefficients are stored in row-major order, each coefficient is 8-byte signed integer. Using this documentation initiator converts 2d array into a raw data of tlm payload. And device model converts raw data back into 2d array. This is how it is usually done.
  24. Roman Popov

    tlm_fifo nb_peek

    Thanks for report. I will put this into SystemC bug tracker. Until it is fixed, you can use a workaround: target_port->tlm_get_peek_if<int>::nb_peek(b);
  25. Hi johnmac, What you try to do is conceptually wrong: Ready and Valid should be two separate signals. Initiator drives the Valid signal, and responder drives the Ready. There is an example of ready-valid channel that comes together with SystemC, check in systemc/examples/sysc/2.3/sc_rvd If your final goal is synthesizable code, then both Mentor and Cadence synthesis tools already have it implemented in their libraries. Check vendor documentation. If you still want to simulate your design, you can try to use SC_UNCHECKED_WRITERS instead of SC_MANY_WRITERS. This will disable a check for multiple drivers in a single delta cycle.
×