MehdiT Posted July 29, 2015 Report Share Posted July 29, 2015 My top-level is a NoC consists of a Network-On-Chip with a grid of 15*15 nodes (Router+PE). Been trying to simulate it in different machines/configurations but kept stopping at different times of the simulation. Can't see what causes the problem and hence I am stack. 1) Running only SystemC / C++ executable (output of the compiler) in a LSF cluster. Simulation runs normally with expected output then it stops at the 55000ish cycle (cycle accurate model) with this error message: noc_exe: ../../../../src/sysc/kernel/sc_cor_qt.cpp:107: virtual void sc_core::sc_cor_qt::stack_protect(bool): Assertion `ret == 0' failed. /home/#######/.lsbatch/1438183854.772657: line 8: 15547 Aborted (core dumped) ./noc_exe 2) Running with Cadence irun command in a LSF cluster. The simulation was heaps of times slower but managed to reach 100 000ish cycles before generating this error message: Simulation interrupted at 1025080 NS + 0 ncsim> ncsim: *W,NCTERM: Simulation received SIGTERM signal from process 22268, user id 0 (/env/seki/app/lsf/8.0/etc/sbatchd). make: *** [run] Error 15 I have investigated the error NCTERM with nchelp and got: nchelp: 14.20-s010: (c) Copyright 1995-2015 Cadence Design Systems, Inc. ncsim/NCTERM = A SIGTERM signal was received by the running simulation. This signal may have been issued due to various reasons: * sent by the user using the kill command * machine on which the job was running went down * sent by LSF (Load Sharing Facility) to enforce certain user specified job control limits (memory, CPU, swap, etc.) I had a little doubt that the stack size might not be enough for my threads. The outputs from 1) and 2) took place even after I tried to increase the stack size. In 2), it is enough to add the -SC_THREAD_STACKSIZE 0x80000 switch to irun command. In 1), I had to go to every registration of thread in my constructors and append it with another line: SC_THREAD(controller_thread); set_stack_size(NOC_THREAD_STACK_SIZE); // 0x80000 I'd appreciate any prompt reply When I run the same test with a smaller number of nodes, the issue does not occur. Quote Link to comment Share on other sites More sharing options...
apfitch Posted July 29, 2015 Report Share Posted July 29, 2015 It might just be the stack of the program - are you declaring any large arrays? It might help to replace them with dynamically allocated memory if you are. Also are you allocating the modules on the stack? I would try to use dynamic allocation wherever possible and see if it helps. Also are there any limits on your OS - try using ulmit or limits to find out, Alan Quote Link to comment Share on other sites More sharing options...
MehdiT Posted July 29, 2015 Author Report Share Posted July 29, 2015 Thanks Alan. In fact all my modules are instantiated with "new" and I do allocate memory dynamically everywhere to avoid using large arrays. I investigated the problem more with gdb to find out that it may be related to the fact that I spawn a lot of dynamic threads. Here is the output of (gdb) backtrace noc_exe: ../../../../src/sysc/kernel/sc_cor_qt.cpp:107: virtual void sc_core::sc_cor_qt::stack_protect(bool): Assertion `ret == 0' failed. [Thread debugging using libthread_db enabled] Program received signal SIGABRT, Aborted. 0x00000038b5e328a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install boost-program-options-1.41.0-17.el6_4.x86_64 glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libstdc++-4.4.7-3.el6.x86_64 (gdb) backtrace #0 0x00000038b5e328a5 in raise () from /lib64/libc.so.6 #1 0x00000038b5e34085 in abort () from /lib64/libc.so.6 #2 0x00000038b5e2ba1e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00000038b5e2bae0 in __assert_fail () from /lib64/libc.so.6 #4 0x00002aaaaadaa5b0 in sc_core::sc_cor_qt::stack_protect(bool) () from /home/########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so #5 0x00002aaaaadbf9dc in sc_core::sc_simcontext::create_thread_process(char const*, bool, void (sc_core::sc_process_host::*)(), sc_core::sc_process_host*, sc_core::sc_spawn_options const*) () from /home/emehtaa/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so #6 0x0000000000428bb6 in sc_core::sc_spawn<sc_boost::_bi::bind_t<int, sc_boost::_mfi::mf2<int, credit_ctrl, unsigned int, noc_trans_t*>, sc_boost::_bi::list3<sc_boost::_bi::value<credit_ctrl*>, sc_boost::_bi::value<unsigned int>, sc_boost::_bi::value<noc_trans_t*> > > > (object=..., name_p=0x47dca6 "credit_spawned_f", opt_p=0x129ac77f0) at /home/emehtaa/systemc-2.3.1/include/sysc/kernel/sc_spawn.h:118 #7 0x0000000000424c5d in credit_ctrl::credit_init_thread (this=0x29fe350) at ./src/credit_ctrl.cpp:240 #8 0x00002aaaaadc5d36 in sc_core::sc_thread_cor_fn(void*) () from /home/########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so #9 0x00002aaaaadcd2c1 in qt_blocki () from /home/#########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so At each node (router), there is a module credit_controller spawning dynamic threads almost every cycle from a main thread credit_init_thread() sc_spawn(sc_bind(&credit_ctrl::transaction_track_thread, this, current_phase_index, trans_ptr)); After spawning this child thread, the main thread continues with processing the next transaction. The child thread transaction_track_thread is defined as a member function inside the same module credit_ctrl. void credit_ctrl::transaction_track_thread(unsigned phase_index, noc_trans_t* noc_trans) { // do some stuff } Although I think that a child thread will terminate after finishing its execution, I have a doubt they are still there. After some simulation time, the huge number of unterminated threads are causing the above issue. It is only a guess though. I tried to pass a name for each spawned thread as an argument for sc_spawn only to get runtime warning messages saying the spawned object already exists with the same name and the declaration will be renamed. Not good. If the child thread terminates so should its name disappear as well. I don't know what I am missing here. Quote Link to comment Share on other sites More sharing options...
apfitch Posted July 29, 2015 Report Share Posted July 29, 2015 I don't know if your threads have to be SC_THREADs (i.e. if they have to wait) but if they don't, you could spawn SC_METHODs instead. The other thing that might be worth trying is to explicitly kill the threads when they are no longer needed. I've no idea if that will help, but perhaps it might cause the kernel to properly remove the threads? regards Alan Quote Link to comment Share on other sites More sharing options...
MehdiT Posted July 30, 2015 Author Report Share Posted July 30, 2015 Hi Alan, There are many reasons why I can't just migrate to sc_methods. Among which the use of wait statements. Killing the spawned threads sounds reasonable although I thought this should be taken care automatically. Do you mean I should use sc_process_handle, wait for the thread to be terminated and then kill it with kill() method in the handle? Quote Link to comment Share on other sites More sharing options...
apfitch Posted July 30, 2015 Report Share Posted July 30, 2015 Yes, but I'm just speculating that it might help! Alan Quote Link to comment Share on other sites More sharing options...
MehdiT Posted August 4, 2015 Author Report Share Posted August 4, 2015 Terminating the threads after they finish their execution didn't help. I went back to the part in my code where I spawn dynamic threads and rewrited it. No processes are dynamically created in my code and problem seems to be solved. From this, I get that it is perhaps not a good idea to spawn dynamic threads in large SystemC simulations. I also noticed that the usage of the RAM grows exponentially when I use dynamic threads (with and without manual termination). It could be because I haven't done it in a correct way though (see code above). Without dynamic threads, the usage of the RAM is stable (goes linearly with the number of static threads in the simulation). However my problem remains unsolved with regard to what reason makes the code crash. I am writing this as a workaround for others if they face a similar issue in the future. Quote Link to comment Share on other sites More sharing options...
sudha Posted August 5, 2015 Report Share Posted August 5, 2015 I guess one of the argument passed to the Sc_spawn was killed/not available. It is important that the argument passed to the sc_spawn is available throughout the lifetime of the spawned process. If not it will cause the return value to be zero/segmentation fault (because the sc_spawn is attempts to access the invalid storage) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.