Jump to content
Sign in to follow this  
MehdiT

SystemC threads stack overflow?

Recommended Posts

My top-level is a NoC consists of a Network-On-Chip with a grid of 15*15 nodes (Router+PE).

Been trying to simulate it in different machines/configurations but kept stopping at different times of the simulation. Can't see what causes the problem and hence I am stack.

1) Running only SystemC / C++ executable (output of the compiler) in a LSF cluster. Simulation runs normally with expected output then it stops at the 55000ish cycle (cycle accurate model) with this error message:

noc_exe: ../../../../src/sysc/kernel/sc_cor_qt.cpp:107: virtual void sc_core::sc_cor_qt::stack_protect(bool): Assertion `ret == 0' failed.
/home/#######/.lsbatch/1438183854.772657: line 8: 15547 Aborted                 (core dumped) ./noc_exe

2) Running with Cadence irun command in a LSF cluster. The simulation was heaps of times slower but managed to reach 100 000ish cycles before generating this error message:

Simulation interrupted at 1025080 NS + 0                              
ncsim> ncsim: *W,NCTERM: Simulation received SIGTERM signal from process 22268, user id 0 (/env/seki/app/lsf/8.0/etc/sbatchd).                                                                                                                
make: *** [run] Error 15

I have investigated the error NCTERM with nchelp and got:

nchelp: 14.20-s010: (c) Copyright 1995-2015 Cadence Design Systems, Inc.
ncsim/NCTERM =
        A SIGTERM signal was received by the running simulation. This signal
        may have been issued due to various reasons:

          * sent by the user using the kill command
          * machine on which the job was running went down
          * sent by LSF (Load Sharing Facility) to enforce certain user
            specified job control limits (memory, CPU, swap, etc.)

I had a little doubt that the stack size might not be enough for my threads. The outputs from 1) and 2) took place even after I tried to increase the stack size.

In 2), it is enough to add the

-SC_THREAD_STACKSIZE 0x80000

switch to irun command.

In 1), I had to go to every registration of thread in my constructors and append it with another line:

SC_THREAD(controller_thread);
set_stack_size(NOC_THREAD_STACK_SIZE); // 0x80000

I'd appreciate any prompt reply :)

 

When I run the same test with a smaller number of nodes, the issue does not occur.

Share this post


Link to post
Share on other sites

It might just be the stack of the program - are you declaring any large arrays?

 

It might help to replace them with dynamically allocated memory if you are.

 

Also are  you allocating the modules on the stack?

 

I would try to use dynamic allocation wherever possible and see if it helps.

 

Also are there any limits on your OS - try using ulmit or limits to find out,

Alan

Share this post


Link to post
Share on other sites

Thanks Alan.

In fact all my modules are instantiated with "new" and I do allocate memory dynamically everywhere to avoid using large arrays. 

I investigated the problem more with gdb to find out that it may be related to the fact that I spawn a lot of dynamic threads. Here is the output of (gdb) backtrace

noc_exe: ../../../../src/sysc/kernel/sc_cor_qt.cpp:107: virtual void sc_core::sc_cor_qt::stack_protect(bool): Assertion `ret == 0' failed.
[Thread debugging using libthread_db enabled]

Program received signal SIGABRT, Aborted.
0x00000038b5e328a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-program-options-1.41.0-17.el6_4.x86_64 glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libstdc++-4.4.7-3.el6.x86_64
(gdb) backtrace
#0  0x00000038b5e328a5 in raise () from /lib64/libc.so.6
#1  0x00000038b5e34085 in abort () from /lib64/libc.so.6
#2  0x00000038b5e2ba1e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000038b5e2bae0 in __assert_fail () from /lib64/libc.so.6
#4  0x00002aaaaadaa5b0 in sc_core::sc_cor_qt::stack_protect(bool) () from /home/########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so
#5  0x00002aaaaadbf9dc in sc_core::sc_simcontext::create_thread_process(char const*, bool, void (sc_core::sc_process_host::*)(), sc_core::sc_process_host*, sc_core::sc_spawn_options const*) () from /home/emehtaa/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so
#6  0x0000000000428bb6 in sc_core::sc_spawn<sc_boost::_bi::bind_t<int, sc_boost::_mfi::mf2<int, credit_ctrl, unsigned int, noc_trans_t*>, sc_boost::_bi::list3<sc_boost::_bi::value<credit_ctrl*>, sc_boost::_bi::value<unsigned int>, sc_boost::_bi::value<noc_trans_t*> > > > (object=..., name_p=0x47dca6 "credit_spawned_f", opt_p=0x129ac77f0) at /home/emehtaa/systemc-2.3.1/include/sysc/kernel/sc_spawn.h:118
#7  0x0000000000424c5d in credit_ctrl::credit_init_thread (this=0x29fe350) at ./src/credit_ctrl.cpp:240
#8  0x00002aaaaadc5d36 in sc_core::sc_thread_cor_fn(void*) () from /home/########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so
#9  0x00002aaaaadcd2c1 in qt_blocki () from /home/#########/systemc-2.3.1/lib-linux64/libsystemc-2.3.1.so

At each node (router), there is a module credit_controller spawning dynamic threads almost every cycle from a main thread credit_init_thread()

sc_spawn(sc_bind(&credit_ctrl::transaction_track_thread, this, current_phase_index, trans_ptr));

After spawning this child thread, the main thread continues with processing the next transaction.

The child thread transaction_track_thread is defined as a member function inside the same module credit_ctrl.

void credit_ctrl::transaction_track_thread(unsigned phase_index, noc_trans_t* noc_trans) {
    // do some stuff
}

Although I think that a child thread will terminate after finishing its execution, I have a doubt they are still there. After some simulation time, the huge number of unterminated threads are causing the above issue. It is only a guess though.

I tried to pass a name for each spawned thread as an argument for sc_spawn only to get runtime warning messages saying the spawned object already exists with the same name and the declaration will be renamed. Not good. If the child thread terminates so should its name disappear as well. I don't know what I am missing here.

Share this post


Link to post
Share on other sites

I don't know if your threads have to be SC_THREADs (i.e. if they have to wait) but if they don't, you could spawn SC_METHODs instead.

 

The other thing that might be worth trying is to explicitly kill the threads when they are no longer needed. I've no idea if that will help, but perhaps it might cause the kernel to properly remove the threads?

 

regards

Alan

Share this post


Link to post
Share on other sites

Hi Alan,

 

There are many reasons why I can't just migrate to sc_methods. Among which the use of wait statements.

Killing the spawned threads sounds reasonable although I thought this should be taken care automatically. Do you mean I should use sc_process_handle, wait for the thread to be terminated and then kill it with kill() method in the handle? 

Share this post


Link to post
Share on other sites

Terminating the threads after they finish their execution didn't help. I went back to the part in my code where I spawn dynamic threads and rewrited it. No processes are dynamically created in my code and problem seems to be solved. 

From this, I get that it is perhaps not a good idea to spawn dynamic threads in large SystemC simulations.

I also noticed that the usage of the RAM grows exponentially when I use dynamic threads (with and without manual termination). It could be because I haven't done it in a correct way though (see code above).

Without dynamic threads, the usage of the RAM is stable (goes linearly with the number of static threads in the simulation).

However my problem remains unsolved with regard to what reason makes the code crash. I am writing this as a workaround for others if they face a similar issue in the future.

Share this post


Link to post
Share on other sites

I guess one of the argument passed to the Sc_spawn was killed/not available. It is important that the argument passed to the sc_spawn is available throughout the lifetime of the spawned process. If not it will  cause the return value to be zero/segmentation fault (because the sc_spawn is attempts to access the invalid storage)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×