Jump to content

SystemC TLM comparison using DMI with No DMI


Recommended Posts

Good day.

 

I am new to SystemC and TLM.

Here I am implementing producer-consumer model with a bus and memory model. The synchronization between producer-consumer is done using semaphore. I add DMI feature to the simulation. So the idea is to give producer direct access to the memory region. The same goes for the consumer. What I expect from doing this is to gain faster simulation time.

 

However, at some point of my experiment it shows that using DMI takes longer simulation time. 

Is this normal to happen? Or I might implement the DMI in the wrong way?

 

Thank you.

Best regards,

 

Li.

Link to comment
Share on other sites

When using DMI, the initiator is responsible for modelling read/write latencies, since no explicit transactions are generated for each access. Unless your processor model adds additional delays for the direct memory accesses internally, it is expected that DMI does not consume simulation time. 

 

To approximate the access delays, the target (and the interconnect in-between) can fill the latency fields in the tlm_dmi structure during the get_direct_mem_ptr call, when granting the DMI access to the initiator:

class tlm_dmi
{
public:
// ...
  sc_core::sc_time get_read_latency() const;
  sc_core::sc_time get_write_latency() const;
// ...
  void set_read_latency(sc_core::sc_time t);
  void set_write_latency(sc_core::sc_time t);
};

For more information, see IEEE 1666-2011, Section 11.2.

Greetings from Oldenburg,
  Philipp

Link to comment
Share on other sites

Hi Philipp,
 
I am aware that there are several time information within a simulation. One is real simulation time and the other is total simulation time (sys+user).
It is true that the total simulation time is always 0 (zero). My coding style is LT and I am creating the platform model in PV.

Attached below is write function within the producer class. 

// Write using DMI
bool producer::write_block(const sc_dt::uint64 & addr, unsigned int length, unsigned int *data) {
	vector<tlm_dmi>::iterator i;

	for (i = p_dmi.begin(); ((i != p_dmi.end()) && p_enable_dmi); i ++) {
		if ((i->get_start_address() <= addr)
	        	&& (i->get_end_address() >= addr + (length * sizeof(data[0])))
	       	 	&& i->is_write_allowed()) {

		        /* DMI access found */
		        memcpy(i->get_dmi_ptr() + (addr - i->get_start_address()),
		        	data,
		                length * sizeof(data[0]));
		        return true;
		}
	}

	/* Perform regular transaction */
	amba_pv_transaction trans;
	amba_pv_extension ex(length,
	                                sizeof(data[0]),
	                                NULL,
	                                AMBA_PV_INCR);
	sc_time t = SC_ZERO_TIME;

	trans.set_write();
	trans.set_address(addr);
	trans.set_data_ptr(reinterpret_cast<unsigned char *>(data));
	trans.set_data_length(sizeof(data[0]) * length);
	trans.set_streaming_width(trans.get_data_length());
	trans.set_extension(& ex);
	p_socket.b_transport(trans, t);
	//wait(t);
	if (ex.get_resp() != amba_pv::AMBA_PV_OKAY) {
		cout << "ERROR\t" << name()
		                << ": write_block() memory failure at "
		                << showbase << hex << addr << endl;
		trans.clear_extension(& ex);
		return false;
	}

	/* DMI may be requested */
	if (trans.is_dmi_allowed() && p_enable_dmi) {
		tlm_dmi dmi;
		
		/* Request DMI access for further write */
		trans.set_address(addr);
		if (p_socket.get_direct_mem_ptr(trans, dmi)) {

		        /* DMI access available */
		        if ((dmi.get_start_address() <= addr)
		        && (dmi.get_end_address() >= addr + (length * sizeof(data[0])))) {
		        	p_dmi.push_back(dmi);
		        }
		}
	}
	trans.clear_extension(& ex);
	return true;
}

I would like to try set_write_latency, but I am not sure where to implement the code.

Kindly advise me if there is any mistake on the DMI implementation.

 

Thank you.

Best regards,

 

Li. 

Link to comment
Share on other sites

I would like to try set_write_latency, but I am not sure where to implement the code.

 

I didn't review the full source code.

 

To add a delay to the DMI access, you can add a wait after the memcpy call:

if( i->get_write_latency() != sc_core::SC_ZERO_TIME )
  wait( length * sizeof(*data) * i->get_write_latency() );

The write latency needs to be filled by the target and/or interconnect modules that fill the tlm_dmi information.

As said before, see IEEE 1666-2011, Section 11.2 for details, especially 11.2.5 (ab)-(ad).

 

Generally speaking, you should respect the time annotation received from the target/interconnect in the non-DMI case as well.

The corresponding wait(t) call is currently disabled in your code.

 

hth,

  Philipp

Link to comment
Share on other sites

Thank you Philipp.  :) 

I appreciate your advice on setting the latency for the dmi. But I am not sure whether or not it is the answer to my problem.

 

I know that DMI is created to make the simulation speed faster (by simulating less). 

What I still couldn't get here is that at some point on my platform simulation, using DMI is much slower than using no DMI (taking into account the real simulation time / actual "wall clock" time). This anomaly occurs when I increase the memory size (from 2 MB to 8 MB, 16 MB, etc.) and or I increase the number of runs (from 1,000 to 1,000,000). Somehow the DMI itself become an overhead to the simulation performance. It works just fine for the 2 MB memory size, but not more than that.

 

For more information, I am using bus and memory model from AMBA-PV. The producer and consumer are implemented using SystemC-2.3.0 with TLM-2.0 (LT). I also attach a Cortex-M3 processor to the platform, and the platform itself is coded for programmers view use-case (no timing). But I have not implemented any application / firmware / OS for the processor, so it's just left hanging there.

 

Philipp, If by adding a delay to the DMI access is the answer. Kindly provide me some more details about it.

 

Thank you.

Best regards.

 

Li

Link to comment
Share on other sites

Thank you Alan.  :)

 

I profile my code using gprof. It shows that line 62 and 63 consume the most time.

Both line basically try to seek for dmi access list which I keep inside a vector, and this is causing the DMI simulation overhead.

I will try to update my code, and see how it goes.

 

Thank you.

Best regards,

 

Li.

Link to comment
Share on other sites

The problem in those two lines is probably that you never clean up your p_dmi vector? 
Does it help to empty it when invalidating the DMI access in invalidate_direct_mem_ptr?

p_dmi_enabled = false;
p_dmi.clear();

Otherwise, you may need to implement a more efficient way to handle multiple active DMI accesses with a custom DMI table wrapped around p_dmi, allowing more efficient address range handling and lookup, e.g. by combining and sorting address ranges and avoiding duplicates.

 

NB: Obviously, modelling a memory latency in the initiator does not improve the simulation speed of the model.  I probably misunderstood your original question here.

Edited by Philipp A. Hartmann
Link to comment
Share on other sites

Thank you Philipp.  :)

 

I do erase dmi access inside p_dmi vector, and this is implemented in invalidate_direct_mem_ptr function.

However, after some debugging I come to know that this one is not working properly.

After I changed it to p_dmi.clear(), the anomaly is gone. The DMI simulation is now much faster in any simulation parameter.

I think by not properly clearing dmi access inside the vector (based upon my code) will cause the vector to grow bigger and causing line 63 & 64 consume more time. That's why the anomaly occurs at some condition (larger memory size, longer simulation runs).

 

I am going to fix this part of code. I think it still better to invalidate certain dmi access inside the list than destroying the whole vector by calling clear().

Kindly advise me if there is any mistake.

 

Thank you.

Best regards,

 

Li.

Link to comment
Share on other sites

I am going to fix this part of code. I think it still better to invalidate certain dmi access inside the list than destroying the whole vector by calling clear().

 

It depends on your particular scenario, whether there is a significant benefit in maintaining a complex set of DMI allowances.

If you think it is necessary, I would recommend to

  • implement a dedicated data structure to store the DMI access information, which you can reuse and test separately
  • keep the ranges non-overlapping and sorted by start-address (and probably have two sets for reads and writes in the initiator)
  • use std::lower_bound to efficiently search the sorted vector internally
  • merge adjacent/overlapping entries to keep the number of entries small
  • forward the invalidate_direct_mem_ptr call to update the ranges in the (internal) vector

/Philipp

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...