Shared Data in Fetches
The current invocation method of worklets has each value of the execution signature independently fetched. Normally this is fine, but there are circumstances where multiple execution arguments come from related data and these fetches make redundant computations.
This is currently most notable when fetching "from" fields in a topology map on a structured cell set. This operation requires (in a point to cell map) taking a flat cell index, converting that to a 3D ijk cell index, converting that back to a flat point index for the lower left, then adding values for the other 7 indices. Because the fetches are independent, this operation is repeated for each one because there is no shared thread-local place to store such information.
This design document presents two proposals to enable us to share this information. The proposals are mutually exclusive, so we need to choose one or the other.
- After a bit of thought and discussion, we are going to go with the second approach labeled "Special Index Loading." --Kmorel (talk) 18:22, 6 October 2015 (EDT)
Thread-Local Storage Argument List
The reason why there is no way to cache information like computed indices when fetching data is that the execution objects transported from the control environment is that these execution objects are shared among the threads. We do not want to have a separate copy of each execution object for each thread, because that could get computationally prohibitive.
Instead, we could define a special thread-local storage object for each argument. There are several ways that we could use to specify this set of objects, but a straightforward way is to have another sub-tag in the control signature tag to specify it. This thread-local storage tag would behave very similarly to the transport tag.
Because they will have to be stored in a function interface, the thread-local storage objects must have a default constructor. They will also have an overloaded parenthesis operator that takes the input index and associated execution object and loads or computes any information that could be shared. This function interface becomes part of the Invocation object, which in turn gets sent to all the Fetches.
Most arguments will not need shared thread-local storage, so there will definitely be a special thread-local storage object that takes no memory and has a no-op for its parenthesis operator.
Pro: This is a very general method that should work for any future shared thread-local storage needs.
Con: This mechanism adds quite a bit of complexity to a system that is already quite complicated. There is already a bit of magic required in the Fetch objects to find arguments and hope the types match up.
Con: This mechanism relies on the compiler quite a bit to avoid unnecessary computation. Most of the thread-local storage objects are empty no-ops. The mechanism to load the thread-local storage will naively iteratively load all of them. If the compiler fails to optimize them away, that is a lot of useless function calls in the inner loop of the scheduler.
Special Index Loading
Currently when a Fetch is called it is passed an invocation index and the entire Invocation object. The Fetch is then left to fend for itself. The idea here is that any execution argument can reference any control arguments to track any data that is needed.
For most Fetches, this is overkill. Most Fetches get their data from a single input object. Thus, this added lookup to find the correct input object is really just a hassle.
Right now the only time a Fetch needs to look at an object other than its own associated parameter from the control signature is to follow index lookups such as to find the “from” indices in a topology map. If this is the only reasonable use case we could simplify things quite a bit by tracking these indices beforehand and just using those.
The first step is to create a special execution object to hold the indices for each thread. Different worklet types will provide different threads, but the basic thread indices are as follows.
class ThreadIndicesBasic
{
public:
template<typename Invocation>
VTKM_EXEC_EXPORT
ThreadIndicesBasic(vtkm::Id threadIndex, const Invocation &invocation);
VTKM_EXEC_EXPORT
vtkm::Id GetInputIndex() const;
VTKM_EXEC_EXPORT
vtkm::Id3 GetInputIndex3D() const;
VTKM_EXEC_EXPORT
vtkm::Id GetOutputIndex() const;
VTKM_EXEC_EXPORT
vtkm::Id GetVisitIndex() const;
};
Worklet types could provide thread indices with more information. For example, the thread indices for topology map worklets could look like this.
template<typename InputDomainType>
class ThreadIndicesTopologyMap : public ThreadIndicesBasic
{
public:
template<typename Invocation>
VTKM_EXEC_EXPORT
ThreadIndicesTopologyMap(vtkm::Id threadIndex, const Invocation &invocation);
typedef typename InputDomainType::IndicesType IndicesFromType;
VTKM_EXEC_EXPORT
IndicesFromType GetIndicesFrom() const;
};
Now let us see what Fetch would look like. Here is the current implementation of Fetch for directly reading from an array. This is about as simple as it gets, but there is still some template gymnastics that have to be performed.
template<typename Invocation, vtkm::IdComponent ParameterIndex>
struct Fetch<
vtkm::exec::arg::FetchTagArrayDirectIn,
vtkm::exec::arg::AspectTagDefault,
Invocation,
ParameterIndex>
{
typedef typename Invocation::ParameterInterface::
template ParameterType<ParameterIndex>::type ExecObjectType;
typedef typename ExecObjectType::ValueType ValueType;
VTKM_EXEC_EXPORT
ValueType Load(vtkm::Id index, const Invocation &invocation) const
{
return invocation.Parameters.template GetParameter<ParameterIndex>().
Get(index);
}
VTKM_EXEC_EXPORT
void Store(vtkm::Id, const Invocation &, const ValueType &) const
{
// Store is a no-op for this fetch.
}
};
With the proposed changes, the Fetch would look like this.
template<typename ThreadIndicesType, typename ExecutionObjectType>
struct Fetch<
vtkm::exec::arg::FetchTagArrayDirectIn,
vtkm::exec::arg::AspectTagDefault,
ThreadIndicesType,
ExecutionObjectType>
{
typedef typename ExecutionObjectType::ValueType ValueType;
VTKM_EXEC_EXPORT
ValueType Load(const ThreadIndicesType &indices,
const ExecutionObjectType &execObject) const
{
return execObject.Get(index.GetInputIndex());
}
VTKM_EXEC_EXPORT
void Store(const ThreadIndicesType &, const Invocation &, const ValueType &) const
{
// Store is a no-op for this fetch.
}
};
Pro: This method should be easier to implement than the other one both in terms of the underlying architecture and for every developer using it now and into the future. In fact, it makes the overall implementation of Fetch objects simpler.
Con: All sharing of parameters is now limited indices. You would no longer be able to create an aspect tag that brought in data from two control signature parameters. That said, we do not have a use case for that nor can I even think of a potential future use case (that would not be more easily be implemented with fancy array handles).
SAND 2015-8686 O