Difference between revisions of "Virtual Methods in the Execution Environment"
(Created page with "Currently VTK-m only supports direct function and method calls in the execution environment. Specifically we do not support virtual methods. Part of the rational is that there...") |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Currently VTK-m only supports direct function and method calls in the execution environment. Specifically we do not support virtual methods. Part of the rational is that there is often a cost associated with calling a virtual method over a known method, particularly if that method can be inlined. Also, early versions of CUDA did not support virtual methods. | Currently VTK-m only supports direct function and method calls in the execution environment. Specifically we do not support virtual methods. Part of the rational is that there is often a cost associated with calling a virtual method over a known method, particularly if that method can be inlined. Also, early versions of CUDA did not support virtual methods. | ||
− | However, requiring statically typed methods has its costs. Often in practical code it cannot be determined what specific type will be used when a particular method is called. The way we currently handle that is by using template tricks to compile an implementation for all supported types. This works, but can lead to excessively long compile times and executable | + | However, requiring statically typed methods has its costs. Often in practical code it cannot be determined what specific type will be used when a particular method is called. The way we currently handle that is by using template tricks to compile an implementation for all supported types. This works, but can lead to excessively long compile times and large executable sizes. |
There are several types of objects for which we would like to introduce virtual methods (or something like it). One big example is a [[Polymorphic ArrayHandle]] that would hide the actual storage implementation of an array. Other examples include a virtual implicit function for use with operations like clip, execution objects for dynamic cell set classes, and virtual methods for various cell operations (like interpolate) for different cell shapes. | There are several types of objects for which we would like to introduce virtual methods (or something like it). One big example is a [[Polymorphic ArrayHandle]] that would hide the actual storage implementation of an array. Other examples include a virtual implicit function for use with operations like clip, execution objects for dynamic cell set classes, and virtual methods for various cell operations (like interpolate) for different cell shapes. | ||
Line 9: | Line 9: | ||
At first blush, implementing virtual methods is just a matter of introducing the <tt>virtual</tt> keyword to the method implementation and create a shared superclass with an anonymous interface. For a standard CPU implementation (and the latest Xeon Phi) that is perfectly feasible. Things get interesting though for CUDA. | At first blush, implementing virtual methods is just a matter of introducing the <tt>virtual</tt> keyword to the method implementation and create a shared superclass with an anonymous interface. For a standard CPU implementation (and the latest Xeon Phi) that is perfectly feasible. Things get interesting though for CUDA. | ||
− | Virtual methods are supported on modern versions of CUDA, so on the surface you can implement it there, too. There are some complications that you have to work through to make sure the virtual table is correct for the device. | + | Virtual methods are supported on modern versions of CUDA, so on the surface you can implement it there, too. There are some complications that you have to work through to make sure the virtual table is correct for the device. In C++ when you create an object with virtual methods under the covers a [https://en.wikipedia.org/wiki/Virtual_method_table virtual table] is built for that object. A virtual table is an array of function pointers that point to the correct implementation of each virtual method (which cannot be resolved until compile time). When a virtual method is called, under the covers the program pulls the appropriate function pointer for that method from the virtual table and then calls that method. In CUDA the complication is that the method pointers are different for the host and device versions of the method. |
+ | |||
+ | To ensure that the virtual table gets built with the function pointers appropriate for the device, you need to construct the object on that device. For technical reasons we won't get into here, the structure of the VTK-m code dictates that the object needs to be constructed in a different kernel call than the kernel call that uses the object. (Likewise for destroying the object.) It adds a level of complication, but when it is all said and done it works. | ||
Well, it at least works functionally. From a practical standpoint there is a huge problem: the virtual method code takes a ''huge'' performance hit. Here are some experimental measurements taken for a simple array operation using a direct call and a virtual method vtable call. | Well, it at least works functionally. From a practical standpoint there is a huge problem: the virtual method code takes a ''huge'' performance hit. Here are some experimental measurements taken for a simple array operation using a direct call and a virtual method vtable call. | ||
Line 22: | Line 24: | ||
As you can see, the virtual table method is about 10x slower. This is not acceptable. | As you can see, the virtual table method is about 10x slower. This is not acceptable. | ||
+ | |||
+ | A big part of the cost that was measured is the time it takes to malloc the object. (A malloc is necessary because the object must be created and used in different kernel launches. An object on the stack would go out of scope, so a malloc is necessary to keep the object around.) Malloc is very slow on CUDA (at least on the FERMI card we were using for profiling). | ||
=== Avoiding the Virtual Table === | === Avoiding the Virtual Table === | ||
Line 28: | Line 32: | ||
<source lang="cpp"> | <source lang="cpp"> | ||
+ | typedef void FooSignature(); | ||
+ | |||
void FooA(); | void FooA(); | ||
void FooB(); | void FooB(); | ||
Line 33: | Line 39: | ||
struct Bar | struct Bar | ||
{ | { | ||
− | |||
− | |||
FooSignature *Foo; | FooSignature *Foo; | ||
}; | }; | ||
Line 53: | Line 57: | ||
So, for example, we could adjust the code from the previous example as follows. | So, for example, we could adjust the code from the previous example as follows. | ||
− | To adjust the code from above, assuming the function <tt>FooA</tt> is | + | To adjust the code from above, assuming the function <tt>FooA</tt> is changed to include the <tt>__device__</tt> modifier, you can replace <tt>bar.Foo = &FooA;</tt> with |
<source lang="cpp"> | <source lang="cpp"> | ||
− | cudaMemcpyFromSymbol((void**)&bar.Foo, | + | __device__ FooSignature *fooPointer = &FooA; |
+ | cudaMemcpyFromSymbol((void**)&bar.Foo, fooPointer, sizeof(FooSignature)); | ||
</source> | </source> | ||
In doing so, the object <tt>bar</tt> can be passed to a CUDA kernel and a call to the <tt>Foo</tt> "method" will correctly call <tt>FooA</tt>. | In doing so, the object <tt>bar</tt> can be passed to a CUDA kernel and a call to the <tt>Foo</tt> "method" will correctly call <tt>FooA</tt>. | ||
Line 75: | Line 80: | ||
=== Making it Device Independent === | === Making it Device Independent === | ||
− | Although the approach of using function pointers is motivated by running on CUDA, there is no reason you shouldn't be able to | + | Although the approach of using function pointers is motivated by running on CUDA, there is no reason you shouldn't be able to use it on all types of devices. The only real difference is how we get the function pointer. Thus we need a device-independent way to copy a function pointer for a particular device into a function pointer. |
This can be done simply by creating a function that is overloaded by a device adapter tag. Let's call it <tt>CopyFunctionPointer</tt>. The implementation for CUDA would look something like this. | This can be done simply by creating a function that is overloaded by a device adapter tag. Let's call it <tt>CopyFunctionPointer</tt>. The implementation for CUDA would look something like this. | ||
Line 91: | Line 96: | ||
} | } | ||
</source> | </source> | ||
+ | |||
+ | : <font color="forestgreen">It turns out this implementation will not quite work. The problem is that CUDA host code cannot access a device function pointer. To implement this, we'll have to launch a kernel function that copies the device pointer. Thus, the interface will be different. --[[User:Kmorel|Kmorel]] ([[User talk:Kmorel|talk]]) 17:03, 30 March 2017 (EDT)</font> | ||
We overload the last argument for other devices as well. For devices that use a CPU (like Serial and TBB), the implementation is trivial. | We overload the last argument for other devices as well. For devices that use a CPU (like Serial and TBB), the implementation is trivial. | ||
Line 192: | Line 199: | ||
template<typename PortalType, typename Device> | template<typename PortalType, typename Device> | ||
+ | VTKM_CONT | ||
ArrayPortalVirtual(PortalType *portal, Device) | ArrayPortalVirtual(PortalType *portal, Device) | ||
: SrcPortal(portal) | : SrcPortal(portal) | ||
Line 231: | Line 239: | ||
}; | }; | ||
</source> | </source> | ||
+ | |||
+ | : <font color="forestgreen">A technical issue that is not addressed here is that the delegate object needs to be copied from the host to the device. In the example above, the constructor to <tt>ArrayPortalVirtual</tt>, run in the control environment, takes a pointer to the actual portal managing the array. Whoever is calling this constructor has to be careful about allocating space for this object in the execution environment and copying the data over. That caller also needs to be responsible for releasing the data at some point in the future. We have some examples of this, but a general design still needs to be thought out. --[[User:Kmorel|Kmorel]] ([[User talk:Kmorel|talk]]) 14:35, 29 March 2017 (EDT)</font> | ||
+ | |||
+ | |||
+ | <small>SAND 2017-3296 O</small> |
Latest revision as of 16:30, 30 March 2017
Currently VTK-m only supports direct function and method calls in the execution environment. Specifically we do not support virtual methods. Part of the rational is that there is often a cost associated with calling a virtual method over a known method, particularly if that method can be inlined. Also, early versions of CUDA did not support virtual methods.
However, requiring statically typed methods has its costs. Often in practical code it cannot be determined what specific type will be used when a particular method is called. The way we currently handle that is by using template tricks to compile an implementation for all supported types. This works, but can lead to excessively long compile times and large executable sizes.
There are several types of objects for which we would like to introduce virtual methods (or something like it). One big example is a Polymorphic ArrayHandle that would hide the actual storage implementation of an array. Other examples include a virtual implicit function for use with operations like clip, execution objects for dynamic cell set classes, and virtual methods for various cell operations (like interpolate) for different cell shapes.
Contents
Virtual Methods in CUDA
At first blush, implementing virtual methods is just a matter of introducing the virtual keyword to the method implementation and create a shared superclass with an anonymous interface. For a standard CPU implementation (and the latest Xeon Phi) that is perfectly feasible. Things get interesting though for CUDA.
Virtual methods are supported on modern versions of CUDA, so on the surface you can implement it there, too. There are some complications that you have to work through to make sure the virtual table is correct for the device. In C++ when you create an object with virtual methods under the covers a virtual table is built for that object. A virtual table is an array of function pointers that point to the correct implementation of each virtual method (which cannot be resolved until compile time). When a virtual method is called, under the covers the program pulls the appropriate function pointer for that method from the virtual table and then calls that method. In CUDA the complication is that the method pointers are different for the host and device versions of the method.
To ensure that the virtual table gets built with the function pointers appropriate for the device, you need to construct the object on that device. For technical reasons we won't get into here, the structure of the VTK-m code dictates that the object needs to be constructed in a different kernel call than the kernel call that uses the object. (Likewise for destroying the object.) It adds a level of complication, but when it is all said and done it works.
Well, it at least works functionally. From a practical standpoint there is a huge problem: the virtual method code takes a huge performance hit. Here are some experimental measurements taken for a simple array operation using a direct call and a virtual method vtable call.
Method Type | Time |
---|---|
direct (with restrict keyword) | 0.009888 ms |
virtual method (through vtable) | 0.099744 ms |
As you can see, the virtual table method is about 10x slower. This is not acceptable.
A big part of the cost that was measured is the time it takes to malloc the object. (A malloc is necessary because the object must be created and used in different kernel launches. An object on the stack would go out of scope, so a malloc is necessary to keep the object around.) Malloc is very slow on CUDA (at least on the FERMI card we were using for profiling).
Avoiding the Virtual Table
After closer inspection, it was determined that the added time was principally the overhead required to create and destroy these classes with virtual methods/virtual tables. As it happens there is a low-tech way of building a structure with functions whose type is not determined until runtime. It is using the old C trick of declaring a function pointer and setting it to the address of a function. Here is a simple example of how that would work.
typedef void FooSignature();
void FooA();
void FooB();
struct Bar
{
FooSignature *Foo;
};
void DoIt()
{
Bar bar;
bar.Foo = &FooA;
bar.Foo(); // Calls FooA
bar.Foo = &FooB;
bar.Foo(); // Calls FooB
}
So far so good, but we still need to make sure we handle host vs. device functions correctly in CUDA. In particular, we really need to be able to set the function pointers for the device from the host. The CUDA runtime comes with a function to do just that: cudaMemcpyFromSymbol.
So, for example, we could adjust the code from the previous example as follows. To adjust the code from above, assuming the function FooA is changed to include the __device__ modifier, you can replace bar.Foo = &FooA; with
__device__ FooSignature *fooPointer = &FooA;
cudaMemcpyFromSymbol((void**)&bar.Foo, fooPointer, sizeof(FooSignature));
In doing so, the object bar can be passed to a CUDA kernel and a call to the Foo "method" will correctly call FooA.
The important aspect of this approach is that it is no longer necessary to execute kernels on the device to establish virtual tables or function pointers; it is all done on the host. Running the same test as before, we get the following additional result.
Method Type | Time |
---|---|
direct (with restrict keyword) | 0.009888 ms |
virtual method (through vtable) | 0.099744 ms |
function pointer | 0.014528 ms |
This version of indirect function calling is much faster than building virtual tables and is within 50% the performance of direct calling.
Making it Device Independent
Although the approach of using function pointers is motivated by running on CUDA, there is no reason you shouldn't be able to use it on all types of devices. The only real difference is how we get the function pointer. Thus we need a device-independent way to copy a function pointer for a particular device into a function pointer.
This can be done simply by creating a function that is overloaded by a device adapter tag. Let's call it CopyFunctionPointer. The implementation for CUDA would look something like this.
template<typename Signature>
inline
void CopyFunctionPointer(const Signature srcFunction,
Signature &destFunctionPointer,
vtkm::cont::DeviceAdapterTagCuda)
{
cudaMemcpyFromSymbol((void**)&destFunctionPointer,
&srcFunction,
sizeof(Signature));
}
- It turns out this implementation will not quite work. The problem is that CUDA host code cannot access a device function pointer. To implement this, we'll have to launch a kernel function that copies the device pointer. Thus, the interface will be different. --Kmorel (talk) 17:03, 30 March 2017 (EDT)
We overload the last argument for other devices as well. For devices that use a CPU (like Serial and TBB), the implementation is trivial.
template<typename Signature>
inline
void CopyFunctionPointer(const Signature srcFunction,
Signature &destFunctionPointer,
vtkm::cont::DeviceAdapterTagSerial)
{
destFunctionPointer = srcFunction;
}
Putting it all Together
Now that we have all the pieces, let us establish a standard method to implement a class with virtual-like methods. Here is a proposed set of steps to do so.
- For each virtual-like method create a function (in a detail namespace) that takes a void pointer to a concrete class and then calls the method on the class after performing a reinterpret_cast. The function will certainly be templated, but should not be overloaded. You need to be able to fully specify the function without arguments.
- Create the struct or class that has virtual-like methods as normal. There should be no actual virtual methods.
- For each virtual-like method, create a typedef for the signature of the function accessor you created earlier in the class.
- For each virtual-like method, create a member variable in the class with the type of the associated signature.
- The class should also have a void pointer member variable for a reference to the concrete class.
- Create a constructor for the class. The constructor should be templated on the concrete class you are calling methods on as well as a device adapter tag. One of the arguments should be a pointer to an instance of the concrete class. The constructor should initialize its function pointer members using CopyFunctionPointer.
- For each virtual-like method, create an implementation of that method that calls the associated function pointer.
Here is an example that implements these steps with a reasonable implementation of an array portal with virtual-like methods.
// Step 1: For each virtual-like method create a function (in a detail
// namespace) that takes a void pointer to a concrete class and then calls the
// method on the class after performing a reinterpret_cast. The function will
// certainly be templated, but should not be overloaded. You need to be able
// to fully specify the function without arguments.
namespace detail {
template<typename PortalType>
VTKM_EXEC
vtkm::Id ArrayPortalVirtualGetNumberOfValues(void *portalPointer)
{
PortalType &portalRef = *reinterpret_cast<PortalType &>(portalPointer);
return portalRef.GetNumberOfValues();
}
template<typename PortalType, typename ValueType>
VTKM_EXEC
ValueType ArrayPortalVirtualGet(void *portalPointer, vtkm::Id index)
{
PortalType &portalRef = *reinterpret_cast<PortalType &>(portalPointer);
return static_cast<ValueType>(portalRef.Get(index));
}
template<typename PortalType, typename ValueType>
VTKM_EXEC
void ArrayPortalVirtualSet(void *portalPointer,
vtkm::Id index,
const ValueType &value)
{
PortalType &portalRef = *reinterpret_cast<PortalType &>(portalPointer);
portalRef.Set(index, value);
}
} // namespace detail
// Step 2: Create the struct or class that has virtual-like methods as normal.
// There should be no actual virtual methods.
template<typename ValueType>
class ArrayPortalVirtual
{
// Step 3: For each virtual-like method, create a typedef for the signature
// of the function accessor you created earlier in the class.
typename vtkm::Id GetNumberOfValuesSignature(void *);
typename ValueType GetSignature(void *, vtkm::Id);
typename void SetSignature(void *, vtkm::Id, const ValueType &);
// Step 4: For each virtual-like method, create a member variable in the
// class with the type of the associated signature.
GetNumberOfValuesSignature *GetNumberOfValuesFunction;
GetSignature *GetFunction;
SetSignature *SetFunction;
// Step 5: The class should also have a void* member variable for a
// reference to the concrete class.
void *SrcPortal;
public:
// Step 6: Create a constructor for the class. The constructor should be
// templated on the concrete class you are calling methods on as well as a
// device adapter tag. One of the arguments should be a pointer to an
// instance of the concrete class. The constructor should initialize its
// function pointer members using CopyFunctionPointer.
template<typename PortalType, typename Device>
VTKM_CONT
ArrayPortalVirtual(PortalType *portal, Device)
: SrcPortal(portal)
{
CopyFunctionPointer(detail::ArrayPortalVirtualGetNumberOfValues<PortalType>,
this->GetNumberOfValuesFunction,
Device());
CopyFunctionPointer(detail::ArrayPortalVirtualGet<PortalType, ValueType>,
this->GetFunction,
Device());
CopyFunctionPointer(detail::ArrayPortalVirtualSet<PortalType, ValueType>,
this->SetFunction,
Device());
}
// Step 7: For each virtual-like method, create an implementation of that
// method that calls the associated function pointer.
VTKM_SUPPRESS_EXEC_WARNINGS
VTKM_EXEC
vtkm::Id GetNumberOfValues() const
{
return this->GetNumberOfValuesFunction(this->SrcPortal);
}
VTKM_SUPPRESS_EXEC_WARNINGS
VTKM_EXEC
ValueType Get(vtkm::Id index) const
{
return this->GetFunction(this->SrcPortal, index);
}
VTKM_SUPPRESS_EXEC_WARNINGS
VTKM_EXEC
void Set(vtkm::Id index, const ValueType &value) const
{
this->SetFunction(this->SrcPortal, index, value);
}
};
- A technical issue that is not addressed here is that the delegate object needs to be copied from the host to the device. In the example above, the constructor to ArrayPortalVirtual, run in the control environment, takes a pointer to the actual portal managing the array. Whoever is calling this constructor has to be careful about allocating space for this object in the execution environment and copying the data over. That caller also needs to be responsible for releasing the data at some point in the future. We have some examples of this, but a general design still needs to be thought out. --Kmorel (talk) 14:35, 29 March 2017 (EDT)
SAND 2017-3296 O