Simplifying the Use of CellSets in DataSets

From VTKM
Revision as of 10:38, 23 July 2015 by Kmorel (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This design has been rejected. See Simplifying the Use of CellSets in DataSets 2 for the new design.

The current DataSet class in VTK-m comprises a group of cell sets, a group of fields, and a group of coordinate systems. The management of fields and coordinate systems is pretty straightforward because they behave pretty similarly across different types of cell models. However, the cell set object is handled problematically because the type of this class depends on the type of data structure (for example structured vs. explicit cell connections). Currently, DataSet stores a pointer to an abstract CellSet superclass with some unknown specialization for the particular data structure.

This is currently causing two problems. The first problem is that memory management is not well established. The abstract CellSet object must be stored as a pointer. It is typically allocated outside of DataSet and then passed to that class. Who is responsible for deleting it and when? Currently this is being solved by wrapping the pointer in a boost smart_ptr, but we really want to keep boost out of the VTK-m API.

The second problem is more serious. Once the CellSet is stored in the DataSet, its type is essentially lost. So whenever you use a DataSet, you have to somehow magically know what the CellSet actually is and do a dynamic_cast to the appropriate concrete type. It's a silly system since if you have to know the type anyway, you may as well store it in a static type. If you don't know the type, you have to write a lot of code to try all types it might be.

Ultimately, data sets (and in particular, the cell set part) should be designed in one of two ways. Either the data set should be strongly typed such that any code using it knows specifically what the type should be, or the data set has a polymorphic data set with a means to dynamically schedule it (much like a dynamic array handle). The current implementation is sort of in the middle of these two approaches and does a good job at neither.

Strongly Typing Data Sets

After having a lengthy discussion with Dave P. and giving this some thought, I think the most effective approach will be to statically type the vtkm::cont::DataSet class to the type of cell set being used. The reason for this is twofold. First, unlike arrays, which can different types and storages simultaneously assigned to fields and other features, the data structure is often known. Or if it is not explicitly known it is often limited to a small set of possibilities. Thus, I expect it to be easier to manage data sets with explicit types than to deal with unknown types. Second, as we chase the rabbit down the whole of possible data structures, we will likely find a huge combination of possibilities (with a small set of common structures). It would be wasteful to manage all possibilities when only a few are likely.

I propose making DataSet templated on the CellSet being stored in it. The implementation could look something like the following. (Details are left out.)

template<typename CellSetType>
class DataSet
{
  DataSet(const CellSetType &cellSet) : Cells(cellSet) {  }

  void AddField(vtkm::cont::Field field) { .. }
  vtkm::cont::Field GetField(vtkm::IdComponent index) const { ..}

  void AddCoordinateSystem(vtkm::cont::CoordinateSystem cs) { .. }
  vtkm::cont::CoordinateSystem GetCoordinateSystem(index) const { .. }

  void SetCellSet(CellSetType cellSet) { .. }
  CellSetType &GetCellSet() { .. }

private:
  std::vector<vtkm::cont::CoordinateSystem> CoordSystems;
  std::vector<vtkm::cont::Field> Fields;
  CellSetType Cells;
};

Now all typing problems with the CellSet classes go away (although they do get offloaded to whoever is using DataSet).

After looking at this a bit, it is apparent that this design makes for verbose names such as vtkm::cont::DataSet<vtkm::cont::CellSetStructured<3> >. It may be worthwhile to create some subclasses for common types such as vtkm::cont::DataSetStructured.

Reintroducing Dynamic Data Sets

Once the DataSet class is defined to have a static cell set type, what if then we need to mange a data set when the cell set is not known at run time. Perhaps this would be because our original prediction of the use of known cell sets is wrong. Or there may be (will likely be?) examples where it is inconvenient to statically type the cell set. In either case, we should have a plan for, although I expect the implementation to be secondary to the initial rollout.

A dynamic data set can be created with a simple specialized version of DataSet. We could do this by creating a new class with a different name (such as DataSetAbstract) or simply create a specialization of DataSet for the CellSet abstract superclass. The implementation could look something like this.

template<>
class DataSet<vtkm::cont::CellSet>
{
public:
  template<typename CellSetType>
  DataSet(const CellSetType &cells)
    : Cells(new CellSetType(cells)) {  }
private:
  boost:smart_ptr<vtkm::cont::CellSet> Cells;
};

This abstract-like version of DataSet must somehow return the CellSet structure. The easiest way to do this is to simply return a pointer to the CellSet.

  vtkm::cont::CellSet *GetCellSet() { return this->Cells.get(); }

Returning an abstract pointer this way means we are back to having to "know" the correct cell set type and cast to it or manually try lots of casts. It may be more convenient to have a DynamicCellSet class that works essentially like DynamicArrayHandle in that it acts like a smart pointer and contains a method to try to cast to some arbitrary set of types.

Multiple Cell Sets

A data set with a single cell set of a static structure is the 80% solution. However, there are cases when a single cell set cannot capture the full topology. There are two approaches one can take for data sets with multiple cell sets. We can either statically define the number and type of cell sets, or we can dynamically set them at run time.

Below I list three options for composite data sets. I am listing them here as possible implementations. It is up for debate which ones should be implemented.

Dynamically Defined Composite Cell Sets

A simple composite data set could hold a collection of the abstract data sets of unknown type. That would allow both the types and number of cell sets being used to be specified at run time. This is usually the case, for example, for data with node sets, edge sets, face sets, etc. It is hard to predict in advance how many such sets there will be and of what type.

I propose storing the multiple cell sets as a list of DataSet classes rather than CellSet classes. This will simplify associating field and coordinate data to the correct data set. It is common for at least some fields to be associated with one particular cell set and not others, and managing which goes where is tricky. If each cell set is contained in its own (sub) data set, then the association of fields to cell sets is clear. In the case where a field or coordinate system applies to multiple cell sets, they can be shallow-copied among them.

class DataSetComposite
{
public:
  void AddDataSet(vtkm::cont::DataSetAbstract) { .. }
  vtkm::cont::DataSetAbstract GetDataSet(index) { .. }
  vtkm::IdComponent GetNumberOfDataSets() { .. }
private:
  std::vector<vtkm::cont::DataSetAbstract> SubDataSet;
};

The DataSetComposite could mimic the behavior of a structured DataSet by implementing methods to get fields and coordinate systems. It could do this by enumerating over all sub data sets, although I am not sure how useful this would be. You could also implement a similar GetCellSet method. This would match better if the other data set classes pretended to have a list that was always size 1. Once again, I'm not sure how important or useful that would be.

Statically Defined Composite Cell Sets

Another possible implementation for composite data sets is to statically (at compile time) define the number of sub data sets and the cell types of each. An example of such a use case is a molecular data set that has exactly two cell sets of known types: one for atoms and one for bonds.

This can be done with variadic templates, which are directly supported in C++11 and can be implemented with hacks in ANSI C++.

Variable Cell Set Lists of the Same Type

A third use case is a data set comprising a list of cell sets all of the same type but with a potentially different amount. The node/edge/field set example suggested before might be a good example of this. All could likely be explicit cell sets, but the number of them is unknown until compile time.

template<typename CellSetType>
class DataSetComposite
{
  void AddDataSet(vtkm::cont::DataSet<CellSet>) { .. }
  vtkm::cont::DataSet<CellSet> GetDataSet(index) { .. }
  vtkm::IdComponent GetNumberOfDataSets() { .. }
private:
  std::vector<vtkm::cont::DataSet<CellSetType> > CellSets;
};

Cleaning up the CellSet Classes

This is a somewhat different topic, but the class structure underneath the cell set classes is a bit inconsistent, which makes it a bit confusing. Here is a breakdown of the classes and what I think I understand about them.

  • CellSet A CellSet holds all the information about the topology. It contains all the information about how topological elements are connected (nodes to edges to faces to cells and all possible combinations). Some of these connections are directly accessible. Others might need some processing. For example, you might need to build the links from cells to nodes in an explicit cell set.
  • Connectivity A Connectivity class describes the connections from one topological element to another. For examples, the connectivity from nodes to cells describes how cells are connected and fields are interpolated. The connectivity from cells to nodes describes all the cells incident on a node and allows cell to point operations.
  • Structure There is really only a Structure class for structured grids and as far as I can tell, it is a convenience class to define methods for different dimensionality and different connectivity directions.

For sake of clarity and convention I suggest the following:

  • The name of the connectivity classes be ConnectivityExplicit and ConnectivityRegular rather than ExplicitConnectivity and RegularConnectivity. This follows the VTK-m convention.
  • The connectivity classes should all be in the vtkm::cont package. Functionality that is identical in both control and execution environment (such as what you find in the current RegularConnectivity and RegularStructure classes) should be in the vtkm::internal package (unless there is a reason to use the functionality outside of the cell set support objects).
  • For each connectivity type in the control environment, there is a matching one in the execution environment (vtkm::exec). This will be used internally by the worklet to get connection information.
    Should the control and execution classes be the same name or different. Which way is more clear? Kmorel (talk) 10:56, 20 July 2015 (EDT)
  • The TopologyType enum should be changed to a collection of tag structures (e.g. struct TopologyTypeCellTag {};). Tags are a bit less awkward for templates and overloading, and as far as I can tell these enumerations are not used for anything other than template resolution.
  • The user code should not have to call the get connectivity method on the cell set. This should be handled within the worklet (more specifically in the transport). Which type of connectivity to get (node to cell vs. cell to node or any other combination) should be defined by the workout type. This means there will be a different worklet type for each from-to combination. That might lead to worklets templated on the from-to pair.
  • There is inconsistency in the class names for structured grids. The cell set class was named structured but the connectivity class is named regular. I think the appropriate name is structured since if you apply an explicit coordinate system it becomes irregular. So the names should be CellSetStructured and ConnectivityStructured. However, the Make*RegularDataSet method names in MakeTestDataSet.h are correct since that is making it with a uniform coordinate system to make it regular.
Should we add the concept of extents to the structured cell sets? VTK uses a 6 id min/max extent rather than a 3 id dimension for structured 3D grids. Dax adopted this simply because that is what VTK did. But everyone finds extents confusing and I don't know if they are important? Do we need them in VTK-m? Kmorel (talk) 10:56, 20 July 2015 (EDT)

SAND 2015-6011 O