Introduction to ragged tensors
- Von Torben Windler
Beitrag teilen:
Introduction
Problem statement
With the standardization of traditional machine learning problems, many models are very easy to implement: read a table of features from a database, use pandas and numpy for preprocessing, build a model with one of the well-known libraries, and type model.fit(). In many cases – that’s it! Done!
But wait – what if the data does not come in tabular form and is irregular by nature? What if there are instances with varying dimensions? Consider a scenario for time series classification and suppose we have a dataset consisting of four different short time series:
As you can see, the series differ in both the number and time of the measurements. Since machine learning models typically require a fixed input size, it’s a bit more complicated to fit such data into our models.
There are a number of possibilities to handle this type of input; for example we could interpolate the series and take virtual measurements at the same timestamps for each series:
Here we take the values from timestamps 0, 2, 4, 6, 8, and 10 such that every series consists of 6 values. However, at this stage we already have to choose hyperparameters such as the type of interpolation, how many values, etc. However, we cannot rely on the accuracy of the interpolation, especially for extrapolated values and values within large gaps between successive measurements (see the orange and green series at time 10).
From the technical side, when we feed the data into a TensorFlow Keras model and do not want to use interpolation techniques, a common practice is to pad the series, e.g. with zeros at the end. This is necessary because TensorFlow groups batches of data together which must have the same shape in every dimension. A batch of the 4 series above would have the shape (4, 6) with 4 being the number of series (=batch dimension) and 6 being the number of measurements per series.
However, the 6 arises from artificial data, either interpolated measurements or padding values. To overcome the uncertainty and the overhead of both these techniques, we can use ragged tensors to work with the original data.
Concept of ragged tensors
The concept of ragged tensors is surprisingly easy after understanding the intention behind them. Let’s stick with our above example with 4 time series. As you can see, the minimum number of measurements per series is 3, while the maximum is 5. With padding we would have to fill every series with zeros at the end (or sometimes at the beginning) to achieve a common length of 5.
In contrast, a ragged tensor consists of the concatenation of all values from all series together with metadata specifying where to split the concatenation into the individual series. Let’s define our dataframe df and then our ragged tensor rt:
time | value |
---|---|
0 | 3 |
3 | 1 |
6 | 8 |
8 | 0 |
10 | 9 |
0 | 15 |
5 | 11 |
8 | 7 |
0 | 12 |
2 | 7 |
4 | 8 |
9 | 2 |
0 | 9 |
4 | 0 |
6 | 13 |
10 | 4 |
row_splits = [0, 5, 8, 12, 16]
rt = tf.RaggedTensor.from_row_splits(values=df.values, row_splits=row_splits)
rt
<tf.RaggedTensor [[[0, 3], [3, 1], [6, 8], [8, 0], [10, 9]], [[0, 15], [5, 11], [8, 7]], [[0, 12], [2, 7], [4, 8], [9, 2]], [[0, 9], [4, 0], [6, 13], [10, 4]]]>
As we can see, the row_splits array defines the individual series by specifying their startrow (inclusive) and endrow (exclusive).
That’s it. This is the really simple structure of ragged tensors. As an alternative to specifying the row_splits we can also create the same ragged tensor with one of the following methods:
- value_rowids: for every row in the concatenated series we specify an id number which indexes the individual series:
value_rowids = [0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
rt_1 = tf.RaggedTensor.from_value_rowids(values=df.values, value_rowids=value_rowids)
rt_1
<tf.RaggedTensor [[[0, 3], [3, 1], [6, 8], [8, 0], [10, 9]], [[0, 15], [5, 11], [8, 7]], [[0, 12], [2, 7], [4, 8], [9, 2]], [[0, 9], [4, 0], [6, 13], [10, 4]]]>
- row_lengths: we state the length of every individual series:
row_lengths = [5, 3, 4, 4]
rt_2 = tf.RaggedTensor.from_row_lengths(values=df.values, row_lengths=row_lengths)
rt_2
<tf.RaggedTensor [[[0, 3], [3, 1], [6, 8], [8, 0], [10, 9]], [[0, 15], [5, 11], [8, 7]], [[0, 12], [2, 7], [4, 8], [9, 2]], [[0, 9], [4, 0], [6, 13], [10, 4]]]>
- constant: we can define the ragged tensor as a “constant” by directly specifying a list of arrays:
rt_3 = tf.ragged.constant([df.loc[0:4, :].values, df.loc[5:7, :].values, df.loc[8:11, :].values, df.loc[12:15, :].values])
rt_3
<tf.RaggedTensor [[[0, 3], [3, 1], [6, 8], [8, 0], [10, 9]], [[0, 15], [5, 11], [8, 7]], [[0, 12], [2, 7], [4, 8], [9, 2]], [[0, 9], [4, 0], [6, 13], [10, 4]]]>
Internally, it does not matter which method we choose to create a ragged tensor, the results are all equivalent. Next we’ll see how to perform mathematical operations on ragged tensors.
Working with ragged tensors
TensorFlow provides a very handy function to perform operations on ragged tensors: tf.ragged.map_flat_values(op, *args, **kwargs). It does what the function name says – every ragged tensor in args is substituted by its concatenated (=flat) version, omitting the batch dimension. In our example, this is the same as if we operate on the df.values directly. The only difference is that the output of the operation is again a ragged tensor with the same metadata information about where to split. Let’s consider an example where we compute the matrix product of the ragged tensor with a matrix m of shape (2, 5). Each individual series in our ragged tensor has shape (k, 2) where k corresponds to the number of measurements in the given series. Taking care to first casting to floats:
m = tf.random.uniform(shape=[2, 5])
print(m.shape)
(2, 5)
rt = tf.cast(rt, tf.float32)
result = tf.ragged.map_flat_values(tf.matmul, rt, m)
print(*(t.shape for t in result), sep='\n')
(5, 5)
(3, 5)
(4, 5)
(4, 5)
m = tf.random.uniform(shape=[2, 5, 4])
print(m.shape)
(2, 5, 4)
rt = tf.cast(rt, tf.float32)
result = tf.ragged.map_flat_values(tf.einsum, "bi, ijk -> bjk", rt, m)
print(*(t.shape for t in result), sep='\n')
(5, 5, 4)
(3, 5, 4)
(4, 5, 4)
(4, 5, 4)
As expected, the batch dimension b corresponds to the length of the individual series, while the other dimensions originate from m
. By the way, tf.einsum refers to the Einstein summation convention, which is extremely handy if we are working with higher dimensional tensors. Read more about it here.
One last thing, it is also very easy to perform aggregations over ragged tensors. For example, if we want to know the columnwise sum, we can use reduction functions for this:
tf.reduce_sum(rt, axis=1)
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[27., 21.],
[13., 33.],
[15., 29.],
[20., 26.]], dtype=float32)>
There exists many more operations for ragged tensors which are listed here.
Conclusion
We learned about the structure of TensorFlow ragged tensors and how to perform basic mathematical operations on them. They make it unnecessary to apply unnatural preprocessing techniques like interpolation or padding. This is especially useful for irregular time series datasets, although there are many other applications. Imagine a dataset with images of various sizes – ragged tensors are even able to handle multiple ragged dimensions, perfect for that.
In a subsequent post I will dive a bit deeper into how to work with ragged tensors as input types for a Keras model by treating the individual time series as sets and performing attention directly on the ragged tensors. Stay tuned!