## The CoRa Tensor Compiler: Compilation for Ragged Tensors With Minimal Padding

Pratik Fegade<sup>1</sup>, Tianqi Chen<sup>12</sup>, Phillip B. Gibbons<sup>1</sup>, Todd C. Mowry<sup>1</sup>

Carnegie Mellon University <sup>2</sup>OctoML

**Carnegie Mellon University** Computer Science Department





## Ragged Tensors in Deep Learning

Natural language processing



Image processing







Ragged Tensor





dimensions have varying lengths



## **Ragged Tensors**

• Ragged tensor is a tensor where the slices corresponding to one or more

Rows (slices of the inner dimension) have varying lengths



## Limited Support for Ragged Tensor Operators

- Limited support for operations on ragged tensors
- Extensive support for dense tensors



## And Padding Leads to Wasted Computation





#### 1.07 - 2.41X wasted computation!







## Ideal Execution: Compilation Without Padding



## **CoRa Enables Ragged Tensor Execution for Higher Frameworks**











## **CoRa Enables Transformer Implementation Without Padding**





- Motivation: Inefficient Support for Ragged Tensors
- CoRa: Our Compiler Based Solution
  - Scheduling and lowering
  - API and overview
- Evaluation
- Wrapping up

- Motivation: Inefficient Support for Ragged Tensors
- CoRa: Our Compiler Based Solution
  - Scheduling and lowering
  - API and overview
- Evaluation
- Wrapping up



Densely packed data with no holes, like dense tensors









- Densely packed data with no holes, like dense tensors
- Ragged computations are similar to dense tensor computations



#### Reuse abstractions and techniques from dense tensor compilers

Variable loop extents

#### Generalize

- Compiler's loop representations
- Scheduling primitives and their • impl.



#### Generalize

- Tensor storage scheme
- Tensor access lowering



- Densely packed data with no holes, like dense tensors
- Ragged computations are similar to dense tensor computations

## for i in 0:32: for j in 0:[s(i):] B[i,j] = 2\*A[i,j]Ragged tensor

accesses

#### Reuse abstractions and techniques from dense tensor compilers

Variable loop extents

#### Generalize

- Compiler's loop representations
- Scheduling primitives and their • impl.

#### Generalize

- Tensor storage scheme
- Tensor access lowering





- Densely packed data with no holes, like dense tensors
- Ragged computations are similar to dense tensor computations

### for i in 0:32: for j in 0:[s(i):] B[i,j] = 2\*A[i,j]Ragged tensor

accesses

#### Reuse abstractions and techniques from dense tensor compilers

Variable loop extents

#### Generalize

- Compiler's loop representations
- Scheduling primitives and their impl.

#### Generalize

- Tensor storage scheme
- Tensor access lowering





## Need to precompute dimension offsets before kernel execution

Offset(1, 3) = RowStart(1) + 3

Once precomputed, we have cheap random accesses, similar to dense tensors!

- Motivation: Inefficient Support for Ragged Tensors
- CoRa: Our Compiler Based Solution
  - Scheduling and lowering
  - API and overview
- Evaluation
- Wrapping up

## CoRa's API Is Similar to That of Dense Compilers



Other scheduling primitives for load balancing, operation splitting, tensor dimension scheduling are available

# i,j = B.axis f = fuse(i,j) fo, fi = split(f,64) bind(fo, 'blockIdx.x') bind(fi, 'threadIdx.x')



## CoRa's Compilation and Runtime Pipeline





- Pre-computation for
  - Fused loop extents and iteration variable
  - relationships
  - Memory offsets for access lowering

- Motivation: Inefficient Support for Ragged Tensors
- CoRa: Our Compiler Based Solution
  - Scheduling and lowering
  - API and overview
- Evaluation
- Wrapping up

## Layer Forward Pass Latencies on Nvidia VI00 GPU

#### Lower is better



Normalized Execution Time

- Motivation: Inefficient Support for Ragged Tensors
- CoRa: Our Compiler Based Solution
  - Scheduling and lowering
  - API and overview
- Evaluation
- Wrapping up

## Wrapping Up CoRa



## CoRa is a tensor compiler for operations on ragged tensors





CoRa provides a familiar API similar to that of dense tensor compilers

CoRa generates code as performant as hand-written code for transformer models