[wip] resnet batchnorm backward fusion spec #4370

chaosagent · 2024-05-01T06:51:37Z

small example for easy inspection for now

Qazalin · 2024-05-01T06:54:01Z

thanks - adding this to the scheduler roadmap!

Qazalin · 2024-05-01T11:01:04Z

test/test_schedule.py

+      # easy case: merge 4 reduces in backward into 1
+      # double reduce case: merge stat calculations from 2 to 1 (be careful of long reduces!)
+      # sum(x - \bar{x}): one kernel just calculates this, can be eliminated
+      # pre-expand fusion: is it fast? -2 kernels possible, 1 fw, 1 bw


does this refer to E_2_16_64n1 +E_2048
(graph ref: https://tiny-tools-client.vercel.app/?id=f7b72a41bad14974970329924c89b2c0)
?
#4235 could do this, it won't because <LB METAL (2, 16, 8, 8) float (<UnaryOps.CAST: 3>, None)> is forced_realize. I think it breaks the API if we fuse a forced_realize parent with its child.

I am referring to E2_16_64n1 (full forward with relu) and E2_16_64 (full backward through batchnorm). The first can be fused with the next conv, and the latter can be fused with the next backward conv. (E_2048 simulates the backward from the next layer, plus relu backward)

This test case does not have the convs to focus on batchnorm, so it cannot happen here. will add more cases.

github-actions · 2024-05-02T13:42:56Z

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

chaosagent · 2024-05-02T14:13:27Z

Added detailed behavior spec. The fusion decision for the parallel reduces should be straightforward and "free" performance wise, but fusing conv(a + b) may be bad in some cases. Need a heuristic to decide when a buffer counts as a "big" buffer, and when one is a "small" buffer.

The specs so far can remove 8 out of 14 extraneous memory passes in bn(conv2d).relu(), with an estimated time saving of 33ms on BS=256 resnet.

(Edited because I posted fake news)

…_sched_spec

chaosagent · 2024-05-05T16:08:42Z

the scheduler change is a little tricky, since you need to make sure that each grouping is a contiguous sub-DAG. My solution to this is currently to do the grouping while toposorting, which should work for the specific bn training case, but is it possible to make it clean?

Probably deferring contiguous reduces until you run out of nodes in queue then grouping them would work.

Qazalin · 2024-05-05T17:27:17Z

I need to think about the scheduler change a bit more, but in general we don't wanna do merge schedules, if there is grouping to be done it should be here https://github.com/tinygrad/tinygrad/blob/master/tinygrad/engine/schedule.py#L225-L228

Qazalin · 2024-05-05T17:32:19Z

tinygrad/engine/schedule.py

+  new_arg = MemBuffer(new_lbs.index(old_lbs[ast.arg.idx]), ast.arg.dtype, ast.arg.st) if ast.op in [BufferOps.LOAD, BufferOps.STORE] else ast.arg
+  return LazyOp(ast.op, tuple(_replace_bufis(x, old_lbs, new_lbs) for x in ast.src), new_arg)
+
+def _merge_prescheduled(prescheduled: List[_LBScheduleItem]):


I've gone through this route in multioutput,

tinygrad/tinygrad/engine/schedule.py

Line 86 in 6c2cb8e

def _schedule_outputs(outs:List[_LBScheduleItem], reduce_for_op:Dict[LazyBuffer, LazyBuffer]) -> ScheduleItem:

I think you need to rebuild the entire AST.

…y tests?

chaosagent · 2024-05-06T09:18:54Z

i implemented deferring contiguous reduces until you run out of nodes in queue. it seems to work quite well, and it passed all the tests I had (very surprised that this happened smoothly). huge mess though.

i am prototyping with merge_prescheduled because i need to toposort to find these fusion opportunities (i don't see a way to analyze the graph locally to find them), and I need shapetracker information to match (lazybuffer, st) read pairs, conveniently provided by preschedule.

the rules as implemented are a little in the style of "performance heuristic" though, which is a little different from the other rules we have. is it possible to move back to pure scheduling land?

Qazalin · 2024-05-06T13:38:11Z

I think all of your fusion targets are children of E_2048

https://tiny-tools-client.vercel.app/?id=3ef8c4a72b0c4999acca0dff9288b2fa

could traversing its local graph work?

chaosagent · 2024-05-06T15:19:15Z

Some of them are also children of the forward pass. How can we tell if there is a path forward -> BN forward -> stuff -> fusion targets so that we don't fuse bn forward and backward?

The first attempt did toposort + local children. But if you don't have all inputs before E_2048 (with BN we are lucky), you will have to get lucky with the toposort order (most of the tests will not pass)

Qazalin · 2024-05-06T15:58:42Z

test/test_schedule.py

+    # match by input + ST and two shapes? start with contigouous input only, check shapes (should determine reduces)
+
+    # what if same input + st but one is early and another is late?
+    check_schedule([x.sum(0, keepdim=True) + a, (a + b).sum()], 2)


is this a real-world case?

could be... maybe if you have a bias weight and

out.sum(0) + bias -> next layer

(bias**2).sum() -> LARS

?

Qazalin · 2024-05-06T16:02:08Z

test/test_schedule.py

+    check_schedule([sum1, (x + sum1).sum()], 2)
+    del sum1
+
+    # super tricky crossing dag case


do you think fusing this is faster?

The (conservative) heuristic I am using is that this fusion should never add extra loads from bijective shapetrackers. If a shapetracker is bijective, then its size matches the full_shape of the kernel, and all non-bijective loads must be from smaller buffer(region)s. In the normal case, the non-bijective "small" buffers are from expands and are very small compared to the bijective ones (here it's 1/16), so adding these won't hurt. Here, fusing the diagonal will save 1 memory pass over a big buffer.

In fact, for simple reduces like these from bijective shapetrackers, it should be fine to fuse many unrelated reduces. Simple reduces don't really need a lot of cache -- the cache really helps when you have expands like (1, a) * (b, 1), since you can do an nm-sized tile with only n + m loads.

i think this may even be a real world case -- consider x and y to be the forward outputs of different layers.

Qazalin · 2024-05-06T16:05:20Z

forward -> BN forward -> stuff -> fusion targets

If you fuse those targets the doesn't the cache fill up with a bunch of the "stuff" bufs? We wanna fuse if they're sharing parents.

chaosagent · 2024-05-06T16:29:21Z

If you fuse those targets the doesn't the cache fill up with a bunch of the "stuff" bufs? We wanna fuse if they're sharing parents.

we need to allow small "stuff"s (the bn backward takes some inputs from bn forward). See the argument for the bijective heuristic above

chaosagent · 2024-05-06T16:35:26Z

hm, i think one of these kernels has a superset of "stuffs" across the rest of the fusion targets. i think that makes it safe to not check the "stuffs" 🤔

actually no, it doesn't , since one of the "stuffs" that only the superset kernel has could be a descendant of the rest of the fusion targets.

This reverts commit 7875b26.

geohot · 2024-06-28T21:14:47Z

This is cool. Can we get some of these tests merged? (even if they are disabled for now)

start

1f09880

Qazalin mentioned this pull request May 1, 2024

Improve reduceop elementwise fusion #4323

Open

3 tasks

add comments

1de3453

Qazalin reviewed May 1, 2024

View reviewed changes

start on parallel reduce fusion spec

0142087

chaosagent added 9 commits May 2, 2024 06:49

more cases

10b7e6f

adjust subtrees of elementwise case

b020f45

test_preconv_e_fusion

7eb504e

use b instead of 3

b51346c

whitespace

c259cac

I didn't know what Tensor.empty did lmao

02def56

typo

a6a7027

add accounting

c5ba80c

don't fuse if st don't match also for r_e

c521e91

chaosagent and others added 10 commits May 2, 2024 07:58

correct fake news

f448c32

Merge branch 'master' into bn_bw_sched_spec

f949536

is this how the linearizer works

dc82999

Merge remote-tracking branch 'chaosagent/bn_bw_sched_spec' into bn_bw…

2012d88

…_sched_spec

different late asts

9a1b1fb

add different late ast

45443f4

don't fuse if sums do not match?

745ff66

can there be too many accumulators for gemm?

4cff10c

add todos

520d2eb

what the heck

fb82a6d

Qazalin reviewed May 5, 2024

View reviewed changes

chaosagent added 10 commits May 6, 2024 01:07

defer reduces to match them

fe8bb23

also do the split queue logic for initial elements, this passes all m…

35a5e94

…y tests?

diamond test

feb72c6

permute test

338f137

outputs+inputs instead of inputs+outputs

2d6de23

add todo

0882eea

tricky crossing dag case

ab7ed92

del

e1e52a9

comment

6301d1b

early/late cases

f329d75

Qazalin reviewed May 6, 2024

View reviewed changes

chaosagent added 4 commits May 6, 2024 21:22

:P

7875b26

Revert ":P", it wasn't needed

889d1fb

This reverts commit 7875b26.

refactor

2b303e4

formatting

4878be8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] resnet batchnorm backward fusion spec #4370

[wip] resnet batchnorm backward fusion spec #4370

[wip] resnet batchnorm backward fusion spec #4370

Are you sure you want to change the base?

[wip] resnet batchnorm backward fusion spec #4370

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment