Add support for context parallelism #174

epwalsh · 2025-02-21T00:27:29Z

To enable, run with --train_module.cp_config='{degree: 2}' (adjust the degree as needed). Probably don't try this with TP at the same time for now.

2015aroras

I don't see any reason why this wouldn't work, but some parts might be less simple/natural than they can be. Approving so you can be unblocked for now.

src/olmo_core/distributed/parallel/__init__.py

2015aroras · 2025-02-21T01:07:23Z

src/olmo_core/train/train_module/transformer_pipeline.py

+        for b in attn_buffers.values():
+            mark_dynamic(b, 0)
+
+        with self._model_forward_context():


It looks like PP doesn't support CP yet. Maybe cause a config error in that case.

2015aroras · 2025-02-21T01:10:38Z

src/olmo_core/train/train_module/transformer.py

@@ -560,9 +592,14 @@ def train_batch(self, batch: Dict[str, Any], dry_run: bool = False):

        # Train one micro-batch at a time.
        for micro_batch_idx, micro_batch in enumerate(micro_batches):
-            with self._train_microbatch_context(micro_batch_idx, num_micro_batches):
+            attn_buffers = self.model.get_attn_buffers(


Is there a need to pass the attention buffers back through forward in this manner? I thought the context parallelism would modify them in-place. I would prefer not doing this if it's not needed.

This API needs some improvement but there's a couple reasons I chose to pass these buffers through the forward method:

This allows the TransformerTrainModule to mark their sizes as dynamic for torch.compile(). This is irrelevant to context parallelism but it's a good thing to do in general. Without this I think we were getting some unnecessary re-compilations during in-loop evals.

It guarantees we don't create duplicate versions of these buffers, which could happen if the BufferCache owned by each Attention module was a different instance. In our usual constructors we make sure the cache is shared, but if someone instantiated their model differently, the cache might not be shared.

Alright refactored this so that the buffers are managed completely by the model.

src/olmo_core/nn/transformer/model.py

Co-authored-by: Shane A <[email protected]>

epwalsh added 8 commits February 20, 2025 14:35

get attn buffers outside of forward pass

1bbf4c0

Enable CP

1601122

more safety

0ca9919

fix

aec627f

clean up

971c53c

fix

48259f2

fix?

80fa76b

clean up

32dfcca

2015aroras approved these changes Feb 21, 2025

View reviewed changes

epwalsh and others added 13 commits February 21, 2025 09:27

fix typo

2d9c459

Co-authored-by: Shane A <[email protected]>

move SDPA kernel selection into Attention class

571fa50

refactor

c95cedd

Clean up pipeline train module

b0b870a

Enable CP with pipeline train module

3b8b444

Adjust long context script

3403544

fix

3430586

updates

51f0c79

increase context length

5f3db2c

clean up docs

aab3c31

update docs

a812490

refactor to make room for other ring implementations

fdfef8d

more API changes

1f38a55

epwalsh mentioned this pull request Feb 23, 2025

Add support for context parallelism (round 2) #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for context parallelism #174

Add support for context parallelism #174

epwalsh commented Feb 21, 2025 •

edited

Loading

2015aroras left a comment

2015aroras Feb 21, 2025

2015aroras Feb 21, 2025

epwalsh Feb 21, 2025

epwalsh Feb 21, 2025

Add support for context parallelism #174

Are you sure you want to change the base?

Add support for context parallelism #174

Conversation

epwalsh commented Feb 21, 2025 • edited Loading

2015aroras left a comment

Choose a reason for hiding this comment

2015aroras Feb 21, 2025

Choose a reason for hiding this comment

2015aroras Feb 21, 2025

Choose a reason for hiding this comment

epwalsh Feb 21, 2025

Choose a reason for hiding this comment

epwalsh Feb 21, 2025

Choose a reason for hiding this comment

epwalsh commented Feb 21, 2025 •

edited

Loading