-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learn2code | Review of the love2code configs #164
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get all the way though. Will do more later.
return AdamWConfig( | ||
lr=4e-4, | ||
weight_decay=0.1, | ||
betas=(0.9, 0.95), | ||
group_overrides=[ | ||
OptimGroupOverride(params=["embeddings.weight"], opts=dict(weight_decay=0.0)) | ||
], | ||
fused=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Granite have any guidelines here?
LRs it still a black art. 7B was fine with up to 12e-4
. So I think you can go higher here. Let's do 12e-4
for this one as well. You're going to train these to the end before thinking about the results, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the phase-1 pretraining, the learning rate follows a cosine schedule starting from 3 × 10−4 which decays to 3× 10−5 with an initial linear warmup step of 2k iterations.
From Granite. So I guess we jut replicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, switched to 12e-4, linearly decaying to 12e-5, with 2k steps of warmup.
Actually, the rest of the script was just boiler plate, so I guess I'm done after all. |
Dirk, help?