Building & training a transformer on the first 325 episodes of the Lex Fridman Podcast to answer questions.
The text is encoded with byte pair encoding (BPE) to get a vocabulary of 1,000 tokens.
The number of tokens after encoding is approx. 60% of the original text length.
Here's an example of the encoding process:
tokens = encode("I think this is going to be awesome.")
>>>
tensor([360, 237, 153, 61, 158, 61, 158, 253, 194, 186, 280, 53, 75, 169,
67, 183, 11], device='cuda:0')
len("I think this is going to be awesome.") # 36
len(tokens) # 17
decode(tokens)
>>>
"I think this is going to be awesome."
It's not very good yet, but can mimick some english.
prompt = "What do you think about language models?"
answer = prompt_model(model, prompt, max_new_tokens=800, topk=2)
print(answer)
>>>
I think that the sort that.
But know?
And there's a lot one the because the but the comple to the of the somether and of comple
of of the because a look, the so the blange,
but I don't some the sort of an and that the be there any had the to,
but I'm unders to don't there there to the some of the sorther.
And that the some that the bractive,
but that.
But the because actory the be the because this to that start of the some the call the of the
and there's they're going the be exconce,
the same that the some to through an that and of it
of they're good, when the ARLOL the good the bedher a conver of of a conver the be of the see
of they're good on That think to, I don't going of,
the can the say, they like,
they they world, you can toper one of the becople
freed that the sorld?
Yeah, they
You can find my notes on the implementation details here: 🤖 Transformer blogpost.
The implementation is based on the "Attention Is All You Need" paper and the "Let's build GPT" tutorial by Andrej Karpathy.
The transcribed subtitles for the first 325 episodes of the Lex Fridman Podcast are from Andrej Karpathy's Lexicap project, which used OpenAI's whisper model to transcribe them. I cleaned the data with some regular expressions to get one big corpus of text for training the transformer model.
The model was trained for ~5 hours on a GPU.
Vaswani et. al: Attention Is All You Need - Link
Andrej Karpathy: Let's build GPT: from scratch, in code, spelled out - Link
Rasa: Rasa Algorithm Whiteboard - Transformers & Attention 1: Self Attention Link
Thumbnail: Link
AI Coffee Break with Letitia: Positional embeddings in transformers EXPLAINED - Demystifying positional encodings. Link