Training and Prediction with Fine-Tuned LLMs on Replicate

This tutorial will show how to prepare data and ultimately fine-tune a language model on Replicate. Replicate is a commercial offering - we’ll be using an API key and an authentication proxy server much in the same way as with OpenAI.

1. Getting an API key from Replicate

Presently this is using credits generously made available to NYU by Replicate.

2. Setting up an authentication proxy server

Same steps as documented here for OpenAI, except the repository URL shall be https://github.com/gohai/replicate-auth-proxy and the name of the environment variable to be created on Glitch shall be REPLICATE_API_TOKEN.

3. Preparing the dataset

Open the Text to JSONL (Autocompleting model) sketch
Navigate to input.txt in the sketch files, and paste your training data there.
Make sure to save the sketch at this point. This might take a short time.
Run the sketch. This should download a file dataset.jsonl to your Downloads folder.

4. Uploading the dataset to GitHub

Open the generated dataset.jsonl file in a text editor (e.g. VS Code), and copy its contents.
Go to https://gist.github.com/ to generate a new Gist (you might need to log in to GitHub for this, if you aren’t already).
Paste the content into the textbox, and set the filename to dataset.json like so.
Click Create secret gist.