This tutorial will show how to prepare data and ultimately fine-tune a language model on Replicate. Replicate is a commercial offering - we’ll be using an API key and an authentication proxy server much in the same way as with OpenAI.
Presently this is using credits generously made available to NYU by Replicate.
Same steps as documented here for OpenAI, except the repository URL shall be https://github.com/gohai/replicate-auth-proxy and the name of the environment variable to be created on Glitch shall be REPLICATE_API_TOKEN.
Open the Text to JSONL (Autocompleting model) sketch
Navigate to input.txt in the sketch files, and paste your training data there.
Make sure to save the sketch at this point. This might take a short time.
Run the sketch. This should download a file dataset.jsonl to your Downloads folder.
Open the generated dataset.jsonl file in a text editor (e.g. VS Code), and copy its contents.
Go to https://gist.github.com/ to generate a new Gist (you might need to log in to GitHub for this, if you aren’t already).
Paste the content into the textbox, and set the filename to dataset.json like so.
Click Create secret gist.