Running LLM AIs on Your Own: Next Steps for the Mildly Interested Passerby

Poking and prodding them properly

Apr 25, 2023

Large language models like GPT-4 are pretty darn neat. I’m not going to make you the big pitch about how they’re going to entirely upend society (frankly I’d bet that they won’t, although it’s certainly not impossible). But I do think they’re going to be quite useful, and will make pretty heavy impacts on some things. So they’re worth getting into at least a little bit.

However, I think the way that many people have used them thus far is just to open up ChatGPT or Bing Chat, ask some questions, then post the funniest or most surprising responses online. This is well and good, but it’s only one window into how you can explore these strange lil black boxes,1 or use them to automate tasks for you.

I’ve recently been working on a project where I want to get responses on a whole long list of things from multiple different LLMs for research purposes. In doing so I’ve had to learn some things about how to use them, and have picked up some tricks. This is far from a post for the power user.

It’s not about writing an app that calls a LLM. It’s also not about prompt engineering to get the best responses from ChatGPT. This is for the basic user, who has used a chat-based version of an LLM before and who wants to take a single next step. All of this, by the way, will require some coding but not too much. And, by the way, I’m writing many of these things not long after learning them myself. Hopefully my relative-beginner status means I can explain these things without getting too technical. It also means please don’t ask me to troubleshoot for you.

This is a brief post about how you can (1) access OpenAI models (GPT3.5 and GPT4) programmatically, (2) fine-tune the kinds of responses you want, (3) embed additional information into OpenAI models, and (4) run other kinds of LLMs that don’t have public-access frontends.

1. Using OpenAI Programmatically

Want to ask GPT, like, a hundred questions? A thousand? Want to ask it just a few questions but adjust some of the controls I’ll talk about in Section 2? You probably don’t want to type each one into ChatGPT and copy/paste the answers back out.

Thankfully, OpenAI has an API you can use to iterate through all the prompts you like, and store the outputs.

You’ll need an API key to do this. So the first step here is to head into your OpenAI account and get an API key. At the moment of writing, you should be able to automatically get a GPT3.5 key, but may have to get on a waiting list for a GPT4 key. Subscribing to their paid service may also be necessary.

Once you’ve got the key, you’re good to go. Both R and Python (and some other languages and platforms) have openai packages available on standard platforms for download. Once you’ve got them, it’s just a matter of putting in your API key and sending in some prompts. Send ‘em in a loop and store the output, and then do what you like with it.

The package has similar syntax and options in R and Python, as you might expect. I’ll discuss some of the options below. One thing I’ll point out is that you are not limited to ChatGPT-style chat responses (although you can get these by asking for “chat completions”, create_chat_completion() in R and openai.ChatCompletion.create() in Python. Note that, at least by default, this won’t let you carry on a conversation like you typically can in a ChatGPT window. Each API call is a new conversation.

The API gives you access to a bunch of different OpenAI models, including those for image creation, translation, and speech-to-text.

Here’s an example of getting some chat completions. It uses the R version of the package because I’ll use that over Python when I can, and I won’t be able to (or at least it’s not nearly as supported) for most of my LLM experience.

library(openai)

Sys.setenv('OPENAI_API_KEY' = 'YOUR API KEY HERE')

my_prompts = c('Pass the peas.','Gimme some more.')

# model could also be 'gpt-4'
get_openai_chat_response = function(prompt, model = 'gpt-3.5-turbo') {
  resp = create_chat_completion(model = model, 
                                messages = list(list(role = 'user',
                                                     content = prompt)))
  just_the_response = resp$choices$message.content
  return(just_the_response)
}

# loop over the prompts and get responses for each
responses = sapply(my_prompts, get_openai_chat_response)

# If you prefer it in a data.frame
responses_data_frame = data.frame(prompt = my_prompts,
                                  responses = responses)

responses
# Pass the peas. 
# "I'm sorry, as an AI language model, I do not have actual peas to pass. But I can help you answer any questions or strike up a conversation if that would help!" 
# Gimme some more. 
# "More what? Please provide more context or information for me to provide a relevant response."

2. Fine-Tuning Responses

One illusion you can get while getting responses from ChatGPT is to think that there is a LLM response you can get. Like “I asked ChatGPT and here’s what LLMs say!” This is because ChatGPT does not let you adjust the model’s parameters or access its uncertainty in any obvious way. Instead they’ve picked a single set of parameters that they think make for a good chat experience.

There are a few things to know about getting LLM responses.

The first is that LLMs don’t just have one way of returning a response. They have different settings, for one. A prominent one you should know about is temperature, which is a setting you’ll find in a lot of LLMs. Temperature, roughly, is “how creative should the response be?”

When an LLM is predicting what the next word is, it generates a probability distribution of what word is likely to come next. What follows “I went to the” in the training data? There’s a decent chance that it’s “store” and a much lower chance that it’s “moon.” But the LLM will allow itself at random to pick options other than the most likely. A low-temperature text generation will bias itself heavily towards only the most-likely option, while a high-temperature text generation will pick low-probability options more often, making the responses more creative, but less likely to make sense or be “correct.”

You should think carefully about what temperature you might want for your application. Are good responses ones that really accurately map out the training data? Go low-temperature. Want something more creative and generative? High temperature.

The other thing to consider, along these lines, is that generation is probabilistic. You’ll get different responses to the same prompts by random chance alone. So when someone takes a ChatGPT screenshot of a single conversation they had and posts it to social media, that’s an indication of what ChatGPT can do, but it’s misrepresentative to say that that’s what it does. It does lots of different things!

If what you’re doing is trying to explore how LLMs perform and what they do, you’re likely going to want to get multiple responses to each prompt. Don’t limit yourself to N = 1 in your explorations. You wouldn’t do that in other analyses!

The OpenAI API makes it easy to both change temperature, and ask for multiple attempts at responding to the same prompt.

library(openai)
# For rbindlist
library(data.table)

Sys.setenv('OPENAI_API_KEY' = 'YOUR API KEY HERE')

my_prompt = 'What\'s a funny thing I can wear on my head?'

# model could also be 'gpt-4'
# temperature  = 1 is the default. Higher = more creative, lower = less
get_openai_chat_response = function(prompt, model = 'gpt-3.5-turbo',
                                    temperature = 1, n_tries = 3) {
  resp = create_chat_completion(model = model, 
                                temperature = temperature,
                                n = n_tries,
                                messages = list(list(role = 'user',
                                                     content = prompt)))
  just_the_responses = data.table(prompt = prompt,
                                  temperature = temperature,
                                  responses = resp$choices$message.content)
  print(just_the_responses)
  return(just_the_responses)
}

# loop over the prompts and get responses for each
responses = c(.2, 1, 1.8) |>
  lapply(function(x) get_openai_chat_response(my_prompt, temperature = x)) |>
  rbindlist()

3. Embedding Additional Information

LLMs are trained on a large set of training text data. However, often you might want it to focus on additional data, like providing it a set of documentation about your favorite software package, or a long document you want to summarize, etc. That’s where embeddings come in. Embeddings are where you take your own set of text data, and transform it into the vector representation of that text that the LLM understands.

At that point, you can do things like query the text you’ve uploaded to produce a summary. Or, perhaps of special interest to researchers, you can do things like perform cluster analysis or topic modeling on your LLM-generated sets of embeddings, looking for patterns in your data set in a different approach to traditional topic modeling that can take into account a much broader set of data than just your own.

“Querying the text you’ve uploaded” is something I haven’t needed to do myself yet, so I don’t know how to do it, although there’s a rapidly growing set of consumer-facing products that claim to do this, so go hunting for something that does it for you.2

The openai package has a tool for creating embeddings, but you may want to surround that with a tool that makes the whole process easier, LangChain. We’re fully in Python now. I stole and modified much of this code from somewhere, I believe the LangChain website:

# Load in LangChain and OpenAI tools
from langchain.agents import create_pandas_dataframe_agent
import tiktoken
from openai.embeddings_utils import get_embedding
import openai

# Standard data-processing
import numpy as np
import pandas as pd

# We'll do cluster analysis on these
from sklearn.cluster import KMeans

openai.api_key = "YOUR API KEY HERE"

# Use an OpenAI model specific to embedding
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002

# Max token length, set very high for longer strings
max_tokens = 8000

# My text data to embed, in this case it's many small strings
# But could be one big long string
df = pd.DataFrame({'my_text': ['I hated my meal.', 'I loved the meal!', 
                               'The meal was ok.', 'French fries were bomb.']})

# Get our text encoding
encoding = tiktoken.get_encoding(embedding_encoding)

# omit any strings too long to embed
# Based on our maximum token length
df["n_tokens"] = df.my_text.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens]

# Actually embed the text
df["embedding"] = df.my_text.apply(lambda x: get_embedding(x, engine=embedding_model))

# Now to do cluster analysis on the results
# Granted, with four obs this won't mean much
matrix = np.vstack(df.embedding.values)
n_clusters = 2
kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_

df[['my_text','n_tokens','Cluster']]
# 	my_text	                n_tokens Cluster
# 0	I hated my meal.	5	 0
# 1	I loved the meal!	5	 0
# 2	The meal was ok.	5	 0
# 3	French fries were bomb.	5	 1

The other neat thing about this is that you can also use AI to label the clusters you’ve identified, which text-analysis researchers (or factor analysis researchers, for that matter) will know is a pain sometimes.

# Pick this many random reviews from each cluster
# to determine what the cluster's theme is
# (note with n = 1 here this is indeed useless, but you can see the code)
rev_per_cluster = 1

# Use AI to label the clusters for us!
for i in range(n_clusters):
    print(f"Cluster {i} Theme:", end=" ")

    reviews = "\n".join(
        df[df.Cluster == i]
        .my_text
        .sample(rev_per_cluster, random_state=42)
        .values
    )
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f'What do the following customer reviews have in common?\n\nCustomer reviews:\n"""\n{reviews}\n"""\n\nTheme:',
        temperature=0,
        max_tokens=64,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    print(response["choices"][0]["text"].replace("\n", ""))

    sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)
    for j in range(rev_per_cluster):
        print(sample_cluster_rows.my_text.values[j][0:100], end="\n")

    print("-" * 55)

# Cluster 0 Theme:  Unsatisfied customer
# I hated my meal.
# -------------------------------------------------------
Cluster 1 Theme:  Positive feedback about the French fries.
French fries were bomb.
# -------------------------------------------------------

4. Running Other LLMs Locally

OpenAI and other chat-based models (some of which are based on OpenAI) takes up a big part of the LLM discussion, but they are not the only game in town! There are many smaller LLMs that do not have public open-access usage capabilities and must be run on your own machine (or perhaps just can be run from your own machine). There’s LLaMa, for example. Much more interesting to me is the growing number of niche LLMs trained on a specialized set of data. In particular, it was a desire to use the BioMedLM and Galactica LLMs that led me to want to do this. These are trained on PubMed articles, and science articles generally, respectively. This kind of thing seems like the way to use LLMs if you want them to have some sort of specialized knowledge, not just popping open ChatGPT and hoping for the best (although, who knows!).

Many thanks, by the way, to Prodhi on Upwork for helping walk me through some of the finer details here.

Many of these models are built on top of the HuggingFace transformers architecture and package. And, as far as I can tell, they all provide handy example code that’s easy to adjust to just ask it for the completions. Just replace the prompts and hit run, right?

Right!

Well, except for the “run” part.

The way these models work, they typically have to first download the model, and then hold it in memory in order to run it.3 No problem. Except these models are many gigabytes in size. Most computers can't run them, certainly not smoothly, at least not unless you've got a kickin' graphics card.

So let’s not run them on your computer. Instead, let’s use one of the many online services that let you rent space on a GPU cluster (quite cheaply, or for free), and have it run the LLM for you.

Our process is going to be this:

Take the example code that the LLM’s guide (hopefully) provides you.
Stick that code in a Jupyter notebook and modify it with the prompts/etc. that you want.
Rent an instance of a GPU cluster.
Upload your notebook, run it, and download the results.

I’ll show you some example code for some BioMedLM queries I ran (which took about 140 seconds for each batch of ten responses I asked it for, in a more complex version of the below code). This started with their example code, adjusted for what I wanted. Note this does “completions”, not “chat completions”, i.e. it expects to finish the prompt you started, not have a conversation with you.

Note that I’ve taken the code out of the Jupyter notebook and am just showing raw code since Substack doesn’t really have a way to present Jupyter:

!pip install pandas
!pip install transformers 

# Load packages
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datetime import datetime

# CUDA is an architecture for making use of NVIDIA graphics card GPUs
device = torch.device("cuda")

# Load model
tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = GPT2LMHeadModel.from_pretrained("stanford-crfm/BioMedLM").to(device)
tokenizer.pad_token = tokenizer.eos_token #for gpt2 base models

# confounder: The name of the confounder being tested
# num_questions: (1) to ask directly if it's a confounder, or (2) to ask if it's a cause of treatment and a cause of outcome
# tries: number of responses to receive
def query_biomedlm(prompt, temperature, tries = 10):
    # Encode the prompt
    input_dat = tokenizer.encode(prompt, return_tensors = 'pt').to(device)
    # Get the response, adding on at most 200 new tokens to the response
    resp = model.generate(input_dat, do_sample = True, temperature = temperature, max_new_tokens = 200, num_return_sequences = tries)
    # Pull out the response text
    resp = tokenizer.batch_decode(resp, skip_special_tokens=True)
    results = pd.DataFrame({
        'model': 'BioMedLM',
        'prompt': prompt,
        'temperature': temperature,
        'responses': resp
    })
    return(results)

# Just get some responses, this leaves out the process of for-looping through prompts which is easy enough to put in
query_biomedlm('Does cold medicine reduce cold length?', temperature = .2)
query_biomedlm('The most common causes of heart failure are', temperature = .2)

Note this uses the “CUDA” procedure, which only works in GPU clusters that have the appropriate graphics cards to run on. Some other machines (like the free Paperspace option I mention below) don’t have this, and so the preamble code to run it on the internal processing instead (IPU) (which I found to be a similar speed to the CUDA application) would look like this:

!pip install pandas
!pip install optimum[onnxruntime]

# Load packages
import pandas as pd
import torch
from transformers import GPT2Tokenizer #, AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM

# Load model: BioMedLM
tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = ORTModelForCausalLM.from_pretrained("prodm93/BioMedLM_IPU")

That’s our code, then. What can we actually run that code on? In my limited experience I know of two solid places.

The first is Paperspace, which is of particular interest because you can use it for free. If you tell it to launch a Notebook project, select the Graphcore option for running Huggingface transformers on an IPU, it will offer you the FREE-IPU-POD4 machine. You can run your notebook on this for up to six hours before it cuts you off (and then you can do it again on a different machine). So if your job isn’t huge this is a great option. Paperspace also offers paid machine rentals (a couple bucks an hour) without the time restriction.

What if six hours doesn’t do it? Another option is vast.ai, which is what I ended up using for my project (which would have taken longer than six hours). This doesn’t have a free option, but you can pick from a wide range of pretty cheap ($.20-$.50/hour) machines. If you go to Edit Image and Configuration, you can ask for a PyTorch setup with a Jupyter frontend. Add some additional disk space, then run the instance!

In both cases, you can then upload your notebook, run it, and add a line to write out the responses you get to a CSV, which you can then download. Done!

Have fun

There’s lots to check out. Who knows what will turn out to be useful in all of this, but it seems likely that something will be. It’s both cool to know the ins and outs for its own sake, and heck maybe you’ll find something really useful to do with all of it.

And “explore these strange little black boxes” really is a good way to think about it. These systems aren’t magic and they’re not an oracle. They’re probabilistic autocomplete. A really good probabilistic autocomplete, when they’re good. The surprising thing is that probabilistic autocomplete turns out to be actually a pretty general tool that can solve a lot of problems and mimic some behavior we might have expected to require deeper human-level cognition.

That said I haven’t been wildly impressed by any of these yet. So (a) I doubt I could do better anyway, why bother, and (b) this will probably improve later, give it some time.

There are finer details here and ways around this, like making use of hard drive space, partitioning models to make them less of a strain, and so on. I will by no means even be attempting to go into all this detail.

Data, On Average