The pg_tiktoken extension
Efficiently tokenize data in your Postgres database using OpenAI's `tiktoken` library
pg_tiktoken extension enables fast and efficient tokenization of data in your Postgres database using OpenAI's tiktoken library.
This topic provides guidance on installing the extension, utilizing its features for tokenization and token management, and integrating the extension with ChatGPT models.
What is a token?
Language models process text in units called tokens. A token can be as short as a single character or as long as a complete word, such as "a" or "apple." In some languages, tokens may comprise less than a single character or even extend beyond a single word.
For example, consider the sentence "Neon is serverless Postgres." It can be divided into seven tokens: ["Ne", "on", "is", "server", "less", "Post", "gres"].
pg_tiktoken offers two functions:
tiktoken_encode: Accepts text inputs and returns tokenized output, allowing you to seamlessly tokenize your text data.
tiktoken_count: Counts the number of tokens in a given text. This feature helps you adhere to text length limits, such as those set by OpenAI's language models.
You can install the
pg_tiktoken extension by running the following
CREATE EXTENSION statement in the Neon SQL Editor or from a client such as
psql that is connected to Neon.
tiktoken_encode function tokenizes text input and returns a tokenized output. The function accepts encoding names and OpenAI model names as the first argument and the text you want to tokenize as the second argument, as shown:
The function tokenizes text using the Byte Pair Encoding (BPE) algorithm.
tiktoken_count function counts the number of tokens in a text. The function accepts encoding names and OpenAI model names as the first argument and text as the second argument, as shown:
tiktoken_encode functions accept both encoding and OpenAI model names as the first argument:
The following models are supported:
|Encoding name||OpenAI model|
|cl100k_base||ChatGPT models, text-embedding-ada-002|
|p50k_base||Code models, text-davinci-002, text-davinci-003|
|p50k_edit||Use for edit models like text-davinci-edit-001, code-davinci-edit-001|
|r50k_base (or gpt2)||GPT-3 models like davinci|
pg_tiktoken with ChatGPT models
pg_tiktoken extension allows you to store chat message history in a Postgres database and retrieve messages that comply with OpenAI's model limitations.
For example, consider the
message table below:
The gpt-3.5-turbo chat model requires specific parameters:
messages parameter is an array of message objects, with each object containing two pieces of information: The
role of the message sender (either
assistant) and the actual message
content. Conversations can be brief, with just one message, or span multiple pages as long as the combined message tokens do not exceed the 4096-token limit.
content, and the number of tokens into the database, use the following query:
Manage text tokens
When a conversation contains more tokens than a model can process (e.g., over 4096 tokens for
gpt-3.5-turbo), you will need to truncate the text to fit within the model's limit.
Additionally, lengthy conversations may result in incomplete replies. For example, if a
gpt-3.5-turbo conversation spans 4090 tokens, the response will be limited to just six tokens.
The following query retrieves messages up to your desired token limits:
<MAX_HISTORY_TOKENS> represents the conversation history you want to keep for chat completion, following this formula:
For example, assume the desired completion length is 100 tokens (
In conclusion, the
pg_tiktoken extension is a valuable tool for tokenizing text data and managing tokens within Postgres databases. By leveraging OpenAI's tiktoken library, it simplifies the process of tokenization and working with token limits, enabling you to integrate more easily with with OpenAI's language models.
As you explore the capabilities of the
pg_tiktoken extension, we encourage you to provide feedback and suggest features you'd like to see added in future updates. We look forward to seeing the innovative natural language processing applications you create using