Announcing pg_tiktoken: A Postgres Extension for Fast BPE Tokenization
Analyze and Process Text Data in Neon with pg_tiktoken's Fast BPE Tokenization
We’re excited to announce the release of the
pg_tiktoken extension on Neon. This new Postgres extension provides fast and efficient tokenization using the BPE (Byte Pair Encoding) algorithm.
pg_tiktoken is a wrapper around OpenAI’s tokenizer, known for its speed and performance in handling natural language processing tasks.
pg_tiktoken solves the problem of tokenizing text data within a Postgres database. The
tiktoken_encode function allows you to tokenize text inputs and returns a tokenized output, making it easier to analyze and process text data for various applications. The
tiktoken_count function enables users to return the number of tokens in a text, which is useful for checking text length limits, like those imposed by OpenAI’s language models.
What are text tokens
Language models process text in chunks known as tokens. In English, a token can range from a single character to a complete word such as “a” or “apple.” In certain languages, tokens can even be shorter than one character or longer than one word.
For instance, the sentence “Neon is Serverless Postgres” is divided into seven tokens: [“Ne”, “on”, ” is”, ” Server”, “less”, ” Post”, “gres”].
Get started with
pg_tiktoken, you can tokenize text data within your Postgres database, making it easier to analyze and process the data for various applications. The extension supports various text inputs, including multiple languages and special characters, and is optimized for speed and efficiency.
To start with
pg_tiktoken, you need to install the extension:
Once the extension is installed, you can use the pg_tiktoken function within your queries to tokenize text data within your database. The function takes a text input and returns a tokenized output:
This will tokenize the input text using the BPE algorithm and return the tokenized output.
You can also return the number of tokens in a text using the `tiktoken_count` function:
tiktoken_encode functions accept both encoding and OpenAI model names as the first argument:
Here is the list of supported models:
|Encoding name||OpenAI models|
|ChatGPT models, |
|Code models, |
|Use for edit models like |
|GPT-3 models like davinci|
pg_tiktoken with the ChatGPT model
You can persist the message history in Postgres and query the database to get messages that fit within OpenAI’s model limits.
Here is an example of the “message” table:
gpt-3.5-turbo model requires the following parameters:
The “messages” parameter consists of an array of message objects. Each object within the array should contain two key pieces of information: the role of the message sender (either “system,” “user,” or “assistant”) and the actual message content. Conversations can be brief, consisting of just one message, or they may continue for multiple pages as long as the sum of the message tokens is within the 4096 limit.
You can use the following query to insert role, content, and the number of tokens into the database:
Manage your text tokens
When you have a conversation with more tokens than a model can handle (like more than 4096 tokens for
gpt-3.5-turbo), you’ll need to shorten your text to fit the limit. But be careful. If you remove a message from the conversation, the model will not know about it anymore.
Also, you might receive incomplete replies if your conversation is too long. For instance, if your
gpt-3.5-turbo conversation is 4090 tokens long, you’ll only get a reply of 6 tokens.
The query below allows you to get messages up to your desired token limits:
`MAX_HISTORY_TOKENS` represents the conversation history you want to keep for the chat completion and follow the following formula:
MAX_HISTORY_TOKENS = MODEL_MAX_TOKENS – NUM_SYSTEM_TOKENS – NUM_COMPLETION_TOKENS
Let’s use the example seen above and assume that the desired completion length is 100 tokens (`NUM_COMPLETION_TOKENS=90`).
MAX_HISTORY_TOKENS = 4096 – 6 – 90 = 4000
In summary, the
pg_tiktoken extension provides fast and efficient tokenization using the BPE algorithm within Postgres, making it easier to analyze and process text data for various applications. We explored how to use the
tiktoken_count function to query the database and retrieve messages that fit within OpenAI’s model limits and how to avoid failing API calls.
What is your use case for the
pg_tiktoken extension and AI? We would love to hear your feedback and learn about features you would like to see added to the
pg_tiktoken in future updates.
We look forward to seeing the natural language processing applications you will build with