šŸ‡¦šŸ‡ŗ Hello Australia! Sydney region is officially live on Neon.Sign Up
Engineering

Parsing JSON from Postgres in JS? Donā€™t get the wrong number

Introducing json-custom-numbers: a conformant, performant JavaScript JSON library

Post image

Back in 2010, my first production Postgres database used a couple of JSON columns to provide some flexibility around the main schema. From the point of view of the database, those columns were actually justĀ text. But Postgres got a native JSON type soon after (with version 9.2Ā in 2012), andĀ its JSON supportĀ has become steadily more powerful since then.

As you probably know, you can now use Postgres not just to store and retrieve JSON, but also to build, transform, index and query it. My own TypeScript/Postgres library uses Postgresā€™s JSON functionsĀ to create and return handy nested structures out of lateral joins. This all represents a valuableĀ hybrid of relational and NoSQLĀ database capabilities.

Trouble with numbers

JSON, of course, has its origins in JavaScript. But thereā€™s a potential problem when we use JSON to communicate between Postgres and JavaScript.

JavaScript has one kind of number: anĀ IEEE 754Ā float64. Postgres, of course, has many kinds. Some of these, likeĀ bigintĀ orĀ numeric, can represent larger and/or more precise numbers than aĀ float64.

JavaScript Postgres drivers typically parse these large or precise values into strings. For example:

await { rows } = pool.query('SELECT (1e16 + 1)::bigint AS big');
 // -> [{ big: '10000000000000001' }]

That leaves you to choose how to deal with them in your code. In this case, youā€™d probably pass the stringified PostgresĀ bigintĀ value to JavaScriptā€™sĀ BigInt().

Now: what if Postgres were to return that sameĀ bigintĀ to JavaScript as a JSON value?

TheĀ JSON specĀ ā€œallows implementations to set limits on the range and precision of numbers acceptedā€, and notes ā€œpotential interoperability problemsā€ when numbers larger or more precise than aĀ float64Ā are used. But it sets no limits of its own on what can be parsed or serialized. So we canā€™t really blame Postgres for representing aĀ bigintĀ in JSON as an appropriately long sequence of digits. Itā€™s not obvious what else it could be expected to do.

This long JSON number value from Postgres then gets parsed with JavaScriptā€™sĀ JSON.parse()Ā and, if itā€™s bigger than JavaScriptā€™sĀ Number.MAX_SAFE_INTEGERĀ (or more negative thanĀ Number.MIN_SAFE_INTEGER), bad things happen.

await { rows } = pool.query('SELECT to_json((1e16 + 1)::bigint) AS big'); 
// -> [{ big: 10000000000000000 }]

Compare the last two results above. Thatā€™s right: without any warning, the number we got out of the second query is not the same number Postgres sent.

Imagine this was theĀ idĀ value of a table row. Well, now itā€™s theĀ idĀ value of a different table row.

[Sinister music plays].

The solution: custom JSON parsing

A solution to this nastiness is to get hold of a custom JSON parser that can handle big numbers, and to tell your Postgres driver to use it. For bothĀ node-postgresĀ andĀ @neondatabase/serverless, that looks like this:

import { types } from '@neondatabase/serverless'; // or from 'pg' 
function myJSONParse(json) { /* implementation */ } 
types.setTypeParser(types.builtins.JSON, myJSONParse); 
types.setTypeParser(types.builtins.JSONB, myJSONParse);

(You might have thought that you could useĀ theĀ reviverĀ argument to nativeĀ JSON.parse()Ā to avoid implementing a complete JSON parser. Sadly, you canā€™t: by the time the function you supply as theĀ reviverĀ sees a number, itā€™s already been parsed to a JavaScriptĀ float64, and the damage has been done).

As I see it, there are three key things weā€™re going to want from a custom JSON parser:

  1. First, conformance: to avoid any surprises or complications, it should be a perfect drop-in replacement forĀ JSON.parse(). That means the same API and, critically, the same result for every input (aside from the numbers it handles better).
  2. Second, performance: itā€™s never going to match the optimised C++ of nativeĀ JSON.parse(), but it should be the fastest gosh-darn JavaScript implementation we can come up with. In some common contexts (such as an API that mediates between Postgres and a website or app) it may haveĀ a lotĀ of data flowing through it, and CPU cycles mean time, electricity and money.
  3. And third, flexibility: when it comes across a large number (or indeed any number) in the JSON input, it should give us the chance to deal with it however we want. That could mean using aĀ BigInt, a string, orĀ some other number library.

So: weā€™re looking for a conformant, performant JSON parser that can deal flexibly with large numbers.

Searching npm turns up two candidate packages:Ā json-bigintĀ andĀ lossless-json. Are they up to the job?

Conformance and performance testing

Behaving the same way asĀ JSON.parse()Ā means our custom JSON parser should throw errors on the same documents, and return the same parsed results for the rest. So we need a set of well-chosen JSON documents, including all the edge cases we can think of, to test against. Happily, theĀ JSON Parsing Test SuiteĀ has our back here, with hundreds of test files of valid, invalid, and ambiguous (by the spec) JSON.

Assessing performance againstĀ JSON.parse()Ā will also call for one or more JSON documents we can test against. Exactly what to use here is a judgment call, but certainly we want to benchmark the parsing of a wide range of JSON values.

Here, Iā€™ve plumped for: long strings (such as a blog post or a product description); short strings (such as object keys); strings full of backslash escapes (likeĀ \u03B1Ā andĀ \n); long numbers (such as high-resolution latitudes and longitudes); short numbers (such as an id or count); andĀ true,Ā falseĀ andĀ null. Iā€™ve combined these values into objects and arrays, so that we also capture speed on the two JSON container types.

For a headline comparison, Iā€™ve then brought all these types together into one large object:Ā { "longStrings": ..., "shortStrings": ..., ... }.

The final piece of the puzzle is: how do we run the performance tests? Performance benchmarking JavaScript seems to have gone way out of fashion in recent years. jsperf.com is long since defunct.Ā benchmark.jsĀ (which powered it) hasnā€™t had a commit in five years, and consequently doesnā€™t even know aboutĀ performance.now().

Iā€™ve therefore put together a simple head-to-head performance function of my own. It evaluatesĀ performance.now()Ā timer resolution, estimates how many iterationsĀ NĀ of the provided functions are needed to get an accurate reading, and then runs 100 trials ofĀ NĀ iterations each. Finally, it plots a simple histogram to compareĀ operations/secondĀ in the two cases, and calculates an appropriate statistic (theĀ Mann-Whitney U) to establish whether the two distributions are significantly different.

json-bigint

First up: json-bigint. The widgets below tell the full story.

For conformance, the summary is that json-bigint correctly parses all valid documents, except those that are very deeply nested. Very deeply nested structures overflow the call stack of its recursive implementation.

json-bigint is then significantly more lax in what else it accepts thanĀ JSON.parse(). It permits numbers in various illegal formats (such asĀ .1,Ā 1.,Ā 01), isnā€™t bothered by unescaped newlines or invalid Unicode escapes in strings, and allows all sorts (character codes 0 ā€“ 32) as whitespace.

For performance, the headline number is that itā€™s 6 ā€“ 11Ɨ slower thanĀ JSON.parse()Ā on my mixed JSON test document, depending on the browser and wind direction.

Regarding flexibility, json-bigint offers various options, but not the one I really want, which is simply to allow me to supply a custom number-parsing function.

lossless-json

Next: lossless-json. How does it compare?

Conformance-wise, lossless-jsonā€™s big thing is that it throws errors on duplicate object keys. It calls this a feature and, to be fair, itā€™s fully in line with its ā€œlosslessā€ branding. But itā€™s also a definite point of difference fromĀ JSON.parse().

Like json-bigint, and for the same reason, lossless-json fails on deeply nested structures. Elsewhere, itā€™s not as lax as json-bigint, but itā€™s still a touch more relaxed thanĀ JSON.parse()Ā on number formats, allowing a leading decimal point with no zero (.1).

Regarding performance, lossless-json does a bit better than json-bigint, with a headline factor of 4 ā€“ 7Ɨ slower thanĀ JSON.parse.

Finally, lossless-json scores points on flexibility by taking a custom number-parsing function as one of its options.

Can we do better?

Overall, neither package exactly matches the behaviour ofĀ JSON.parse(), and neither seems blisteringly quick. Donā€™t think Iā€™m looking a gift-horse in the mouth here. Iā€™m grateful to the maintainers of both packages for doing the hard work of making useful code and documentation available for free.

But we can do better on all three criteria I set out above: conformance, performance, and flexibility:

  • We can, of course, choose to fully match the behaviour ofĀ JSON.parse(), and to provide wholly customisable number parsing.
  • Less obviously, we can also improve performance substantially.

Presenting: json-custom-numbers

To cut to the chase:Ā json-custom-numbersĀ is a conformant, performant, flexible new custom JSON parser (and stringifier too).

Todayā€™s take-home message is: if you need custom parsing of numbers in JSON, useĀ json-custom-numbers. It is (I believe) a perfect match for the behaviour of nativeĀ JSON.parse(), and itā€™s usually only 1.5 ā€“ 3Ɨ slower, which is substantially quicker than the alternatives.

Speed varies according to the JavaScript engine and what youā€™re parsing, so there are someĀ more detailed comparisons in the project README.

To use json-custom-numbers with Neonā€™s serverless driver (or node-postgres) to parse PostgresĀ bigintĀ values as JavaScriptĀ BigIntĀ values, you can do this:

import { types } from '@neondatabase/serverless';  // or from 'pg'
import { parse } from 'json-custom-numbers';

function parseJSONWithBigInts(str) {
  return parse(str, undefined, function (k, str) {
    const n = +str;
    const safe = n >= Number.MIN_SAFE_INTEGER && n <= Number.MAX_SAFE_INTEGER;
    return safe || /[.eE]/.test(str) ? n : BigInt(str);
  });
}

types.setTypeParser(types.builtins.JSON, parseJSONWithBigInts);
types.setTypeParser(types.builtins.JSONB, parseJSONWithBigInts);

This code sample usesĀ BigIntĀ only for integers that donā€™t fit in an ordinary number value. That means a PostgresĀ bigintĀ value can end up as either an ordinary number or aĀ BigInt, depending on its size. For sanity, youā€™ll probably want to ensure that anything thatĀ mightĀ be aĀ BigIntĀ is treated as one, by subsequently manually converting it:Ā BigInt(bigintValueFromPostgres).

This is a fine place to stop reading. Carry on if youā€™d like me to point out a few things I learned in the process of writing and optimising the library.

What I learned

Sticky RegExps rock

Discovery number one is thatĀ ā€˜stickyā€™ RegExpsĀ plus theĀ RegExpĀ test()Ā methodĀ are a parser-writerā€™s best friend.

A sticky RegExp is one created with theĀ yĀ flag. It has aĀ lastIndexĀ property. You can setĀ lastIndexĀ to the string index where you want your RegExp to begin matching. RegExp methods likeĀ test()Ā then setĀ lastIndexĀ to the index where matching ended.

The json-custom-numbersĀ parse()Ā function parses all primitive values (strings, numbers,Ā true,Ā false, andĀ null) using sticky RegExps. This gives a major performance boost compared to the other implementations, which step through string input character-by-character.

Experiment

Itā€™s an obvious point, but thereā€™s no substitute for running experiments and seeing whatā€™s quicker.

For example,Ā function inlining is a well-known optimizationĀ applied by language compilers of all stripes,Ā including JavaScript engines like V8. You might therefore think that manually inlining would have little performance impact. But some empirical testing showed that inlining functions to read the next character and to skip whitespace ā€” which my original recursive parsing code had inherited fromĀ Crockfordā€™s reference implementationĀ ā€” led to overall performance gains of 10 ā€“ 20%.

As another example, I had an idea that switching from processing single-character strings (extracted withĀ charAt()) to processing integer character codes (extracted withĀ charCodeAt()) might speed things up in some of the places sticky RegExps couldnā€™t help. Experimentation showed this was true, but the scale of the gains is strongly dependent on the JavaScript engine. The change reduced parsing time by about 10% in Safari (JavaScriptCore), 20% in Chrome (V8), and over 50% in Firefox (SpiderMonkey).

CodeĀ remembers

Probably the nastiest and most maddening thing I learned is that JavaScript code has memory! It matters how much your code has been run already. It also mattersĀ what input itā€™s seen.

JavaScript engines optimise code progressively, as they discover which functions are ā€˜hotā€™ and where the potential optimisation gains might be highest. Optimisation depends heavily on data types and code paths, and code can also be de-optimised if assumptions made by the engine turn out false. I knew this in principle, but I hadnā€™t thought through the implications for benchmarking.

This issue reared its head when I was trying to optimiseĀ \uXXXXĀ (Unicode) escape parsing code. In Safari, every approach I could think of benchmarked worse than what Iā€™d started with, which was essentially the Crockford reference implementation. I was surprised by this.

I eventually resorted to benchmarking Crockford against Crockford ā€” and found that one copy of an identical implementation was significantly slower than the other (pĀ < 0.001). I then realised that my parsing conformance tests involve lots of invalid JSON input, throwing repeated errors in every possible location.

Being exposed to the tests therefore appears to reduce the performance of any particular parsing code. Skipping the prior conformance check (or running it on a different copy of the same code) could turn 20% slower into 10% faster when I then tested performance differences.

You can actually see this effect in action using the conformance/performance widget pairs further up this page. For any pair, youā€™ll generally find that the performance figure is substantially better if youĀ havenā€™tĀ tested conformance since page load than if you have.

The good news is that if youā€™re using json-custom-numbers to parse JSON thatā€™s coming back from Postgres, everything it sees should be valid JSON, and performance will be best-case.

Writing aids thinking

I didnā€™tĀ planĀ to throw one away. But in the end it was writing about the code that led me to do just that.

I thought Iā€™d finished the package, Iā€™d already written most of this post, and I was in the middle of claiming that the json-custom-numbers parser perfectly matches nativeĀ JSON.parse()Ā behaviour. A caveat the occurred to me, and duly wrote a section about how my implementation was recursive, meaning that really-unusually-deeply-nested JSON structures would overflow the call stack.

Seeing it written down, and attempting to justify it, this then seemed kind of lame: if you can see the problem, why not fix it? So I went back and rewrote the parser asĀ a nice big state machineĀ instead. Since this was slightly slower than the recursive implementation had been, I then wrote a slightly fasterĀ final non-recursive implementation.

You canā€™t nativelyĀ stringifyĀ everything you canĀ parse

The major JavaScript engines all now have non-recursiveĀ JSON.parse()Ā implementations. For example,Ā V8 became non-recursive in 2019.

So I was surprised to discover (after writing non-recursive implementations for bothĀ parseĀ andĀ stringify) that the nativeĀ JSON.stringify()Ā implementations still appear to be recursive. Given a deeply-enough nested structure,Ā JSON.stringify()Ā will give youĀ RangeError: Maximum call stack size exceededĀ orĀ InternalError: too much recursion.

This means there are values ofĀ nĀ for whichĀ let deepObj = JSON.parse('['.repeat(n) + ']'.repeat(n))Ā succeeds, butĀ let deepJSON = JSON.stringify(deepObj)Ā then fails. The smallest value ofĀ nĀ where this happens indicates your JavaScript engineā€™s call stack size (today, on my laptop, that smallestĀ nĀ seems to be 3,375 for Firefox, 3,920 for Chrome, 4,639 for Node, and 40,001 for Safari or Bun).

You might argue that this is a feature, in that it prevents a circular reference leading to an infinite loop. (Circular references are usually detected, but certainĀ replacerĀ functions can thwart this: for example,Ā let obj = {}; obj.obj = obj; JSON.stringify(obj)Ā gets you a complaint about the circular reference, butĀ JSON.stringify(obj, (k, v) => [v])Ā on the same object overflows the call stack instead).

Anyway, for json-custom-numbers I decided to keep my non-recursiveĀ stringifyĀ implementation, but also to provide aĀ maxDepthĀ option for bothĀ stringifyĀ andĀ parse. ForĀ stringify,Ā maxDepthĀ defaults to 50,000 ā€” a bit higher than you get in Safari and Bun ā€” but can be set to anything up toĀ Infinity. ForĀ parse, it defaults toĀ Infinity, which matches the native implementations and means you can go as deep as available memory allows.

The code behind this post is atĀ https://github.com/jawj/perfcompare/.