[PATCH] json_lex_string: don't overread on bad UTF8

  • Jump to comment-1
    jacob.champion@enterprisedb.com2024-04-30T17:39:25+00:00
    Hi all, When json_lex_string() hits certain types of invalid input, it calls pg_encoding_mblen_bounded(), which assumes that its input is null-terminated and calls strnlen(). But the JSON lexer is constructed with an explicit string length, and we don't ensure that the string is null-terminated in all cases, so we can walk off the end of the buffer. This isn't really relevant on the server side, where you'd have to get a superuser to help you break string encodings, but for client-side usage on untrusted input (such as my OAuth patch) it would be more important. Attached is a draft patch that explicitly checks against the end-of-string pointer and clamps the token_terminator to it. Note that this removes the only caller of pg_encoding_mblen_bounded() and I'm not sure what we should do with that function. It seems like a reasonable API, just not here. The new test needs to record two versions of the error message, one for invalid token and one for invalid escape sequence. This is because, for smaller chunk sizes, the partial-token logic in the incremental JSON parser skips the affected code entirely when it can't find an ending double-quote. Tangentially: Should we maybe rethink pieces of the json_lex_string error handling? For example, do we really want to echo an incomplete multibyte sequence once we know it's bad? It also looks like there are places where the FAIL_AT_CHAR_END macro is called after the `s` pointer has already advanced past the code point of interest. I'm not sure if that's intentional. Thanks, --Jacob