Importing CSV with a json array of strings in one of the column

All we need is an easy explanation of the problem, so here it is.

The columns of the CSV are of the format text (can include empty lines), date and a JSON array of strings (something like ['a', 'b', 'c']. I’ve been trying to copy that CSV to a PostgreSQL table (using psycopg2‘s copy_expert, which simply executes the given SQL COPY command, if that matters)

Table is created with

CREATE TABLE posts(
    id SERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    rubrics jsonb[] NOT NULL
);

and the copy command is

COPY posts(text, created_date, rubrics)
FROM STDIN
WITH CSV HEADER

where STDIN is the CSV file.

The error I get is

malformed array literal: "['a', 'b', 'c']"
DETAIL:  "[" must introduce explicitly-specified array dimensions.
CONTEXT:  COPY posts, line 15, column rubrics: "['a', 'b', 'c']"

I’ve tried all 4 JSON-related data types (json and jsonb with and without [] or [3]), including square brackets produces the error above, while omitting them (rubrics jsonb NOT NULL when creating) gives a new one:

invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: ['...
COPY posts, line 15, column rubrics: "['a', 'b', 'c']"

Do I have any recourse besides manually fixing the .csv to use {} instead of [] before copying it? It feels like I do, yet I haven’t really been able to found anything besides a couple somewhat, but not fully relevant questions.

Update regarding the comments

I’ve changed the table creation to

CREATE TABLE posts(
    id SERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    rubrics jsonb NOT NULL
);

and left a single entry in the CSV to test, so it looks like this

text,created_date,rubrics
"Lorem

Ipsum

Test",2019-07-25 12:42:13,'["f", "o", "o"]'

Things I’ve tried and errors I got

Using the CSV as is:

 extra data after last expected column
CONTEXT:  COPY posts, line 6: ""Lorem

Ipsum

Test",2019-07-25 12:42:13,'["f", "o", "o"]'"

An additional pair of "" around the array ("'["f", "o", "o"]'")

 invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: '...
COPY posts, line 6, column rubrics: "'[f, o, o]'"

No quotes at all (just the ["f", "o", "o"])

extra data after last expected column
CONTEXT:  COPY posts, line 6: ""Lorem

Ipsum

Test",2019-07-25 12:42:13,["f", "o", "o"]"

Double quotes instead of single quotes ("["f", "o", "o"]")

invalid input syntax for type json
DETAIL:  Token "f" is invalid.
CONTEXT:  JSON data, line 1: [f...
COPY posts, line 6, column rubrics: "[f, o, o]"

Double quotes outside, single quotes inside ("['f', 'o', 'o']")

invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: ['...
COPY posts, line 6, column rubrics: "['f', 'o', 'o']"

Could it be an issue with the Python library I’m using after all?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

JSON is a layer of encapsulation with a standardized syntax that the JSON field in your input must respect.

CSV is another layer of encapsulation that applies on top, with a syntax that is more of less standardized in rfc 4180. As mentioned in the RFC, programs that read or write CSV do not always conform strictly to all these rules, but concerning the quoting rules, Postgres is conformant.

JSON uses double quotes around string literals, so all the attempts in the question to use single quotes around strings are invalid for the JSON parser

CSV requires that double quotes are doubled, see rules #5 and #7 of rfc 4180, so all attempts to use double quotes in JSON without doubling them for CSV or without enclosing the entire field in double quotes are invalid for the CSV parser.

  1. Each field may or may not be enclosed in double quotes (however
    some programs, such as Microsoft Excel, do not use double quotes
    at all). If fields are not enclosed with double quotes, then
    double quotes may not appear inside the fields
  1. If double-quotes are used to enclose fields, then a double-quote
    appearing inside a field must be escaped by preceding it with
    another double quote

Your example slightly edited to conform to both CSV and JSON does work:

CREATE TABLE posts(
    id SERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    rubrics jsonb NOT NULL
);

COPY posts(text, created_date, rubrics)
FROM STDIN WITH CSV;
foo,2022-05-17,"[""a"", ""b"", ""c""]"
\.

select * from posts;
 id | text |    created_date     |     rubrics     
----+------+---------------------+-----------------
  1 | foo  | 2022-05-17 00:00:00 | ["a", "b", "c"]

If you wanted a Postgres array of JSON objects in rubrics, that would be more complicated, as expressing the Postgres array as text requires another layer of encapsulation with its own syntax rules, so the production of valid data to import with COPY would have to do

JSON quoting -> postgres array quoting -> CSV quoting

But as seen in your example, it seems that you don’t need a Postgres array of JSON objects as a column, but rather a single JSON value that contains a JSON array.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply