相关文章推荐
风流倜傥的麦片  ·  Dependency ...·  1 年前    · 
很酷的手套  ·  python ...·  1 年前    · 
卖萌的佛珠  ·  WinForm 的checkbox控件 ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I need to insert multiple rows with one query (number of rows is not constant), so I need to execute query like this one:

INSERT INTO t (a, b) VALUES (1, 2), (3, 4), (5, 6);

The only way I know is

args = [(1,2), (3,4), (5,6)]
args_str = ','.join(cursor.mogrify("%s", (x, )) for x in args)
cursor.execute("INSERT INTO t (a, b) VALUES "+args_str)

but I want some simpler way.

I built a program that inserts multiple lines to a server that was located in another city.

I found out that using this method was about 10 times faster than executemany. In my case tup is a tuple containing about 2000 rows. It took about 10 seconds when using this method:

args_str = ','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
cur.execute("INSERT INTO table VALUES " + args_str) 

and 2 minutes when using this method:

cur.executemany("INSERT INTO table VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s)", tup)
                Still very relevant almost two years later. An experience today suggests that as the number of rows you want to push increases, the better it is to use the execute strategy. I saw speedup of around 100x thanks to this!
– Rob Watts
                Jan 22, 2014 at 21:16
                Perhaps executemany runs a commit after each insert. If you instead wrap the whole thing in a transaction, maybe that would expedite things?
– Richard
                Apr 12, 2015 at 17:06
                Just confirmed this improvement myself. From what I've read psycopg2's executemany doesn't do anything optimal, just loops and does many execute statements. Using this method, a 700 row insert to a remote server went from 60s to <2s.
– Nelson
                Apr 27, 2015 at 23:22
                Maybe I'm being paranoid, but concatenating the query with a + seems like it could open up to sql injection, I feel like @Clodoaldo Neto execute_values() solution is safer.
– Will Munn
                Jan 17, 2018 at 11:55
                in case someone encounters the following error: [TypeError: sequence item 0: expected str instance, bytes found] run this command instead [args_str = ','.join(cur.mogrify("(%s,%s)", x).decode("utf-8") for x in tup)]
– mrt
                Sep 3, 2018 at 23:08

New execute_values method in Psycopg 2.7:

data = [(1,'x'), (2,'y')]
insert_query = 'insert into t (a, b) values %s'
psycopg2.extras.execute_values (
    cursor, insert_query, data, template=None, page_size=100

The pythonic way of doing it in Psycopg 2.6:

data = [(1,'x'), (2,'y')]
records_list_template = ','.join(['%s'] * len(data))
insert_query = 'insert into t (a, b) values {}'.format(records_list_template)
cursor.execute(insert_query, data)

Explanation: If the data to be inserted is given as a list of tuples like in

data = [(1,'x'), (2,'y')]

then it is already in the exact required format as

  • the values syntax of the insert clause expects a list of records as in

    insert into t (a, b) values (1, 'x'),(2, 'y')

  • Psycopg adapts a Python tuple to a Postgresql record.

    The only necessary work is to provide a records list template to be filled by psycopg

    # We use the data list to be sure of the template length
    records_list_template = ','.join(['%s'] * len(data))
    

    and place it in the insert query

    insert_query = 'insert into t (a, b) values {}'.format(records_list_template)
    

    Printing the insert_query outputs

    insert into t (a, b) values %s,%s
    

    Now to the usual Psycopg arguments substitution

    cursor.execute(insert_query, data)
    

    Or just testing what will be sent to the server

    print (cursor.mogrify(insert_query, data).decode('utf8'))
    

    Output:

    insert into t (a, b) values (1, 'x'),(2, 'y')
                    Here's a gist with a benchmark.  copy_from scales to about 6.5X faster on my machine with 10M records.
    – Joseph Sheedy
                    Jul 13, 2016 at 0:28
                    Looks nice - I think you have a stray , at the end of your initial definition of insert_query (unless you were trying to make it a tuple?) and are missing a s after the % for %s also in the initial definition of insert_query.
    – deadcode
                    May 18, 2017 at 1:22
                    using execute_values I was able to get my system running at 1k records a minute up to 128k records a minute
    – Conrad.Dean
                    Nov 15, 2018 at 15:17
                    @Phillipp that's normal with every execute statement, unless you're in auto commmit mode.
    – Chris
                    Sep 3, 2020 at 10:17
    

    Update with psycopg2 2.7:

    The classic executemany() is about 60 times slower than @ant32 's implementation (called "folded") as explained in this thread: https://www.postgresql.org/message-id/20170130215151.GA7081%40deb76.aryehleib.com

    This implementation was added to psycopg2 in version 2.7 and is called execute_values():

    from psycopg2.extras import execute_values
    execute_values(cur,
        "INSERT INTO test (id, v1, v2) VALUES %s",
        [(1, 2, 3), (4, 5, 6), (7, 8, 9)])
    

    Previous Answer:

    To insert multiple rows, using the multirow VALUES syntax with execute() is about 10x faster than using psycopg2 executemany(). Indeed, executemany() just runs many individual INSERT statements.

    @ant32 's code works perfectly in Python 2. But in Python 3, cursor.mogrify() returns bytes, cursor.execute() takes either bytes or strings, and ','.join() expects str instance.

    So in Python 3 you may need to modify @ant32 's code, by adding .decode('utf-8'):

    args_str = ','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x).decode('utf-8') for x in tup)
    cur.execute("INSERT INTO table VALUES " + args_str)
    

    Or by using bytes (with b'' or b"") only:

    args_bytes = b','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
    cur.execute(b"INSERT INTO table VALUES " + args_bytes) 
                    Thanks, the updated answer works good. Please, don't forget to conn.commit() to persist changes.
    – RicHincapie
                    Oct 17, 2021 at 21:52
                    execute_many may no longer be the dog it once was: as of 3.1, it uses pycopg's pipeline mode to batch commands into a Postgresql single send/receive sequence.
    – Erik Knowles
                    May 8 at 18:05
    

    cursor.copy_from is the fastest solution I've found for bulk inserts by far. Here's a gist I made containing a class named IteratorFile which allows an iterator yielding strings to be read like a file. We can convert each input record to a string using a generator expression. So the solution would be

    args = [(1,2), (3,4), (5,6)]
    f = IteratorFile(("{}\t{}".format(x[0], x[1]) for x in args))
    cursor.copy_from(f, 'table_name', columns=('a', 'b'))
    

    For this trivial size of args it won't make much of a speed difference, but I see big speedups when dealing with thousands+ of rows. It will also be more memory efficient than building a giant query string. An iterator would only ever hold one input record in memory at a time, where at some point you'll run out of memory in your Python process or in Postgres by building the query string.

    Here is a benchmark comparing copy_from/IteratorFile with a query builder solution. copy_from scales to about 6.5X faster on my machine with 10M records. – Joseph Sheedy Sep 30, 2016 at 17:49 Hi, still relevant after 5 years !! I am new to Python and Postgres and I am trying to find out the most efficient and fastest way to load data (extracted from XML) to Postgres. So I found @Joseph Sheedy's answer and want to try and test it. But I couldn't understand where or how should the function def readline(self): be used because I didn't find any reference to the same in the code ? Can someone give a complete usage example please. Pls excuse my naivety. Thanks. – marie20 Jul 2, 2022 at 5:06 @marie20, readline is called implicitly by copy_from. Here's the docs: psycopg.org/docs/cursor.html#cursor.copy_from – Joseph Sheedy Jul 13, 2022 at 16:16

    A last item I would like to show you is how to insert multiple rows using a dictionary. If you had the following:

    namedict = ({"first_name":"Joshua", "last_name":"Drake"},
                {"first_name":"Steven", "last_name":"Foo"},
                {"first_name":"David", "last_name":"Bar"})
      

    You could easily insert all three rows within the dictionary by using:

    cur = conn.cursor()
    cur.executemany("""INSERT INTO bar(first_name,last_name) VALUES (%(first_name)s, %(last_name)s)""", namedict)
    

    It doesn't save much code, but it definitively looks better.

    This will run many individual INSERT statements. Useful, but not the same as a single multi-VALUEd insert. – Craig Ringer Apr 9, 2013 at 2:26 And in the same document it is written The cur.executemany statement will automatically iterate through the dictionary and execute the INSERT query for each row. – sp1rs Jan 21, 2021 at 5:09

    Security vulnerabilities

    As of 2022-11-16, the answers by @Clodoaldo Neto (for Psycopg 2.6), @Joseph Sheedy, @J.J, @Bart Jonk, @kevo Njoki, @TKoutny and @Nihal Sharma contain SQL injection vulnerabilities and should not be used.

    The fastest proposal so far (copy_from) should not be used either because it is difficult to escape the data correctly. This is easily apparent when trying to insert characters like ', ", \n, \, \t or \n.

    The author of psycopg2 also recommends against copy_from:

    copy_from() and copy_to() are really just ancient and incomplete methods

    The fastest method

    The fastest method is cursor.copy_expert, which can insert data straight from CSV files.

    with open("mydata.csv") as f:
        cursor.copy_expert("COPY mytable (my_id, a, b) FROM STDIN WITH csv", f)
    

    copy_expert is also the fastest method when generating the CSV file on-the-fly. For reference, see the following CSVFile class, which takes care to limit memory usage.

    import io, csv
    class CSVFile(io.TextIOBase):
        # Create a CSV file from rows. Can only be read once.
        def __init__(self, rows, size=8192):
            self.row_iter = iter(rows)
            self.buf = io.StringIO()
            self.available = 0
            self.size = size
        def read(self, n):
            # Buffer new CSV rows until enough data is available
            buf = self.buf
            writer = csv.writer(buf)
            while self.available < n:
                    row_length = writer.writerow(next(self.row_iter))
                    self.available += row_length
                    self.size = max(self.size, row_length)
                except StopIteration:
                    break
            # Read requested amount of data from buffer
            write_pos = buf.tell()
            read_pos = write_pos - self.available
            buf.seek(read_pos)
            data = buf.read(n)
            self.available -= len(data)
            # Shrink buffer if it grew very large
            if read_pos > 2 * self.size:
                remaining = buf.read()
                buf.seek(0)
                buf.write(remaining)
                buf.truncate()
            else:
                buf.seek(write_pos)
            return data
    

    This class can then be used like:

    rows = [(1, "a", "b"), (2, "c", "d")]
    cursor.copy_expert("COPY mytable (my_id, a, b) FROM STDIN WITH csv", CSVFile(rows))
    

    If all your data fits into memory, you can also generate the entire CSV data directly without the CSVFile class, but if you do not know how much data you are going to insert in the future, you probably should not do that.

    f = io.StringIO()
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)
    f.seek(0)
    cursor.copy_expert("COPY mytable (my_id, a, b) FROM STDIN WITH csv", f)
    

    Benchmark results

  • 914 milliseconds - many calls to cursor.execute
  • 846 milliseconds - cursor.executemany
  • 362 milliseconds - psycopg2.extras.execute_batch
  • 346 milliseconds - execute_batch with page_size=1000
  • 265 milliseconds - execute_batch with prepared statement
  • 161 milliseconds - psycopg2.extras.execute_values
  • 127 milliseconds - cursor.execute with string-concatenated values
  • 39 milliseconds - copy_expert generating the entire CSV file at once
  • 32 milliseconds - copy_expert with CSVFile
  • All of these techniques are called 'Extended Inserts" in Postgres terminology, and as of the 24th of November 2016, it's still a ton faster than psychopg2's executemany() and all the other methods listed in this thread (which i tried before coming to this answer).

    Here's some code which doesnt use cur.mogrify and is nice and simply to get your head around:

    valueSQL = [ '%s', '%s', '%s', ... ] # as many as you have columns.
    sqlrows = []
    rowsPerInsert = 3 # more means faster, but with diminishing returns..
    for row in getSomeData:
            # row == [1, 'a', 'yolo', ... ]
            sqlrows += row
            if ( len(sqlrows)/len(valueSQL) ) % rowsPerInsert == 0:
                    # sqlrows == [ 1, 'a', 'yolo', 2, 'b', 'swag', 3, 'c', 'selfie' ]
                    insertSQL = 'INSERT INTO "twitter" VALUES ' + ','.join(['(' + ','.join(valueSQL) + ')']*rowsPerInsert)
                    cur.execute(insertSQL, sqlrows)
                    con.commit()
                    sqlrows = []
    insertSQL = 'INSERT INTO "twitter" VALUES ' + ','.join(['(' + ','.join(valueSQL) + ')']*len(sqlrows))
    cur.execute(insertSQL, sqlrows)
    con.commit()
    

    But it should be noted that if you can use copy_from(), you should use copy_from ;)

    Bringing up from the dead, but what happens in the situation of the last few rows? I assume you actually run that final clause again on the last remaining rows, in the case you have an even number of rows? – mcpeterson Nov 23, 2016 at 23:03 Correct, sorry I must have forgotten to do that when i wrote the example - thats pretty stupid of me. Not doing so wouldn't have given people an error, which makes me worried how many people copy/pasted the solution and went about their business..... Anyway, very grateful mcpeterson - thank you! – J.J Nov 24, 2016 at 1:49

    I've been using ant32's answer above for several years. However I've found that is thorws an error in python 3 because mogrify returns a byte string.

    Converting explicitly to bytse strings is a simple solution for making code python 3 compatible.

    args_str = b','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup) 
    cur.execute(b"INSERT INTO table VALUES " + args_str)
    

    executemany accept array of tuples

    https://www.postgresqltutorial.com/postgresql-python/insert/

        """ array of tuples """
        vendor_list = [(value1,)]
        """ insert multiple vendors into the vendors table  """
        sql = "INSERT INTO vendors(vendor_name) VALUES(%s)"
        conn = None
            # read database configuration
            params = config()
            # connect to the PostgreSQL database
            conn = psycopg2.connect(**params)
            # create a new cursor
            cur = conn.cursor()
            # execute the INSERT statement
            cur.executemany(sql,vendor_list)
            # commit the changes to the database
            conn.commit()
            # close communication with the database
            cur.close()
        except (Exception, psycopg2.DatabaseError) as error:
            print(error)
        finally:
            if conn is not None:
                conn.close()
    

    The cursor.copyfrom solution as provided by @jopseph.sheedy (https://stackoverflow.com/users/958118/joseph-sheedy) above (https://stackoverflow.com/a/30721460/11100064) is indeed lightning fast.

    However, the example he gives are not generically usable for a record with any number of fields and it took me while to figure out how to use it correctly.

    The IteratorFile needs to be instantiated with tab-separated fields like this (r is a list of dicts where each dict is a record):

        f = IteratorFile("{0}\t{1}\t{2}\t{3}\t{4}".format(r["id"],
            r["type"],
            r["item"],
            r["month"],
            r["revenue"]) for r in records)
    

    To generalise for an arbitrary number of fields we will first create a line string with the correct amount of tabs and field placeholders : "{}\t{}\t{}....\t{}" and then use .format() to fill in the field values : *list(r.values())) for r in records:

            line = "\t".join(["{}"] * len(records[0]))
            f = IteratorFile(line.format(*list(r.values())) for r in records)
    

    complete function in gist here.

    Another nice and efficient approach - is to pass rows for insertion as 1 argument, which is array of json objects.

    E.g. you passing argument:

    [ {id: 18, score: 1}, { id: 19, score: 5} ]
    

    It is array, which may contain any amount of objects inside. Then your SQL looks like:

    INSERT INTO links (parent_id, child_id, score) 
    SELECT 123, (r->>'id')::int, (r->>'score')::int 
    FROM unnest($1::json[]) as r 
    

    Notice: Your postgress must be new enough, to support json

    If you're using SQLAlchemy, you don't need to mess with hand-crafting the string because SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement:

    rows = []
    for i, name in enumerate(rawdata):
        row = {
            'id': i,
            'name': name,
            'valid': True,
        rows.append(row)
    if len(rows) > 0:  # INSERT fails if no rows
        insert_query = SQLAlchemyModelName.__table__.insert().values(rows)
        session.execute(insert_query)
                    Under the hood SQLAlchemy uses psychopg2's executemany() for calls like this and therefore this answer will have severe performance issues for large queries. See the execute method docs.sqlalchemy.org/en/latest/orm/session_api.html.
    – sage88
                    Jan 5, 2017 at 22:48
                    I don't think that's the case. It's been a bit since I looked at this, but IIRC, this is actually building a single insertion statement in the insert_query line. Then, session.execute() is just calling psycopg2's execute() statement with a single massive string. So the "trick" is building the entire insertion statement object first. I am using this to insert 200,000 rows at a time and saw massive performance increases using this code compared to the normal executemany().
    – Jeff Widman
                    Jan 6, 2017 at 1:13
                    The SQLAlchemy doc you linked to has a section that shows exactly how this works and even says: "It is essential to note that passing multiple values is NOT the same as using traditional executemany() form". So it is explicitly calling out that this works.
    – Jeff Widman
                    Jan 6, 2017 at 1:23
                    I stand corrected. I didn't notice your usage of the values() method (without it SQLAlchemy just does executemany). I would say edit the answer to include a link to that doc so that I can change my vote, but obviously you've already included it. Perhaps mention that this is not the same thing as calling an insert() with execute() with a list of dicts?
    – sage88
                    Jan 6, 2017 at 6:15
    

    From @ant32

    def myInsertManyTuples(connection, table, tuple_of_tuples):
        cursor = connection.cursor()
            insert_len = len(tuple_of_tuples[0])
            insert_template = "("
            for i in range(insert_len):
                insert_template += "%s,"
            insert_template = insert_template[:-1] + ")"
            args_str = ",".join(
                cursor.mogrify(insert_template, x).decode("utf-8")
                for x in tuple_of_tuples
            cursor.execute("INSERT INTO " + table + " VALUES " + args_str)
            connection.commit()
        except psycopg2.Error as e:
            print(f"psycopg2.Error in myInsertMany = {e}")
            connection.rollback()
    

    psycopg2 2.9.3

    data = "(1, 2), (3, 4), (5, 6)"
    query = "INSERT INTO t (a, b) VALUES {0}".format(data)
    cursor.execute(query)
    
    data = [(1, 2), (3, 4), (5, 6)]
    data = ",".join(map(str, data))
    query = "INSERT INTO t (a, b) VALUES {0}".format(data)
    cursor.execute(query)
    

    If you want to insert multiple rows within one insert statemens (assuming you are not using ORM) the easiest way so far for me would be to use list of dictionaries. Here is an example:

     t = [{'id':1, 'start_date': '2015-07-19 00:00:00', 'end_date': '2015-07-20 00:00:00', 'campaignid': 6},
          {'id':2, 'start_date': '2015-07-19 00:00:00', 'end_date': '2015-07-20 00:00:00', 'campaignid': 7},
          {'id':3, 'start_date': '2015-07-19 00:00:00', 'end_date': '2015-07-20 00:00:00', 'campaignid': 8}]
    conn.execute("insert into campaign_dates
                 (id, start_date, end_date, campaignid) 
                  values (%(id)s, %(start_date)s, %(end_date)s, %(campaignid)s);",
    

    As you can see only one query will be executed:

    INFO sqlalchemy.engine.base.Engine insert into campaign_dates (id, start_date, end_date, campaignid) values (%(id)s, %(start_date)s, %(end_date)s, %(campaignid)s);
    INFO sqlalchemy.engine.base.Engine [{'campaignid': 6, 'id': 1, 'end_date': '2015-07-20 00:00:00', 'start_date': '2015-07-19 00:00:00'}, {'campaignid': 7, 'id': 2, 'end_date': '2015-07-20 00:00:00', 'start_date': '2015-07-19 00:00:00'}, {'campaignid': 8, 'id': 3, 'end_date': '2015-07-20 00:00:00', 'start_date': '2015-07-19 00:00:00'}]
    INFO sqlalchemy.engine.base.Engine COMMIT
                    Showing logging from the sqlalchemy engine is NOT a demonstration of only running a single query, it just means that the sqlalchemy engine ran one command. Under the hood this is using psychopg2's executemany which is very inefficient. See the execute method docs.sqlalchemy.org/en/latest/orm/session_api.html.
    – sage88
                    Jan 5, 2017 at 23:06
    

    The Solution am using can insert like 8000 records in 1 millisecond

    curtime = datetime.datetime.now()
    postData = dict()
    postData["title"] = "This is Title Text"
    postData["body"] = "This a Body Text it Can be Long Text"
    postData['created_at'] = curtime.isoformat()
    postData['updated_at'] = curtime.isoformat()
    data = []
    for x in range(8000):
        data.append(((postData)))
    vals = []
    for d in postData:
        vals.append(tuple(d.values())) #Here we extract the Values from the Dict
    flds = ",".join(map(str, postData[0]))
    tableFlds =  ",".join(map(str, vals))
    sqlStr = f"INSERT INTO posts ({flds}) VALUES {tableFlds}"
    db.execute(sqlStr)
    connection.commit()
    rowsAffected = db.rowcount
    print(f'{rowsAffected} Rows Affected')
                    Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
    – CheTesta
                    Oct 6, 2022 at 14:07
    

    Finally in SQLalchemy1.2 version, this new implementation is added to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:

    engine = create_engine(
        "postgresql+psycopg2://scott:tiger@host/dbname",
        use_batch_mode=True)
    

    http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109

    Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together..

    tup = [(gid, pid) for pid in items] args_str = ",".join([str(s) for s in tup]) # insert into group values (1, 10), (1, 11), (1, 12), (1, 13) yield from cur.execute("INSERT INTO group VALUES " + args_str) This method is not safe from SQL injections. As psycopg2 documentation states (that aiopg2 links to): 'Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.' – Michał Pawłowski Sep 10, 2015 at 16:13

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.

  •