3 Data-Backed Ways To Significantly Speed Up Your MySQL Bulk Inserts

See what the data says about the best ways to optimize MySQL bulk inserts.

Ecky Putrady
Towards Data Science

--

Photo by Clay Banks on Unsplash

In Data Science projects, a common last step in the data pipeline is to persist the result into a database (e.g. MySQL). The result of this data pipeline is usually big, so optimizing the writes into the database is important to achieve the acceptable pipeline latency.

I’ve benchmarked and analyzed many MySQL bulk insert setups to get insights on what really matters to significantly speed up such workload. In this article, I’ll share them so you can save time on researching and immediately start to implement the right setup. Interested? Read on.

1. Defer Building The Index

Index is extremely bad for inserts. In the graph above, index=EARLY creates the index before any of the inserts are done, while index=LATE creates the index after all of the inserts are done. We can see that:

  1. As the input size gets bigger, inserting into tables with index doesn’t scale as well as tables without index.
  2. The gap between the two is 1 hour when the input size is 50 million rows.

The graph also suggests that the table with index has non-linear latency growth. Does writing to tables with index gets slower over time?

Oh yes, it does. As the input size is bigger, the write rate is getting slower. Contrast this to the table without index where the write rate is consistent.

2. Inserts Concurrently

You can cut up to 50% latency from your bulk inserts by inserting concurrently to tables without index.

From the figure above, we also see another downside of index: concurrent inserts into tables with index have no advantage or even worsen the latency.

Concurrent inserts are surprisingly better compared to LOAD DATA. This sounds unbelievable given the endorsement LOAD DATA got in the MySQL reference, but it is what it is.

Well, to be fair, LOAD DATA is still a lot faster compared to single-threaded inserts. However, it’s just not as fast as 4 concurrent inserts.

3. Use Larger Batch Size With The Right Frequency of Commits

Batching is the use of extended inserts (insert into table values (1, 2), (3, 4)). The impact? Up to 83% latency reduction!

Now, let’s talk about COMMIT operation. When you ask the database to COMMIT, the database will perform a log flush to disk, which “formally” persists the data in the database.

Ideally, we only commit once: At the end of the workload. However, it isn’t always practical. Consider this: If you have so many data that your bulk inserts take 6 hours to run, you can’t always guarantee that the database connection is stable on that long 6 hours. Once it disconnects, you lose the whole progress and need to restart from the beginning. So, a more practical way is to COMMIT every few rows.

Since log flush operation is slow, then fewer commits should speed up the overall workload, right?

Surprisingly, No. More frequent commits are slightly faster than less frequent commits. However, the trend doesn’t hold if we tune down the batch size from 30,000 rows to 300 rows as we can see in the next figure.

The takeaway is that there’s an ideal number of rows (or, to be precise, bytes) before a COMMIT should be performed.

4. Bonus: Use Auto-Incrementing Primary Key

One tip that I’ve received is that the auto-increment primary key is faster than no primary key at all. But really, how much faster it is?

Answer: Not much at all, but still is the fastest.

The slowest setup is when the primary key is supplied along with the data in an unordered manner. Even then, it’s not significantly worse.

It kinda makes sense though, supplying the primary key in the data means you send more bytes to the database. The small extra bytes per row stack up once you have millions of rows.

Conclusion

In this article, we’ve learned 3 ways that give you the biggest ROI for reducing latency on bulk inserts workload. They are:

  1. Defer building the index
  2. Inserts concurrently
  3. Use larger batch size with the right frequency of commits

--

--