A client complained of mysterious connection spikes. "We think we're getting hit by bots, but we're not sure," they said; "Can you look into it?"
So I edited postgresql.conf, turned on log_connections and log_disconnections, did a 'pg_ctl -D MY_DATA_DIR reload', and waited (in my case I also set log_min_duration_statment = 0, but I don't recommend that unless you can watch to make sure the logs aren't filling the log partition — and make sure you're not going to hammer a slow log device).
A while later, I did
ls -t data/pg_log/*.csv | head -5 | xargs grep -c connection
(That says "List log files in order from most recent to least recent, take the five most recent, and count the lines in each file containing 'connection'.")
I got lucky. A spike was obviously in progress; the current log file showed a count two orders of magnitude higher than the other log files'. Since the problem was in progress, no post mortem was required; I just mailed the client and asked "Do your web logs show bot traffic right now?"
The client mailed right back saying, "Yes, we see a bunch of connections from Baidu's web crawler; it's going nuts hitting our site."
Baidu is a Chinese search engine. It turns out its crawling bot is well-known for being aggressive. As that link explains, the bot doesn't respect Crawl-delay in robots.txt (that's annoying but unsurprising, since Crawl-delay is a non-standard extension).
So if this happens to you, you have these options:
- Register at Baidu's site (Chinese-only) to request less-aggressive crawling.
- Try to tune PostgreSQL connection pooling to handle the onslaught.
- Block Baidu entirely. This is an extreme option — not advised if you do business in China, or if you have Chinese-speaking customers elsewhere who are reaching you through Baidu. The latter requires analytics to figure out.