Tuesday, June 9, 2015

Baidu Bot Blues

A client complained of mysterious connection spikes. "We think we're getting hit by bots, but we're not sure," they said; "Can you look into it?"

So I edited postgresql.conf, turned on log_connections and log_disconnections, did a 'pg_ctl -D MY_DATA_DIR reload', and waited (in my case I also set log_min_duration_statment = 0, but I don't recommend that unless you can watch to make sure the logs aren't filling the log partition — and make sure you're not going to hammer a slow log device).

A while later, I did

ls -t data/pg_log/*.csv | head -5 | xargs grep -c connection

(That says "List log files in order from most recent to least recent, take the five most recent, and count the lines in each file containing 'connection'.")

I got lucky. A spike was obviously in progress; the current log file showed a count two orders of magnitude higher than the other log files'. Since the problem was in progress, no post mortem was required; I just mailed the client and asked "Do your web logs show bot traffic right now?"

The client mailed right back saying, "Yes, we see a bunch of connections from Baidu's web crawler; it's going nuts hitting our site."

Baidu is a Chinese search engine. It turns out its crawling bot is well-known for being aggressive. As that link explains, the bot doesn't respect Crawl-delay in robots.txt (that's annoying but unsurprising, since Crawl-delay is a non-standard extension).

So if this happens to you, you have these options:

  1. Register at Baidu's site (Chinese-only) to request less-aggressive crawling.
  2. Try to tune PostgreSQL connection pooling to handle the onslaught.
  3. Block Baidu entirely. This is an extreme option — not advised if you do business in China, or if you have Chinese-speaking customers elsewhere who are reaching you through Baidu. The latter requires analytics to figure out.

No comments:

Post a Comment