服务器慢,cpu忽高,内存报警,问题排查

有个很奇怪的现象,show processlist 列表里有给unauthenticated user 账号链接。不多只有一个,而且服务器反应很慢。
Id      	User                	Host            	db	Command	Time	State	Info
138264504 unauthenticated user 10.xx.xx.xx:xxxx \N Connect 0 login \N
 
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 1 merges
merged operations:
insert 2, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 553193, node heap has 615 buffer(s)
Hash table size 553193, node heap has 107 buffer(s)
Hash table size 553193, node heap has 104 buffer(s)
Hash table size 553193, node heap has 61 buffer(s)
Hash table size 553193, node heap has 1063 buffer(s)
Hash table size 553193, node heap has 9 buffer(s)
Hash table size 553193, node heap has 33 buffer(s)
Hash table size 553193, node heap has 49 buffer(s)
15904.09 hash searches/s, 510.66 non-hash searches/s

这里也很奇怪,Ibuf: size 1, free list len 0, seg size 2, 1 merges
innodb_buffer_pool_size  设置为2G,但是engine status 显示没有空闲page,正在使用为1,那我的buffer 都哪儿去了?
 
mysql> show variables like 'innodb_buffer_pool_size';
+-------------------------+------------+
| Variable_name | Value |
+-------------------------+------------+
| innodb_buffer_pool_size | 2147483648 |
+-------------------------+------------+
1 row in set (0.07 sec)

mysql> select 2147483648/1024/1024/1024;
+---------------------------+
| 2147483648/1024/1024/1024 |
+---------------------------+
| 2.000000000000 |
+---------------------------+
1 row in set (0.00 sec)


top.png

 
cpu忽高忽低,有时也抓不到慢SQL。
 
整个服务器反应很慢。
 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
接着说后半段
发现备份作业失败
180925 04:05:01 innobackupex: Starting the backup operation

IMPORTANT: Please check that the backup run completes successfully.
At the end of a successful backup run innobackupex
prints "completed OK!".

180925 04:05:01 version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;host=localhost;port=3306;mysql_socket=/opt/mysql/mysql.sock' as 'baker' (using password: YES).
Failed to connect to MySQL server: DBI connect(';mysql_read_default_group=xtrabackup;host=localhost;port=3306;mysql_socket=/opt/mysql/mysql.sock','baker',...) failed: Too many connections at - line 1314
180925 04:05:01 Connecting to MySQL server host: localhost, user: baker, password: set, port: 3306, socket: /opt/mysql/mysql.sock
Failed to connect to MySQL server: Too many connections.
因此查看了连接数,发现果然,连接数已经达到上限。max_connections=3000,当前连接数3002。
之前发现cpu忽高,os级内存也报警超过90%,所以之前查看 show processlist 只关注了info<>'' 慢事务。
发现连接数超了,但是大多数都是sleep 的。同时,错误日志也在不停的刷note
2018-09-25T03:05:17.270824+08:00 133909894 [Note] Aborted connection 133909894 to db: 'dbname' user: 'user_name' host: '10.xx.xx.xx' (Got an error reading communication packets)
 为了防止error log 一直刷连接失败的【note】
修改了  log_warnings=1 取消了note 提示。
同时,修改了等待超时时间。
SET GLOBAL wait_timeout=100

SET GLOBAL interactive_timeout = 100;
这样新进来的连接如果无操作100秒就会释放了。
kill 掉之前sleep 的连接。因为老的连接在修改timeout之前就连进来了,不能被释放。
SELECT CONCAT('kill ',id,';') FROM information_schema.PROCESSLIST WHERE command='Sleep' AND USER='user_name'

 做了上面的处理后,连接数也会到达上限,但是会随着100s的释放,有个周期波动,暂时不会出现不可控的登陆异常。
因为os级内存也高到90%
因此把最大连接数也降低了一些
SET GLOBAL max_connections=2000;

 数据库做了这么多临时处理,只是缓解问题。联系了架构师对连接池进行排查,架构师反馈连接池设置并无异常,max=50 单应用,同时4台tomcat。到这里,仍然没头绪。
 
因为没有官方审计和第三方审计功能,因此开启了general log,观察log内容。general log 记录所有在数据库中的操作,日志庞大,因此只开启了两分钟,收集了60M的log。
 
通过排查,发现一段可疑的代码

2018-09-25T15:28:18.288024+08:00 140091406 Connect dbname@10.xx.xx.xx on dbname using TCP/IP
2018-09-25T15:28:18.291119+08:00 140091406 Query /* mysql-connector-java-5.1.38 ( Revision: fe541c166cec739c74cc727c5da96c1028b4834a ) */SELECT  @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_buffer_length AS net_buffer_length, @@net_write_timeout AS net_write_timeout, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS tx_isolation, @@wait_timeout AS wait_timeout
2018-09-25T15:28:18.291559+08:00 140091406 Query SET NAMES utf8
2018-09-25T15:28:18.291811+08:00 140091406 Query SET character_set_results = NULL
2018-09-25T15:28:18.292087+08:00 140091406 Query SET autocommit=1
2018-09-25T15:28:18.292342+08:00 140091406 Query SET autocommit=0
2018-09-25T15:28:18.292682+08:00 140091406 Query select msg_id,client_no as client_no  ,id as id,msg_ctx  as msg_ctx,subject as subject,bus_type  as bus_type,auths as auths,domain_no as domain_no,service_id as service_id,mem_no,mem_co,mem_name,create_time,send_time,service_type,msg_no from TAB_SMS_MSG_LOG where order_type=0 and status=0 order by auths desc

 这段代码,一直在general log 里重复,在申请链接。把这段代码提交给了架构师,定位到了问题。是程序代码和并的问题,架构师处理后,问题解决。
 
总结:
1. 发现问题,内存os级报警90%==》进而查看备份作业,发现备份失败由于连接数过多==》查看连接数,查看top 监控系统负载 cpu 忽高==》查看processlist ,查看慢事务,查看innodb_trx 是否有阻塞==》查看error log 内容,查看binlog 大小==》iostat 查看io读写 ==》查看general log 发现问题代码段
 
2. 通过一系列的排查,搜集系统信息,缩小影响范围。最终锁定,除了连接数过多,MySQL没有阻塞事务,没有大事务卡住,没有读写压力。最后通过general log(或审计日志发现也有问的代码段)
 
3. 通过这次的问题,总结处理故障的思路,根据现行排查问题,在不能确定问题点的时候,先排查整个系统和数据的整体情况,缩小影响范围,这也是排除法,最后根据现象逐一推进。及时联系相关人员,比如本次的架构师。问题开始时联系厂商的开发并未发现问题,所以应该及时联系又能力处理问题的人,加快解决速度。
 

 
已邀请:

liyh

赞同来自:

内存情况:
                total           used       free     shared    buffers     cached
Mem:            15             14          1          0          0          8
-/+ buffers/cache:          5         10
Swap:            3                0          3

yejr

赞同来自:

perf top利器用起来

要回复问题请先登录注册