分类 默认分类 下的文章

如果发现当前branch的代码有问题,而不知道在什么时间被人改坏了,可以用git bisect进行二分查找,快速定位问题。
比如我们在common/payment.class.php这个文件中有一行代码

curl_setopt ($ch, CURLOPT_TIMEOUT, 15);

被奇怪的删除掉了,也不知道删除它的原因是什么, 只知道在2018-12-06的某个版本 rev=29b40048e41 时代码是包含这1行的。
因此可以简单的用git bisect定位是谁做出删除修改的:

git bisect start

git bisect  good 29b40048e41
git bisect bad HEAD
git bisect run  grep 'CURLOPT_TIMEOUT, 15' common/payment.class.php

简单等待几秒钟, 修改的commit就找到了,相当黑科技:

running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 4859 revisions left to test after this (roughly 12 steps)
[204bf3a98ca58c6df72f89ef9f9c0ef67761080b] maint:@lixuan 906-新手任务-接口格式
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 2429 revisions left to test after this (roughly 11 steps)
[10c9d05322fb7e945e189cae9d094baa24eb3371] maint:@dengxiaochao 关注关系redis切库
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 1214 revisions left to test after this (roughly 10 steps)
[221ff56686d1839b68b5be5d725ea5f6dc36ac31] MAINT: fix 高分榜翻页page@dengxiaochao
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 606 revisions left to test after this (roughly 9 steps)
[779847710f9cfa729a9714ed1d1ee8cdbb4d2a70] maint:push后台-兼容老后台,不用测  @Yuanchangjun
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 303 revisions left to test after this (roughly 8 steps)
[f044ded3dd8bac233927bfc94b4a2079575ae605] maint: 钱包数据库du xie账号配置 不用测试 @zhouhui
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 151 revisions left to test after this (roughly 7 steps)
[1f65e6403afda43a8490087d855dcd13aebe8d95] MAINT: 添加支付宝sdk,不用测试@wangsanchao
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 75 revisions left to test after this (roughly 6 steps)
[9be1623fcd6b01543a7bbd4588111c9799a4ae64] maint:钱包修改生成财务打款记录 不用测试 @zhouhui
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[8e14a6349e87b5cfbc1377c9490c7aa3b05e222d] maint: 钱包相关 财务后台 不用测试 @zhouhui
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 18 revisions left to test after this (roughly 4 steps)
[4c531389bfcce12ba3ea051983f0fb98b3d87085] MAINT: 高分榜一审@yumingkun
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[83d2badcd475c9524483ad6ffa66427442af6676] maint:钱包相关 后台打款明细添加按userid查询字段 不用测试 @zhouhui
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[ab555514ca0dd18e74bdd4432e36e18d8dba2afe] maint:@lixuan 任务中心-完善日志
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
Bisecting: 2 revisions left to test after this (roughly 1 step)
[05a2a5c5c8f2b84fa8a12d458bdbe904e862a0e6] MAINT:回退之前的修改,校验超时设置15秒长还是会有问题,看日志苹果会超时抽风 不用测 @liusurong
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[6404ea9662090469f77a46886614bc1582ea1fe6] maint:@lixuan 紧急fix bug 任务中心-完成明日再来任务,会漏发做完全部新手任务的奖励
running grep CURLOPT_TIMEOUT, 15 common/payment.class.php
        curl_setopt ($ch, CURLOPT_TIMEOUT, 15);
05a2a5c5c8f2b84fa8a12d458bdbe904e862a0e6 is the first bad commit
commit 05a2a5c5c8f2b84fa8a12d458bdbe904e862a0e6
Author: liusurong <liusurong@bbcd940b-3153-4649-9eaf-8d73ff426239>
Date:   Wed Jan 16 02:32:48 2019 +0000

    MAINT:回退之前的修改,校验超时设置15秒长还是会有问题,看日志苹果会超时抽风 不用测 @liusurong
    

:040000 040000 bf91e2ad602db0b551452bc6536e7ed39bcb602d 526200888c2c21194cdaaf0bd430950e836d15f5 M    common
bisect run success

有一个redis切换主从时会出现的坑,我当初在redis升级的change log上看到的,估计不少人会遇到,所以干脆写出来。
可以看这段描述

Also note that since Redis 4.0 replica writes are only local, and are not 
propagated to sub-replicas attached to the instance. Sub-replicas instead 
will always receive the replication stream identical to the one sent by 
the top-level master to the intermediate replicas. So for example in the 
following setup:

A ---> B ---> C
Even if B is writable, C will not see B writes and will instead have 
identical dataset as the master instance A.

以前迁移redis主节点的时候,普遍都是用的这个方法:

  • A-->B
  • 增加新slave:A-->B-->C
  • 把B设置为: config set slave-read-only no
  • 把写请求切换到B,读请求切换到C
  • 断开A B的主从关系, slaveof no one
  • B-->C

这个流程在redis4.0以前是没有问题的,上流程中4)执行后,A B的写操作都会同步到C; 但是在redis4.0及以后, 只有A的写入能同步给C,B的操作只会在B本地执行,并不会同步给C。也就是说: 身份是role:slave的节点,它的写入操作不会SYNC给它的slave

举一个有意思的例子, 比如当前集群是A-->B-->C, 其中B是可写的, redis中有个keytest的值是100.
然后分别在A和B上执行incr test, 完毕后,A B C上test的值分别为:101、102、101.

今天发现内网一台服务器的samba速度在10~30MB/s之间波动,原以为是smb.conf配置的问题,调整了半天也没找到原因。测试用scpnetperf发现都能达到100MB左右的满速度,非常奇怪。

找了半天也没找到有用的信息,而且这台机器的虚拟机的samba也同样速度不佳,但是在netperf压测的同时,从samba拷贝文件的速度居然能突然提上去,都是些非常奇怪的现象。后来怀疑是内核参数有不兼容变化导致tcp性能出现问题, 想起来最近刚升级了linux内核到5.0.6.arch1-1, 干脆降级到4.17.4-1-ARCH, 发现一切正常了。仔细想想,当时用netperf测试的时候,64字节小包只能到800+Mb,被忽略了。

最近4.10+以来,已经发生过3次网络问题,1次显卡驱动问题了, archlinux的更新速度注定了系统的不稳定。。。

发现内网的电脑的hostname出现了问题, macbook的hostname全部变成了bogon,不少统计网络连接的服务会把内网的ip都解析为bogon。其原因是dns反向解析的问题。
比如:

dig -x 192.168.11.12

; <<>> DiG 9.13.4 <<>> -x 192.168.11.12
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44486
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;12.11.168.192.in-addr.arpa.    IN    PTR

;; ANSWER SECTION:
12.11.168.192.in-addr.arpa. 38934 IN    PTR    bogon.

;; Query time: 3 msec
;; SERVER: 192.168.10.1#53(192.168.10.1)
;; WHEN: 一 1月 07 16:46:11 CST 2019
;; MSG SIZE  rcvd: 74

仔细研究一下, 发现223.5.5.5119.29.29.29这两个dns公共服务都会解析出bogon, 但是8.8.8.8没有这个问题。

最后的解决办法,就是在网关的dnsmasq上加入:

bogus-priv

即可。
官方文档的说明:

-b, --bogus-priv
Bogus private reverse lookups. All reverse lookups for private IP ranges (ie 192.168.x.x, etc) which are not found in /etc/hosts or the DHCP leases file are answered with "no such domain" rather than being forwarded upstream. The set of prefixes affected is the list given in RFC6303, for IPv4 and IPv6.

再查询一次:

dig -x 192.168.11.12

; <<>> DiG 9.13.4 <<>> -x 192.168.11.12
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 33874
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;12.11.168.192.in-addr.arpa.    IN    PTR

;; Query time: 2 msec
;; SERVER: 192.168.10.1#53(192.168.10.1)
;; WHEN: 一 1月 07 18:11:25 CST 2019
;; MSG SIZE  rcvd: 55

我使用archlinux的机器最近发生了两次内存占用过多导致卡死的问题。第一次我以为是chrome开太多窗口造成的,没有在意。但是第二次出现的时候,有来得及kill掉进程, 所以顺带看了一下free,发现和预想不一样的地方。

free
              total        used        free      shared  buff/cache   available
Mem:           7666        5616        1305         228         743        1561
Swap:           958         448         509

我在已经kill掉chrome的情况下,是不可能占用5G以上内存的啊?,只能说:一定是什么地方出了问题。。。
htop检查得知所有进程占用不到几百M。而清理/proc/sys/vm/drop_caches依然没有改善。 于是怀疑还是slap的问题,检查/proc/meminfo信息:

cat /proc/meminfo |grep ^S
SwapCached:          500 kB
SwapTotal:        981276 kB
SwapFree:         513248 kB
Shmem:            227892 kB
Slab:            5145568 kB
SReclaimable:      44640 kB
SUnreclaim:      5100928 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB

如上,SUnreclaim居然占了5个G,不科学。 再联想到只有最近才出现,且连续出现了两次内存问题, 怀疑是升级的新内核版本有内存泄露的bug。
我的版本是:

Linux NAS-Arch 4.18.12-arch1-1-ARCH #1 SMP PREEMPT Thu Oct 4 01:01:27 UTC 2018 x86_64 GNU/Linux

稍微搜了一下,好像的确是内存泄露的原因。
Huge memory leak on linux kernel 4.18
可以用Kernel Memory Leak Detector来检测,
好像也有一些猜测的原因:

I have updated the gist with the output of kmemleak after clearing and scanning (done after about 45 mins of runtime)
https://gist.github.com/coolsidd/d8a1d5addafd6a2367b68e6a6b243dc4/revisions

As for the amdgpu I don't believe that it is the cause (of atleast the major part) of the leak. It was the first module I removed while checking so I can confirm that the leak persists after removing amdgpu.

As for the rtl8723be (my network driver) , the lts version is very different doesn't properly work (it does not have the antenna select option). However I have been using this module since a year and it is also present of 4.17.x. Were there any changes in this version (their github page does not show any major changes since last 6 months). I will build for 4.17.x tomorrow to confirm whether the leak is due to the rtl drivers.
--
There were a few commits for rtlwifi between 4.17.14 and 4.18.6
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/wireless/realtek/rtlwifi?h=v4.17.1&qt=range&q=v4.17.14..v4.18.6

没仔细看, 我先升级到最新的内核,如果再有问题,就回退到4.17.11