Quick way to list down all the packages installed by user on Fedora:

sudo dnf history userinstalled | less

The big hard Sun

“So many people live within unhappy circumstances and yet will not take the initiative to change their situation because they are conditioned to a life of security, conformity, and conservatism, all of which may appear to give one peace of mind, but in reality nothing is more damaging to the adventurous spirit within a man than a secure future. The very basic core of a man’s living spirit is his passion for adventure. The joy of life comes from our encounters with new experiences and hence there is no greater joy than to have an endlessly changing horizon, for each day to have a new and different sun.”
― Christopher McCandless

Ankho Dekhi

After a long time, watched a movie that is so fine. Ankho Dekhi. Such a deep philosophy said in such a simple way and what a brilliant brilliant acting performance by everyone in the film. At times it made me feel that I am listening to Osho’s philosophy. It has really made my weekend. But now I am going to watch JHMS this weekend and i have no hopes from the movie. SRK is usually larger than life and never can connect with real life personal. Leta see.

kexec on iMX7d

So I get to work on some more on Kernel, yay! But not so yay after almost a week struggling with the problem. The problem is with the kexec on a custom iMX7d based platform. This is not the first time I am dealing with the kexec problem on this platform. Earlier two problems were fixed by me and I learned quite a few things while debugging those. But those were kind of simple problems compared to this. 1st one was about kexec not working at all. There were no logs after kexec runs and prints the famous “Bye!” message before booting into new kernel image. By enabling the earlyprintks and init level debugging (as I explained one of my earlier posts), I could figure out the problem which was related to a failure whilre reading a special OTP entry. And that was caused due to missing clock for that particular module. Good that it was fixed relatively quickly. The 2nd problem appeared when we updated the kernel version. The kexec just stopped working. Now I could not even see the “Bye!” message. Debugging for quite some time, it was found that the kernel was crashing when executing some “IRQ migrating” routine before doing the soft restart. It took a while to figure out what was the cause of the problem and I dug quite a lot of code in kernel’s IRQ subsystem. Eventually it turned out that an MFD based IRQ chip driver was trying to register itself and was failing due to missing IRQ entry in DTS. I just never paid attention to the boot logs where it clearly mentioned that failing to request IRQ. However, the driver for that MFD based irq chip simply ignored that warning to continue booting – which was done intentionally, but troubled me a lot. In the end I was sort of relieved when the issue was fixed. But while working on these issues, I sometimes encountered the problem where kexec just simply refused to do soft restart. But it happened so rarely that I never paid quite attention and considered them as low priority.

Suddenly this behavior of kexec failing to reboot appears quite frequently and I have to debug this. I hate such issues which fail to reproduce that easily because I unnecessarily waste the time to create an environment to reproduce the problem and it eats a lot of time. So far I have not managed to consistently reproduce the problem, but by enabling the “debug” and “earlyprintk” I hit another problem, which would put device into endless boot loop. The “dmesg” shows something similar to what is seen in this thread:

I had doubted on serial driver code (driver/tty/serial/imx.c) since I thought it looks like TTY driver has some buffer corruption and due to which I see the same booting messages again and again. I googled a bit and found a patch which was supposed to fix the TTY related problem for during kexec but that did not help either. The problem persisted. I kept searching on internet about TTY related problems, but it was in vain. I was clearing going on wrong direction. Then searching for quite some time on google, I stumbled upon the above thread, which has a similar debug message which I was observing. And I now cannot remember how did I reach to that email thread but that corrected my direction. By closely reading the thread multiple times, I learned that the problem is related to one of the CPUs (out of two Cortex-A7 core in iMX7d SoC) not coming online. I also noticed that when the kexec executes successfully, it always runs on single core. The 2nd core just fails to come online – which is good, because the other core can just do its work. And when kexec fails, it just keeps on trying to boot up the other core and ends up in a boot loop.

So I was disheartened initially that I don’t know anything about booting sequence, how would I debug. But since I had no other options, I started looking at the codes and putting prints. But then I remembered that I have seen this “CPU1: failed to come line” message earlier as well, when I used the sysfs interface to bring a core down and make it up again. It was when we were doing power consumption measurements. So it was easy to figure out what part of code gets executed when the CPU fails to come online. I traced it back using cscope, ag combination and landed into the assembly code. Wow. I never thought that I would debug assembly – it was a mixed feeling, exciting and disheartening. The ARM assembly is not that difficult, so I just started looking at the code to figure out what exactly happens. The branching instructions and special register MCore handling gave me some insights on the flow. I looked at the ARM linker script on what actually calls the start function and prints “Booting Linux on physical CPU…” and looked at the iMX7d Reset Controller registers and Clock gating registers to figure out if all the required settings are done or not. Everything looks okay as per the datasheet. I just have no clue what the hell is happening. It reached to a point where the CPU1 is powered up, it successfully calls the assembly routine and eventually ends up in secondary_start_kernel() in arch/arm/kernel/smp.c and in the process of figuring this out, I learned how to print ascii messages from ARM assembly (some yay moments). (There is an assembly file at arch/arm/kernel/debug.S which contains a lots of helper routines to print ascii, chars and hex digits from assembly). But then from secondary_start_kernel(), the code reports a completion (by calling complete(&cpu_running)) and wakes up a blocking process on calling processor (usually CPU0). But for whatever reason, this completion sometimes never reaches to CPU0. The data structure of “struct completion” somehow seems to be messed up between two CPUs (some caching issue as far as I can imagine). And even if sometimes the completion wakes up the blocking process on CPU0, the CPU online mask bits (cpu_onlne_mask) is not properly updated. At the moment I have no idea how do I debug this issue, and I am sad about it :-(

As a workaround to the original kexec problem, now we are calling the kexec with “nosmp” kernel command line so that the secondary core is never booted and causes any troubles. But I would keep an eye on this problem and definitely try to figure out what’s going on.

The only good thing that happened that happened while debugging this problem is that now I have become more familiar with the part of kernel code which I had never touched upon and did some debugging on assembly for the first time. Yay!

Happy debugging!

Vim Tricks

I wanted to try a decent IDE to browse the source code for huge projects, like the Linux Kernel, but did not find any. I was happy with Vim + cscope vim plugin. But now the same requirement arose for a large C++ project. C++ source browsing is no fun, nevertheless I asked a colleague of mine and found that he is also in the same boat – not satisfied with existing IDEs for Linux. He is also a power user for Vim, so he shared the following link:


And it contains some amazing plugins. I liked the FixWhiteSpace, and Unite the most. Grepping the code with “ag” is also fun (especially when cscope gives up in finding C++ symbols). Worth a try guys!

connman conflict with NetworkManager

I had installed connman (Connection Manager package) on my Fedora 23, and did not pay attention to the installation dialogues, which would have replaced NetworkManager as default connection management tool. That was the biggest mistake I could have made. Connman did not trouble me then, but as soon as I upgraded my distribution to latest, Fedora 25, the internet stopped working. The ethernet link was active though. I had a similar problem a while ago on my home laptop, but since I was in a hurry and had a separate /home partition, I just did a fresh install (debugging connman without internet took 25 minutes without solution, and re-installation took 15 minutes, poor me!). But on my office work station, there was no option for re-installation, so I had to debug the issue. From my past experience I knew the problem was related to connman, since I could see the /etc/resolv.conf was not taking DNS values which I set in NetworkManager GUI. By checking at the file, I found it as a dangling symbolic link to /var/run/connman/… I uninstalled connman in a hope that the link would be re-created to a proper file but that did not happen. So I searched for resolv.conf file in /usr/ and found one under systemd directory, which already had DNS information stored to, which is not correct. Still I pointed /etc/resolv.conf to that location, and there was no success either. In the desperation I just removed the link, and restarted the NetworkManager systemd service, and to my surprise, this fixed the problem.

For someone, who has a similar problem, here is the correct link for the /etc/resolv.conf

➜  ~ ls -l /etc/resolv.conf 
lrwxrwxrwx. 1 root root 35 Jun  7 07:09 /etc/resolv.conf -> /var/run/NetworkManager/resolv.conf