kexec on iMX7d

So I get to work on some more on Kernel, yay! But not so yay after almost a week struggling with the problem. The problem is with the kexec on a custom iMX7d based platform. This is not the first time I am dealing with the kexec problem on this platform. Earlier two problems were fixed by me and I learned quite a few things while debugging those. But those were kind of simple problems compared to this. 1st one was about kexec not working at all. There were no logs after kexec runs and prints the famous “Bye!” message before booting into new kernel image. By enabling the earlyprintks and init level debugging (as I explained one of my earlier posts), I could figure out the problem which was related to a failure whilre reading a special OTP entry. And that was caused due to missing clock for that particular module. Good that it was fixed relatively quickly. The 2nd problem appeared when we updated the kernel version. The kexec just stopped working. Now I could not even see the “Bye!” message. Debugging for quite some time, it was found that the kernel was crashing when executing some “IRQ migrating” routine before doing the soft restart. It took a while to figure out what was the cause of the problem and I dug quite a lot of code in kernel’s IRQ subsystem. Eventually it turned out that an MFD based IRQ chip driver was trying to register itself and was failing due to missing IRQ entry in DTS. I just never paid attention to the boot logs where it clearly mentioned that failing to request IRQ. However, the driver for that MFD based irq chip simply ignored that warning to continue booting – which was done intentionally, but troubled me a lot. In the end I was sort of relieved when the issue was fixed. But while working on these issues, I sometimes encountered the problem where kexec just simply refused to do soft restart. But it happened so rarely that I never paid quite attention and considered them as low priority.

Suddenly this behavior of kexec failing to reboot appears quite frequently and I have to debug this. I hate such issues which fail to reproduce that easily because I unnecessarily waste the time to create an environment to reproduce the problem and it eats a lot of time. So far I have not managed to consistently reproduce the problem, but by enabling the “debug” and “earlyprintk” I hit another problem, which would put device into endless boot loop. The “dmesg” shows something similar to what is seen in this thread:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-February/409324.html

I had doubted on serial driver code (driver/tty/serial/imx.c) since I thought it looks like TTY driver has some buffer corruption and due to which I see the same booting messages again and again. I googled a bit and found a patch which was supposed to fix the TTY related problem for during kexec but that did not help either. The problem persisted. I kept searching on internet about TTY related problems, but it was in vain. I was clearing going on wrong direction. Then searching for quite some time on google, I stumbled upon the above thread, which has a similar debug message which I was observing. And I now cannot remember how did I reach to that email thread but that corrected my direction. By closely reading the thread multiple times, I learned that the problem is related to one of the CPUs (out of two Cortex-A7 core in iMX7d SoC) not coming online. I also noticed that when the kexec executes successfully, it always runs on single core. The 2nd core just fails to come online – which is good, because the other core can just do its work. And when kexec fails, it just keeps on trying to boot up the other core and ends up in a boot loop.

So I was disheartened initially that I don’t know anything about booting sequence, how would I debug. But since I had no other options, I started looking at the codes and putting prints. But then I remembered that I have seen this “CPU1: failed to come line” message earlier as well, when I used the sysfs interface to bring a core down and make it up again. It was when we were doing power consumption measurements. So it was easy to figure out what part of code gets executed when the CPU fails to come online. I traced it back using cscope, ag combination and landed into the assembly code. Wow. I never thought that I would debug assembly – it was a mixed feeling, exciting and disheartening. The ARM assembly is not that difficult, so I just started looking at the code to figure out what exactly happens. The branching instructions and special register MCore handling gave me some insights on the flow. I looked at the ARM linker script on what actually calls the start function and prints “Booting Linux on physical CPU…” and looked at the iMX7d Reset Controller registers and Clock gating registers to figure out if all the required settings are done or not. Everything looks okay as per the datasheet. I just have no clue what the hell is happening. It reached to a point where the CPU1 is powered up, it successfully calls the assembly routine and eventually ends up in secondary_start_kernel() in arch/arm/kernel/smp.c and in the process of figuring this out, I learned how to print ascii messages from ARM assembly (some yay moments). (There is an assembly file at arch/arm/kernel/debug.S which contains a lots of helper routines to print ascii, chars and hex digits from assembly). But then from secondary_start_kernel(), the code reports a completion (by calling complete(&cpu_running)) and wakes up a blocking process on calling processor (usually CPU0). But for whatever reason, this completion sometimes never reaches to CPU0. The data structure of “struct completion” somehow seems to be messed up between two CPUs (some caching issue as far as I can imagine). And even if sometimes the completion wakes up the blocking process on CPU0, the CPU online mask bits (cpu_onlne_mask) is not properly updated. At the moment I have no idea how do I debug this issue, and I am sad about it :-(

As a workaround to the original kexec problem, now we are calling the kexec with “nosmp” kernel command line so that the secondary core is never booted and causes any troubles. But I would keep an eye on this problem and definitely try to figure out what’s going on.

The only good thing that happened that happened while debugging this problem is that now I have become more familiar with the part of kernel code which I had never touched upon and did some debugging on assembly for the first time. Yay!

Happy debugging!

Vim Tricks

I wanted to try a decent IDE to browse the source code for huge projects, like the Linux Kernel, but did not find any. I was happy with Vim + cscope vim plugin. But now the same requirement arose for a large C++ project. C++ source browsing is no fun, nevertheless I asked a colleague of mine and found that he is also in the same boat – not satisfied with existing IDEs for Linux. He is also a power user for Vim, so he shared the following link:

http://yannesposito.com/Scratch/en/blog/Vim-as-IDE/

And it contains some amazing plugins. I liked the FixWhiteSpace, and Unite the most. Grepping the code with “ag” is also fun (especially when cscope gives up in finding C++ symbols). Worth a try guys!

connman conflict with NetworkManager

I had installed connman (Connection Manager package) on my Fedora 23, and did not pay attention to the installation dialogues, which would have replaced NetworkManager as default connection management tool. That was the biggest mistake I could have made. Connman did not trouble me then, but as soon as I upgraded my distribution to latest, Fedora 25, the internet stopped working. The ethernet link was active though. I had a similar problem a while ago on my home laptop, but since I was in a hurry and had a separate /home partition, I just did a fresh install (debugging connman without internet took 25 minutes without solution, and re-installation took 15 minutes, poor me!). But on my office work station, there was no option for re-installation, so I had to debug the issue. From my past experience I knew the problem was related to connman, since I could see the /etc/resolv.conf was not taking DNS values which I set in NetworkManager GUI. By checking at the file, I found it as a dangling symbolic link to /var/run/connman/… I uninstalled connman in a hope that the link would be re-created to a proper file but that did not happen. So I searched for resolv.conf file in /usr/ and found one under systemd directory, which already had DNS information stored to 127.0.0.53, which is not correct. Still I pointed /etc/resolv.conf to that location, and there was no success either. In the desperation I just removed the link, and restarted the NetworkManager systemd service, and to my surprise, this fixed the problem.

For someone, who has a similar problem, here is the correct link for the /etc/resolv.conf

➜  ~ ls -l /etc/resolv.conf 
lrwxrwxrwx. 1 root root 35 Jun  7 07:09 /etc/resolv.conf -> /var/run/NetworkManager/resolv.conf

Enjoy!

Debugging again…

I got one more kernel related problem which made me happier, that finally I was debugging something which I love. But the fading of love is directly proportional to the time it takes to solve the problem. And then, a sudden surge of satisfaction, a kind of kick you get, once you figured out the root cause.

This time the problem was showing the symptom at user-space. We recently upgraded the kernel version, and already working code at userspace broke after the upgrade. The basic logic would say that something is broken at kernel space. Since the problem was appearing when running our complex audio application, I wanted to get rid of all the complex stuff and wanted to reproduce it with basic ALSA utilities (aplay, arecord). And as usual, the problem did not appear when trying to reproduce with aplay and arecord. This was enough to force us to think that something is wrong with audio application. The exception reported by the application was about failing to set the ALSA hardware params. I had faces similar problem few years, where hardware param was failing due to passing invalid arguments. But how to identify what parameter was wrong? On the other hand, there was no change in codec driver and application between kernel upgrade. Nevertheless I brushed off the dust from memory and found the actual place in kernel (sound/core/pcm_native.c – snd_pcm_hw_refine()) where the hardware params are refined and could potentially fail if passed an invalid value. The entire calculation and the parameter refinement code is still a puzzle for me. Anyway, I could see that the channel count passed by the audio application was 4 for playback and it turned out be the culprit. More obvious and logical next step was to check whether codec supports 4 channels or not, and found codec actually supports 6 channels for playback. Hmm, strange. Then I tried to run aplay with 4 channels and that worked, without any problem. Looking at the application logs and comments from other engineer who worked on the same problem before me (but could not continue further due to other tasks), I figured the audio application makes use of an ALSA plugin. The ALSA plugin takes 2 channels as input and converts into 4 and passes to actual sound card. I thought something may be broken in the plugin code, but was not fully convinced, since nothing changed at user-space, and the code works just fine with old kernel.

And I entered into the usual printk-debugging method and spent rest of the day without any luck. Spent some time next day to understand the refinement code in pcm_native.c but little or no documentation makes it difficult. Then suddenly thought of taking a look at McASP platform driver (sound/soc/davinci/davinci-mcasp.c) and did a quick diff between two kernel versions. There were significant changes. I glanced through them and found many changes related to imposing channel related constraints. That gave me a direction for debugging further, and found the piece of code which actually overwriting the channel count, and setting the new channel constraint based on overwritten channel count, which I thought was the source of problem. I added some code to bypass that logic and whoa, the user-space application worked like a charm. I also figured why the standalone ALSA apps (aplay and arecord) failed to reproduce the problem. The problem was showing up with the audio application because it was opening playback and capture device at the same time. The order also mattered, because the application was opening capture device prior to playback device. Had I done the same, I could have encountered the problem. Now it is on my task list for Tuesday. Anyway I thought the channel constraint code can be fixed by having per stream channel count and prepared the patch for the same, in anticipation of upstreaming one more kernel patch. While writing the commit message, I thought of taking a look at the code once more to better formulate the message. That’s when I stumbled upon the code commentary which actually describes why the logic was done in that way; and I saved myself from embarrassment. Well, but I learnt something and at the end of the day it was greatly satisfying.

Happy debugging!

Summertime

Weather crossed 30°C today after so long. And it was such a bright sunny day to go out. Spazieren Gehen as people say in Deutch. Bijal did not want to join me today since I wanted to visit a palace within the city with a huge garden and would have to walk for quite some time. So it was a solo outing for me today probably first time after a year or so. I quite enjoyed it. 3-4 hours without a word, just me and nature and long talks from OSHO on my handy.

I always feel surprised how wonderfully these monuments are maintained along with huge gardens. I truly feel the taxpayers money is well spent. So many people were enjoying the sun bath in the vast lawns and lushy green slopes on the banks of man made lake. Literally in Bikini. Many people were with kids playing games and having real fun without mobile phones. And I got to click some stunning pictures from my “mobile phone” :-/. Next visit would definitely be with my camera. I feel content.

Moving Out

Finally after a long wait, we found a decent apartment. Actually the apartment found us. We stopped searching after a while since nobody responded to our requests. And out of the blue a friend messaged in the morning that he is vacating his apartment and offered us to take over. What a relief it was! Now after completing most of the formalities and an informal meeting with the actual landlord, we started the planning to move out. It’s a fully furnished flat and we thought moving out would be easy. But nothing is easy here as I’ve learned in past year and a half. Fortunately the same friend provided a contact who can help us with movement. We had to pack most of the stuff apart from bigger furniture like Sofas and Beds and Kitchen platform etc. By end of three days hard work, we packed most of the items in around 40 boxes/bags. That was too much. We were anticipating that the helper guy would carry all the stuff and the moving would be a piece of cake. Finally the day came, Saturday as we decided. The morning was great, since I had to rush again to buy some more bags to pack some remaining stuff. The scheduled time was 12.00 noon for relocation, and I got an message around 11.00 from Alton (the moving helper) that he is running behind the schedule and would not be able to make it at 12.00 noon, but moving at 14.00 was still possible. We were okay with that since we also got a breathing time. I got a little nap and got up at 13.45 in an anticipation of his message, but at 14.15 got another message from him that he has a problem with the earlier shifting and would not be possible to leave them for some time. At 15.45 another message came that the other clients had trouble finding the keys to the apartment and since their stuff is still lying in his truck, they are going to make a decision what to do at 16.15. I messaged him at 16.30, no reply, the biggest WTF moment since we were all packed and nothing to do. I started sending messages to the local whatsapp group to get help with relocation. A nice guy replied with couple of contacts who can help with relocation, but i was little skeptical working with them since I had no idea what would they charge and weather they would help me immediately (or worse the next day – Sunday, when mostly nobody works). Nevertheless, I made a call to Tania – the freelancer who helps in relocation, and she was speaking very nice English (step-1 was easy ;-). Then we estimated the time to relocate based on our stuff and distance. She offered a helper and let me know the charges. I consulted with Bijal and thought to appoint her. Meanwhile I received an apology message from Alton that he was really sorry for keeping us waited. I was frustrated and but had a hope that next day will be better. Meanwhile I got a message from Sebastian – the new tenant, that we do not need to do the cleaning and painting of the apartment as they are anyway going to do that. So nice of him, but out of respect, we wanted to clean the apartment if not paint, so we started off with Bathroom and cleaned it just like new. And then we were really tired.

So to celebrate our last evening at this home, we went to Burger King and had my favorite Country Veggie burger – and it is delicious as always. Went back home in time and started unpacking few of the boxes to get minimal bedding setup and some clothes, such a hassle.

Next day we all woke up in time, and cleaned most of the windows, which was much harder than I thought before. And then we were waiting with hopes that Tania would not disappoint us, and she did not. She gave my address to Pete, a helper who would do the heavy lifting stuff and I can give him a hand to free him faster. Pete rang the bell sharp at 10.00 as discussed. Wow, the moment we have been waiting for since a day. He climbed till the forth floor and one of the first questions he asked was: at what floor we are moving in? I answered that at 1st floor and he said: Jackpot. I can understand that. So both of us started moving the boxes and stuff, and by 40 minutes, all stuff was on the ground floor at the entrance. Meanwhile Tania had arrived at 10.30 on appointed time and started loading everything on truck. That was very professional. I fetched the new apartment keys and we were off to the new destination.

I also called Sebastian if he could come to the apartment and collect the keys from Bijal, since I was in the truck with Tania and Pete and asked Bijal to stay at home, hand over the keys then take S-Bahn to the new apartment. The ride in truck reminded me of my hostel commute days – me sitting in the middle seat (not actually a seat but a sponge sheet) between Tania and Pete, saving myself from the gear shifting. And we discussed the demonetization (?) to my surprise that both Tania and Pete were very well informed about the stuff going on in India. And Pete was extremely against that move, and sounded like Anti-Modi, that he had no trust in him and considered the demonetization as a open theft from the poor people. I tried to reason him with all my efforts but I could not convince him that it was a bitter good move. So with our small talks (not really) we reached to the destination, started unloading and carrying stuff to the apartment. That was quicker than I anticipated and we departed after I made the payment. It reduced down to half the price that I had decided with Alton – and I was happy about it.

While carrying the stuff to the apartment, I met first few new neighbors and flaunted my basic German skills (by saying that my German is not good – in German :() and they reply that their English is worse than my German :) Then I met an interesting man at the front garden, who was working on his laptop and looked to me in his early 50s. I just introduced myself in a bit German and he spoke very softly and slowly, just like my teacher in my German class and later on during our conversation I figured that he is actually a pysicst, working as a software engineer and teaching German to the refugees in Berlin. Wow. Few moments later, while I was lying on the bed, Bijal and her mom arrived at the nearest station and I went to pick them up. And finally we moved in, officially.

Such a busy day! – Dr Mashoor Gulati ;-)

Adventure JS

I got introduced to Node.JS from one of my colleagues from Electronics department. He likes to hack around in his spare time and has built a really cool web based tool which can interact with a device via serial console (which makes his day to day life easier). And the tool was developed using Node.JS. It was pretty impressive to see what Node.JS can do in very short time. So this is going to be my next adventure. And I found a really really nice free e-book which can be found here.

I have some ideas where I want to use this. Let’s hope my affair lasts longer than it usually last with new stuffs.

Happy hacking!