Wednesday, October 26, 2005

Ethics of writing a Signal Handler

Signal handlers are generally used incorrectly in the code. I am writing some guidelines for writing a signal handler (This involved comments from Boris and also some hacking with apache.)

1. A signal should do the minimum, like just setting/unsetting a variable. Most often signal handlers interupt execution of a syscal (if its in a interuptable sleep) , and this syscal restarts when signal handler finishes. Now If this signal handler calls the same syscal which was interupted .. we have a "futex" ( I have had some real bad experiences with this .. trust me).

Boris - "On the other hand, if we expect some code to execute right away,
we need to also arrange any blocking system calls to be interrupted by the
signal. By default, some calls like read() are automatically restarted
after the signal handler returns but this behaviour can be changed with
sa_flags in sigaction()."

2. Earlier the implementation of exclog had same sig-handler for both SIGTERM and SIGHUP but If apache calls SIGHUP and then SIGTERM (which it does) in a very short span of time, sig handler for SIGHUP will get interupetd by sig handler of SIGTERM !! .. and as boris explains it "BOOM!! Anything from lock up to infinite loop to smoke and fire may happen."

3. Linux automatically blocks the signal which caused a signal handler to be invoked during the execution of that handler BUT it doesn't block any other signals that may be using the same handler. Luckily, this is easy to achieve with sa_mask in sigaction().

4. In addition to point 1. Sometime when you modify a variable in a signal handler which is later being checked in the main loop for any changes, it is always advisable to declare this global variable as volatile. Coz the compiler seeing that variable is only check in main loop, may optimize so that the vaiable is not loaded from memory everytime its checked. Hence changes by signal handler may not be visible in the loop.


From Apache's point of view -
When Apache is stopping it sends a SIGHUP first and then SIGTERM to all its children ( thats what was also visible in apache modules). And most of the modules catch both these signals to clean up stuff. So, I am not sure why SIGHUP is handled my modules, it can be handled using SIG_ING. For that matter why SIGHUP and SIGTERM are handled when the process is going to die anyways .. and you can't do anything about it.

Why buy a life insurance for self .. if you are going to die ;-).

What if two SIGTERMs appear together, it would cause a futex, and the process will never die.

Friday, October 21, 2005

Veritas Interview Part II

I am just summing up some more qus which veritas asked me .. during all those technicle rounds ... Some of these were what i ans irrespective of whether they were asked or not :D

They told me that they just want my approach .. Even if my ans are incorrect.

here it goes ...

Q1. Would the TLBs be synchronized in SMP environment. And why.
Ans: I struggled and then said No. Obviously both CPU would get different set of instructions and there is not sense in syncronizing them.

Q2 How does computer boot up. ( I myself pulled the guy to this qus ).
This was interesting ..
Ans: Algo for x86
1. BIOS dos POST (some hardware checks) and Loads MBR (512Kb) at some fixed addr i think .. 0xfffffff0
2. sends RESET on the BUS. CPU starts execution from 0xfffffff0
3. lilo or whatever is loaded into MBR searches for Linux kernel's compressed image.
4. Loads kernel and the TRICK is.. that BIOS can only address physical addresses and even the CPU while booting knows only physical addresses (Since its registers like GDTR/LDTR and all are not set to traslate virtual-physical addresses)
At one point linux has to set the resgisters make CPU start executing in "protected mode" i.e 32bit virtaul addr. mode .. but since after that point the kernel execution path has to change coz kernel is %eip is poining to kernel code as per physical addresses. So before booting CPU into virtual addr mode .. kernel copies itself to another location from where execution would start when CPU comes to protected mode.

Q3. What do you call a daemon process. Who cleans up all the zombie processes.
Ans: Daemon is a program whose parent is already dead and init owns it now. And also it doesn't hold any tty/pts. Daemon is issentially a zombie process. And since it doesn't know if its parent would is dead .. its waiting for its parent to call wait on it. (Cos if parent calls wait and child is dead .. there would be kernel OOPs).

So init cleans all listed zombie processes.

Q4. What is cache snopping.
Ans: No idea ...

Q5. So if i say Its syncing the L1/L2 caches in SMP. So why and how is the done.
Ans: (I could only answered it when the guy asked me in second interview) You sync them coz in SMP even if the OS changes something in memory its no sure if the CPU is realy writing to the mem or just L1/L2 cache. This syncing is done through inter processor interupts. These interrupts can be raised writing to cr0 (control) register.

Q6. How are interrupts delivered to different CPUs in SMP.
Ans: APIC - advanced progm Inter. Contr. or something like that is persent in x86 (p3 onwards) you can program it to mask of some int. for some CPU. OR you can program it to give int. to CPU with lowest priority of task.

I dont know about 8259 or whatvere the int. controller was that in 386-P1.

Q7. Explain what is the layout of Ext2 filesystem.
Ans. Harddisk is divided into parts ( i dont know the actual term) and each part has itz own superblock, disk bitmap, inode bitmap, intode table , data blocks. Same copy of superblock is maintained by each segment so that if the first one gets corrupt then the others can be read and recovered. Bitmap cntains the bitmap of the data/inode. So 0 in bitmap would mean that corresponding block in the inode/data blocks is valid. It is used to quickly delete files.


Q8.How would you write a recursive function in assembly.
Ans: In normal functions you
push %ebp
mov %esp, %ebp .. so that you acn access the arguments pushed to the stack using %ebp
So Arg1 can be 8(%ebp) and Arg2 at 12(%ebp).
I told him about the recursive function .. The only trick there was that you needed to pop %ebp on every return and move %ebp into %esp.

Q9. How would you query for a graph in another graph. ( This he asked after i told him about Glassbox stuff).
Ans: Well given you have starting-points/root of the graph same you can identify the graph in O(n2) but if thats not give its a NP complete problem.


Finally i realised that they are just testing for basic skills ... I mean if these guys really wanted to screw me .. they could easily have.

Finally .. today i got the offer. Itz good. I am joining the filesystem group .. most probabbly the AIX/Linux team. This group has arround 35 dev members and most of them Phd. So wish me luck.

Monday, October 17, 2005

Veritas Interview Part I

This is bloody cool .. believe me ... And this time i didnt have to use "Towers of Honoi" Trick

Interview -

Q1: So you have worked with filesystems, tell me about the work.
Ans: I have just sent out patches for Unionfs, a vfs-layer fanout file-system. But thats my hobby My work at Ensim is in system level.

Q2. What entry points into the VFS you would have to trap, to handle this situation.
And: Everything including open, create, readdir .. etc. The only execeptions are sendfile and mmap and some others.

Q3. Can't you just pass the lower level dentries to mounted filesystem.
And: Its a union so you have to union directory entries hence create new interposed inodes and dentries. And Also coz you have to maintain different caches at different levels. The the files you craete/delete over the union are not there on the lower branches. And you don't want writes to the lower branches to affect the union.

Q4. What could be the implecations of nfs over unionfs. Tell me how mmap works over NFS.
Ans: mmap i am not sure, But in general NFS relies upon persistant inodes. i.e the inode numbers should be same over remounts. Since unionfs creates vnodes on the fly it can't ensure same inode numbers over the remounts. Although I had thought of a patch for computing inode numbers as a hash of the file path. but it would be a problem for hardlinks.

Q5. How would you write a disk defragment software for ext3. Are there any benifits in doing that.
Ans: It would be good for any filesystem to have disk defragmentation as it brings all the file blocks together. It saves -
1. reading time
2. inode table space. As inodes would not be dublicated over multiple inode blocks.

Algo:
1. Lock superblock
2. Aquire semaphore over inode.
2. Everytime you move the blocks ensure that file cache is distoyed. This can be done by increasing the superblock generation number. or otherwise...
3. Update inode tables
4. Release locks.

Q6. How does a fs say ext3 serializes the read/write on same byte over same file.
Ans: In uniprocessor it already serialized. In SMP I guess the bus controller ensures that. I think so ....

Q7. Whats the difference between Semaphores and Spin Locks ?
Semaphores sleep while waiting. Spin locks spin in a while(1) loop. Spin locks make no sence in uniprocessor as there is only on thread running.

In SMP .. you cant implement semaphores without using spin locks. In SMP data-structure of semaphore itself becomes a critical section. ..You have to ensure local interupts are disabled and SMP waits over spin lock.

There is an overhead in semaphores of acquiring a spin lock and then enqueuing the task struct and the dequeuing it. Better use a spin lock if you know wait would not be much.

Q8. In which parts of the kernel code you can't use semaphores.
Ans: Interrupt handlers. They obviously can't sleep over locks.

Q9. In SMP environment would the CPU cache cause a problem in spin locks. How would you avoid it.
Ans: Cache in L1/L2 may become dirty in one CPU. So other CPU must be informed about it. x86 uses interprocessor interrupts to flush the cache. I think c0 control register is used for it.

Typical use case is after using a spinlock. The spinning condition is almost always in cache and a spinlock changes the value for which the other CPU is waiting/spinning for.

Q10. How would you make sure the instructions are serialized and there is no optimization by the compiler/CPU over the instructions.
Ans: Use something called memory barriers. In x86 use a prefix instruction called "lock". Which requests the CPU to do all thats written before this point before executing the instruction.

In compiler you could append volatile before asm qualifier.


Q11. How do you insure such things in C. Give me an example of a possible scenario.
Ans: Same volatile. Typical use case is when a global variable changes in signal handler. If that variable doesn't directly change in the function. The compiler makes it constant i.e reads the variable value from the memory and copies to the DS (Data segment) and never refers back to the changed value.

Putting a volatile qualifier tells gcc not to make such assumptions.

Q12. Do you know about itanium architechture? In a project in veritas, We had problems with spin locks over different itanium CPUs in same machine. One lock was increasing the value and the other decreasing the value. How would you avoid race conditions in this case. Are there any instructions x86/ia64 insures for this.
Ans: x86 gives atomic instructions to test_and_set the value in the memory. I guess BTS, BTC inctrutions can work well over SMP... and also those 'X' instruction for byte manipulation like CMPXCHG etc.

Q13. How POSIX recomends locks in system level. Tell me the rules.
Ans: Posix recomends mandatory and advisory locks. In mandatory locks every system call has to wait for a lock (if it is acquired). Generally syscall send a signal to the lock owner and waits for a lease time. After that they forcefully take the lock.

Q14. Explain segmentation in Linux wrt to x86.
Ans: Linux doesn't use segmentation of x86. The 16 segment selectors (DS,SS, ES) all point to GDT tables with have initialized the segment descripter with base addreess as 0x0000

Q15. Explain paging with respect to x86.
Ans: Explained.

Q16. Can I do everything in software or hardware support is necessary.
Ans: Everything but the page cache misses. (Answered after a lot of struggle)

Q17. We wanted to have more than 8KB serialized physical memory allocation for a process in Linux. Is it easy/difficult.
Ans: Sorry :D


Q18. I have a machine with infinite RAM. Do i still near virtual memory.
Ans: Yes, to have on demand pagging.

Q19. And if don't i want on demand fecthing.
Ans: When you would have to statically compile all the code. There would be no dinamic compilation.

Q20. Have you heard about. Paging extentions in x86 (P4 onwards). Explain.
Ans: Fortunately yes. You basically use a PDTR to address 64 GB of ram with page size as 2MB. P4 gives 36 address lines inspite of 32 in normal case.

Q21. Will that have good bad effect on TLB hits.
Ans: (With lots of struggle) TLB will improve as it can adress moe mem.

Q22. I have an adjacency list of a graph. And for each node i want to sort the neighbours.
1 -> 5 , 3 , 2
2 -> 1, 4, 3
3 -> 2 , 1 , 5

becomes ..
1-> 2, 3, 5
2 -> 1, 3, 4
3-> 1, 2, 5
Solve.

Ans: (With lotz of struggle) After sorting the first list you can make out that if
1-> 2, 3, 5 then all 2,3,&5 will have first element as 1. And so on.


Q23. I have overlapping line segments and then i give you a point. You gotto tell me how many lines would have this point.

Ans: Used R-trees ( the guy gave some hints )


Just waiting for results now .... :D