A new site

Have just had a new place. http://en.chys.info.

All existing posts here have been copied over.

Linux’s vsyscall

It is obvious that querying the current time can in no way be done completely in userspace. However, strace does not record any system call used by the time function in Linux x86_64.

Let’s disassemble glibc:
$ objdump -d /lib64/libc-2.9.so | fgrep -A5 '<time>:'
000000000008a510 <time>:
   8a510:       48 83 ec 08             sub    $0x8,%rsp
   8a514:       48 c7 c0 00 04 60 ff    mov    $0xffffffffff600400,%rax
   8a51b:       ff d0                   callq  *%rax
   8a51d:       48 83 c4 08             add    $0x8,%rsp
   8a521:       c3                      retq

It seems glibc is redirecting the function call to something fixed at virtual address 0xffffffffff600400. But what is there?

Then I found out it was the so-called vsyscall (virtual system call) mechanism, which Linux used as an effort to make certain system calls as fast as possible. This does not involve the syscall instruction and is therefore ignored by strace.

The vsyscalls are part of the kernel, but the kernel pages containing them are executable with userspace privileges. And they’re mapped to fixed addresses in the virtual memory[1].

There are currently 3 vsyscalls in Linux x86_64: gettimeofday, time and getcpu. Their locations in the virtual memory can be found with the VSYSCALL_ADDR macro defined in /usr/include/asm/vsyscall.h:
#ifndef _ASM_X86_VSYSCALL_H
#define _ASM_X86_VSYSCALL_H

enum vsyscall_num {
    __NR_vgettimeofday,
    __NR_vtime,
    __NR_vgetcpu,
};

#define VSYSCALL_START (-10UL << 20)
#define VSYSCALL_SIZE 1024
#define VSYSCALL_END (-2UL << 20)
#define VSYSCALL_MAPPED_PAGES 1
#define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))


#endif /* _ASM_X86_VSYSCALL_H */
NOTE: We do not need to use vsyscalls explicitly. The corresponding glibc wrappers (for getcpu, it’s sched_getcpu) already take advantage of them.


[1] I really hate Microsoft’s use of the term ‘virtual memory’ to refer to swapping files in disks! It once confused me so much..

Difference between dup(0) and open("/dev/fd/0",...);

I believe APUE (2nd ed.; Sec. 3.16) is not correct.

APUE says fd = open("/dev/fd/0", mode); is equivalent to fd = dup (0);, and mode is completely ignored. It seems this is the case in Solaris, but wrong in Linux. (I don’t have access to other Unices at this moment.)

A test program:
01 #include <unistd.h>
02 #include <fcntl.h>
03
04 int main ()
05 {
06     close (0);
07     printf ("%d\n", open ("a.txt", O_RDONLY)); // Should be 0
08     //int f2 = open ("/dev/fd/0", O_WRONLY);
09     int f2 = dup(0);
10     printf ("%d\n", f2);
11     write (f2, "Hello world\n", 12);
12     return 0;
13 }

Let’s run the program with an empty a.txt. Certainly the write function in Line 11 is going to fail.

Now, let’s comment out Line 9 and uncomment line 8 and try it again.

First I ran it in Solaris, the write call still failed. The behavior is like what APUE tells us.

Try it again in Linux - It was successful!

It seems that in Linux, /dev/fd/0 is considered by open as nothing but a normal symlink to a.txt. So it returns a completely new descriptor instead of a duplicate of the old.

Let’s try it again with a shell script:
rm -f a.txt
touch a.txt
exec 0<a.txt
exec 3>/dev/fd/0
echo 'Hello world' >&3
cat a.txt

Run it in Linux (with DASH or BASH): Both outputed ‘Hello world’.

Run it in Solaris (with Bourne shell and BASH): Both failed, outputting nothing (Bourne shell) or failing with ‘Bad file number’ (BASH).

Conclusion:

(1) Solaris handles /dev/fd/.. specially, as APUE tells us;

(2) Linux simply consider /dev/fd/0 a symlink to the actual file.

(I’ll try later how Linux handles open("/dev/fd/0",mode) if the descriptor is an anonymous pipe or socket or something else that a normal symlink is unable to link to.



Kernels used in the above tests:

Linux: Linux desktop 2.6.28-gentoo #4 SMP Mon Jan 12 17:39:23 CST 2009 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz GenuineIntel GNU/Linux

Solaris: SunOS caesar 5.8 Generic_117350-51 sun4u sparc SUNW,Ultra-80 Solaris

gspca in Linux 2.6.27

only works with v4l2, but not v4l. So it can lead to problems - programs using v4l gives strange pictures as well as annoying error messages.

My webcam worked with gspcav1 and Linux 2.6.26, but it failed in Linux 2.6.27 (with its in-kernel gspca drivers):
>>cmcapture err -1
cvsync err
: Invalid argument
cmcapture: Invalid argument

Solution: Install libv4l, and use a command like this:LD_PRELOAD=/usr/lib/libv4l/v4l2convert.so skype.

Reference
Linux kernel bug #11860.

BASH’s compat31 option

Several of my bash scripts failed when I migrated from Debian to Gentoo almost one year ago for the different ways bash interpretes commands like this: [[ "$x" =~ '^[0-9]$' ]]. This command succeeded in Debian when $x is a single digit, but failed in Gentoo. I had to remove the single quotes surround the regular expression to make it work in Gentoo.

Today I finally found the reason: I was using bash 3.1 in Debian and 3.2 in Gentoo. Bash 3.2 by default mandates that regular expressions not be surrounded by quotes; however, the behavior can be modified using shopt -s compat31.

Leap year bug crashes Zune

Microsoft’s 30GB Zune players fail to work today (Dec 31).

The problem has been identified - A bug in the freescale firmware leads to an infinite loop on the last day of a leap year.
year = ORIGINYEAR; /* = 1980 */

while (days > 365)
{
   if (IsLeapYear(year))
   {
      if (days > 366)
      {
         days -= 366;
         year += 1;
      }
   }
   else
   {
      days -= 365;
      year += 1;
   }
}

If such poor codes were found in an airplane, or a medical device, ooops, it should be terrible..

Migrating to EXT4

Ext4, the successor to ext3 which was formerly known as ext4dev, is marked stable in Linux kernel 2.6.28, meaning the Linux kernel team now recommends using ext4 in production.

To convert a file system from ext3 to ext4, use
tune2fs -O extents /dev/DEV
and remount the file system as ext4. (Two e2fsck runs are recommended before and after tune2fs.) Some documentations also include the -E test_fs option. This is not necessary now since ext4 is no longer experimental.

Finally do not forget to modify /etc/fstab.

An ext4 file system created this way is not a “true” ext4 - the extents feature, the main advantage of ext4 comapred to ext3, is not automatically applied to old files. New files created afterwards are in the extents format.

Unlike the 100% backward compatibility of ext3 with ext2, an ext4 file system can no longer be mounted as if it were an ext3, unless the extents feature is disabled. (If you want to disable extents, why not simply use ext3?)