Post

Writing a Linux kernel module (3)

This post is an automatic translation from French. You can read the original version here.

C From Scratch Episode 25

A new stream from Imil means new notes… So here we are again with a new article on creating kernel modules! This time, we will revisit the concept of char devices, and dig deeper into the notions of Major and Minor, which we glossed over a bit too quickly last time.

Warning: The notes below do not always follow the plan Imil used in his stream, even though – I hope – everything will be covered here as well!

A driver? /dev? What’s that?

A process does not have access to everything. As we discussed extensively on Imil’s stream, it only “sees” virtual memory (thanks to the MMU!) and can neither directly access disks, nor directly access files, nor access anything other than memory without going through a request to the all-powerful kernel. And these requests are syscalls.

Our machine’s hardware is therefore only accessible to the kernel, and we, poor peasants that we are, are condemned to call upon the services responsible for managing that hardware: the drivers. In short, our code issues a syscall, which the kernel processes. Depending on our request, it triggers an appropriate function in the driver responsible for managing that particular piece of hardware. So far so good!

However, there is a wide variety of peripherals: screens, webcams, fingerprint readers, brrr, hard drives, network cards… Just a few examples, yet so many differences. The approach taken by UNIX systems was therefore to offer a unified interface for communicating with all of them: we talk to them as if we were opening files. The enormous advantage of this idea is that the basic functions for manipulating files are few, and we know them well: open, close, read, write, seek… This reduces the number of different syscalls needed!

The pseudo-files that allow communication with drivers are neatly organized in /dev.

Let’s take an example, that will surely be clearer. As you probably know, the hexdump command displays the contents of a file in hexadecimal:

$sudo hexdump /dev/input/mice

Well, nothing shows up… But if we play around with the mouse a bit:

$sudo hexdump /dev/input/mice
0000000 0108 0800 0001 0108 0800 0002 0208 0801
0000010 0002 0208 0800 0003 0208 0801 0003 0308
0000020 0801 0102 0308 0802 0103 0208 0801 0202
0000030 0208 0802 0202 0208 0802 0202 0208 0803
0000040 0202 0108 0802 0202 0108 0803 0201 0108
0000050 0802 0201 0008 0802 0200 0108 0802 0100
0000060 0008 0801 0101 0008 0801 0101 0008 0801
0000070 0100 0008 0801 0100 0008 0801 0200 0008
0000080 0801 0200 0008 1801 02ff 0008 0802 0100
0000090 0008 1802 01ff 0008 0801 0100 ff18 1800
00000a0 00ff 0008 1801 00ff ff18 1800 00ff 0028
00000b0 38ff ffff 0028 38ff ffff 0028 38fe ffff
00000c0 ff38 28ff fe00 ff38 38fe feff ff38 38ff
00000d0 feff 0028 38fe feff ff38 38fe feff ff38
00000e0 28fe fe00 ff38 38fe fdff ff38 38fe ffff
00000f0 ff38 38fe feff 0028 38fe ffff ff38 28ff
0000100 fe00 ff38 38ff feff ff38 28ff fe00 ff38
0000110 38ff feff ff38 28ff ff00 ff38 38ff ffff
0000120 0028 38ff ffff ff38 38ff ffff ff38 28ff
0000130 ff00 ff18 1800 00ff 0028 18ff 00ff 0028

Magical, isn’t it? By viewing this file, we are receiving bytes coming from the mouse: we are actually talking to the mouse driver!!!!! The files in /dev are not real files, but pseudo-files that serve to interact with drivers.

By the way, if we look at them a bit more closely, these pseudo-files are quite peculiar:

$sudo ls -al /etc
total 1828
drwxr-xr-x 120 root  root   12288  9 avril 19:33 .
drwxr-xr-x  24 root  root    4096 12 sept.  2021 ..
drwxr-xr-x   2 root  root    4096 13 févr.  2020 a2ps
drwxr-xr-x   4 root  root    4096 28 nov.  19:23 acpi
drwxr-xr-x   3 root  root    4096 19 avril  2019 alsa
-rw-r--r--   1 root  root     541 28 nov.  15:31 anacrontab
drwxr-xr-x   3 root  root    4096 12 sept.  2021 audit
[...]

$sudo ls -al /dev/input/
total 0
drwxr-xr-x  4 root root     720  9 avril 19:33 .
drwxr-xr-x 19 root root    4580  9 avril 19:33 ..
[...]
crw-rw----  1 root input 13, 63  9 avril 19:33 mice
crw-rw----  1 root input 13, 32  9 avril 19:33 mouse0
crw-rw----  1 root input 13, 33  9 avril 19:33 mouse1
crw-rw----  1 root input 13, 34  9 avril 19:33 mouse2

Here, there is no file size: it is replaced by a pair of numbers: the char Major and Minor. Similarly, the first character of the permissions block shows c for char devices, and b for block devices.

What sorcery is this?

Devices, drivers, and nodes

The services we can access through a driver are most often hardware peripherals. Behind them lies an electronic circuit that the driver manages for us. However, this is not always the case. The famous /dev/zero and /dev/null, for example, are purely software. Does that make a difference? Fundamentally, not really. You can think of it as emulated or virtual hardware: it changes absolutely nothing about our discussion. It remains a resource that we access through the driver. /dev/zero provides us with zeros. Whether that’s thanks to a chip or a program, it doesn’t matter. The device(s) accessible through the driver are opaque to us: they provide a service, regardless of how.

On the other hand, the way we communicate with a device can be slightly different depending on how we interact with it. For this reason, we distinguish two major families of peripherals:

  • Block devices
  • Char devices

When communicating with a char device, data exchange happens byte by byte. We could see this with the mouse example above – the conversation takes the form of a byte stream that we receive or send one at a time.

When communicating with a block device, data exchange happens in blocks. The most obvious example, already mentioned in the “Linux From Scratch” series streams, is hard drives, which are accessed 512 bytes at a time.

We can view the char and block peripherals in /sys/dev:

$ls -l /sys/dev/block/
total 0
lrwxrwxrwx 1 root root 0 10 avril 14:08 254:0 -> ../../devices/virtual/block/dm-0
lrwxrwxrwx 1 root root 0 10 avril 14:08 254:1 -> ../../devices/virtual/block/dm-1
lrwxrwxrwx 1 root root 0 10 avril 14:08 254:2 -> ../../devices/virtual/block/dm-2
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:0 -> ../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/nvme0n1
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:1 -> ../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p1
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:2 -> ../../devices/pci0000:00/0000:00:1b.4/0000:03:00.0/nvme/nvme1/nvme1n1
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:3 -> ../../devices/pci0000:00/0000:00:1b.4/0000:03:00.0/nvme/nvme1/nvme1n1/nvme1n1p1
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:4 -> ../../devices/pci0000:00/0000:00:1b.4/0000:03:00.0/nvme/nvme1/nvme1n1/nvme1n1p2
lrwxrwxrwx 1 root root 0 10 avril 14:07 259:5 -> ../../devices/pci0000:00/0000:00:1b.4/0000:03:00.0/nvme/nvme1/nvme1n1/nvme1n1p3

$ls -l /sys/dev/char/
total 0
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:121 -> ../../devices/virtual/misc/vboxnetctl
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:122 -> ../../devices/virtual/misc/vboxdrvu
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:123 -> ../../devices/virtual/misc/vboxdrv
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:124 -> ../../devices/virtual/misc/acpi_thermal_rel
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:125 -> ../../devices/virtual/misc/cpu_dma_latency
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:126 -> ../../devices/virtual/misc/udmabuf
lrwxrwxrwx 1 root root 0 10 avril 14:36 10:127 -> ../../devices/virtual/misc/vga_arbiter
[...]

Who declares the existence of all these peripherals to the kernel? The drivers do. In the case of char devices, they do so using a function such as register_chrdevice, which we saw last time.

And this is where char Major and Minor come in. The char major, as we will discuss below, allows registering with the kernel as a driver managing a device. But it is common for a driver to manage multiple peripherals. It will therefore also register as many minors as devices it manages.

Whoa… I can see you turning pale; we may have gone through that a bit too fast. Let’s go back to our mice from earlier:

$ls -l /dev/input/mouse*
crw-rw---- 1 root input 13, 32 10 avril 14:08 /dev/input/mouse0
crw-rw---- 1 root input 13, 33 10 avril 14:08 /dev/input/mouse1
crw-rw---- 1 root input 13, 34 10 avril 14:08 /dev/input/mouse2
crw-rw---- 1 root input 13, 35 10 avril 14:08 /dev/input/mouse3

All our mice share the same char major, 13, but each has a different char minor (32, 33, 34, and 35 respectively): they are different peripherals, but all are managed by the same device driver.

By the way, we can find out which driver registered char major 10 very simply:

$grep 10 /proc/devices
10 misc

And there you go!!!

What about /dev in all this?

When a driver declares a device, the kernel knows it must now route any associated syscall (read, write, etc.) to that driver. Except… creating a pseudo-file in /dev is not the kernel’s responsibility.

Indeed, /dev belongs to userspace: permissions, owners, file names… All of that is associated with the poor peasants that we are, not with kernel space!

We can manually create a node in /dev using the following command:

$mknod /dev/prout c 246 0

With this command, we have just created a pseudo-file, a “node”, that corresponds to a char device (c) whose char major is 246 and char minor is 0. Any attempt to read or write to this node will trigger syscalls that the kernel will forward to the corresponding driver.

As a small experiment, we can verify that it is entirely possible to create another node for the mouse driver:

$mknod /dev/rancune c 13 35

$ls -al /dev/rancune
crw-r--r-- 1 root root 13, 35 10 avril 15:19 /dev/rancune

$hexdump /dev/rancune
0000000 0108 0800 0103 0408 0801 0205 0408 0800
0000010 0105 0508 0801 0004 0508 0800 0006 0508
0000020 0800 0006 0528 28ff ff06 0528 28ff ff05
0000030 0528 28ff ff04 0428 28ff ff04 0328 28ff
0000040 ff03 0208 0800 0002 0108 0800 0001 ff18
0000050 1800 00ff ff18 1801 00ff fe18 1801 01fd
0000060 fd18 1801 01fc fd18 1802 01fc fc18 1801
0000070 01fd fd18 1801 01fd fd18 1800 01fd fd18
0000080 1801 01fd fd18 1800 01fc fc18 1801 01fd
0000090 fc18 1801 01fd fc18 1801 00fd fd18 1801
00000a0 00fd fd18 1800 00fd fe18 1800 00fe fe18
00000b0 1800 00ff fe18 1800 00ff ff18 1800 00ff

Our node, /dev/rancune, has the same major and minor as /dev/input/mouse3. Using one or the other makes strictly no difference: the syscalls will be handled by the same driver regardless!

I know what you’re going to say: How is /dev populated on my PC? Well, that’s the job of udev, a daemon that listens to kernel notifications (transmitted via a special socket called netlink) and watches /sys. When necessary, udev creates a node in /dev based on a set of configurable rules for managing the node’s permissions, owners, etc.

A small example?

Well, since all of this is a bit tough to grasp, let’s take a small example: the mem driver, whose source code is found in the Linux kernel sources.

Mem is a driver that provides very simple peripherals:

  • /dev/zero, an infinite source of zeros
  • /dev/null, a bottomless pit you can write to
  • /dev/random, a source of random numbers
  • etc.

If we read its source code, we find a structure similar to the driver that made us vibrate last week!

At line 756 of /usr/src/linux/drivers/char/mem.c:

static int __init chr_dev_init(void)
{
	int minor;

	if (register_chrdev(MEM_MAJOR, "mem", &memory_fops))
		printk("unable to get major %d for memory devs\n", MEM_MAJOR);

	mem_class = class_create(THIS_MODULE, "mem");
	if (IS_ERR(mem_class))
		return PTR_ERR(mem_class);

	mem_class->devnode = mem_devnode;
	for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
		if (!devlist[minor].name)
			continue;

		/*
		 * Create /dev/port?
		 */
		if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
			continue;

		device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
			      NULL, devlist[minor].name);
	}

	return tty_init();
}

As we can see above, the driver first registers the char Major MEM_MAJOR using the register_chrdev function in the same way we did last week.

We then find a loop that registers each of the char minors it needs with the corresponding fops structure. It is the device_create function that performs this task.

The definition of the array containing each of the devices to declare to the kernel is found a bit higher in the code, at line 716:

static const struct memdev {
        const char *name;
        umode_t mode;
        const struct file_operations *fops;
        fmode_t fmode;
} devlist[] = {
#ifdef CONFIG_DEVMEM
         [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
#endif
         [3] = { "null", 0666, &null_fops, 0 },
#ifdef CONFIG_DEVPORT
         [4] = { "port", 0, &port_fops, 0 },
#endif
         [5] = { "zero", 0666, &zero_fops, 0 },
         [7] = { "full", 0666, &full_fops, 0 },
         [8] = { "random", 0666, &random_fops, 0 },
         [9] = { "urandom", 0666, &urandom_fops, 0 },
#ifdef CONFIG_PRINTK
        [11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
};

The sharp coder that you are (What? You can be a mere peasant before the kernel and still be good at C!) will have noticed that each peripheral has its own file_operations structure… and therefore its own functions for responding to the read, write, open, etc. syscalls.

A small example? Here is zero_fops, corresponding to the /dev/zero peripheral:

static const struct file_operations zero_fops = {
        .llseek         = zero_lseek,
        .write          = write_zero,
        .read_iter      = read_iter_zero,
        .read           = read_zero,
        .write_iter     = write_iter_zero,
        .mmap           = mmap_zero,
        .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
        .mmap_capabilities = zero_mmap_capabilities,
#endif
};

And the corresponding functions are found a bit further down in the code…

Exactly the same as ours, I tell you!

In summary:

  • One driver, one char major
  • For each peripheral managed by that driver:
    • a char minor,
    • a file_operations structure (fops)
    • … and therefore a set of functions to respond to each syscall!

In the end, it’s not that complicated!

Implementing our driver

To prepare for the next session, we started a new kernel module, the famous “prout”. I will not comment on this code, since it largely follows the last stream, but I invite the interested reader to watch the previous stream and its summary – everything is there!!!

File prout.c

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/fs.h>

MODULE_DESCRIPTION("Prout Prout");
MODULE_AUTHOR("CFS/LFS");
MODULE_LICENSE("PPL");

#define DEVNAME "prout"

static char proutdev[] = "proutproutproutprout\n" ;
static int proutlen ;
static int major ;

static ssize_t prout_read( struct file *, char*, size_t, loff_t * );

static struct file_operations fops = {
    .read = prout_read
};

static int
prout_init(void)
{
    printk("coucou la voila\n") ;
    major=register_chrdev(0, DEVNAME, &fops) ;
    if (major<0) {
        printk("nacasse!!\n");
        return major ;
    }
    proutlen = strlen(proutdev) ;
    return 0 ;
}


static void
prout_exit(void)
{
    if (major != 0 )
        unregister_chrdev(major, DEVNAME ) ;
    printk("napuuuuuuuuuuuuuuuu\n") ;
}

static ssize_t
prout_read( struct file *filep, char* buf, size_t len, loff_t *off )
{
    int minlen = min( proutlen, len ) ;
    if ( copy_to_user( buf, proutdev ,minlen ) != 0 ) {
        printk("nacasse\n");
        return -EFAULT;
    }
    return minlen ;
}

module_init(prout_init);
module_exit(prout_exit);

File Makefile

KDIR=/lib/modules/`uname -r`/build

kbuild:
	make -C $(KDIR) M=`pwd`
clean:
	make -C $(KDIR) M=`pwd` clean

File Kbuild

obj-m = prout.o

To be continued!!!!!!

Rancune.

Note: As Lea very rightly pointed out to me, it is worth highlighting here the use of min(), used in our prout_read function. It is actually a macro declared in kernel.h that calls __careful_cmp() and returns the smaller of two elements. This macro notably ensures that there is no type mismatch between the two elements being compared.

References

This post is licensed under CC BY 4.0 by the author.