Post

Writing a Linux kernel module (2)

Writing a Linux kernel module (2)

This post is an automatic translation from French. You can read the original version here.

C From Scratch Episode 24 (2/2)

In the last article, a simple kernel module was written and compiled. This time, we will see how this module can interact with userspace through a node in /dev. This chapter follows the structure of Imil’s stream in two parts: the first explains how function pointers can be used within a structure, and the second explains how to put all of that into practice in our kernel module.

Note: I took the liberty of slightly changing the examples compared to Imil’s stream.

Introduction

Under Unix, generally speaking, we interact with hardware through a device, that is, a pseudo file located in /dev. The tools for doing so are those we have long known for interacting with files: cat, echo, etc…

But if we look a bit more closely, all of this still relies on syscalls:

$strace cat /etc/passwd
[...]
read( [...]
[...]
$strace cat /dev/input/mouse0
[...]
read( [...]
[...]

What may be surprising here is that both commands use the same syscall (read) even though these are two very different objects:

  • In the first case, /etc/passwd is a file.

  • In the second case, /dev/input/mouse0 is a pseudo file that allows us to talk to the mouse driver.

These are entirely different in nature… And yet we always access them via the read syscall. Magical, isn’t it?

In fact, we should remember that under Unix, as the famous saying goes, everything is a file. This means that our various drivers each have their own implementation of the read() function, and the syscall triggers the right function when we access the node in /dev.

Thanks to this little trick, there is no need to multiply syscalls. We use the API that allows access to a file and that’s it!

Function pointers and structures

To understand how the kernel does it, we can start with the following program:

int dummy_open(int) ;
int dummy_close(int) ;

struct dummy{
    int (*open)(int) ;
    int (*close)(int) ;
};

int
main()
{
    int r ;
    struct dummy d = {
        .open = dummy_open,
        .close = dummy_close
    } ;

    r = d.open(1);
    r = d.close(r) ;

    return r ;
}

int
dummy_open( int i )
{
    return --i ;
}

int
dummy_close( int i )
{
    return ++i ;
}

The dummy structure contains two function pointers, named “open” and “close” respectively. When initializing the variable d, these two pointers are set to point to dummy_open and dummy_close respectively.

From now on, when we call d.open(), it is actually dummy_open() that gets called. Similarly, d.close() actually executes dummy_close.

To illustrate this clearly, let’s do the same thing with an array of two dummy structures, both pointing to different functions. And let’s call the open() function:

#include <stdio.h>

int dummy_open(int) ;
int dummy_close(int) ;
int yeah_open(int) ;
int yeah_close(int) ;

struct dummy{
    int (*open)(int) ;
    int (*close)(int) ;
};

int
main()
{
    struct dummy d[2] ;             // An array of 2 dummy structures

    d[0].open = dummy_open ;        // Initialize d[0]
    d[0].close = dummy_close ;

    d[1].open = yeah_open ;         // Initialize d[1]
    d[1].close = yeah_close ;


    d[0].open(1) ;                  // Call the open function of d[0]
    d[1].open(1) ;                  // Call the open function of d[1]

    return 0;
}

int
dummy_open( int i )
{
    printf("dummy_open\n");
    return 0 ;
}

int
dummy_close( int i )
{
    printf("dummy_close\n");
    return 0 ;
}

int
yeah_open( int i )
{
    printf("yeah_open\n");
    return 0 ;
}

int
yeah_close( int i )
{
    printf("yeah_close\n");
    return 0 ;
}

Let’s compile and test our program:

$gcc -o test main.c

$./test
dummy_open
yeah_open

As we can see, it is the dummy_open() function that gets called for d[0], and the yeah_open() function for d[1]. It works!!

Now let’s see how this is used in the kernel.

Back to our module!

There is a structure in the kernel whose behavior is very similar. It allows masking the inner workings of a device and only exposing the read function on the user side through a syscall.

This structure is found in the kernel sources, in the file /usr/src/linux/include/linux/fs.h, and is called file_operations. Here is its declaration:

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
	int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
			unsigned int flags);
	int (*iterate) (struct file *, struct dir_context *);
	int (*iterate_shared) (struct file *, struct dir_context *);
	__poll_t (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	unsigned long mmap_supported_flags;
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
	int (*setlease)(struct file *, long, struct file_lock **, void **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
	void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
	unsigned (*mmap_capabilities)(struct file *);
#endif
	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
			loff_t, size_t, unsigned int);
	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
				   struct file *file_out, loff_t pos_out,
				   loff_t len, unsigned int remap_flags);
	int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

As we can see, this structure only contains function pointers (these are called “place-holders”). It works in the same way as our dummy structure.

To build our driver, we create the file brrr.c:

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/fs.h>           // Required for the file_operations structure
#include <linux/uaccess.h>      // Required for the copy_to_user() function

MODULE_DESCRIPTION("BRRRRRRRRRRRR");
MODULE_AUTHOR("CFS");
MODULE_LICENSE("BRRRRRRRRRRRRRRRRR");

#define DEVNAME "brrr"

static char brrr[]="brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr\n" ;
static int brrrlen ;
static int major ;

static ssize_t brrr_read( struct file*, char*, size_t, loff_t* ) ;

static struct file_operations fops = {
	.read = brrr_read
};

static int
brrr_init(void) {
	major = register_chrdev(0, DEVNAME, &fops ) ;
	if ( major < 0 ) {
		printk("not happy :(\n" ) ;
		return major;
	}
	brrrlen=strlen(brrr) ;
}

static void
brrr_exit(void)
{
	if (major != 0 ) {
		unregister_chrdev( major, DEVNAME ) ;
	}
}

static ssize_t
brrr_read( struct file* fp, char* buf, size_t len, loff_t* off)
{
	if ( copy_to_user( buf, brrr, brrrlen ) != 0 ) {
		printk("Oh no !!! \n" ) ;
		return EFAULT ;
	}
	return brrrlen ;
}

module_init(brrr_init);
module_exit(brrr_exit);

The overall structure of the program is the same as the module from the previous article. This time, however, the module has three functions:

  • brrr_init(): the module initialization
  • brrr_exit(): the function called when the module is unloaded
  • brrr_read(): which implements the read() function.

This module corresponds to a char device. It registers with the kernel as “brrr” and must provide its char major. Since we don’t have an assigned char major, we let the kernel choose by passing 0 to the register_chrdev function. The major assigned to us is returned by the function and stored in the global variable major.

To indicate which of our functions is called during a READ syscall, we also pass the address of the fops variable, of type file_operations. This global variable is initialized at the top of the code.

It is ultimately the brrr_read function that does the work, thanks to a call to copy_to_user which allows writing data to userspace. Why not write directly to the buf pointer? Simply because we are in kernel space, and therefore we must account for the MMU and process memory management!

To compile all this, nothing special! We use the same method as before!

After verifying that the module loads correctly into memory with

1
grep brrr /proc/modules

We can retrieve its char major:

$sudo grep brrr /proc/devices
509 brrr

And create the node in /dev:

$sudo mknod /dev/brrr c 509 0

All that’s left is to test:

$sudo cat /dev/brrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
brrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

It wooooooorks! Thank you Imil!!!!!!!!!

* Walks away vibrating :)

Rancune

References:

This post is licensed under CC BY 4.0 by the author.