Linux内核学习之 -- 系统调用poll()分析笔记

一、背景

内核版本：linux 4.19

poll和select的区别不大，区别在于可以监控的描述符数量，select默认为1024(由glibc进行限制，通过宏__FD_SETSIZE进行限制)，而poll则没有这个限制。

此外还有ppoll这个系统调用，可以等待纳秒级别的信号，以及可以不被其他信号(非指定信号，比如crtl + c)中断。一个使用ppoll轮询的进程，如果把ppoll的第四个参数sigmask设置为SIGINT，那么ctrl + c是无法结束的，只能等到ppoll超时。

本篇文章主要是还是给自己当笔记用的，主要分析都作为注释写到代码中了。

二、驱动开发中poll的用法

用户空间应编写程序，伪代码如下：

	#include "poll.h"// 打开设备文件int fd = open("/dev/xxx", O_RDWR);// 创建 struct polld 结构体，并为其中的成员赋值struct pollfd fds[1];fds[0].fd = fd; 			// 指定文件描述fds[0].events = POLLIN; 	// 有数据可以读的时候返回ret = poll(fds, 1, 5000);// 情况1，只传入了一个描述符，就无所谓了，返回了>0的值就说明ok了// =0表示超时，< 0则失败if (ret > 0) {					// 如果poll有效，驱动给了返回值ret = read(fd, &data, sizeof(data));}// 情况2,如果监听了多个描述符，要判断哪个设备发生了监听的事件if (ret > 0) {						for(i=0; i< ARRAY_SIZE(fds); i++) {			// 每个描述符都要判断if(pollfds[i].revents & POLLIN) {// 做对应的处理}}}

驱动中应实现对应的poll：

unsigned int imx6uirq_poll(struct file *filp, struct poll_table_struct *wait)
{unsigned int mask = 0;struct imx6uirq_dev *dev = (struct imx6uirq_dev *)filp->private_data;// g, 此函数最终会调用传入的pt->_qproc，也就是__pollwait()，把当前current加入到等待队列dev->r_wait中// g, 随后由do_poll()进行scheduel切换进程，开启睡眠// g, 直到在某处调用wake_up_interruptible(&dev->r_wait)，唤醒该进程// g, 唤醒进程的实际工作是在wait_event绑定的回调函数pollwake()中做的,涉及到default_wake_function()->..->try_to_wake_up()poll_wait(filp, &dev->r_wait, wait);	/* 将等待队列头添加到poll_table中 */if(atomic_read(&dev->releasekey)) {		/* 按键按下 */mask = POLLIN | POLLRDNORM;			/* 返回PLLIN */}return mask;
}
...
...
/* 设备操作函数 */
static struct file_operations imx6uirq_fops = {.owner = THIS_MODULE,.open = imx6uirq_open,.read = imx6uirq_read,.poll = imx6uirq_poll,
};

驱动的poll需要在某个设备事件(自己定义，可以是等待按键，也可以啥都不做直接返回也没问题，不过一般都是用来轮询数据有没有准备好)到来时，返回监听的事件，也就是POLLIN即可，就可以通知到用户层事件的发生，跳出poll的轮询。

三、poll()系统调用

关于系统调用(还在整理)：Linux内核学习之 – ARMv8架构的系统调用

该系统调用的实现如下：

fs/select.c:
SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,int, timeout_msecs)
{struct timespec64 end_time, *to = NULL;int ret;if (timeout_msecs >= 0) {to = &end_time;// g, 将参数timeout_msecs转换到结构struct timespecpoll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));}ret = do_sys_poll(ufds, nfds, to);// g, 当系统调用被其他信号中断时(此时并不是系统调用出错，而是被信号中断了)if (ret == -EINTR) {struct restart_block *restart_block;restart_block = &current->restart_block;restart_block->fn = do_restart_poll;restart_block->poll.ufds = ufds;restart_block->poll.nfds = nfds;if (timeout_msecs >= 0) {restart_block->poll.tv_sec = end_time.tv_sec;restart_block->poll.tv_nsec = end_time.tv_nsec;restart_block->poll.has_timeout = 1;} elserestart_block->poll.has_timeout = 0;// g, 返回了该错误码(ERESTART_RESTARTBLOCK)，该错误码会使内核认为此次系统调用应该重启(不会返回到用户空间)// g, 该restart_block会被存入current->restart_block中// g, note 什么时候重启？我看好像是在系统调用退出执行do_notify_resume()->do_signal()时，若该信号设置了SA_RESTART，则会修改regs->pc，重新指向系统调用指令，也就是说退出后会重新执行一遍系统调用。ret = -ERESTART_RESTARTBLOCK;		}return ret;
}

该系统调用的执行过程可以分为三步:

转换用户传入的超时时间为struct timespec64
调用do_sys_poll()函数，这是处理poll的主函数
若在执行该系统调用时被其他信号打断，则设置重启操作。

第一步很好理解。第三步是系统调用重新执行相关的操作，涉及到系统调用退出时在entry.S中要执行的一个函数do_notify_resume()，该函数比较复杂，是否需要启动进程调度(need_reseched)，是否重启系统调用，调试暂停(拦截内核的信号，暂停在这里并通知调试器)等都会在这里处理，暂时不进行分析，以后有空单独写一篇笔记。

真正起作用的是do_sys_poll()函数：

fs/select.c:
static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,struct timespec64 *end_time)
{struct poll_wqueues table;int err = -EFAULT, fdcount, len, size;/* Allocate small arguments on the stack to save memory and befaster - use long to make sure the buffer is aligned properlyon 64 bit archs to avoid unaligned access */long stack_pps[POLL_STACK_ALLOC/sizeof(long)];					// g, 256/sizeof(long) = 256 / 8 = 32，分配了一个long[32]，可以认为是分配了一段内存(内核栈中分配的)struct poll_list *const head = (struct poll_list *)stack_pps; 	// g, 一个struct poll_list的大小为:int + int， 是一个longstruct poll_list *walk = head;unsigned long todo = nfds;										// g, 用户空间调用poll函数的第二个参数，要监听的文件个数if (nfds > rlimit(RLIMIT_NOFILE))				// g, 这里明显对poll()可以监控的文件数量做了一个限制，也就是进程可以open的最大文件数量，那为什么都说poll()没有数量限制呢？return -EINVAL;// g, 下面的拷贝过程可以概括为:// 1. 创建一个struct poll_list链表// 2. 链表中的每一个节点，都会保存一部分用户传入的struct poll_fd，每一个节点占用内存不得超过一个PAGE_SIZE(内核页)。若超过了PAGE_SIZE，则重新创建一个结点插入到链表中，并为其申请所需的内存// 3. 链表的头结点除外，链表的头结点内存是在内核栈中分配的，只能分配long[32]大小。其余的结点内存分配都是在内核堆区分配的。// 4. 经过拷贝之后，用户空间传入的所有信息(最重要的是struct poll_fd这个结构体的信息)，都拷贝到了内核空间，并可以通过一个struct poll_list链表获取。len = min_t(unsigned int, nfds, N_STACK_PPS);	// g, 最后一个宏是判断刚才分配的那一段内存stack_pps[]，最多能存放多少个struct poll_fdfor (;;) {walk->next = NULL;walk->len = len;if (!len)		// g, len为0了就跳出去break;// g, ufds是用户传入的第一个参数,也就是struct* pollfdif (copy_from_user(walk->entries, ufds + nfds-todo,	// g, 如果返回值不为0，也就是说有拷贝失败的字节数sizeof(struct pollfd) * walk->len))		// g , 这个poll_list节点还可以放len个，就一次性拷贝len个goto out_fds;todo -= walk->len;	// g, 还剩下多少if (!todo)			// g, 如果剩下的为0，也可以结束了break;len = min(todo, POLLFD_PER_PAGE); // g, 该宏定义了一个内核PAGE能存放多少个struct poll_fd。也就是说一个poll_list节点里面的所有内容，一定要放在一个内核页中，不要跨页size = sizeof(struct poll_list) + sizeof(struct pollfd) * len;walk = walk->next = kmalloc(size, GFP_KERNEL);if (!walk) {err = -ENOMEM;goto out_fds;}}// g, 会初始化一些内容：有一个比较重要:table->pt->qproc = __pollwait poll_initwait(&table);fdcount = do_poll(head, &table, end_time);		// g, 调用do_poll 此处进入poll， 最终会调用驱动中的pollpoll_freewait(&table);for (walk = head; walk; walk = walk->next) {	// g, 结果拷贝给到用户内存struct pollfd *fds = walk->entries;int j;for (j = 0; j < walk->len; j++, ufds++)if (__put_user(fds[j].revents, &ufds->revents))goto out_fds;}err = fdcount;
out_fds:walk = head->next;while (walk) {struct poll_list *pos = walk;walk = walk->next;kfree(pos);}return err;
}

该函数也分为三步：

从用户空间拷贝信息(struct pollfd)到内核
初始化struct poll_wqueues，绑定一个函数指针到__pollwait，执行处理函数do_poll()
将内核信息(处理过后的struct pollfd)拷贝回用户空间

3.1 用户空间 -> 内核

主要是这一段代码：

fs/select.c：
static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,struct timespec64 *end_time)
{struct poll_wqueues table;int err = -EFAULT, fdcount, len, size;/* Allocate small arguments on the stack to save memory and befaster - use long to make sure the buffer is aligned properlyon 64 bit archs to avoid unaligned access */long stack_pps[POLL_STACK_ALLOC/sizeof(long)];					// g, 256/sizeof(long) = 256 / 8 = 32，分配了一个long[32]，可以认为是分配了一段内存(内核栈中分配的)struct poll_list *const head = (struct poll_list *)stack_pps; 	// g, 一个struct poll_list的大小为:int + int， 是一个longstruct poll_list *walk = head;unsigned long todo = nfds;										// g, 用户空间调用poll函数的第二个参数，要监听的文件个数if (nfds > rlimit(RLIMIT_NOFILE))				// g, 这里明显对poll()可以监控的文件数量做了一个限制，也就是进程可以open的最大文件数量，那为什么都说poll()没有数量限制呢？return -EINVAL;// g, 下面的拷贝过程可以概括为:// 1. 创建一个struct poll_list链表// 2. 链表中的每一个节点，都会保存一部分用户传入的struct poll_fd，每一个节点占用内存不得超过一个PAGE_SIZE(内核页)。若超过了PAGE_SIZE，则重新创建一个结点插入到链表中，并为其申请所需的内存// 3. 链表的头结点除外，链表的头结点内存是在内核栈中分配的，只能分配long[32]大小。其余的结点内存分配都是在内核堆区分配的。// 4. 经过拷贝之后，用户空间传入的所有信息(最重要的是struct poll_fd这个结构体的信息)，都拷贝到了内核空间，并可以通过一个struct poll_list链表获取。len = min_t(unsigned int, nfds, N_STACK_PPS);	// g, 最后一个宏是判断刚才分配的那一段内存stack_pps[]，最多能存放多少个struct pollfdfor (;;) {walk->next = NULL;walk->len = len;if (!len)		// g, len为0了就跳出去break;// g, ufds是用户传入的第一个参数,也就是struct* pollfdif (copy_from_user(walk->entries, ufds + nfds-todo,	// g, 如果返回值不为0，也就是说有拷贝失败的字节数sizeof(struct pollfd) * walk->len))		// g , 这个poll_list节点还可以放len个，就一次性拷贝len个goto out_fds;todo -= walk->len;	// g, 还剩下多少if (!todo)			// g, 如果剩下的为0，也可以结束了break;len = min(todo, POLLFD_PER_PAGE); // g, 该宏定义了一个内核PAGE能存放多少个struct poll_fd。也就是说一个poll_list节点里面的所有内容，一定要放在一个内核页中，不要跨页size = sizeof(struct poll_list) + sizeof(struct pollfd) * len;walk = walk->next = kmalloc(size, GFP_KERNEL);if (!walk) {err = -ENOMEM;goto out_fds;}}......

这一段代码做了几件事情：

创建一个struct poll_list链表，首先创建一个头结点，以局部变量的方式创建，位于内核栈中
链表中的每一个节点，都会保存一部分用户传入的struct poll_fd，每一个节点占用内存不得超过一个PAGE_SIZE(内核页)。若超过了PAGE_SIZE，则重新创建一个struct poll_list结点插入到链表中，并为其申请所需的内存。
链表的头结点除外，链表的头结点内存是在内核栈中分配的，只能分配long[32]大小。其余的结点内存分配都是在内核堆区分配的。
经过拷贝之后，用户空间传入的所有信息(最重要的是struct poll_fd这个结构体的信息)，都拷贝到了内核空间，并可以通过struct poll_list链表获取。每一个节点的entries域都会保存一个或多个struct pollfd

节点结构体定义如下：

fs/select.c：
struct poll_list {struct poll_list *next;		// g, 指向下一个节点int len;					// g, 表示当前节点保存了多少个struct pollfdstruct pollfd entries[0];	// g, 空数组，用多少填充多少
};

3.2 poll_initwait()与do_poll()

先来说一下poll_initwait()函数，该函数用来初始化一个struct poll_wqueues结构体：

fs/select.c：
void poll_initwait(struct poll_wqueues *pwq)
{init_poll_funcptr(&pwq->pt, __pollwait);	// g, 初始化pwq->pt->_qproc = __pollwaitpwq->polling_task = current;				// g, 设置为current，后续会加入到等待队列中pwq->triggered = 0;pwq->error = 0;pwq->table = NULL;pwq->inline_index = 0;
}--->include/linux/poll.h:
static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
{pt->_qproc = qproc;pt->_key   = ~(__poll_t)0; /* all events enabled */
}

这里面涉及到一个关键函数：__pollwait()，后面需要通过该函数来把当前进程(pwq->polling_task = current;)加入到等待队列中，后面再说，现在先放着

接下来就会执行do_poll(head, &table, end_time)，该函数实现如下：

fs/select.c
static int do_poll(struct poll_list *list, struct poll_wqueues *wait,struct timespec64 *end_time)
{poll_table* pt = &wait->pt;		// g, wait在上一步已经进行了初始化ktime_t expire, *to = NULL;int timed_out = 0, count = 0;u64 slack = 0;__poll_t busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;unsigned long busy_start = 0;/* Optimise the no-wait case */if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {pt->_qproc = NULL;timed_out = 1;}if (end_time && !timed_out)slack = select_estimate_accuracy(end_time);for (;;) {struct poll_list *walk;bool can_busy_loop = false;// g, list也在外面进行了初始化,copy_from_user用户传入的struct pollfd，所有struct poll_ist组成了一个链表// g, 接下来会遍历链表，遍历链表每一个节点，然后遍历每一个节点的pollfd数组for (walk = list; walk != NULL; walk = walk->next) {			// g, 遍历整个链表struct pollfd * pfd, * pfd_end;pfd = walk->entries;pfd_end = pfd + walk->len;for (; pfd != pfd_end; pfd++) { 		// g, 遍历链表每个poll_list节点中的所有struct pollfd/** Fish for events. If we found one, record it* and kill poll_table->_qproc, so we don't* needlessly register any other waiters after* this. They'll get immediately deregistered* when we break out and return.*/// g, pt->_qproc在之前的一步已经被初始化为了__pollwait，内核的poll会调用到这个pt->_qproc(也就是__pollwait)，把当前进程加入到某个等待队列中// g, 如果有多个fd(也就是说有多个设备驱动)，那么就应该是每一个设备驱动中都应创建一个自己的等待队列，然后把current加进去，然后无论哪个fd的条件满足了都会唤醒这个等待队列// g, __pollwait中会为等待队列设置唤醒时回调func()，该函数设置为了pollwake(),最终会调用 __pollwake()。该函数会pwq->triggered = 0;// g, 判断是否有期望的事件触发，也就是比较pfd->events与驱动返回的mask是否相等// g, 如果有期望的事件发生，则会更新pfd->revents，也就是真实发生的事件if (do_pollfd(pfd, pt, &can_busy_loop,busy_flag)) {count++;pt->_qproc = NULL;		// g, 如果有事件发生了，就不会再次加入到等待队列中了，这个函数指针不再是__pollwait了/* found something, stop busy polling */busy_flag = 0;can_busy_loop = false;	// g, 如果有期望事件触发了，停止loop，也就是停止轮询}}}/** All waiters have already been registered, so don't provide* a poll_table->_qproc to them on the next loop iteration.*/pt->_qproc = NULL;		// g, 整个poll的过程，对每个fd来说，只有一次加入等待队列然后scheduel的机会。之后要么等待被唤醒，要么超时if (!count) {			// g, count = 0，说明所有fd都没有期望的事件发生，则可以去响应其他信号count = wait->error;if (signal_pending(current))	// g, 仅检查当前进程是否有信号处理(不会在这里处理信号)，返回不为0表示有信号需要处理,则需要退出当前的系统调用count = -EINTR;				// g, 表示系统调用被信号中断}// g, 如果count不是0，也就是说出现了事件(要么是驱动设备那里出现了事件，要么是有信号需要处理，不管哪种情况都需要退出系统调用),则不会再调用poll_schedule_timeout休眠了// g, 也就是说，虽然我把在do_pollfd中已经把curret加入到了等待队列，但是我没有sechedule，所以不会休眠if (count || timed_out)				break;/* only if found POLL_BUSY_LOOP sockets && not out of time */if (can_busy_loop && !need_resched()) { // g, 如果现在是busy并且没有设置过need_resched标志(内核进程需要调度时会显式的设置该标志)if (!busy_start) {busy_start = busy_loop_current_time();continue;}if (!busy_loop_timeout(busy_start))continue;}busy_flag = 0;/** If this is the first loop and we have a timeout* given, then we convert to ktime_t and set the to* pointer to the expiry value.*/if (end_time && !to) {expire = timespec64_to_ktime(*end_time);to = &expire;}if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))	// g, 休眠，等待被事件唤醒。timed_out = 1;													// g, 超时后设置该变量为1}return count;
}

该函数主要进行以下几个工作：

首先遍历所有struct pollfd
对每一个struct pollfd，都调用一次do_pollfd()函数，该函数会判断是否有事件发生，并且该函数会调用到设备驱动中实现的dev_poll()函数
如果遍历完所有pollfd，都没有检测到事件发生，会调用poll_schedule_timeout()函数阻塞，让出cpu

其中的关键就是do_pollfd()函数，其实现如下：

fs/select.c：
static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait,bool *can_busy_poll,__poll_t busy_flag)
{int fd = pollfd->fd;		// g, 这个pollfd->fd一定在用户空间open过了__poll_t mask = 0, filter;struct fd f;if (fd < 0)goto out;mask = EPOLLNVAL;f = fdget(fd);				// g, 获取打开的fd对应的struct file(被强转成了struct fd, fd->file就是struct file)if (!f.file)goto out;/* userland u16 ->events contains POLL... bitmap */filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;	// g, poll->events是用户空间传入的要求检测的事件，只关注这些事件pwait->_key = filter | busy_flag;// g, 最终会调用file->f_op->poll, open的时候file->f_op已经绑定了Inode->i_op了，也就是decice_create设备的时候绑定的操作集// g, 最终由驱动层字符设备的ops返回mask// g, 还有一点很重要，驱动中的poll，会调用poll_wait()->[pwait->_qproc()]，也就是已经注册了的 __pollwait 函数mask = vfs_poll(f.file, pwait);	if (mask & busy_flag)*can_busy_poll = true;mask &= filter;							// g, 检测是否有指定的事件触发fdput(f);out:/* ... and so does ->revents */pollfd->revents = mangle_poll(mask);	// g, 设置返回值，用户空间可以通过判断revents域判断哪个fd发生了事件return mask;
}

该函数会调用虚拟文件系统的vfs_poll()，并最终调用到驱动中绑定的dev_poll()函数：

include/linux/poll.h：
static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
{if (unlikely(!file->f_op->poll))return DEFAULT_POLLMASK;return file->f_op->poll(file, pt);
}

在这里，需要回顾一下在驱动中的dev_poll()，要做什么工作。回到文章开始，驱动中实现的dev_poll()函数如下：

...
...
struct imx6uirq_dev{...wait_queue_head_t r_wait;	/* 读等待队列头 */...
};
struct imx6uirq_dev imx6uirq;	/* irq设备 */
...
...
void timer_function(unsigned long arg)
{....../* 唤醒进程 */if(atomic_read(&dev->releasekey)) {	/* 完成一次按键过程 *//* wake_up(&dev->r_wait); */wake_up_interruptible(&dev->r_wait);	// 当某一个情况发生时，唤醒等待队列中的进程}......
}
...
...
unsigned int imx6uirq_poll(struct file *filp, struct poll_table_struct *wait)
{unsigned int mask = 0;struct imx6uirq_dev *dev = (struct imx6uirq_dev *)filp->private_data;// g, 此函数最终会调用传入的pt->_qproc，也就是__pollwait()，把当前current加入到等待队列dev->r_wait中// g, 随后由do_poll()进行scheduel切换进程，开启睡眠// g, 直到在某处调用wake_up_interruptible(&dev->r_wait)，唤醒该进程// g, 唤醒进程的实际工作是在wait_event绑定的回调函数pollwake()中做的,涉及到default_wake_function()->..->try_to_wake_up()poll_wait(filp, &dev->r_wait, wait);	/* 将等待队列头添加到poll_table中 */if(atomic_read(&dev->releasekey)) {		/* 按键按下 */mask = POLLIN | POLLRDNORM;			/* 返回PLLIN */}return mask;
}
...
...
static int xx_init(void)
{....../* 初始化等待队列头 */init_waitqueue_head(&imx6uirq.r_wait);......return 0;
}
module_init(xx_init);

驱动中做了什么工作呢？

创建并初始化一个等待队列头r_wait
调用poll_wait()函数，并传入等待队列头r_wait和do_pollfd()传入的poll_table
给do_pollfd()返回一个mask，所以说，事件发生与否，完全看驱动中的dev_poll()返回了什么mask
在合适的时机调用wake_up_interruptible(&dev->r_wait)唤醒该等待队列。

为什么驱动中要做这些工作？我猜这应该是poll系统调用约定俗成的。先不管poll_wait()函数做了什么，先看一下do_poll()在通过do_pollfd()调用了驱动程序中的dev_poll()之后，又做了什么工作：

fs/select.c
static int do_poll(struct poll_list *list, struct poll_wqueues *wait,struct timespec64 *end_time)
{.....for (;;) {......for (walk = list; walk != NULL; walk = walk->next) {			// g, 遍历整个链表struct pollfd * pfd, * pfd_end;pfd = walk->entries;pfd_end = pfd + walk->len;for (; pfd != pfd_end; pfd++) { 		// g, 遍历链表每个poll_list节点中的所有struct pollfdif (do_pollfd(pfd, pt, &can_busy_loop,busy_flag)) {count++;pt->_qproc = NULL;		// g, 如果有事件发生了，就不会再次加入到等待队列中了，这个函数指针不再是__pollwait了/* found something, stop busy polling */busy_flag = 0;can_busy_loop = false;	// g, 如果有期望事件触发了，停止loop，也就是停止轮询}}}......if (end_time && !to) {expire = timespec64_to_ktime(*end_time);to = &expire;}if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))	// g, 休眠，等待被事件唤醒。timed_out = 1;													// g, 超时后设置该变量为1}return count;
}

它调用poll_schedule_timeout()休眠了。既然休眠了，那肯定是每法自己唤醒自己的，只能由其他进程来唤醒自己了。那这一个环节具体是咋做的呢，关键在我们驱动函数dev_poll()调用的这个poll_wait()中。

3.3 __pollwait()与pollwake()

先看一下dev_poll中调用的poll_wait()：

static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{if (p && p->_qproc && wait_address)p->_qproc(filp, wait_address, p);
}

可以看到，最终是调用了p->_qproc，而该指针已经指向了__pollwait()：

static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,poll_table *p)
{struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);struct poll_table_entry *entry = poll_get_entry(pwq);	// g, 又是一个空数组，每调用一次，返回pwq->inline_entries[index++]。如果inline_entries不够用了再kmalloc新的一页来存放struct poll_table_entryif (!entry)return;entry->filp = get_file(filp);entry->wait_address = wait_address;entry->key = p->_key;init_waitqueue_func_entry(&entry->wait, pollwake);		// g, pollwake作为该等待队列唤醒时的回调函数。entry->wait.private = pwq;// g, 加入等待队列wait_address中，但是没有直接调用sechedule，所以还没有休眠。这个wait_address是驱动中定义的等待队列头// g, 等驱动中某一个中断之类的来唤醒这个等待队列，唤醒时会执行绑定的函数，也就是pollwake(),最终会调用default_wake_function()->try_to_wake_up()add_wait_queue(wait_address, &entry->wait); 
}

这个函数，把pwq->inline_entries[index]->wait加入到我们驱动程序中注册的等待队列头。这样，我们就可以在其他地方来唤醒等待队列中的进程。基于此，我们看一下唤醒等待队列时执行的函数pollwake()：

驱动程序中某事件就绪->wake_up_interruptible(等待队列)->__wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)->__wake_up_common_lock()->__wake_up_common()->curr.func(curr, mode, wake_flags, key),->curr.func就是等待队列项绑定的回调函数，对于__pollwait()中加入到该等待队列中的等待队列项来说，就是pollwake()fs/select.c:
static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 
{struct poll_table_entry *entry;entry = container_of(wait, struct poll_table_entry, wait);if (key && !(key_to_poll(key) & entry->key))return 0;return __pollwake(wait, mode, sync, key);	
}
...
...
static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{struct poll_wqueues *pwq = wait->private;DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);/*展开:struct wait_queue_entry dummy_wait = { .private = pwq->polling_task,  // 早已被设置为了current.func = default_wake_function, .entry = { NULL, NULL } }*/smp_wmb();pwq->triggered = 1;return default_wake_function(&dummy_wait, mode, sync, key);
}->default_wake_function()->try_to_wake_up(pwq->polling_task, mode, wake_flags)，try_to_wake_up会唤醒等待队列中进程，之前pwq->polling_task被设置为了current

当唤醒之后，do_poll()会从休眠处继续运行，也就是会从poll_schedule_timeout()处继续运行(实际上是从调用schedule()的返回处继续运行)。这里可以看一下这个休眠函数的实现过程：

static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,ktime_t *expires, unsigned long slack)
{int rc = -EINTR;set_current_state(state);if (!pwq->triggered)													// g, 如果triggered = 0，就要休眠。有个函数__pollwake会设置为1，该函数是wait队列唤醒时的回调函数中会调用rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);	// g, 使得当前进程休眠指定的时间范围，使用CLOCK_MONOTONIC计时系统。返回0表示超时，返回-EINTR表示到时之前被唤醒。用了高精度定时器，hrtime开头的都是高精度的。// g,上一步就要休眠了，其中调用了schedule。被唤醒了的时候才会执行到下面这一步__set_current_state(TASK_RUNNING);/** Prepare for the next iteration.** The following smp_store_mb() serves two purposes.  First, it's* the counterpart rmb of the wmb in pollwake() such that data* written before wake up is always visible after wake up.* Second, the full barrier guarantees that triggered clearing* doesn't pass event check of the next iteration.  Note that* this problem doesn't exist for the first iteration as* add_wait_queue() has full barrier semantics.*/smp_store_mb(pwq->triggered, 0);return rc;
}->schedule_hrtimeout_range_clock()->schedule()，完成任务切换。

当被唤醒后，如果是因为超时被唤醒的，则进入到下一此的for循环仍然会对所有struct pollfd进行最后一次遍历，然后break；

如果不是因为超时被唤醒的，那就是我们的驱动程序主动唤醒的，说明驱动程序中准备好了该事件，那么在下次遍历中do_pollfd()再次调用到驱动中dev_poll()时应该是能拿到相应的mask的。要是拿不到，那就说明你这个驱动写的有问题，把人家唤醒了还不给人家return正确的mask。

四、关于try_to_wake_up()这个函数

这个函数涉及到linux的调度策略，这个函数会为要唤醒的进程选择一个合适的cpu，加入到它的ready队列中。如何选择合适的cpu就是一个问题。在看这个函数的时候发现了一种调度策略叫EAS调度，但是我看的4.19的内核没有这个调度策略，这个调度不是CFS的选进程策略，而是一个选核策略。5.0之后才添加了这个调度策略。

但是该函数除此之外，还有个地方值得研究一下，就是该函数中用了四个内存屏障，理解这四处内存屏障的用法对ARM cache一致性和内存屏障的作用的了解非常有帮助，后续有时间写一写一致性模型和EAS调度器相关的笔记。

try_to_wake_up()->select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)->p->sched_class->select_task_rq()，如果p是NORMAL类,那么就是它的sched_class就是CFS->fair_sched_class，CFS的sched_class->find_energy_efficient_cpu()