文章目录
- 进程调度
- 双内核进程切换
- 带内->带外
- 带外->带内
- oob任务切换
- 事件通知
- oob异常处理
- 系统调用
- inband 事件
- Alternate task上下文
- 拓展内存上下文
- 拦截返回带内
进程调度
为需要可靠、超低延迟响应的特定用例提供支持,希望实现基于evl的调度程序来控制常见的Linux任务,与Linux的调度程序完全分离,并拥有高于所有Linux活动的绝对优先级。
Dovetail为evl提供以下支持:
- 以有序和安全的方式在带内和带外移动Linux任务,同步两个内核的调度器;
- 向evl通知带内事件;
- 关于故障、系统调用和其他异常等事件的通知来自带外执行阶段;
- 集成支持在带外任务之间执行上下文切换,包括内存上下文和浮点单元管理。
Linux任务(用户空间任务或内核线程)需要通过调用dovetail_init_altsched()初始化交替调度特性,并通过dovetail_start_altsched()启用该功能,**dovetail_stop_altsched()**禁用当前任务的事件通知。此后,任务:
- 可以在带内(in-band)和带外(out-of-band)执行阶段之间自由切换;
- 可以发出evl可以处理的带外系统调用;
- 被Dovetail的通知系统跟踪,以便涉及该任务的带内和带外事件可以被派发到evl;
- 独立于Linux调度,参与到由evl触发的基于Dovetail上下文切换。
void dovetail_init_altsched(struct dovetail_altsched_context *p)
{struct task_struct *tsk = current;struct mm_struct *mm = tsk->mm;check_inband_stage();// 将当前任务和内存管理结构记录到调度上下文中。p->task = tsk;p->active_mm = mm;p->borrowed_mm = false;/** Make sure the current process will not share any private* page with its child upon fork(), sparing it the random* latency induced by COW. MMF_DOVETAILED is never cleared* once set. We serialize with dup_mmap() which holds the mm* write lock. CAUTION: the boot context has no mm and does* not bear the PF_KTHREAD bit either.* 如果任务有内存管理结构且不是内核线程,并且未设置* MMF_DOVETAILED标志,则设置该标志以避免子进程在fork()时* 共享私有页面。*/if (mm && !(tsk->flags & PF_KTHREAD) &&!test_bit(MMF_DOVETAILED, &mm->flags)) {mmap_write_lock(mm);__set_bit(MMF_DOVETAILED, &mm->flags);mmap_write_unlock(mm);}
}void dovetail_start_altsched(void)
{check_inband_stage();// 设置线程本地标志,表示启用交替调度。set_thread_local_flags(_TLF_DOVETAIL);
}
双内核进程切换
任何时刻,任何Linux任务要么由Linux内核控制,要么由evl控制,两者之间不能重叠:
- 在evl中运行或休眠时,Linux内核认为其处于TASK_INTERRUPTIBLE睡眠状态;
- 在Linux内核中运行或休眠时,evl认为其处于T_INBAND阻塞状态。
直到CPU上没有带外任务可执行时,evl会将 CPU 让给带内阶段。通过在evl中调度一个具有最低优先级的任务占位符来完成的,该占位符代表Linux内核及其带内上下文,并链接到evl为每个 CPU 维护的运行队列中。例如,EVL 核心将这样的占位符任务分配给其 SCHED_IDLE 策略,当其他策略在 CPU 上没有可运行的任务时,就会选择该策略。
一旦带内上下文恢复,如果虚拟屏蔽状态允许,那些带内中断事件将被传递给Linux内核处理。
任务可以在带内和带外阶段之间切换过程较为复杂且耗时,至少包括两次调度操作(一次用于在退出阶段挂起,一次用于从相反阶段恢复)。
带内->带外
带外切换是指 Linux 任务从主内核调度程序的控制下转移到自主核心添加的替代调度程序的过程。
- 通过evl的系统调用显式请求切换调度,例如 evl_switch_oob();
- evl强制进行这种转换以响应用户请求,例如发出只能从带外阶段处理的系统调用。
int evl_switch_oob(void)
{struct evl_thread *curr = evl_current();struct task_struct *p = current;unsigned long flags;int ret;inband_context_only();if (curr == NULL)return -EPERM;if (signal_pending(p))return -ERESTARTSYS;trace_evl_switch_oob(curr);evl_clear_sync_uwindow(curr, EVL_T_INBAND);ret = dovetail_leave_inband();if (ret) {evl_test_cancel();evl_set_sync_uwindow(curr, EVL_T_INBAND);return ret;}/** On success, dovetail_leave_inband() stalls the oob stage* before returning to us: clear this stall bit since we don't* use it for protection but keep hard irqs off.*/unstall_oob();/** The current task is now running on the out-of-band* execution stage, scheduled in by the latest call to* __evl_schedule() on this CPU: we must be holding the* runqueue lock and hard irqs must be off.*/oob_context_only();finish_rq_switch_from_inband();trace_evl_switched_oob(curr);/** In case check_cpu_affinity() caught us resuming oob from a* wrong CPU (i.e. outside of the oob set), we have EVL_T_CANCELD* set. Check and bail out if so.*/if (curr->info & EVL_T_CANCELD)evl_test_cancel();/** Since handle_sigwake_event()->evl_kick_thread() won't set* EVL_T_KICKED unless EVL_T_INBAND is cleared, a signal received* during the stage transition process might have gone* unnoticed. Recheck for signals here and raise EVL_T_KICKED if* some are pending, so that we switch back in-band asap for* handling them.*/if (signal_pending(p)) {raw_spin_lock_irqsave(&curr->rq->lock, flags);curr->info |= EVL_T_KICKED;raw_spin_unlock_irqrestore(&curr->rq->lock, flags);}return 0;
}
通过Dovetail,正在带内阶段执行的任务可以通过以下一系列操作切换到带外,蓝色为带内阶段,浅红为切换阶段,橘色为带外阶段,每条虚线都是上下文切换:
- 调用 dovetail_leave_inband()为切换做准备(evl_switch_oob()函数里就有调用),将调用者置于 TASK_INTERRUPTIBLE 睡眠状态,然后立即重新调度。此时,任务正在向带外阶段切换,schedule() 恢复当前 CPU 上应运行的下一个带内任务。
- 当下一个带内任务的上下文恢复(context_switch()函数)时,Linux调用inband_switch_tail()检查是否有任务正在切换到带外阶段。若有任务正在切换(在finalize_oob_transition()里判断),则调用resume_oob_task()恢复oob任务;若没有任务需要切换,此时在inband则继续切换带内任务,在oob则切换oob上下文标志表示切换完成。
int dovetail_leave_inband(void)
{struct task_struct *p = current;struct irq_pipeline_data *pd;unsigned long flags;// 禁用抢占preempt_disable();pd = raw_cpu_ptr(&irq_pipeline);if (WARN_ON_ONCE(dovetail_debug() && pd->task_inflight))goto out; /* Paranoid. */// 获取进程锁,将该进程放入切换阶段raw_spin_lock_irqsave(&p->pi_lock, flags);pd->task_inflight = p;/** The scope of the off-stage state is broader than _TLF_OOB,* in that it includes the transition path from the in-band* context to the oob stage.*/// 设置切换标记,状态设置成TASK_INTERRUPTIBLEset_thread_local_flags(_TLF_OFFSTAGE);set_current_state(TASK_INTERRUPTIBLE);raw_spin_unlock_irqrestore(&p->pi_lock, flags);sched_submit_work(p);/** The current task is scheduled out from the inband stage,* before resuming on the oob stage. Since this code stands* for the scheduling tail of the oob scheduler,* arch_dovetail_switch_finish() is called to perform* architecture-specific fixups (e.g. fpu context reload).*/// 调用Linux调度函数if (likely(__schedule(SM_NONE))) {arch_dovetail_switch_finish(false);return 0;}// 从带外切换到带内clear_thread_local_flags(_TLF_OFFSTAGE);pd->task_inflight = NULL;
out:preempt_enable();return -ERESTARTSYS;
}bool inband_switch_tail(void)
{bool oob;check_hard_irqs_disabled();/** We may run this code either over the inband or oob* contexts. If inband, we may have a thread blocked in* dovetail_leave_inband(), waiting for the companion core to* schedule it back in over the oob context, in which case* finalize_oob_transition() should take care of it. If oob,* the core just switched us back, and we may update the* context markers before returning to context_switch().** Since the preemption count does not reflect the active* stage yet upon inband -> oob transition, we figure out* which one we are on by testing _TLF_OFFSTAGE. Having this* bit set when running the inband switch tail code means that* we are completing such transition for the current task,* switched in by dovetail_context_switch() over the oob* stage. If so, update the context markers appropriately.*/// 根据线程的本地标志_TLF_OFFSTAGE判断当前是否处于oob上下文oob = test_thread_local_flags(_TLF_OFFSTAGE);if (oob) {/** The companion core assumes a stalled stage on exit* from dovetail_leave_inband().*/// 暂停oob阶段并切换oob上下文标志,切换完成stall_oob();set_thread_local_flags(_TLF_OOB);if (!IS_ENABLED(CONFIG_HAVE_PERCPU_PREEMPT_COUNT)) {WARN_ON_ONCE(dovetail_debug() &&(preempt_count() & STAGE_MASK));preempt_count_add(STAGE_OFFSET);}} else {// 判断是否有任务需要切换finalize_oob_transition();hard_local_irq_enable();}return oob;
}static void finalize_oob_transition(void) /* hard IRQs off */
{struct irq_pipeline_data *pd;struct irq_stage_data *p;struct task_struct *t;// 检查是否有需要切换的任务pd = raw_cpu_ptr(&irq_pipeline);t = pd->task_inflight;if (t == NULL)return;/** @t which is in flight to the oob stage might have received* a signal while waiting in off-stage state to be actually* scheduled out. We can't act upon that signal safely from* here, we simply let the task complete the migration process* to the oob stage. The pending signal will be handled when* the task eventually exits the out-of-band context by the* converse migration.*/pd->task_inflight = NULL;/** The transition handler in the companion core assumes the* oob stage is stalled, fix this up.*/// 进行切换stall_oob();resume_oob_task(t);unstall_oob();p = this_oob_staged();if (stage_irqs_pending(p))/* Current stage (in-band) != p->stage (oob). */sync_irq_stage(p->stage);
}
带外->带内
任务从带外阶段切换到带内阶段的原因可能有:
- 任务通过evl提供的系统调用(如evl_switch_inband())显式请求切换到带内。
- evl强制任务切换到带内,例如:
- 任务发出只能从带内阶段处理的常规系统调用;
- 任务收到同步故障或异常,如内存访问违规、浮点异常等。由于直接从带外阶段处理这些事件需要大量代码重复,并且可能会与带内子系统产生冲突,因此需要切换到带内阶段。
- 任务有挂起的信号需要处理。主内核逻辑可能需要任务在带内阶段确认信号,因此需要切换到带内阶段。
任务从带外阶段切换到带内阶段的过程如下:
- 调用wake_up_process()解除任务在Linux中的阻塞状态。通常,可以使用irq_work机制(Dovetail实现了irq_work拓展,合成中断)来实现这一点,因为irq_work机制允许在中断上下文之外安全地安排一些工作,这些工作将在稍后的时间点执行。Linux会在适当的时机(如中断返回时或在软中断上下文中)检查irq_work_queue并取出irq_work执行。(与irq_work相似的有softirq以及workqueue,irq_work和softirq每个CPU都有,Workqueue可以在任意CPU执行,irq_work适合轻量级、开销较小任务,Softirq适用于需要处理大量、耗时较长的中断后处理工作且处理的irq是内核编译时确定的,Workqueue适合处理复杂的异步任务且运行在进程上下文)。
- evl阻塞/挂起需要切换的任务,为任务设置T_INBAND阻塞位,并立即重新调度。
- 在当前CPU所有的oob任务执行完后,恢复到上次Linux被切换抢占处继续执行。
- 切换任务被恢复执行,并调用**dovetail_resume_inband()**来完成任务从带外阶段到带内阶段的切换。
oob任务切换
下图表示了evl上下文切换步骤,从即将离开的prev任务切换到next任务:
- **evl_schedule()**在prev上下文被调用,按照优先级顺序选择下一个要调度的任务。如果PREV仍然是当前CPU上所有可运行任务中优先级最高的,那么序列就会在那里停止。
- 调用dovetail_context_switch()切换内存上下文,并将CPU寄存器文件切换到下一个任务NEXT的寄存器文件,同时保存上一个任务PREV的寄存器文件。如果PREV是带内任务在带外的占位符,意味着Linux被抢占,需要通过传递leave_inband=true给dovetail_context_switch()来告知内核这种抢占。
- 如果NEXT在睡眠之前是带外运行的,那么切换点就是evl_schedule()中的切换尾代码;如果NEXT正在完成一个带内切换,那么切换点就是schedule()中的切换尾代码。
static inline void evl_schedule(void)
{// 获取当前CPU运行队列struct evl_rq *this_rq = this_evl_rq();/** If we race here reading the rq state locklessly because of* a CPU migration, we must be running over the in-band stage,* in which case the call to __evl_schedule() will be* escalated to the oob stage where migration cannot happen,* ensuring safe access to the runqueue state.** Remote RQ_SCHED requests are paired with out-of-band IPIs* running on the oob stage by definition, so we can't miss* them here.** Finally, RQ_IRQ is always tested from the CPU which handled* an out-of-band interrupt, there is no coherence issue.*/// 只有当运行队列处于需要调度的状态且没有中断请求时,才会继续执行。if (((this_rq->flags|this_rq->local_flags) & (RQ_IRQ|RQ_SCHED)) != RQ_SCHED)return;// 在oob阶段执行调度if (likely(running_oob())) {__evl_schedule();return;}// 在inband阶段则切换到oob阶段再执行调度run_oob_call((int (*)(void *))__evl_schedule, NULL);
}/** CAUTION: curr->altsched.task may be unsynced and even stale if* (this_rq->curr == &this_rq->root_thread), since the task logged by* dovetail_context_switch() may not still be the current one. Always* use "current" for disambiguating if you intend to refer to the* running inband task.*/
void __evl_schedule(void) /* oob or/and hard irqs off (CPU migration-safe) */
{struct evl_rq *this_rq = this_evl_rq();struct evl_thread *prev, *next, *curr;bool leaving_inband, inband_tail;unsigned long flags;// 检查是否inband和中断if (EVL_WARN_ON_ONCE(CORE, running_inband() && !hard_irqs_disabled()))return;trace_evl_schedule(this_rq);flags = hard_local_irq_save();/** Check whether we have a pending priority ceiling request to* commit before putting the current thread to sleep.* evl_current() may differ from rq->curr only if rq->curr ==* &rq->root_thread. Testing EVL_T_USER eliminates this case since* a root thread never bears this bit.*/curr = this_rq->curr;/** Deferred WCHAN requeuing must be handled prior to* rescheduling.*/EVL_WARN_ON(CORE, curr->info & EVL_T_WCHAN);/** Priority protection for mutexes is only available to* applications. Kernel users stick with the priority* inheritance protocol (see evl_init_kmutex()).*/if (curr->state & EVL_T_USER)evl_commit_monitor_ceiling();/** Only holding this_rq->lock is required for test_resched(),* but we grab curr->lock in advance in order to keep the* locking order safe from ABBA deadlocking.*/raw_spin_lock(&curr->lock);/** Detect any lingering priority boost which should not be in* effect anymore. Since this situation is likely to stick,* warn only once. If that message ever appears, something* would be really wrong in the PI/PP implementation anyway.*/// 检查是否有待处理的优先级提升请求,并处理任务的状态变化if (IS_ENABLED(CONFIG_EVL_DEBUG_CORE) &&curr->state & EVL_T_BOOST && list_empty(&curr->boosters)) {raw_spin_unlock(&curr->lock);EVL_WARN_ON_ONCE(CORE, 1);raw_spin_lock(&curr->lock);}raw_spin_lock(&this_rq->lock);if (unlikely(!test_resched(this_rq))) {raw_spin_unlock(&this_rq->lock);raw_spin_unlock_irqrestore(&curr->lock, flags);return;}// 选取下一个线程next = pick_next_thread(this_rq);trace_evl_pick_thread(next);// 如果下一个任务与当前任务相同,则处理特殊情况后返回if (next == curr) {// EVL_T_ROOT是Root thread (in-band kernel placeholder),返回inbandif (unlikely(next->state & EVL_T_ROOT)) { if (this_rq->local_flags & RQ_TPROXY)evl_notify_proxy_tick(this_rq);if (this_rq->local_flags & RQ_TDEFER)evl_program_local_tick(&evl_mono_clock);}raw_spin_unlock(&this_rq->lock);raw_spin_unlock_irqrestore(&curr->lock, flags);return;}prev = curr;this_rq->curr = next;leaving_inband = false;if (prev->state & EVL_T_ROOT) {leave_inband(prev);leaving_inband = true;} else if (next->state & EVL_T_ROOT) {if (this_rq->local_flags & RQ_TPROXY)evl_notify_proxy_tick(this_rq);if (this_rq->local_flags & RQ_TDEFER)evl_program_local_tick(&evl_mono_clock);enter_inband(next);}evl_switch_account(this_rq, &next->stat.account);evl_inc_counter(&next->stat.csw);raw_spin_unlock(&prev->lock);prepare_rq_switch(this_rq, prev, next);inband_tail = dovetail_context_switch(&prev->altsched,&next->altsched, leaving_inband);finish_rq_switch(inband_tail, flags);
}bool dovetail_context_switch(struct dovetail_altsched_context *out,struct dovetail_altsched_context *in,bool leave_inband)
{unsigned long pc __maybe_unused, lockdep_irqs;struct task_struct *next, *prev, *last;struct mm_struct *prev_mm, *next_mm;bool inband_tail = false;WARN_ON_ONCE(dovetail_debug() && on_pipeline_entry());// 抢占了Linux内核if (leave_inband) {struct task_struct *tsk = current;/** We are about to leave the current inband context* for switching to an out-of-band task, save the* preempted context information.*/out->task = tsk;out->active_mm = tsk->active_mm;/** Switching out-of-band may require some housekeeping* from a kernel VM which might currently run guest* code, notify it about the upcoming preemption.*/notify_guest_preempt();}arch_dovetail_switch_prepare(leave_inband);// 保存任务上下文,包括task_struct、内存、抢占计数、中断状态next = in->task;prev = out->task;// 用于优化内存上下文切换,允许内核线程或无内存上下文的进程// 共享前一个进程的内存上下文,减少页表切换的开销。prev_mm = out->active_mm;next_mm = in->active_mm;if (next_mm == NULL) {in->active_mm = prev_mm;in->borrowed_mm = true;enter_lazy_tlb(prev_mm, next);} else {switch_oob_mm(prev_mm, next_mm, next);/** We might be switching back to the inband context* which we preempted earlier, shortly after "current"* dropped its mm context in the do_exit() path* (next->mm == NULL). In such a case, a lazy TLB* state is expected when leaving the mm.*/if (next->mm == NULL)enter_lazy_tlb(prev_mm, next);}if (out->borrowed_mm) {out->borrowed_mm = false;out->active_mm = NULL;}/** Tasks running out-of-band may alter the (in-band)* preemption count as long as they don't trigger an in-band* rescheduling, which Dovetail properly blocks.** If the preemption count is not stack-based but a global* per-cpu variable instead, changing it has a globally* visible side-effect though, which is a problem if the* out-of-band task is preempted and schedules away before the* change is rolled back: this may cause the in-band context* to later resume with a broken preemption count.** For this reason, the preemption count of any context which* blocks from the out-of-band stage is carried over and* restored across switches, emulating a stack-based* storage.** Eventually, the count is reset to FORK_PREEMPT_COUNT upon* transition from out-of-band to in-band stage, reinstating* the value in effect when the converse transition happened* at some point before.*/if (IS_ENABLED(CONFIG_HAVE_PERCPU_PREEMPT_COUNT))pc = preempt_count();/** Like the preemption count and for the same reason, the irq* state maintained by lockdep must be preserved across* switches.*/lockdep_irqs = lockdep_read_irqs_state();// 切换上下文switch_to(prev, next, last);barrier();if (check_hard_irqs_disabled())hard_local_irq_disable();/** If we entered this routine for switching to an out-of-band* task but don't have _TLF_OOB set for the current context* when resuming, this portion of code is the switch tail of* the inband schedule() routine, finalizing a transition to* the inband stage for the current task. Update the stage* level as/if required.*/if (unlikely(!leave_inband && !test_thread_local_flags(_TLF_OOB))) {if (IS_ENABLED(CONFIG_HAVE_PERCPU_PREEMPT_COUNT))preempt_count_set(FORK_PREEMPT_COUNT);else if (unlikely(dovetail_debug() &&!(preempt_count() & STAGE_MASK)))WARN_ON_ONCE(1);elsepreempt_count_sub(STAGE_OFFSET);lockdep_write_irqs_state(lockdep_irqs);/** Fixup the interrupt state conversely to what* inband_switch_tail() does for the opposite stage* switching direction.*/stall_inband();trace_hardirqs_off();inband_tail = true;} else {if (IS_ENABLED(CONFIG_HAVE_PERCPU_PREEMPT_COUNT))preempt_count_set(pc);lockdep_write_irqs_state(lockdep_irqs);}arch_dovetail_switch_finish(leave_inband);/** inband_tail is true whenever we are finalizing a transition* to the inband stage from the oob context for current. See* above.*/return inband_tail;
}
事件通知
oob异常处理
如果处理器异常在oob阶段发生(例如由于某些无效的内存访问、错误指令、浮点单元或对齐错误等),该任务必须切换到inband阶段处理。Dovetail 会调用 handle_oob_trap_entry() 函数处理该异常。在带内陷阱处理程序最终退出之前,会调用handle_oob_trap_exit(),如果核心需要在陷阱上下文离开之前执行任何修复操作,应该重写这个处理程序(使用弱绑定)。
/* hard irqs off. */
// trapnr是异常代码(不同架构定义不同),regs是指向出错上下文的寄存器帧的指针
void handle_oob_trap_entry(unsigned int trapnr, struct pt_regs *regs)
{struct evl_thread *curr;bool is_bp = false;int diag;trace_evl_thread_fault(trapnr, regs);/** We may not demote the current task if running in NMI* context. Just bail out if so.*/if (in_nmi())return;/** We might be running oob over a non-dovetailed task context* (e.g. taking a trap on top of evl_schedule() ->* run_oob_call()). In this case, there is nothing we* can/should do, just bail out.*/curr = evl_current();if (curr == NULL)return;// 防止递归处理异常if (curr->local_info & EVL_T_INFAULT) {note_trap(curr, trapnr, regs, "recursive fault");return;}oob_context_only();curr->local_info |= EVL_T_INFAULT;// 判断该异常是否为调试断点if (current->ptrace & PT_PTRACED)is_bp = evl_is_breakpoint(trapnr);// 记录异常事件信息if ((EVL_DEBUG(CORE) || (curr->state & EVL_T_WOSS)) && !is_bp)note_trap(curr, trapnr, regs, "switching in-band");/** We received a trap on the oob stage, switch to in-band* before handling the exception.*/diag = is_bp ? EVL_HMDIAG_TRAP : EVL_HMDIAG_EXDEMOTE;if (user_mode(regs))evl_switch_inband_details(diag, evl_intval(instruction_pointer(regs)));elseevl_switch_inband(diag);
}
系统调用
evl也提供了一套系统调用供应用程序使用,Dovetail需要分别处理Linux和evl的系统调用。Dovetail在内核入口代码的早期拦截系统调用(例如RISCV,do_trap_ecall_u->syscall_enter_from_user_mode->__syscall_enter_from_user_work->pipeline_syscall->handle_oob_syscall or handle_pipelined_syscall),将其分配到以下两种处理方式:
- 快速路径:如果系统调用号不在inband内核的有效范围内,并且调用者当前正在oob阶段运行,则将调用传递给**handle_oob_syscall()**处理程序。
- 慢速路径:除快速路径情况外,都探测**handle_pipelined_syscall()**来在适当的执行阶段处理请求,以下是慢速路径处理流程。
inband 事件
inband 事件是指那些仅在linux上下文中发生的事件,这些事件可能会影响evl对任务的管理。Dovetail 提供了一种机制,使用handle_inband_event()将这些事件通知给evl,以便其能够同步状态或采取相应的行动。这个处理程序的执行上下文始终是带内的。在调用时,带外和带内阶段都未停滞。除了INBAND_TASK_EXIT和INBAND_PROCESS_CLEANUP这两个通知是针对任何退出的用户空间任务而调用的之外,其他通知仅针对启用了交错执行的任务发出(参见dovetail_start_altsched())。通过这些事件通知机制,evl可以更好地管理任务的生命周期,同时与 in-band 逻辑保持同步。
void handle_inband_event(enum inband_event_type event, void *data)
{switch (event) {case INBAND_TASK_SIGNAL: // 当 target 任务即将接收信号时触发handle_sigwake_event(data);break;case INBAND_TASK_EXIT:// 当任务 p->task 即将迁移到目标 CPU p->dest_cpu 时触发evl_drop_subscriptions(evl_get_subscriber()); if (evl_current())put_current_thread();break;case INBAND_TASK_MIGRATION:// 在任务退出时,从 do_exit() 函数中触发,此时任务尚未释放其文件和内存映射handle_migration_event(data);break;case INBAND_TASK_RETUSER:// 在进程的内存上下文(mm_struct)即将被完全释放之前触发handle_retuser_event();break;case INBAND_TASK_PTSTOP:// 当任务即将从内核返回用户空间时触发handle_ptstop_event();break;case INBAND_TASK_PTCONT:// 当前任务即将进入 ptrace_stop() 状态时触发,任务会在该状态下等待调试器允许其继续运行handle_ptcont_event();break;case INBAND_TASK_PTSTEP:// 当任务从 ptrace_stop() 状态中唤醒时触发,任务可能会返回用户空间或处理挂起的信号handle_ptstep_event(data);break;case INBAND_PROCESS_CLEANUP:// 当 ptrace(2) 实现即将对 target 任务执行单步调试时触发handle_cleanup_event(data);break;}
}
Alternate task上下文
evl为每个可交替执行的任务定义了额外的信息,在thread information block,即struct thread_info,中添加了一个名为oob_state的成员,类型为struct oob_thread_state,evl可以通过调用dovetail_current_state()来获取:
// 以arm64为例
struct oob_thread_state {struct evl_thread *thread; // 指向EVL任务控制块的指针struct evl_subscriber *subscriber; // 用于管理线程订阅可观察对象的信息int preempt_count; // evl特定的抢占计数器
};/** low level task data that entry.S needs immediate access to.*/
struct thread_info {unsigned long flags; /* low level flags */unsigned long local_flags; /* local (synchronous) flags */
#ifdef CONFIG_ARM64_SW_TTBR0_PANu64 ttbr0; /* saved TTBR0_EL1 */
#endifunion {u64 preempt_count; /* 0 => preemptible, <0 => bug */struct {
#ifdef CONFIG_CPU_BIG_ENDIANu32 need_resched;u32 count;
#elseu32 count;u32 need_resched;
#endif} preempt;};
#ifdef CONFIG_SHADOW_CALL_STACKvoid *scs_base;void *scs_sp;
#endifu32 cpu;struct oob_thread_state oob_state;
};
task_struct用来描述进程,与架构无关的,而thread_info是task_struct的一部分,与架构相关的,是一个低级别的线程信息结构,辅助 task_struct提供快速访问和操作线程状态的能力。thread_info可以和内核栈共用一个内存区域,使其可以互相快速定位。
union thread_union {
#ifndef CONFIG_ARCH_TASK_STRUCT_ON_STACKstruct task_struct task;
#endif
#ifndef CONFIG_THREAD_INFO_IN_TASKstruct thread_info thread_info;
#endifunsigned long stack[THREAD_SIZE/sizeof(long)];
};struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK/** For reasons of header soup (see current_thread_info()), this* must be the first element of task_struct.*/struct thread_info thread_info;
#endif
...
}
拓展内存上下文
evl需要为每个进程维护专属数据集,Dovetail 在通用的 struct mm_struct 描述符中添加了一个名为 oob_state 的成员,类型为 struct oob_mm_state。由于内核线程只能暂时借用内存上下文,实际上并不拥有任何内存上下文,因此这个 Dovetail 扩展仅对在用户空间运行的 EVL 线程可用,不适用于通过 evl_run_kthread() 创建的线程:
struct oob_mm_state {unsigned long flags; /* Guaranteed zero initially. */struct list_head ptrace_sync;struct evl_wait_queue ptsync_barrier;
};
拦截返回带内
evl需要某个线程从inband阶段返回用户模式之前进行拦截,例如,你可能希望强制该线程在离开Linux内核并恢复用户模式执行之前切换回oob阶段。为此,需要让该线程在返回用户模式时跳回到evl代码,然后由evl决定下一步的操作。**dovetail_request_ucall()**可以实现这一功能:
/* 为 target 任务挂起一个请求,使其在从带内阶段返回用户模式时触发 INBAND_TASK_RETUSER 事件。这个事件会在任务即将恢复用户模式执行时被触发。*/
static inline void dovetail_request_ucall(struct task_struct *task)
{struct thread_info *ti = task_thread_info(task);if (test_ti_local_flags(ti, _TLF_DOVETAIL))set_ti_thread_flag(ti, TIF_RETUSER);
}