Copy-on-Write Fork

Concepts: copy_on_write, fork, upcall, page_fault, scheduler_activation.

PartB 部分说明

Copy-on-Write Fork。基本思想是parent env fork son env的时候对于parent的地址空间上的内容,不是在son env上新建一个物理页然后从parent env 上复制物理页到这个新建的物理页上,而是复制va->pa 的mapping,这样就有两个va 映射到一个pa上,两个va共享一个pa,同时需要把这个物理页标记为RO的(在两个env的PTE上标记),当要修改这个物理页的时候,会产生PGFLT。再在 trap 处理函数中新建一个物理页使用。

Exercise 8

Something need to know

用户态的产生的PGFLT可以分为以下几种情况:

  • 由于 COW 产生的PGFLT
  • 栈的动态增长产生的PGFLT:刚开始的时候值分配一个Page所有栈的空间,当一个Page不够的时候,产生PGFLT,增加栈的空间。
  • BSS段产生的PGFLT:这个时候需要分配一个物理Page,同时初始化为0。
  • TEXT段产生的PGFLT:这时候需要从磁盘上读取对应的二进制文件的物理页到内存,然后在映射到相应的地址空间。

可以在kernel态处理这些用户态产生的PGFLT,也可以在用户态处理,在JOS中,是放在用户态处理的,kernel通修改user env的栈和寄存器来实现在返回user态的时候,运行user注册的upcall函数。
关于kernel upcall 引用 lkml 中的一段话 :

An upcall is a mechanism that allows the kernel to execute a function in userspace, and potentially be returned information as a result.

An upcall is like a signal, except that the kernel may use it at any time, for any purpose, including in an interrupt handler.

A process asks to use upcalls, and passes the kernel the addresses of a series of stacks to execute upcalls on. The kernel wires down down the stacks. The process registers functions associated with a set of predefined events (such as a page fault or blocking I/O). When such an event happens, the thread for which the event occured to doesn’t call schedule(), but instead switches to an upcall stack, constructs a dummy trap return so that on return to user space it will execute the upcall, and returns to user space via a trap return.

Even Larry will, I hope, admit that this is a pretty fast process, much faster than a context switch, and way faster than a call to any schedule().

Note however that the function NEVER RETURNS TO THE KERNEL.

在这个邮件列表里面还给出了 upcall 可以用来实现 scheduler activation 和 timing in user space code :

Why would you want upcalls ? Well, we implemented upcalls specifically for a thread package that uses an idea called scheduler activations; every time a kernel thread blocks on I/O or suffers a page fault, the kernel “activates” the user level thread scheduler and tells it what happened. This way, the user level thread scheduler can continue to use the processor by deciding to run some other thread.

It would also allow much more precise timing for Linux user space code, because a process could register a function (and yes, it has to be a very carefully designed process) to be executed by the timer interrupt (probably the timer code BH), not whenever the process gets woken by the timer interrupt and then run.

Exercise 8

实现 sys_env_set_pgfault_upcall 系统调用,user 注册upcall函数,在lab4中,struct Env 中添加成员 env_pgfault_upcall,用于记录用户注册的upcall函数在user地址空间的地址。注意各种参数检查。

static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
  int ret;
  struct Env * env;

  // check envid
  if ((ret = envid2env(envid, &env, 1)) < 0)
    return -E_BAD_ENV;
  
  // check func is in user space
  // For passing the test of user/faultevilhandler, this checking should be 
  // removed.
  /*
  if ((uintptr_t)func >= UTOP)
    return -E_INVAL;
  */

  // set upcall
  env->env_pgfault_upcall = func;

  return 0;
}

Exercise 9,10,11

Something need to know

在JOS中,一个user的env,有两个stack:

  • normal user stack : user正常执行使用的栈。
  • exception stack : 在产生一个中断之后,user执行user注册的中断函数的栈。
/*
 * UTOP,UENVS ------>  +------------------------------+ 0xeec00000
 * UXSTACKTOP -/       |     User Exception Stack     | RW/RW  PGSIZE
 *                     +------------------------------+ 0xeebff000
 *                     |       Empty Memory (*)       | --/--  PGSIZE
 *    USTACKTOP  --->  +------------------------------+ 0xeebfe000
 *                     |      Normal User Stack       | RW/RW  PGSIZE
 *                     +------------------------------+ 0xeebfd000
 *                     |                              |
 *                     |                              |
 *                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *                     .                              .
 *                     .                              .
 *                     .                              .
 *                     |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
 *                     |     Program Data & Heap      |
 *    UTEXT -------->  +------------------------------+ 0x00800000
 */

lib对用户的upcall函数进行了一些包装。用户的upcall函数是:_pgfault_handler,lib中注册的upcall函数是_pgfault_upcall,_pgfault_upcall函数的作用就是:

  1. 调用_pgfault_handler函数。
  2. 调用结束之后,根据UTrapfram的状态,恢复到用户trap之前执行的位置。

Exercise 9

实现page_fault_handler函数,如果user注册了upcall函数,那么建立upcall函数的UTrapframe,使得当返回到user态执行的时候,在UXSTACK上执行user的upcall(UTrapframe)。

Utrapframe中,保存了在trap的时候,user env的状态,使得能够当upcall函数执行结束之后,返回到在trap之前的执行状态。

struct UTrapframe {
  /* information about the fault */
  uint32_t utf_fault_va;  /* va for T_PGFLT, 0 otherwise */
  uint32_t utf_err;
  /* trap-time return state */
  struct PushRegs utf_regs;
  uintptr_t utf_eip;
  uint32_t utf_eflags;
  /* the trap-time stack to return to */
  uintptr_t utf_esp;
} __attribute__((packed));

/*
 *                     <-- UXSTACKTOP
 * trap-time esp
 * trap-time eflags
 * trap-time eip
 * trap-time eax       start of struct PushRegs
 * trap-time ecx
 * trap-time edx
 * trap-time ebx
 * trap-time esp
 * trap-time ebp
 * trap-time esi
 * trap-time edi       end of struct PushRegs
 * tf_err (error code)
 * fault_va            <-- %esp when handler is run
 * 
 */

注意:

  • 如果trap之前,env是执行在normal user stack上,那么直接在UXTSTACKTOP上插入UTrapframe。
  • 如果trap之前,env是执行在exception stack上,那么需要在UXSTACK上push UTrapframe之前,要先push 32bit的空值。为什么这样做后面会看到。
void
page_fault_handler(struct Trapframe *tf)
{
  uint32_t fault_va;

  // Read processor's CR2 register to find the faulting address
  fault_va = rcr2();

  // Handle kernel-mode page faults.

  if ((tf->tf_cs & 3 ) == 0){ 
    panic("page_fault_halder : page fault in kernel mode.\n");
    return ;
  }

  struct PageInfo *pp;
  int ret;
  // if upcall exist 
  if (curenv->env_pgfault_upcall) {
    // check if user exception stack exist. considering that at 
    // env_create calls load_icode, and load_icode doesnot allocate
    // user exception stack. So, this is necessary. Otherwise, PTE of 
    // UXSTACKTOP has PTE_P not set, a PGFLT will be generated.

    // For passing the test of user/faultnostack, user exception stack checking 
    // should be removed.
    /* 
    pp = page_lookup(curenv->env_pgdir, (void*)(UXSTACKTOP-PGSIZE), 0);
    if (!pp) {
      // destory this env.
      cprintf("[%08x] physical page of user exception stack not allocated.\n", 
        curenv->env_id);
      cprintf("[%08x] user fault va %08x ip %08x\n",
        curenv->env_id, fault_va, tf->tf_eip);
      print_trapframe(tf);
      env_destroy(curenv);
    }
    */
    
    // push UTrapframe into exception stack
    struct UTrapframe *utf;
    if ((uintptr_t)(UXSTACKTOP - PGSIZE) <= tf->tf_esp && 
      tf->tf_esp < (uintptr_t)UXSTACKTOP ) {
      utf = (struct UTrapframe *)(tf->tf_esp - 4 - sizeof(struct UTrapframe));
    } else {
      utf = (struct UTrapframe *)(UXSTACKTOP - sizeof(struct UTrapframe));
    }
    
    user_mem_assert(curenv, utf, sizeof(struct UTrapframe), 
      PTE_U | PTE_W | PTE_P);
    utf->utf_esp = tf->tf_esp;
    utf->utf_eflags = tf->tf_eflags;
    utf->utf_eip = tf->tf_eip;
    utf->utf_regs = tf->tf_regs;
    utf->utf_err = tf->tf_err;
    utf->utf_fault_va = fault_va;
    
    // at trap(), if in user mode, tf = &curenv->env_tf;
    // 设置在返回到user态的时候,执行upcall函数。
    tf->tf_esp = (uintptr_t)utf;
    tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;
    env_run(curenv); // never return 
  }

  // Destroy the environment that caused the fault.
  cprintf("[%08x] user fault va %08x ip %08x\n",
    curenv->env_id, fault_va, tf->tf_eip);
  print_trapframe(tf);
  env_destroy(curenv);
}

Exercise 10

user 通过调用 lib 的 set_pgfault_handler 注册upcall函数。user的upcall函数是 _pgfault_handler,lib实际注册的upcall函数是_pgfault_upcall。 _pgfault_upcall的作用就是调用_pgfault_handler,在_pgfault_handler执行结束之后,返回到UTrapframe中的trap-time 的状态。
对于trap-time-eip的位置:

  • 如果trap之前,是执行在normal stack上,那么trap-time-eip位置在 trap-time-esp - 4B上。
  • 如果trap之前,是执行在exception stack上,那么trap-time-eip位置在Utrapframe之间留出的 4B 上。

这样,在恢复trap-time-esp的时候,第一个弹出的一定是trap-time-eip。

.text
.globl _pgfault_upcall
_pgfault_upcall:
  // Call the C page fault handler.
  pushl %esp      // function argument: pointer to UTF
  movl _pgfault_handler, %eax
  call *%eax
  addl $4, %esp     // pop function argument
  
  // ret 指令做的事情就是从stack上popl %eip,然后从这个eip开始执行。
  // 一个Excepiton Stack的图对这个的理解有帮助。
  // 下面的指令的目的就是把(1)的状态转化为 (2)的状态,同时,恢复
  // trap-time-regs, trap-time-esp, trap-time-eflags.
  //  
  //  +----------USTACKTOP------+   high
  //  |            ...          |
  //  +-------------------------+
  //  |                         |
  //  +-------------------------+   
  //  |   trap-time-esp    (4B) |
  //  +-------------------------+   
  //  |   trap-time-eflags (4B) |
  //  +-------------------------+   
  //  |   trap-time-eip    (4B) |
  //  +-------------------------|   low
  //  |   trap-time-regs   (32B)|
  //  |   ...                   |
  //  |   ...                   |
  //  +-------------------------+   
  //  |   err              (4B) |
  //  +-------------------------+   
  //  |   fault_va         (4B) | 
  //  +-------------------------+   <-- cur_esp 
  //            (1)
  //  
  //  +----trap-time-stack------+
  //  |            ...          |
  //  +-------------------------+
  //  |   trap-time-eip    (4B) |
  //  +-------------------------+   <-- trap_time_esp
  //
  //            (2)
  //  
  //  
  // Now the C page fault handler has returned and you must return
  // to the trap time state.
  // Push trap-time %eip onto the trap-time stack.
  //
  // Explanation:
  //   We must prepare the trap-time stack for our eventual return to
  //   re-execute the instruction that faulted.
  //   Unfortunately, we can't return directly from the exception stack:
  //   We can't call 'jmp', since that requires that we load the address
  //   into a register, and all registers must have their trap-time
  //   values after the return.
  //   We can't call 'ret' from the exception stack either, since if we
  //   did, %esp would have the wrong value.
  //   So instead, we push the trap-time %eip onto the *trap-time* stack!
  //   Below we'll switch to that stack and call 'ret', which will
  //   restore %eip to its pre-fault value.
  //
  //   In the case of a recursive fault on the exception stack,
  //   note that the word we're pushing now will fit in the
  //   blank word that the kernel reserved for us.
  //
  // Throughout the remaining code, think carefully about what
  // registers are available for intermediate calculations.  You
  // may find that you have to rearrange your code in non-obvious
  // ways as registers become unavailable as scratch space.
  //
  // LAB 4: Your code here.
  // new_trap_time_esp = trap_time_esp - 4, for storing trap_time_eip.
  movl 0x30(%esp), %eax
  subl $0x4, %eax
  movl %eax, 0x30(%esp)
  
  // move trap_time_eip to new_trap_time_esp, because exception stack and 
  // user stack are using the same address sapce, so this will work in
  // both user stack and exception stack
  movl 0x28(%esp), %ebx
  movl %ebx, (%eax)

  // Restore the trap-time registers.  After you do this, you
  // can no longer modify any general-purpose registers.
  // LAB 4: Your code here.
  addl $0x8, %esp
  popal

  // Restore eflags from the stack.  After you do this, you can
  // no longer use arithmetic operations or anything else that
  // modifies eflags.
  // LAB 4: Your code here.
  // skip trap_time_eip
  addl $0x4, %esp
  popfl

  // Switch back to the adjusted trap-time stack.
  // LAB 4: Your code here.
  popl %esp

  // Return to re-execute the instruction that faulted.
  // LAB 4: Your code here.
  ret

Exercise 11

实现lib对set_pgfault_handler的包装。handler是用户输入的upcall函数,_pgfault_upcall是lib包装的函数。

void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
  int r;

  if (_pgfault_handler == 0) {
    // First time through!
    r = sys_page_alloc(thisenv->env_id, 
          (void *)(UXSTACKTOP-PGSIZE), PTE_U | PTE_P | PTE_W);
    if (r < 0) 
      panic("set_pgfault_handler : sys_page_alloc failed. %e.\n",r);
    // set page fault upcall
    r = sys_env_set_pgfault_upcall(thisenv->env_id, (void*)_pgfault_upcall);
    if (r < 0)
      panic("set_pgfault_handler : sys_env_set_pgfault_upcall failed. %e.\n",r);
  }

  // Save handler pointer for assembly to call.
  _pgfault_handler = handler;
}

Exercise 12

Something need to know

JOS页表涉及的一个trick: 在kern/init.c 中mem_init()的时候,初始化kern_pgdir有这样一行:

/* 关于UVPT, PTSIZE = (PGSIZE*NPTENTRIES)
 * ULIM, MMIOBASE -->  +------------------------------+ 0xef800000
 *                     |  Cur. Page Table (User R-)   | R-/R-  PTSIZE
 *    UVPT      ---->  +------------------------------+ 0xef400000
 *                     |          RO PAGES            | R-/R-  PTSIZE
 *    UPAGES    ---->  +------------------------------+ 0xef000000
 *                     |           RO ENVS            | R-/R-  PTSIZE
 * UTOP,UENVS ------>  +------------------------------+ 0xeec00000
 */ 


kern_pgdir[PDX(UVPT)] = PADDR(kern_pgdir) | PTE_U | PTE_P;

在env_setup_vm 的时候,复制了kern_pgdir中初始化的一部分page_dir( > UTOP),同时将UVPT这个虚拟地址指向了env_pgdir。

memmove(e->env_pgdir, kern_pgdir, PGSIZE);
// UVPT maps the env's own page table read-only.
// Permissions: kernel R, user R
e->env_pgdir[PDX(UVPT)] = PADDR(e->env_pgdir) | PTE_P | PTE_U;  // (1)

在lib/entry.S中定义了如下几个全局变量:

.data
// Define the global symbols 'envs', 'pages', 'uvpt', and 'uvpd'
// so that they can be used in C as if they were ordinary global arrays.
  .globl envs
  .set envs, UENVS
  .globl pages
  .set pages, UPAGES
  .globl uvpt
  .set uvpt, UVPT
  .globl uvpd
  .set uvpd, (UVPT+(UVPT>>12)*4)

对于env_pgdir 和 kern_pgdir来说,都是PGSIZE,可以看作是page_table,也可以看作是page_dir:

  • 看作是page_table: 那么 PGNUM(va) 为 N 的PTE就存放在 uvpt[N]中。
  • 看作是page_dir : 那么 uvpd 这个虚拟地址指向的物理页就是env_pgdir对应的物理页. 于是可以uvpd[PDX(va)]就是va对应的pde的内容了。

为什么 uvpt[PGNUM(va)]就是 va对应的pte的内容?
首先,uvpt[PGNUM(va)] = uvpt + PGNUM(va) * sizeof(uvpt), 因为uvpt 是 pte_t* 类型的,所以
uvpt[PGNUM(va)] = uvpt + PGNUM(va)<<2uvpt = 0xEF40000,也就是说低22位都是0。

31            22|21             12 | 11       2| 1 0 | 
   PDX(UVPT)    |    PDX(va)       |  PTX(va)  |     |
      a1        |        a2        |     a3    |     |

这样在地址翻译的时候:

  1. 翻译a1部分,还是对应到env_pgdir对应的物理页 ppn1(在mmu看来ppn1是一个page table)
  2. 翻译a2部分,根据PDX(va) 在 ppn1 的偏移量,找到va对应的page_table的ppn2(在mmu看来,ppn2不是page table,而是physical page)
  3. 在ppn2这个va对应的pagetable中,根据PTX(va)找到va对应的PTE(在mmu 看来,这个不是PTE,而是一个page里面的一个offset 对应的值).

为什么 uvpd这个虚拟地址指向的就是env_pgdir的物理页?
UVPT+(UVPT>>12)*4 UVPT+(UVPT>>10)的区别是:前者能够将第10,11位清零。
uvpd 的值就是:

31            22|21             12 | 11         0 | 
   PDX(UVPT)    |    PDX(UVPT)     |       0      |
      a1        |        a2        |              |

这样在地址翻译的时候:

  1. 根据env_pgdir 对应的ppn1 物理页中,找到a1部分对应的ppn2,ppn2 == ppn1
  2. 从ppn2中找到 a2 部分对应的ppn3, ppn3 = ppn1
  3. 从ppn3中 offset位0,本质就是读取 ppn1中的第0项,也就是env_pgdir中的第0项。

个人觉得理解了这几点,其他跟着提示就能够把代码写出来了。

Exercise 12

实现 COW fork,当fork的时候调用duppage将PTE设置为COW,如果当写一个COW的页的时候,产生PGFLT,通过pgfault函数实现COW。
注意检查pde是否是PTE_P的。记得把parent env 的page 也设为 PTE_COW, 相当于清除PTE_W,PTE_COW 是software defined。

// lib/fork.c
static int
duppage(envid_t envid, unsigned pn)
{
  int r;
  void * va = (void *) (pn << PGSHIFT);

  // check page dir PTE_P exist
  if (!(uvpd[PDX(pn << PGSHIFT)] & PTE_P )) 
    panic("duppage : page dir PTE_P is not set.\n");

  // check page is PTE_W or PTE_COW
  if (!(uvpt[pn] & ( PTE_W | PTE_COW )))
    panic("duppage : page is not PTE_W or PTE_COW.\n");

  // map child's page as PTE_COW
  r = sys_page_map(0, va, envid, va, PTE_U | PTE_COW | PTE_P);
  if (r < 0)
    panic("duppage : sys_page_map error : %e.\n",r);
 
  // remap parent's page as PTE_COW, make PTE_W invalid.
  r = sys_page_map(0, va, 0, va, PTE_U | PTE_COW | PTE_P);
  if (r < 0) 
    panic("dupage : sys_page_map error : %e.\n", r);

  return 0;
}

pgfault的时候,如果是COW的,从env1将一个page copy到env2 (env1 和 env2 必须是 parent/children 关系或者相同),在env2中做分为几步:

  1. 在env2 中分配一个新 page,映射到env2_tmp_va
  2. memmove(env2_tmp_va, va, PGSIZE)
  3. page_map(env_tmp_va, va)
  4. page_ummap(env_tmp_va)
static void
pgfault(struct UTrapframe *utf)
{
  void *addr = (void *) utf->utf_fault_va;
  uint32_t err = utf->utf_err;
  int r;

  if ( !(uvpd[PDX(addr)] & PTE_P ) ) 
    panic("pgfault : page dir PTE_P not set.\n");
  if (((err & FEC_WR) != FEC_WR) || !(uvpt[PGNUM(addr)] & PTE_COW) ) 
    panic("pgfault : pagefault %08x not FEC_WR or PTE_COW.\n",err);

  // allocate  PFTEMP -> new page
  r = sys_page_alloc(0, PFTEMP, PTE_U | PTE_P | PTE_W);
  if (r < 0)
    panic("pgfault : sys_page_alloc error : %e.\n",r);

  // copy old page  = new page 注意对齐,卡在这里很长时间
  addr = ROUNDDOWN(addr, PGSIZE);
  memmove(PFTEMP, addr, PGSIZE);

  // make addr -> new page
  r = sys_page_map(0, PFTEMP, 0, addr, PTE_U | PTE_P | PTE_W);
  if (r < 0)
    panic("pgfault : sys_page_map error : %e.\n",r);

  // delete map of PFTEMP -> new page
  r = sys_page_unmap(0, PFTEMP);
  if (r < 0)
    panic("pgfault : sys_page_unmap error : %e.\n",r);

  return ;
}

所以fork就成这样了

envid_t
fork(void)
{
  envid_t envid;
  uintptr_t va;
  int r;

  // set pagefault handler
  set_pgfault_handler(pgfault);

  // allocate child env
  envid = sys_exofork();
  if (envid < 0) 
    panic("fork : sys_exofork error, %e.\n", envid);

  // Executing at child 
  if (envid == 0) {
    thisenv = &envs[ENVX(sys_getenvid())];
    return 0;
  }

  // Child's env initialization : 
  // 1. element in struct Env itself.
  // 2. child env's page table initizlization( address space )
  // 
  // For 1. some part of struct Env is initialized in sys_exefork, 
  // remaining exception stack and pgfault_upcall to initialize.
  // For 2. create envid 's address space

  // 2.1. Duppage [UTEXT, USTACKTOP] of PTE_W | PTE_COW | PTE_P
  // first see if pdt & PTE_P or not
  for (va = UTEXT ; va < USTACKTOP; va += PGSIZE){
    if ((uvpd[PDX(va)] & PTE_P) && (uvpt[PGNUM(va)] & PTE_P) && 
        (uvpt[PGNUM(va)] & PTE_U) && (uvpt[PGNUM(va)] & (PTE_W | PTE_COW)))
      duppage(envid, PGNUM(va));
    
    // For pages that are not PTE_W or PTE_COW, just ignore it, some of 
    // that page are protection consideration.
  }

  // 1.2. Create exception stack, parent's exception stack cannot 
  // be duppaged ! because at this time it's page fault are using it, 
  // and it should be writable.
  r = sys_page_alloc(envid, (void*)(UXSTACKTOP-PGSIZE), PTE_U | PTE_P | PTE_W);
  if (r < 0)
    panic("[%08x] fork : sys_page_alloc error : %e.\n", thisenv->env_id, r);

  // 1.1 Set child's page fault handler -- initialize 
  // child_env->env_pgfault_upcall
  r = sys_env_set_pgfault_upcall(envid, (void*)_pgfault_upcall);
  if (r < 0)
    panic("[%08x] fork : sys_env_set_pgfault_upcall error : %e.\n", 
      thisenv->env_id, r);

  // Child is ready to run, make it RUNNABLE
  r = sys_env_set_status(envid, ENV_RUNNABLE);
  if (r < 0)
    panic("[%08x] fork : sys_env_set_status error : %e",thisenv->env_id, r);

  return envid;
}

– EOF –