From matthew at wil.cx Wed Oct 1 09:07:07 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 1 Oct 2008 10:07:07 -0600 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: References: Message-ID: <20081001160706.GI13822@parisc-linux.org> On Sat, Sep 27, 2008 at 04:28:45PM +0800, Zhao, Yu wrote: > +++ b/Documentation/DocBook/kernel-api.tmpl > @@ -239,6 +239,7 @@ X!Ekernel/module.c > > > PCI Support Library > +!Iinclude/linux/pci.h Why do you need to do this? Thus far, all the documentation has been with the implementation, not in the header file. > +1.2 What is ARI > + > +Alternative Routing-ID Interpretation (ARI) allows a PCI Express Endpoint > +to use its device number field as part of function number. Traditionally, > +an Endpoint can only have 8 functions, and the device number of all > +Endpoints is zero. With ARI enabled, an Endpoint can have up to 256 > +functions by using device number in conjunction with function number to > +indicate a function in the device. This is almost transparent to the Linux > +kernel because the Linux kernel still can use 8-bit bus number field plus > +8-bit devfn number field to locate a function. ARI is managed via the ARI > +Forwarding bit in the Device Capabilities 2 register of the PCI Express > +Capability on the Root Port or the Downstream Port and a new ARI Capability > +on the Endpoint. I don't think this section actually helps a software developer use SR-IOV, does it? > +2. User Guide > + > +2.1 How can I manage SR-IOV > + > +If a device supports SR-IOV, then there should be some entries under > +Physical Function's PCI device directory. These entries are in directory: > + - /sys/bus/pci/devices/BB:DD.F/iov/ > + (BB:DD:F is bus:dev:fun) The 'domain:' prefix has been there for a long time now. > +and > + - /sys/bus/pci/devices/BB:DD.F/iov/N > + (N is VF number from 0 to initialvfs-1) > + > +To enable or disable SR-IOV: > + - /sys/bus/pci/devices/BB:DD.F/iov/enable > + (writing 1/0 means enable/disable VFs, state change will > + notify PF driver) > + > +To change number of Virtual Functions: > + - /sys/bus/pci/devices/BB:DD.F/iov/numvfs > + (writing positive integer to this file will change NumVFs) > + > +The total and initial number of VFs can get from: > + - /sys/bus/pci/devices/BB:DD.F/iov/totalvfs > + - /sys/bus/pci/devices/BB:DD.F/iov/initialvfs > + > +The identifier of a VF that belongs to this PF can get from: > + - /sys/bus/pci/devices/BB:DD.F/iov/N/rid > + (for all class of devices) Wouldn't it be more useful to have the iov/N directories be a symlink to the actual pci_dev used by the virtual function? > +For network device, there are: > + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac > + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan > + (value update will notify PF driver) We already have tools to set the MAC and VLAN parameters for network devices. > +To register SR-IOV service, Physical Function device driver needs to call: > + int pci_iov_register(struct pci_dev *dev, > + int (*notify)(struct pci_dev *, u32), char **entries) I think a better interface would put the 'notify' into the struct pci_driver. That would make 'notify' a bad name .... how about 'virtual'? There's also no documentation for the second parameter to 'notify'. > +Note: entries could be NULL if PF driver doesn't want to create new entries > +under /sys/bus/pci/devices/BB:DD.F/iov/N/. So 'entries' is a list of names to create sysfs entries for? > +To enable SR-IOV, Physical Function device driver needs to call: > + int pci_iov_enable(struct pci_dev *dev, int numvfs) > + > +To disable SR-IOV, Physical Function device driver needs to call: > + void pci_iov_disable(struct pci_dev *dev) I'm not 100% convinced about this API. The assumption here is that the driver will do it, but I think it should probably be in the core. The driver probably wants to be notified that the PCI core is going to create a virtual function, and would it please prepare to do so, but I'm not convinced this should be triggered by the driver. How would the driver decide to create a new virtual function? > +To read or write VFs configuration: > + - int pci_iov_read_config(struct pci_dev *dev, int id, > + char *entry, char *buf, int size); > + - int pci_iov_write_config(struct pci_dev *dev, int id, > + char *entry, char *buf); I think we'd be better off having the driver create its own sysfs entries if it really needs to. > +3.2 Usage example > + > +Following piece of code illustrates the usage of APIs above. > + [...] > +static struct pci_driver dev_driver = { > + .name = "SR-IOV Physical Function driver", > + .id_table = dev_id_table, > + .probe = dev_probe, > + .remove = __devexit_p(dev_remove), > +#ifdef CONFIG_PM > + .suspend = dev_suspend, > + .resume = dev_resume, > +#endif > +}; From jbarnes at virtuousgeek.org Wed Oct 1 09:52:42 2008 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Wed, 1 Oct 2008 09:52:42 -0700 Subject: [PATCH 1/6 v3] PCI: export some functions and macros In-Reply-To: References: Message-ID: <200810010952.44016.jbarnes@virtuousgeek.org> On Saturday, September 27, 2008 1:27 am Zhao, Yu wrote: > Export some functions and move some macros from c file to header file. > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index 9c71858..f99160d 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c > @@ -696,7 +696,7 @@ static struct bin_attribute pci_config_attr = { > .name = "config", > .mode = S_IRUGO | S_IWUSR, > }, > - .size = 256, > + .size = PCI_CFG_SPACE_SIZE, > .read = pci_read_config, > .write = pci_write_config, I just pushed Yanmin's cleanups here, can you separate out the rest of your config space size changes and push them separately? > extern int pci_uevent(struct device *dev, struct kobj_uevent_env *env); > @@ -144,3 +150,17 @@ struct pci_slot_attribute { > }; > #define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, > attr) > > +enum pci_bar_type { > + pci_bar_unknown, /* Standard PCI BAR probe */ > + pci_bar_io, /* An io port BAR */ > + pci_bar_mem32, /* A 32-bit memory BAR */ > + pci_bar_mem64, /* A 64-bit memory BAR */ > + pci_bar_rom, /* A ROM BAR */ > +}; > + > +extern int pci_read_base(struct pci_dev *dev, enum pci_bar_type type, > + struct resource *res, unsigned int reg); > +extern struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, > + struct pci_dev *bridge, int busnr); > + > +#endif /* DRIVERS_PCI_H */ See Matthew's comments here; the pci_read_base changes should be part of a separate patch too. > a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 3abbfad..6c78cf8 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -299,7 +299,7 @@ static void pbus_size_io(struct pci_bus *bus) > > if (r->parent || !(r->flags & IORESOURCE_IO)) > continue; > - r_size = r->end - r->start + 1; > + r_size = resource_size(r); > > if (r_size < 0x400) > /* Might be re-aligned for ISA */ > @@ -350,7 +350,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned > long mask, unsigned long > > if (r->parent || (r->flags & mask) != type) > continue; > - r_size = r->end - r->start + 1; > + r_size = resource_size(r); > /* For bridges size != alignment */ > align = resource_alignment(r); > order = __ffs(align) - 20; > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c > index 1a5fc83..56e4042 100644 > --- a/drivers/pci/setup-res.c > +++ b/drivers/pci/setup-res.c > @@ -133,7 +133,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) > resource_size_t size, min, align; > int ret; > > - size = res->end - res->start + 1; > + size = resource_size(res); > min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : > PCIBIOS_MIN_MEM; > > align = resource_alignment(res); These resource_size changes seem like good cleanups by themselves, can you separate them out into a separate patch? > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 98dc624..cc78be6 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -456,8 +456,8 @@ struct pci_driver { > > /** > * PCI_VDEVICE - macro used to describe a specific pci device in short > form - * @vend: the vendor name > - * @dev: the 16 bit PCI Device ID > + * @vendor: the vendor name > + * @device: the 16 bit PCI Device ID > * > * This macro is used to create a struct pci_device_id that matches a > * specific PCI device. The subvendor, and subdevice fields will be set Another good standalone cleanup, please submit separately. Thanks, -- Jesse Barnes, Intel Open Source Technology Center From akataria at vmware.com Wed Oct 1 10:14:02 2008 From: akataria at vmware.com (Alok Kataria) Date: Wed, 01 Oct 2008 10:14:02 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. Message-ID: <1222881242.9381.17.camel@alok-dev1> Hi, Please find below the proposal for the generic use of cpuid space allotted for hypervisors. Apart from this cpuid space another thing worth noting would be that, Intel & AMD reserve the MSRs from 0x40000000 - 0x400000FF for software use. Though the proposal doesn't talk about MSR's right now, we should be aware of these reservations as we may want to extend the way we use CPUID to MSR usage as well. While we are at it, we also think we should form a group which has at least one person representing each of the hypervisors interested in generalizing the hypervisor CPUID space for Linux guest OS. This group will be informed whenever a new CPUID leaf from the generic space is to be used. This would help avoid any duplicate definitions for a CPUID semantic by two different hypervisors. I think most of the people are subscribed to LKML or the virtualization lists and we should use these lists as a platform to decide on things. Thanks, Alok --- Hypervisor CPUID Interface Proposal ----------------------------------- Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for software use. Hypervisors can use these levels to provide an interface to pass information from the hypervisor to the guest running inside a virtual machine. This proposal defines a standard framework for the way in which the Linux and hypervisor communities incrementally define this CPUID space. (This proposal may be adopted by other guest OSes. However, that is not a requirement because a hypervisor can expose a different CPUID interface depending on the guest OS type that is specified by the VM configuration.) Hypervisor Present Bit: Bit 31 of ECX of CPUID leaf 0x1. This bit has been reserved by Intel & AMD for use by hypervisors, and indicates the presence of a hypervisor. Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's (all existing and future cpu's) set this bit to zero. This bit can be probed by the guest software to detect whether they are running inside a virtual machine. Hypervisor CPUID Information Leaf: Leaf 0x40000000. This leaf returns the CPUID leaf range supported by the hypervisor and the hypervisor vendor signature. # EAX: The maximum input value for CPUID supported by the hypervisor. # EBX, ECX, EDX: Hypervisor vendor ID signature. Hypervisor Specific Leaves: Leaf range 0x40000001 - 0x4000000F. These cpuid leaves are reserved as hypervisor specific leaves. The semantics of these 15 leaves depend on the signature read from the "Hypervisor Information Leaf". Generic Leaves: Leaf range 0x40000010 - 0x4000000FF. The semantics of these leaves are consistent across all hypervisors. This allows the guest kernel to probe and interpret these leaves without checking for a hypervisor signature. A hypervisor can indicate that a leaf or a leaf's field is unsupported by returning zero when that leaf or field is probed. To avoid the situation where multiple hypervisors attempt to define the semantics for the same leaf during development, we can partition the generic leaf space to allow each hypervisor to define a part of the generic space. For instance: VMware could define 0x4000001X Xen could define 0x4000002X KVM could define 0x4000003X and so on... Note that hypervisors can implement any leaves that have been defined in the generic leaf space whenever common features can be found. For example, VMware hypervisors can implement leafs that have been defined in the KVM area 0x4000003X and vice versa. The kernel can detect the support for a generic field inside leaf 0x400000XY using the following algorithm: 1. Get EAX from Leaf 0x400000000, Hypervisor CPUID information. EAX returns the maximum input value for the hypervisor CPUID space. If EAX < 0x400000XY, then the field is not available. 2. Else, extract the field from the target Leaf 0x400000XY by doing cpuid(0x400000XY). If (field == 0), this feature is unsupported/unimplemented by the hypervisor. The kernel should handle this case gracefully so that a hypervisor is never required to support or implement any particular generic leaf. -------------------------------------------------------------------------------- Definition of the Generic CPUID space. Leaf 0x40000010, Timing Information. VMware has defined the first generic leaf to provide timing information. This leaf returns the current TSC frequency and current Bus frequency in kHz. # EAX: (Virtual) TSC frequency in kHz. # EBX: (Virtual) Bus (local apic timer) frequency in kHz. # ECX, EDX: RESERVED (Per above, reserved fields are set to zero). -------------------------------------------------------------------------------- Written By, Alok N Kataria Dan Hecht Inputs from, Jun Nakajima From hpa at zytor.com Wed Oct 1 10:21:33 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 10:21:33 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222881242.9381.17.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> Message-ID: <48E3B19D.6060905@zytor.com> Alok Kataria wrote: > > (This proposal may be adopted by other guest OSes. However, that is not > a requirement because a hypervisor can expose a different CPUID > interface depending on the guest OS type that is specified by the VM > configuration.) > Excuse me, but that is blatantly idiotic. Expecting the user having to configure a VM to match the target OS is *exactly* as stupid as expecting the user to reconfigure the BIOS. It's totally the wrong thing to do. -hpa From akataria at vmware.com Wed Oct 1 10:33:51 2008 From: akataria at vmware.com (Alok Kataria) Date: Wed, 01 Oct 2008 10:33:51 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3B19D.6060905@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> Message-ID: <1222882431.9381.23.camel@alok-dev1> On Wed, 2008-10-01 at 10:21 -0700, H. Peter Anvin wrote: > Alok Kataria wrote: > > > > (This proposal may be adopted by other guest OSes. However, that is not > > a requirement because a hypervisor can expose a different CPUID > > interface depending on the guest OS type that is specified by the VM > > configuration.) > > > > Excuse me, but that is blatantly idiotic. Expecting the user having to > configure a VM to match the target OS is *exactly* as stupid as > expecting the user to reconfigure the BIOS. It's totally the wrong > thing to do. Hi Peter, Its not a user who has to do anything special here. There are *intelligent* VM developers out there who can export a different CPUid interface depending on the guest OS type. And this is what most of the hypervisors do (not necessarily for CPUID, but for other things right now). Alok. > > -hpa From hpa at zytor.com Wed Oct 1 10:45:55 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 10:45:55 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222882431.9381.23.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> Message-ID: <48E3B753.7000309@zytor.com> Alok Kataria wrote: > > Hi Peter, > > Its not a user who has to do anything special here. > There are *intelligent* VM developers out there who can export a > different CPUid interface depending on the guest OS type. And this is > what most of the hypervisors do (not necessarily for CPUID, but for > other things right now). > It doesn't matter, really; it's still the wrong thing to do, for the same reason it's the wrong thing in -- for example -- ACPI, which has similar "cleverness". If we want to have a "Linux standard CPUID interface" suite we should just put them on a different set of numbers and let a hypervisor export all the interfaces. -hpa From hpa at zytor.com Wed Oct 1 10:47:48 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 10:47:48 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222881242.9381.17.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> Message-ID: <48E3B7C4.2020707@zytor.com> Alok Kataria wrote: > > Hypervisor CPUID Interface Proposal > ----------------------------------- > > Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for > software use. Hypervisors can use these levels to provide an interface > to pass information from the hypervisor to the guest running inside a > virtual machine. > > This proposal defines a standard framework for the way in which the > Linux and hypervisor communities incrementally define this CPUID space. > I also observe that your proposal provides no mean of positive identification, i.e. that a hypervisor actually conforms to your proposal. -hpa From jeremy at goop.org Wed Oct 1 11:04:49 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 11:04:49 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222881242.9381.17.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> Message-ID: <48E3BBC1.2050607@goop.org> Alok Kataria wrote: > Hi, > > Please find below the proposal for the generic use of cpuid space > allotted for hypervisors. Apart from this cpuid space another thing > worth noting would be that, Intel & AMD reserve the MSRs from 0x40000000 > - 0x400000FF for software use. Though the proposal doesn't talk about > MSR's right now, we should be aware of these reservations as we may want > to extend the way we use CPUID to MSR usage as well. > > While we are at it, we also think we should form a group which has at > least one person representing each of the hypervisors interested in > generalizing the hypervisor CPUID space for Linux guest OS. This group > will be informed whenever a new CPUID leaf from the generic space is to > be used. This would help avoid any duplicate definitions for a CPUID > semantic by two different hypervisors. I think most of the people are > subscribed to LKML or the virtualization lists and we should use these > lists as a platform to decide on things. > > Thanks, > Alok > > --- > > Hypervisor CPUID Interface Proposal > ----------------------------------- > > Intel & AMD have reserved cpuid levels 0x40000000 - 0x400000FF for > software use. Hypervisors can use these levels to provide an interface > to pass information from the hypervisor to the guest running inside a > virtual machine. > > This proposal defines a standard framework for the way in which the > Linux and hypervisor communities incrementally define this CPUID space. > > (This proposal may be adopted by other guest OSes. However, that is not > a requirement because a hypervisor can expose a different CPUID > interface depending on the guest OS type that is specified by the VM > configuration.) > > Hypervisor Present Bit: > Bit 31 of ECX of CPUID leaf 0x1. > > This bit has been reserved by Intel & AMD for use by > hypervisors, and indicates the presence of a hypervisor. > > Virtual CPU's (hypervisors) set this bit to 1 and physical CPU's > (all existing and future cpu's) set this bit to zero. This bit > can be probed by the guest software to detect whether they are > running inside a virtual machine. > > Hypervisor CPUID Information Leaf: > Leaf 0x40000000. > > This leaf returns the CPUID leaf range supported by the > hypervisor and the hypervisor vendor signature. > > # EAX: The maximum input value for CPUID supported by the hypervisor. > # EBX, ECX, EDX: Hypervisor vendor ID signature. > > Hypervisor Specific Leaves: > Leaf range 0x40000001 - 0x4000000F. > > These cpuid leaves are reserved as hypervisor specific leaves. > The semantics of these 15 leaves depend on the signature read > from the "Hypervisor Information Leaf". > > Generic Leaves: > Leaf range 0x40000010 - 0x4000000FF. > > The semantics of these leaves are consistent across all > hypervisors. This allows the guest kernel to probe and > interpret these leaves without checking for a hypervisor > signature. > > A hypervisor can indicate that a leaf or a leaf's field is > unsupported by returning zero when that leaf or field is probed. > > To avoid the situation where multiple hypervisors attempt to define the > semantics for the same leaf during development, we can partition > the generic leaf space to allow each hypervisor to define a part > of the generic space. > > For instance: > VMware could define 0x4000001X > Xen could define 0x4000002X > KVM could define 0x4000003X > and so on... > No, we're not getting anywhere. This is an outright broken idea. The space is too small to be able to chop up in this way, and the number of vendors too large to be able to do it without having a central oversight. The only way this can work is by having explicit positive identification of each group of leaves with a signature. If there's a recognizable signature, then you can inspect the rest of the group; if not, then you can't. That way, you can avoid any leaf usage which doesn't conform to this model, and you can also simultaneously support multiple hypervisor ABIs. It also accommodates existing hypervisor use of this leaf space, even if they currently use a fixed location within it. A concrete counter-proposal: The space 0x40000000-0x400000ff is reserved for hypervisor usage. This region is divided into 16 16-leaf blocks. Each block has the structure: 0x400000x0: eax: max used leaf within the leaf block (max 0x400000xf) e[bcd]x: leaf block signature. This may be a hypervisor-specific signature, or a generic signature, depending on the contents of the block A guest may search for any supported Hypervisor ABIs by inspecting each leaf at 0x400000x0 for a known signature, and then may choose its mode of operation accordingly. It must ignore any unknown signatures, and not touch any of the leaves within an unknown leaf block. Hypervisor vendors who want to add a hypervisor-specific leaf block must choose a signature which is recognizably related to their or their hypervisor's name. Signatures starting with "Generic" are reserved for generic leaf blocks. A guest may scan leaf blocks to enumerate what hypervisor ABIs/hypercall interfaces are available to it. It may mix and match any information from leaves it understands. However, once it starts using a specific hypervisor ABI by making hypercalls or doing other operations with side-effects, it must commit to using that ABI exclusively (a specific hypervisor ABI may include the generic ABI by reference, however). Correspondingly, a hypervisor must treat any cpuid accesses as side-effect free. Definition of specific blocks: Generic hypervisor leaf block: 0x400000x0 signature is "GenericVMMIF" (or something) 0x400000x1 tsc leaf as you've described J From jeremy at goop.org Wed Oct 1 11:06:25 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 11:06:25 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222882431.9381.23.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> Message-ID: <48E3BC21.4080803@goop.org> Alok Kataria wrote: > Its not a user who has to do anything special here. > There are *intelligent* VM developers out there who can export a > different CPUid interface depending on the guest OS type. And this is > what most of the hypervisors do (not necessarily for CPUID, but for > other things right now). > No, that's always a terrible idea. Sure, its necessary to deal with some backward-compatibility issues, but we should even consider a new interface which assumes this kind of thing. We want properly enumerable interfaces. J From hpa at zytor.com Wed Oct 1 11:07:03 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 11:07:03 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BBC1.2050607@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> Message-ID: <48E3BC47.60900@zytor.com> Jeremy Fitzhardinge wrote: > > No, we're not getting anywhere. This is an outright broken idea. The > space is too small to be able to chop up in this way, and the number of > vendors too large to be able to do it without having a central oversight. > I suspect we can get a larger number space if we ask Intel & AMD. In fact, I think we should request that the entire 0x40xxxxxx numberspace is assigned to virtualization *anyway*. -hpa From jeremy at goop.org Wed Oct 1 11:12:19 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 11:12:19 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BC47.60900@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> Message-ID: <48E3BD83.2090801@goop.org> H. Peter Anvin wrote: > Jeremy Fitzhardinge wrote: >> >> No, we're not getting anywhere. This is an outright broken idea. >> The space is too small to be able to chop up in this way, and the >> number of vendors too large to be able to do it without having a >> central oversight. >> > > I suspect we can get a larger number space if we ask Intel & AMD. In > fact, I think we should request that the entire 0x40xxxxxx numberspace > is assigned to virtualization *anyway*. Yes, that would be good. In that case I'd revise my proposal to back each leaf block 256 leaves instead of 16. But it still needs to be a proper enumeration with signatures, rather than assigning fixed points in that space to specific interfaces. J From hpa at zytor.com Wed Oct 1 11:16:05 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 11:16:05 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BD83.2090801@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> Message-ID: <48E3BE65.2050909@zytor.com> Jeremy Fitzhardinge wrote: >> >> I suspect we can get a larger number space if we ask Intel & AMD. In >> fact, I think we should request that the entire 0x40xxxxxx numberspace >> is assigned to virtualization *anyway*. > > Yes, that would be good. In that case I'd revise my proposal to back > each leaf block 256 leaves instead of 16. But it still needs to be a > proper enumeration with signatures, rather than assigning fixed points > in that space to specific interfaces. > With a sufficiently large block, we could use fixed points, e.g. by having each vendor create interfaces in the 0x40SSSSXX range, where SSSS is the PCI ID they use for PCI devices. Note that I said "create interfaces". It's important that all about this is who specified the interface -- for "what hypervisor is this" just use 0x40000000 and disambiguate based on that. -hpa From jeremy at goop.org Wed Oct 1 11:36:27 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 11:36:27 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BE65.2050909@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> <48E3BE65.2050909@zytor.com> Message-ID: <48E3C32B.3090701@goop.org> H. Peter Anvin wrote: > With a sufficiently large block, we could use fixed points, e.g. by > having each vendor create interfaces in the 0x40SSSSXX range, where > SSSS is the PCI ID they use for PCI devices. Sure, you could do that, but you'd still want to have a signature in 0x40SSSS00 to positively identify the chunk. And what if you wanted more than 256 leaves? > Note that I said "create interfaces". It's important that all about > this is who specified the interface -- for "what hypervisor is this" > just use 0x40000000 and disambiguate based on that. "What hypervisor is this?" isn't a very interesting question; if you're even asking it then it suggests that something has gone wrong. Its much more useful to ask "what interfaces does this hypervisor support?", and enumerating a smallish range of well-known leaves looking for signatures is the simplest way to do that. (We could use signatures derived from the PCI vendor IDs which would help with managing that namespace.) J From hpa at zytor.com Wed Oct 1 11:43:15 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 11:43:15 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3C32B.3090701@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> <48E3BE65.2050909@zytor.com> <48E3C32B.3090701@goop.org> Message-ID: <48E3C4C3.1050903@zytor.com> Jeremy Fitzhardinge wrote: > H. Peter Anvin wrote: >> With a sufficiently large block, we could use fixed points, e.g. by >> having each vendor create interfaces in the 0x40SSSSXX range, where >> SSSS is the PCI ID they use for PCI devices. > > Sure, you could do that, but you'd still want to have a signature in > 0x40SSSS00 to positively identify the chunk. And what if you wanted > more than 256 leaves? What you'd want, at least, is a standard CPUID identification and range leaf at the top. 256 leaves is a *lot*, though; I'm not saying one couldn't run out, but it'd be hard. Keep in mind that for large objects there are "counting" CPUID levels, as much as I personally dislike them, and one could easily argue that if you're doing something that would require anywhere near 256 leaves you probably are storing bulk data that belongs elsewhere. Of course, if we had some kind of central authority assigning 8-bit IDs that would be even better, especially since there are tools in the field which already scan on 64K boundaries. I don't know, though, how likely it is that we'll have to deal with 256 hypervisors. >> Note that I said "create interfaces". It's important that all about >> this is who specified the interface -- for "what hypervisor is this" >> just use 0x40000000 and disambiguate based on that. > > "What hypervisor is this?" isn't a very interesting question; if you're > even asking it then it suggests that something has gone wrong. Its much > more useful to ask "what interfaces does this hypervisor support?", and > enumerating a smallish range of well-known leaves looking for signatures > is the simplest way to do that. (We could use signatures derived from > the PCI vendor IDs which would help with managing that namespace.) > I agree completely, of course (except that "what hypervisor is this" still has limited usage, especially when it comes to dealing with bug workarounds. Similar to the way we use CPU vendor IDs and stepping numbers for physical CPUs.) -hpa From jeremy at goop.org Wed Oct 1 12:56:34 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 12:56:34 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3C4C3.1050903@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> <48E3BE65.2050909@zytor.com> <48E3C32B.3090701@goop.org> <48E3C4C3.1050903@zytor.com> Message-ID: <48E3D5F2.4090708@goop.org> H. Peter Anvin wrote: > What you'd want, at least, is a standard CPUID identification and > range leaf at the top. 256 leaves is a *lot*, though; I'm not saying > one couldn't run out, but it'd be hard. Keep in mind that for large > objects there are "counting" CPUID levels, as much as I personally > dislike them, and one could easily argue that if you're doing > something that would require anywhere near 256 leaves you probably are > storing bulk data that belongs elsewhere. I agree, but it just makes the proposal a bit more brittle. > Of course, if we had some kind of central authority assigning 8-bit > IDs that would be even better, especially since there are tools in the > field which already scan on 64K boundaries. I don't know, though, how > likely it is that we'll have to deal with 256 hypervisors. I'm assuming that the likelihood of getting all possible vendors - current and future - to agree to a scheme like this is pretty small. We need to come up with something that will work well when there are non-cooperative parties to deal with. > I agree completely, of course (except that "what hypervisor is this" > still has limited usage, especially when it comes to dealing with bug > workarounds. Similar to the way we use CPU vendor IDs and stepping > numbers for physical CPUs.) I guess. Its certainly useful to be able to identify the hypervisor for bug reporting and just general status information. But making functional changes on that basis should be a last resort. J From anthony at codemonkey.ws Wed Oct 1 13:03:44 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 15:03:44 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BBC1.2050607__35819.6151479662$1222884502$gmane$org@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607__35819.6151479662$1222884502$gmane$org@goop.org> Message-ID: <48E3D7A0.3000403@codemonkey.ws> Jeremy Fitzhardinge wrote: > Alok Kataria wrote: > > No, we're not getting anywhere. This is an outright broken idea. The > space is too small to be able to chop up in this way, and the number of > vendors too large to be able to do it without having a central oversight. > > The only way this can work is by having explicit positive identification > of each group of leaves with a signature. If there's a recognizable > signature, then you can inspect the rest of the group; if not, then you > can't. That way, you can avoid any leaf usage which doesn't conform to > this model, and you can also simultaneously support multiple hypervisor > ABIs. It also accommodates existing hypervisor use of this leaf space, > even if they currently use a fixed location within it. > > A concrete counter-proposal: Mmm, cpuid bikeshedding :-) > The space 0x40000000-0x400000ff is reserved for hypervisor usage. > > This region is divided into 16 16-leaf blocks. Each block has the > structure: > > 0x400000x0: > eax: max used leaf within the leaf block (max 0x400000xf) Why even bother with this? It doesn't seem necessary in your proposal. Regards, Anthony Liguori From jeremy at goop.org Wed Oct 1 13:08:08 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 13:08:08 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3D7A0.3000403@codemonkey.ws> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607__35819.6151479662$1222884502$gmane$org@goop.org> <48E3D7A0.3000403@codemonkey.ws> Message-ID: <48E3D8A8.604@goop.org> Anthony Liguori wrote: > Mmm, cpuid bikeshedding :-) My shade of blue is better. >> The space 0x40000000-0x400000ff is reserved for hypervisor usage. >> >> This region is divided into 16 16-leaf blocks. Each block has the >> structure: >> >> 0x400000x0: >> eax: max used leaf within the leaf block (max 0x400000xf) > > Why even bother with this? It doesn't seem necessary in your proposal. It allows someone to incrementally add things to their block in a fairly orderly way. But more importantly, its the prevailing idiom, and the existing and proposed cpuid schemes already do this, so they'd fit in as-is. J From chrisw at sous-sol.org Wed Oct 1 13:38:03 2008 From: chrisw at sous-sol.org (Chris Wright) Date: Wed, 1 Oct 2008 13:38:03 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3C32B.3090701@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> <48E3BE65.2050909@zytor.com> <48E3C32B.3090701@goop.org> Message-ID: <20081001203803.GB634@sequoia.sous-sol.org> * Jeremy Fitzhardinge (jeremy at goop.org) wrote: > "What hypervisor is this?" isn't a very interesting question; if you're > even asking it then it suggests that something has gone wrong. It's essentially already happening. Everyone wants to be a better hyperv than hyperv ;-) From akataria at vmware.com Wed Oct 1 14:01:18 2008 From: akataria at vmware.com (Alok Kataria) Date: Wed, 01 Oct 2008 14:01:18 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BBC1.2050607@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> Message-ID: <1222894878.9381.63.camel@alok-dev1> On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote: > No, we're not getting anywhere. This is an outright broken idea. The > space is too small to be able to chop up in this way, and the number of > vendors too large to be able to do it without having a central oversight. > > The only way this can work is by having explicit positive identification > of each group of leaves with a signature. If there's a recognizable > signature, then you can inspect the rest of the group; if not, then you > can't. That way, you can avoid any leaf usage which doesn't conform to > this model, and you can also simultaneously support multiple hypervisor > ABIs. It also accommodates existing hypervisor use of this leaf space, > even if they currently use a fixed location within it. > > A concrete counter-proposal: > > The space 0x40000000-0x400000ff is reserved for hypervisor usage. > > This region is divided into 16 16-leaf blocks. Each block has the > structure: > > 0x400000x0: > eax: max used leaf within the leaf block (max 0x400000xf) > e[bcd]x: leaf block signature. This may be a hypervisor-specific > signature, or a generic signature, depending on the contents of the block > > A guest may search for any supported Hypervisor ABIs by inspecting each > leaf at 0x400000x0 for a known signature, and then may choose its mode > of operation accordingly. It must ignore any unknown signatures, and > not touch any of the leaves within an unknown leaf block. > Hypervisor vendors who want to add a hypervisor-specific leaf block must > choose a signature which is recognizably related to their or their > hypervisor's name. > > Signatures starting with "Generic" are reserved for generic leaf blocks. > > A guest may scan leaf blocks to enumerate what hypervisor ABIs/hypercall > interfaces are available to it. It may mix and match any information > from leaves it understands. However, once it starts using a specific > hypervisor ABI by making hypercalls or doing other operations with > side-effects, it must commit to using that ABI exclusively (a specific > hypervisor ABI may include the generic ABI by reference, however). > > Correspondingly, a hypervisor must treat any cpuid accesses as > side-effect free. > > Definition of specific blocks: > > Generic hypervisor leaf block: > 0x400000x0 signature is "GenericVMMIF" (or something) > 0x400000x1 tsc leaf as you've described > I see following issues with this proposal, 1. Kernel complexity : Just thinking about the complexity that this will put in the kernel to handle these multiple ABI signatures and scanning all of these leaf block's is difficult to digest. 2. Divergence in the interface provided by the hypervisors : The reason we brought up a flat hierarchy is because we think we should be moving towards a approach where the guest code doesn't diverge too much when running under different hypervisors. That is the guest essentially does the same thing if its running on say Xen or VMware. This design IMO, will take us a step backward to what we already have seen with para virt ops. Each hypervisor (mostly) defines its own cpuid block, the guest correspondingly needs to have code to handle each of these cpuid blocks, with these blocks will mostly being exclusive. 3. Is their a need to do all this over engineering : Aren't we over engineering a simple interface over here. The point is, there are right now 256 cpuid leafs do we realistically think we are ever going to exhaust all these leafs. We are really surprised to know that people may think this space is small enough. It would be interesting to know what all use you might want to put cpuid for. Thanks, Alok > J From anthony at codemonkey.ws Wed Oct 1 14:03:23 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:03:23 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3D8A8.604__13396.6479487301$1222891831$gmane$org@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607__35819.6151479662$1222884502$gmane$org@goop.org> <48E3D7A0.3000403@codemonkey.ws> <48E3D8A8.604__13396.6479487301$1222891831$gmane$org@goop.org> Message-ID: <48E3E59B.8060900@codemonkey.ws> Jeremy Fitzhardinge wrote: > Anthony Liguori wrote: >> Mmm, cpuid bikeshedding :-) > > My shade of blue is better. > >>> The space 0x40000000-0x400000ff is reserved for hypervisor usage. >>> >>> This region is divided into 16 16-leaf blocks. Each block has the >>> structure: >>> >>> 0x400000x0: >>> eax: max used leaf within the leaf block (max 0x400000xf) >> Why even bother with this? It doesn't seem necessary in your proposal. > > It allows someone to incrementally add things to their block in a fairly > orderly way. But more importantly, its the prevailing idiom, and the > existing and proposed cpuid schemes already do this, so they'd fit in as-is. We just leave eax as zero. It wouldn't be that upsetting to change this as it would only keep new guests from working on older KVMs. However, I see little incentive to change anything unless there's something compelling that we would get in return. Since we're only talking about Linux guests, it's just as easy for us to add things to our paravirt_ops implementation as it would be to add things using this new model. If this was something that other guests were all agreeing to support (even if it was just the BSDs and OpenSolaris), then there may be value to it. Right now, I see no real value in changing the status quo. Regards, Anthony Liguori > J From anthony at codemonkey.ws Wed Oct 1 14:03:23 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:03:23 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3D8A8.604__13396.6479487301$1222891831$gmane$org@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607__35819.6151479662$1222884502$gmane$org@goop.org> <48E3D7A0.3000403@codemonkey.ws> <48E3D8A8.604__13396.6479487301$1222891831$gmane$org@goop.org> Message-ID: <48E3E59B.8060900@codemonkey.ws> Jeremy Fitzhardinge wrote: > Anthony Liguori wrote: >> Mmm, cpuid bikeshedding :-) > > My shade of blue is better. > >>> The space 0x40000000-0x400000ff is reserved for hypervisor usage. >>> >>> This region is divided into 16 16-leaf blocks. Each block has the >>> structure: >>> >>> 0x400000x0: >>> eax: max used leaf within the leaf block (max 0x400000xf) >> Why even bother with this? It doesn't seem necessary in your proposal. > > It allows someone to incrementally add things to their block in a fairly > orderly way. But more importantly, its the prevailing idiom, and the > existing and proposed cpuid schemes already do this, so they'd fit in as-is. We just leave eax as zero. It wouldn't be that upsetting to change this as it would only keep new guests from working on older KVMs. However, I see little incentive to change anything unless there's something compelling that we would get in return. Since we're only talking about Linux guests, it's just as easy for us to add things to our paravirt_ops implementation as it would be to add things using this new model. If this was something that other guests were all agreeing to support (even if it was just the BSDs and OpenSolaris), then there may be value to it. Right now, I see no real value in changing the status quo. Regards, Anthony Liguori > J From akataria at vmware.com Wed Oct 1 14:05:53 2008 From: akataria at vmware.com (Alok Kataria) Date: Wed, 01 Oct 2008 14:05:53 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3BC21.4080803@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> Message-ID: <1222895153.9381.69.camel@alok-dev1> On Wed, 2008-10-01 at 11:06 -0700, Jeremy Fitzhardinge wrote: > Alok Kataria wrote: > > Its not a user who has to do anything special here. > > There are *intelligent* VM developers out there who can export a > > different CPUid interface depending on the guest OS type. And this is > > what most of the hypervisors do (not necessarily for CPUID, but for > > other things right now). > > > > No, that's always a terrible idea. Sure, its necessary to deal with > some backward-compatibility issues, but we should even consider a new > interface which assumes this kind of thing. We want properly enumerable > interfaces. The reason we still have to do this is because, Microsoft has already defined a CPUID format which is way different than what you or I are proposing ( with the current case of 256 leafs being available). And I doubt they would change the way they deal with it on their OS. Any proposal that we go with, we will have to export different CPUID interface from the hypervisor for the 2 OS in question. So i think this is something that we anyways will have to do and not worth binging about in the discussion. -- Alok > J From anthony at codemonkey.ws Wed Oct 1 14:08:47 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:08:47 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222894878.9381.63.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> Message-ID: <48E3E6DF.8060400@codemonkey.ws> Alok Kataria wrote: > On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote: > > 2. Divergence in the interface provided by the hypervisors : > The reason we brought up a flat hierarchy is because we think we should > be moving towards a approach where the guest code doesn't diverge too > much when running under different hypervisors. That is the guest > essentially does the same thing if its running on say Xen or VMware. > > This design IMO, will take us a step backward to what we already have > seen with para virt ops. Each hypervisor (mostly) defines its own cpuid > block, the guest correspondingly needs to have code to handle each of > these cpuid blocks, with these blocks will mostly being exclusive. > What's wrong with what we have in paravirt_ops? Just agreeing on CPUID doesn't help very much. You still need a mechanism for doing hypercalls to implement anything meaningful. We aren't going to agree on a hypercall mechanism. KVM uses direct hypercall instructions, Xen uses a hypercall page, VMware uses VMI, Hyper-V uses MSR writes. We all have already defined the hypercall namespace in a certain way. We've already gone down the road of trying to make standard paravirtual interfaces (via virtio). No one was sufficiently interested in collaborating. I don't see why other paravirtualizations are going to be much different. Regards, Anthony Liguori From chrisw at sous-sol.org Wed Oct 1 14:15:32 2008 From: chrisw at sous-sol.org (Chris Wright) Date: Wed, 1 Oct 2008 14:15:32 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3E6DF.8060400@codemonkey.ws> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E6DF.8060400@codemonkey.ws> Message-ID: <20081001211532.GC634@sequoia.sous-sol.org> * Anthony Liguori (anthony at codemonkey.ws) wrote: > We've already gone down the road of trying to make standard paravirtual > interfaces (via virtio). No one was sufficiently interested in > collaborating. I don't see why other paravirtualizations are going to > be much different. The point is to be able to support those interfaces. Presently a Linux guest will test and find out which HV it's running on, and adapt. Another guest will fail to enlighten itself, and perf will suffer...yadda, yadda. thanks, -chris From akataria at vmware.com Wed Oct 1 14:23:35 2008 From: akataria at vmware.com (Alok Kataria) Date: Wed, 01 Oct 2008 14:23:35 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3E6DF.8060400@codemonkey.ws> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E6DF.8060400@codemonkey.ws> Message-ID: <1222896215.9381.79.camel@alok-dev1> On Wed, 2008-10-01 at 14:08 -0700, Anthony Liguori wrote: > Alok Kataria wrote: > > On Wed, 2008-10-01 at 11:04 -0700, Jeremy Fitzhardinge wrote: > > > > 2. Divergence in the interface provided by the hypervisors : > > The reason we brought up a flat hierarchy is because we think we should > > be moving towards a approach where the guest code doesn't diverge too > > much when running under different hypervisors. That is the guest > > essentially does the same thing if its running on say Xen or VMware. > > > > This design IMO, will take us a step backward to what we already have > > seen with para virt ops. Each hypervisor (mostly) defines its own cpuid > > block, the guest correspondingly needs to have code to handle each of > > these cpuid blocks, with these blocks will mostly being exclusive. > > > > What's wrong with what we have in paravirt_ops? Your explanation below answers the question you raised, the problem being we need to have support for each of these different hypercall mechanisms in the kernel. I understand that this was the correct thing to do at that moment. But do we want to go the same way again for CPUID when we can make it generic (flat enough) for anybody to use it in the same manner and expose a generic interface to the kernel. > Just agreeing on CPUID > doesn't help very much. Yeah, nobody is removing any of the paravirt ops support. > You still need a mechanism for doing hypercalls > to implement anything meaningful. We aren't going to agree on a > hypercall mechanism. KVM uses direct hypercall instructions, Xen uses a > hypercall page, VMware uses VMI, Hyper-V uses MSR writes. We all have > already defined the hypercall namespace in a certain way. Thanks, Alok From jeremy at goop.org Wed Oct 1 14:17:18 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Wed, 01 Oct 2008 14:17:18 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222894878.9381.63.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> Message-ID: <48E3E8DE.1080602@goop.org> Alok Kataria wrote: > 1. Kernel complexity : Just thinking about the complexity that this will > put in the kernel to handle these multiple ABI signatures and scanning > all of these leaf block's is difficult to digest. > The scanning for the signatures is trivial; it's not a significant amount of code. Actually implementing them is a different matter, but that's the same regardless of where they are placed or how they're discovered. After discovery its the same either way: there's a leaf base with offsets from it. > 2. Divergence in the interface provided by the hypervisors : > The reason we brought up a flat hierarchy is because we think we should > be moving towards a approach where the guest code doesn't diverge too > much when running under different hypervisors. That is the guest > essentially does the same thing if its running on say Xen or VMware. > I guess, but the bulk of the uses of this stuff are going to be hypervisor-specific. You're hard-pressed to come up with any other generic uses beyond tsc. In general, if a hypervisor is going to put something in a special cpuid leaf, its because there's no other good way to represent it. Generic things are generally going to appear as an emulated piece of the virtualized platform, in ACPI, DMI, a hardware-defined cpuid leaf, etc... > 3. Is their a need to do all this over engineering : > Aren't we over engineering a simple interface over here. The point is, > there are right now 256 cpuid leafs do we realistically think we are > ever going to exhaust all these leafs. We are really surprised to know > that people may think this space is small enough. It would be > interesting to know what all use you might want to put cpuid for. > Look, if you want to propose a way to use that cpuid space in a reasonably flexible way that allows it to be used as the need arises, then we can talk about it. But I think your proposal is a poor way to achieve those ends If you want blessing for something that you've already implemented and shipped, well, you don't need anyone's blessing for that. J From anthony at codemonkey.ws Wed Oct 1 14:29:48 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:29:48 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222896215.9381.79.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E6DF.8060400@codemonkey.ws> <1222896215.9381.79.camel@alok-dev1> Message-ID: <48E3EBCC.70502@codemonkey.ws> Alok Kataria wrote: > Your explanation below answers the question you raised, the problem > being we need to have support for each of these different hypercall > mechanisms in the kernel. > I understand that this was the correct thing to do at that moment. > But do we want to go the same way again for CPUID when we can make it > generic (flat enough) for anybody to use it in the same manner and > expose a generic interface to the kernel. > But what sort of information can be stored in cpuid that's actually useful? Right now we just it in KVM for feature bits. Most of the stuff that's interesting is stored in shared memory because a guest can read that without taking a vmexit or via a hypercall. We can all agree upon a common mechanism for doing something but if no one is using that mechanism to do anything significant, what purpose does it serve? Regards, Anthony Liguori From anthony at codemonkey.ws Wed Oct 1 14:31:30 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:31:30 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <20081001211532.GC634@sequoia.sous-sol.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E6DF.8060400@codemonkey.ws> <20081001211532.GC634@sequoia.sous-sol.org> Message-ID: <48E3EC32.4030603@codemonkey.ws> Chris Wright wrote: > * Anthony Liguori (anthony at codemonkey.ws) wrote: > >> We've already gone down the road of trying to make standard paravirtual >> interfaces (via virtio). No one was sufficiently interested in >> collaborating. I don't see why other paravirtualizations are going to >> be much different. >> > > The point is to be able to support those interfaces. Presently a Linux guest > will test and find out which HV it's running on, and adapt. Another > guest will fail to enlighten itself, and perf will suffer...yadda, yadda. > Agreeing on CPUID does not get us close at all to having shared interfaces for paravirtualization. As I said in another note, there are more fundamental things that we differ on (like hypercall mechanism) that's going to make that challenging. We already are sharing code, when appropriate (see the Xen/KVM PV clock interface). Regards, Anthony Liguori > thanks, > -chris > From anthony at codemonkey.ws Wed Oct 1 14:34:09 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 16:34:09 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3E8DE.1080602@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> Message-ID: <48E3ECD1.30809@codemonkey.ws> Jeremy Fitzhardinge wrote: > Alok Kataria wrote: > > I guess, but the bulk of the uses of this stuff are going to be > hypervisor-specific. You're hard-pressed to come up with any other > generic uses beyond tsc. And arguably, storing TSC frequency in CPUID is a terrible interface because the TSC frequency can change any time a guest is entered. It really should be a shared memory area so that a guest doesn't have to vmexit to read it (like it is with the Xen/KVM paravirt clock). Regards, Anthony Liguori > In general, if a hypervisor is going to put something in a special > cpuid leaf, its because there's no other good way to represent it. > Generic things are generally going to appear as an emulated piece of > the virtualized platform, in ACPI, DMI, a hardware-defined cpuid leaf, > etc... From chrisw at sous-sol.org Wed Oct 1 14:43:38 2008 From: chrisw at sous-sol.org (Chris Wright) Date: Wed, 1 Oct 2008 14:43:38 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3ECD1.30809@codemonkey.ws> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> Message-ID: <20081001214338.GD634@sequoia.sous-sol.org> * Anthony Liguori (anthony at codemonkey.ws) wrote: > And arguably, storing TSC frequency in CPUID is a terrible interface > because the TSC frequency can change any time a guest is entered. It True for older hardware, newer hardware should fix this. I guess the point is, the are numbers that are easy to measure incorrectly in guest. Doesn't justify the whole thing.. From hpa at zytor.com Wed Oct 1 15:38:10 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 15:38:10 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <20081001203803.GB634@sequoia.sous-sol.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <48E3BC47.60900@zytor.com> <48E3BD83.2090801@goop.org> <48E3BE65.2050909@zytor.com> <48E3C32B.3090701@goop.org> <20081001203803.GB634@sequoia.sous-sol.org> Message-ID: <48E3FBD2.5010007@zytor.com> Chris Wright wrote: > * Jeremy Fitzhardinge (jeremy at goop.org) wrote: >> "What hypervisor is this?" isn't a very interesting question; if you're >> even asking it then it suggests that something has gone wrong. > > It's essentially already happening. Everyone wants to be a better > hyperv than hyperv ;-) That's a hy-perv? ;) -hpa From hpa at zytor.com Wed Oct 1 15:46:45 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 15:46:45 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222895153.9381.69.camel@alok-dev1> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> Message-ID: <48E3FDD5.7040106@zytor.com> Alok Kataria wrote: >> No, that's always a terrible idea. Sure, its necessary to deal with >> some backward-compatibility issues, but we should even consider a new >> interface which assumes this kind of thing. We want properly enumerable >> interfaces. > > The reason we still have to do this is because, Microsoft has already > defined a CPUID format which is way different than what you or I are > proposing ( with the current case of 256 leafs being available). And I > doubt they would change the way they deal with it on their OS. > Any proposal that we go with, we will have to export different CPUID > interface from the hypervisor for the 2 OS in question. > > So i think this is something that we anyways will have to do and not > worth binging about in the discussion. No, that's a good hint that what "you and I" are proposing is utterly broken and exactly underscores what I have been stressing about noncompliant hypervisors. All I have seen out of Microsoft only covers CPUID levels 0x40000000 as an vendor identification leaf and 0x40000001 as a "hypervisor identification leaf", but you might have access to other information. This further underscores my belief that using 0x400000xx for anything "standards-based" at all is utterly futile, and that this space should be treated as vendor identification and the rest as vendor-specific. Any hope of creating a standard that's actually usable needs to be outside this space, e.g. in the 0x40SSSSxx space I proposed earlier. -hpa From zach at vmware.com Wed Oct 1 16:47:04 2008 From: zach at vmware.com (Zachary Amsden) Date: Wed, 01 Oct 2008 16:47:04 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3ECD1.30809@codemonkey.ws> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> Message-ID: <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> On Wed, 2008-10-01 at 14:34 -0700, Anthony Liguori wrote: > Jeremy Fitzhardinge wrote: > > Alok Kataria wrote: > > > > I guess, but the bulk of the uses of this stuff are going to be > > hypervisor-specific. You're hard-pressed to come up with any other > > generic uses beyond tsc. > > And arguably, storing TSC frequency in CPUID is a terrible interface > because the TSC frequency can change any time a guest is entered. It > really should be a shared memory area so that a guest doesn't have to > vmexit to read it (like it is with the Xen/KVM paravirt clock). It's not terrible, it's actually brilliant. TSC is part of the processor architecture, the processor should a way to tell us what speed it is. Having a TSC with no interface to determine the frequency is a terrible design flaw. This is what caused the problem in the first place. And now we're trying to fiddle around with software wizardry what should be done in hardware in the first place. Once again, para-virtualization is basically useless. We can't agree on a solution without over-designing some complex system with interface signatures and multi-vendor cooperation and nonsense. Solve the non-virtualized problem and the virtualized problem goes away. Jun, you work at Intel. Can you ask for a new architecturally defined MSR that returns the TSC frequency? Not a virtualization specific MSR. A real MSR that would exist on physical processors. The TSC started as an MSR anyway. There should be another MSR that tells the frequency. If it's hard to do in hardware, it can be a write-once MSR that gets initialized by the BIOS. It's really a very simple solution to a very common problem. Other MSRs are dedicated to bus speed and so on, this seems remarkably similar. Once the physical problem is solved, the virtualized problem doesn't even exist. We simply add support for the newly defined MSR and voilla. Other chipmakers probably agree it's a good idea and go along with it too, and in the meantime, reading a non-existent MSR is a fairly harmlessly handled #GP. I realize it's the wrong thing for us now, but long term, it's the only architecturally 'correct' approach. You can even extend it to have visible TSC frequency changes clocked via performance counter events (and then get interrupts on those events if you so wish), solving the dynamic problem too. Paravirtualization is a symptom of an architectural problem. We should always be trying to fix the architecture first. Zach From anthony at codemonkey.ws Wed Oct 1 17:41:18 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 01 Oct 2008 19:41:18 -0500 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> Message-ID: <48E418AE.1090306@codemonkey.ws> Zachary Amsden wrote: > On Wed, 2008-10-01 at 14:34 -0700, Anthony Liguori wrote: > >> Jeremy Fitzhardinge wrote: >> >>> Alok Kataria wrote: >>> >>> I guess, but the bulk of the uses of this stuff are going to be >>> hypervisor-specific. You're hard-pressed to come up with any other >>> generic uses beyond tsc. >>> >> And arguably, storing TSC frequency in CPUID is a terrible interface >> because the TSC frequency can change any time a guest is entered. It >> really should be a shared memory area so that a guest doesn't have to >> vmexit to read it (like it is with the Xen/KVM paravirt clock). >> > > It's not terrible, it's actually brilliant. But of course! Okay, not really :-) > TSC is part of the > processor architecture, the processor should a way to tell us what speed > it is. > It does. 1 tick == 1 tick. The processor doesn't have a concept of wall clock time so wall clock units don't make much sense. If it did, I'd say, screw the TSC, just give me a ns granular time stamp and let's all forget that the TSC even exists. > And now we're trying to fiddle around with software wizardry what should > be done in hardware in the first place. Once again, para-virtualization > is basically useless. We can't agree on a solution without > over-designing some complex system with interface signatures and > multi-vendor cooperation and nonsense. Solve the non-virtualized > problem and the virtualized problem goes away. > > Jun, you work at Intel. Can you ask for a new architecturally defined > MSR that returns the TSC frequency? Not a virtualization specific MSR. > A real MSR that would exist on physical processors. The TSC started as > an MSR anyway. There should be another MSR that tells the frequency. > If it's hard to do in hardware, it can be a write-once MSR that gets > initialized by the BIOS. rdtscp sort of gives you this. But still, just give me my rdnsc and I'll be happy. > I realize it's the wrong thing for us now, but long term, it's the only > architecturally 'correct' approach. You can even extend it to have > visible TSC frequency changes clocked via performance counter events > (and then get interrupts on those events if you so wish), solving the > dynamic problem too. > So a solution is needed that works for now. Anything that requires a vmexit is bad because the TSC frequency can change quite often. Even if you ignore the troubles with frequency scaling on older processors and VCPU migration across NUMA nodes, there will be a very visible change in TSC frequency after a live migration. So there are two possible solutions. Have a shared memory area that the guest can consult that has the latest TSC frequency (this is what KVM and Xen do) or have some sort of interrupt mechanism that notifies the guest when the TSC frequency changes after which, software can do something that vmexits to get the TSC frequency. The proposed solution doesn't include a TSC frequency change notification mechanism. This is part of the problem with this sort of approach to standardization. It's hard to come up with the best interface at first. You have to try a couple ways, and then everyone can eventually standardize on the best one if one ever emerges. Regards, Anthony Liguori > Paravirtualization is a symptom of an architectural problem. We should > always be trying to fix the architecture first. > > Zach > > From hpa at zytor.com Wed Oct 1 17:39:22 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 17:39:22 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> Message-ID: <48E4183A.7090208@zytor.com> Zachary Amsden wrote: > > Jun, you work at Intel. Can you ask for a new architecturally defined > MSR that returns the TSC frequency? Not a virtualization specific MSR. > A real MSR that would exist on physical processors. The TSC started as > an MSR anyway. There should be another MSR that tells the frequency. > If it's hard to do in hardware, it can be a write-once MSR that gets > initialized by the BIOS. It's really a very simple solution to a very > common problem. Other MSRs are dedicated to bus speed and so on, this > seems remarkably similar. > Ah, if it was only that simple. Transmeta actually did this, but it's not as useful as you think. There are at least three crystals in modern PCs: one at 32.768 kHz (for the RTC), one at 14.31818 MHz (PIT, PMTMR and HPET), and one at a higher frequency (often 200 MHz.) All the main data distribution clocks in the system are derived from the third, which is subject to spread-spectrum modulation due to RFI concerns. Therefore, relying on the *nominal* frequency of this clock is vastly incorrect; often by as much as 2%. Spread-spectrum modulation is supposed to vary around zero enough that the spreading averages out, but the only way to know what the center frequency actually is is to average. Furthermore, this high-frequency clock is generally not calibrated anywhere near as well as the 14 MHz clock; in good designs the 14 MHz is actually a TCXO (temperature compensated crystal oscillator), which is accurate to something like ?2 ppm. -hpa From jun.nakajima at intel.com Wed Oct 1 18:11:11 2008 From: jun.nakajima at intel.com (Nakajima, Jun) Date: Wed, 1 Oct 2008 18:11:11 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E3FDD5.7040106@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> Message-ID: <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> On 10/1/2008 3:46:45 PM, H. Peter Anvin wrote: > Alok Kataria wrote: > > > No, that's always a terrible idea. Sure, its necessary to deal > > > with some backward-compatibility issues, but we should even > > > consider a new interface which assumes this kind of thing. We > > > want properly enumerable interfaces. > > > > The reason we still have to do this is because, Microsoft has > > already defined a CPUID format which is way different than what you > > or I are proposing ( with the current case of 256 leafs being > > available). And I doubt they would change the way they deal with it on their OS. > > Any proposal that we go with, we will have to export different CPUID > > interface from the hypervisor for the 2 OS in question. > > > > So i think this is something that we anyways will have to do and not > > worth binging about in the discussion. > > No, that's a good hint that what "you and I" are proposing is utterly > broken and exactly underscores what I have been stressing about > noncompliant hypervisors. > > All I have seen out of Microsoft only covers CPUID levels 0x40000000 > as an vendor identification leaf and 0x40000001 as a "hypervisor > identification leaf", but you might have access to other information. No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface identification, which determines the semantics of leaves from 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor identifier signature (i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the proposal. > > This further underscores my belief that using 0x400000xx for anything > "standards-based" at all is utterly futile, and that this space should > be treated as vendor identification and the rest as vendor-specific. > Any hope of creating a standard that's actually usable needs to be > outside this space, e.g. in the 0x40SSSSxx space I proposed earlier. > Actually I'm not sure I'm following your logic. Are you saying using that 0x400000xx for anything "standards-based" is utterly futile because Microsoft said "the range is hypervisor vendor-neutral"? Or you were not sure what they meant there. If we are not clear, we can ask them. > -hpa . Jun Nakajima | Intel Open Source Technology Center From zach at vmware.com Wed Oct 1 18:11:40 2008 From: zach at vmware.com (Zachary Amsden) Date: Wed, 01 Oct 2008 18:11:40 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E4183A.7090208@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> <48E4183A.7090208@zytor.com> Message-ID: <1222909900.7330.106.camel@bodhitayantram.eng.vmware.com> On Wed, 2008-10-01 at 17:39 -0700, H. Peter Anvin wrote: > third, which is subject to spread-spectrum modulation due to RFI > concerns. Therefore, relying on the *nominal* frequency of this clock I'm not suggesting using the nominal value. I'm suggesting the measurement be done in the one and only place where there is perfect control of the system, the processor boot-strapping in the BIOS. Only the platform designers themselves know the speed of the oscillator which is modulating the clock and so only they should be calibrating the speed of the TSC. If this modulation really does alter the frequency by +/- 2% (seems high to me, but hey, I don't design motherboards), using an LFO, then basically all the calibration done in Linux is broken and has been for some time. You can't calibrate only once, or risk being off by 2%, you can't calibrate repeatedly and take the fastest estimate, or you are off by 2%, and you can't calibrate repeatedly and take the average without risking SMI noise affecting the lowest clock speed measurement, contributing unknown error. Hmm. Re-reading your e-mail, I see you are saying the nominal frequency may be off by 2% (and I easily believe that), not necessarily that the frequency modulation may be 2% (which I still think is high). Does anyone know what the actual bounds on spread spectrum modulation are or how fast the clock is modulated? Zach From hpa at zytor.com Wed Oct 1 18:21:56 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 18:21:56 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <1222909900.7330.106.camel@bodhitayantram.eng.vmware.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> <48E4183A.7090208@zytor.com> <1222909900.7330.106.camel@bodhitayantram.eng.vmware.com> Message-ID: <48E42234.9000808@zytor.com> Zachary Amsden wrote: > > I'm not suggesting using the nominal value. I'm suggesting the > measurement be done in the one and only place where there is perfect > control of the system, the processor boot-strapping in the BIOS. > > Only the platform designers themselves know the speed of the oscillator > which is modulating the clock and so only they should be calibrating the > speed of the TSC. > No. *Noone*, including the manufacturers, know the speed of the oscillator which is modulating the clock. What you have to do is average over a timespan which is long enough that the SSM averages out (a relatively small fraction of a second.) As for trusting the BIOS on this, that's a total joke. Firmware vendors can't get the most basic details right. > If this modulation really does alter the frequency by +/- 2% (seems high > to me, but hey, I don't design motherboards), using an LFO, then > basically all the calibration done in Linux is broken and has been for > some time. You can't calibrate only once, or risk being off by 2%, you > can't calibrate repeatedly and take the fastest estimate, or you are off > by 2%, and you can't calibrate repeatedly and take the average without > risking SMI noise affecting the lowest clock speed measurement, > contributing unknown error. You have to calibrate over a sample interval long enough that the SSM averages out. > Hmm. Re-reading your e-mail, I see you are saying the nominal frequency > may be off by 2% (and I easily believe that), not necessarily that the > frequency modulation may be 2% (which I still think is high). Does > anyone know what the actual bounds on spread spectrum modulation are or > how fast the clock is modulated? No, I'm saying the frequency modulation may be up to 2%. Typically it is something like [-2%,+0%]. -hpa From hpa at zytor.com Wed Oct 1 18:24:26 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Wed, 01 Oct 2008 18:24:26 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> Message-ID: <48E422CA.2010606@zytor.com> Nakajima, Jun wrote: >> >> All I have seen out of Microsoft only covers CPUID levels 0x40000000 >> as an vendor identification leaf and 0x40000001 as a "hypervisor >> identification leaf", but you might have access to other information. > > No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface identification, which determines the semantics of leaves from 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor identifier signature (i.e. hypervisor identification) and the hypervisor CPUID leaf range, as in the proposal. > In other words, 0x40000002+ is vendor-specific space, based on the hypervisor specified in 0x40000001 (in theory); in practice both 0x40000000:0x40000001 since M$ seem to use clever identifiers as "Hypervisor 1". >> This further underscores my belief that using 0x400000xx for anything >> "standards-based" at all is utterly futile, and that this space should >> be treated as vendor identification and the rest as vendor-specific. >> Any hope of creating a standard that's actually usable needs to be >> outside this space, e.g. in the 0x40SSSSxx space I proposed earlier. > > Actually I'm not sure I'm following your logic. Are you saying using that 0x400000xx for anything "standards-based" is utterly futile because Microsoft said "the range is hypervisor vendor-neutral"? Or you were not sure what they meant there. If we are not clear, we can ask them. > What I'm saying is that Microsoft is effectively squatting on the 0x400000xx space with their definition. As written, it's not even clear that it will remain consistent between *their own* hypervisors, even less anyone else's. -hpa From hpa at kernel.org Wed Oct 1 17:57:56 2008 From: hpa at kernel.org (H. Peter Anvin) Date: Wed, 01 Oct 2008 17:57:56 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E4183A.7090208@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <1222904824.7330.83.camel@bodhitayantram.eng.vmware.com> <48E4183A.7090208@zytor.com> Message-ID: <48E41C94.7040904@kernel.org> H. Peter Anvin wrote: > > Ah, if it was only that simple. Transmeta actually did this, but it's > not as useful as you think. > For what it's worth, Transmeta's implementation used CPUID leaf 0x80860001.ECX to give the TSC frequency rounded to the nearest MHz. The caveat of spread-spectrum modulation applies. -hpa From avi at redhat.com Thu Oct 2 04:29:00 2008 From: avi at redhat.com (Avi Kivity) Date: Thu, 02 Oct 2008 14:29:00 +0300 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <20081001214338.GD634@sequoia.sous-sol.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3BBC1.2050607@goop.org> <1222894878.9381.63.camel@alok-dev1> <48E3E8DE.1080602@goop.org> <48E3ECD1.30809@codemonkey.ws> <20081001214338.GD634@sequoia.sous-sol.org> Message-ID: <48E4B07C.60605@redhat.com> Chris Wright wrote: > * Anthony Liguori (anthony at codemonkey.ws) wrote: > >> And arguably, storing TSC frequency in CPUID is a terrible interface >> because the TSC frequency can change any time a guest is entered. It >> > > True for older hardware, newer hardware should fix this. I guess the > point is, the are numbers that are easy to measure incorrectly in guest. > Doesn't justify the whole thing.. > It's not fixed for newer hardware. Larger systems still have multiple tsc frequencies. -- error compiling committee.c: too many arguments to function From jbarnes at virtuousgeek.org Thu Oct 2 09:03:15 2008 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Thu, 2 Oct 2008 09:03:15 -0700 Subject: [PATCH 3/6 v3] PCI: support ARI capability In-Reply-To: References: Message-ID: <200810020903.16385.jbarnes@virtuousgeek.org> On Saturday, September 27, 2008 1:28 am Zhao, Yu wrote: > Add Alternative Routing-ID Interpretation (ARI) support. > > Cc: Jesse Barnes > Cc: Randy Dunlap > Cc: Grant Grundler > Cc: Alex Chiang > Cc: Matthew Wilcox > Cc: Roland Dreier > Cc: Greg KH > Signed-off-by: Yu Zhao > > --- > drivers/pci/pci.c | 31 +++++++++++++++++++++++++++++++ > drivers/pci/pci.h | 12 ++++++++++++ > drivers/pci/probe.c | 3 +++ > include/linux/pci.h | 1 + > include/linux/pci_regs.h | 14 ++++++++++++++ > 5 files changed, 61 insertions(+), 0 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 400d3b3..fe9efc4 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1260,6 +1260,37 @@ void pci_pm_init(struct pci_dev *dev) > } > } > > +/** > + * pci_ari_init - turn on ARI forwarding if it's supported > + * @dev: the PCI device > + */ > +void pci_ari_init(struct pci_dev *dev) > +{ > + int pos; > + u32 cap; > + u16 ctrl; > + > + if (!dev->is_pcie || (dev->pcie_type != PCI_EXP_TYPE_ROOT_PORT && > + dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)) > + return; > + > + pos = pci_find_capability(dev, PCI_CAP_ID_EXP); > + if (!pos) > + return; > + > + pci_read_config_dword(dev, pos + PCI_EXP_DEVCAP2, &cap); > + > + if (!(cap & PCI_EXP_DEVCAP2_ARI)) > + return; > + > + pci_read_config_word(dev, pos + PCI_EXP_DEVCTL2, &ctrl); > + ctrl |= PCI_EXP_DEVCTL2_ARI; > + pci_write_config_word(dev, pos + PCI_EXP_DEVCTL2, ctrl); > + > + dev->ari_enabled = 1; > + dev_info(&dev->dev, "ARI forwarding enabled.\n"); > +} > + Maybe we should be consistent with the other APIs and call it pci_enable_ari (like we do for wake & msi). Looks pretty good otherwise. Jesse From matthew at wil.cx Thu Oct 2 09:17:01 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Thu, 2 Oct 2008 10:17:01 -0600 Subject: [PATCH 3/6 v3] PCI: support ARI capability In-Reply-To: <200810020903.16385.jbarnes@virtuousgeek.org> References: <200810020903.16385.jbarnes@virtuousgeek.org> Message-ID: <20081002161701.GO13822@parisc-linux.org> On Thu, Oct 02, 2008 at 09:03:15AM -0700, Jesse Barnes wrote: > Maybe we should be consistent with the other APIs and call it pci_enable_ari > (like we do for wake & msi). Looks pretty good otherwise. Those APIs are for drivers ... this is internal. I don't have any objection of my own, though I agree with Alex's remark that the printk is unnecessary and just adds clutter to the boot process. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From jbarnes at virtuousgeek.org Thu Oct 2 09:21:31 2008 From: jbarnes at virtuousgeek.org (Jesse Barnes) Date: Thu, 2 Oct 2008 09:21:31 -0700 Subject: [PATCH 2/6 v3] PCI: add new general functions In-Reply-To: References: Message-ID: <200810020921.32335.jbarnes@virtuousgeek.org> On Saturday, September 27, 2008 1:27 am Zhao, Yu wrote: > Centralize capability related functions into several new functions and put > PCI resource definitions into an enum. > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c > index f99160d..f2feebc 100644 > --- a/drivers/pci/pci-sysfs.c > +++ b/drivers/pci/pci-sysfs.c The sysfs changes look fine, they should be submitted separately. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 259eaff..400d3b3 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -356,25 +356,10 @@ pci_find_parent_resource(const struct pci_dev *dev, > struct resource *res) static void > pci_restore_bars(struct pci_dev *dev) > { > - int i, numres; > - > - switch (dev->hdr_type) { > - case PCI_HEADER_TYPE_NORMAL: > - numres = 6; > - break; > - case PCI_HEADER_TYPE_BRIDGE: > - numres = 2; > - break; > - case PCI_HEADER_TYPE_CARDBUS: > - numres = 1; > - break; > - default: > - /* Should never get here, but just in case... */ > - return; > - } > + int i; > > - for (i = 0; i < numres; i ++) > - pci_update_resource(dev, &dev->resource[i], i); > + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) > + pci_update_resource(dev, i); > } This confused me for a minute until I saw that the new pci_update_resource ignores invalid BAR numbers. Not sure if that's clearer than the current code... > +/** > + * pci_resource_bar - get position of the BAR associated with a resource > + * @dev: the PCI device > + * @resno: the resource number > + * @type: the BAR type to be filled in > + * > + * Returns BAR position in config space, or 0 if the BAR is invalid. > + */ > +int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type > *type) +{ > + if (resno < PCI_ROM_RESOURCE) { > + *type = pci_bar_unknown; > + return PCI_BASE_ADDRESS_0 + 4 * resno; > + } else if (resno == PCI_ROM_RESOURCE) { > + *type = pci_bar_rom; > + return dev->rom_base_reg; > + } > + > + dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); > + return 0; > +} It looks like this will spew an error even under normal circumstances since pci_restore_bars gets called at resume time, right? You could make this into a debug message or just get rid of it. Also now that I look at this, I don't think it'll provide equivalent functionality to the old restore_bars code, won't a cardbus bridge end up getting pci_update_resource called on invalid BARs? > +static void pci_init_capabilities(struct pci_dev *dev) > +{ > + /* MSI/MSI-X list */ > + pci_msi_init_pci_dev(dev); > + > + /* Power Management */ > + pci_pm_init(dev); > + > + /* Vital Product Data */ > + pci_vpd_pci22_init(dev); > +} > + These capabilities changes look good, care to separate them out? Let's see if we can whittle down this patchset by extracting and applying all the various cleanups; that should make the core bits easier to review. Thanks, -- Jesse Barnes, Intel Open Source Technology Center From ryov at valinux.co.jp Thu Oct 2 22:37:49 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 03 Oct 2008 14:37:49 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction Message-ID: <20081003.143749.193701570.ryov@valinux.co.jp> Hi everyone, This is the dm-ioband version 1.7.0 release. Dm-ioband is an I/O bandwidth controller implemented as a device-mapper driver, which gives specified bandwidth to each job running on the same physical device. - Can be applied to the kernel 2.6.27-rc5-mm1. - Changes from 1.6.0 (posted on Sep 24, 2008): - Fix a problem that processes issuing I/Os are permanently blocked when I/O requests to reclaim pages are consecutively issued. You can apply the latest bio-cgroup patch to this dm-ioband version. The bio-cgroup provides a BIO tracking mechanism with dm-ioband. Please see the following site for more information: Block I/O tracking http://people.valinux.co.jp/~ryov/bio-cgroup/ Thanks, Ryo Tsuruta From ryov at valinux.co.jp Thu Oct 2 22:38:24 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 03 Oct 2008 14:38:24 +0900 (JST) Subject: [PATCH 1/2] dm-ioband: I/O bandwidth controller v1.7.0: Source code and patch In-Reply-To: <20081003.143749.193701570.ryov@valinux.co.jp> References: <20081003.143749.193701570.ryov@valinux.co.jp> Message-ID: <20081003.143824.226785638.ryov@valinux.co.jp> This patch is the dm-ioband version 1.7.0 release. Based on 2.6.27-rc5-mm1 Signed-off-by: Ryo Tsuruta Signed-off-by: Hirokazu Takahashi diff -uprN linux-2.6.27-rc5-mm1.orig/drivers/md/Kconfig linux-2.6.27-rc5-mm1/drivers/md/Kconfig --- linux-2.6.27-rc5-mm1.orig/drivers/md/Kconfig 2008-08-29 07:52:02.000000000 +0900 +++ linux-2.6.27-rc5-mm1/drivers/md/Kconfig 2008-10-02 19:56:48.000000000 +0900 @@ -275,4 +275,17 @@ config DM_UEVENT ---help--- Generate udev events for DM events. +config DM_IOBAND + tristate "I/O bandwidth control (EXPERIMENTAL)" + depends on BLK_DEV_DM && EXPERIMENTAL + ---help--- + This device-mapper target allows to define how the + available bandwidth of a storage device should be + shared between processes, cgroups, the partitions or the LUNs. + + Information on how to use dm-ioband is available in: + . + + If unsure, say N. + endif # MD diff -uprN linux-2.6.27-rc5-mm1.orig/drivers/md/Makefile linux-2.6.27-rc5-mm1/drivers/md/Makefile --- linux-2.6.27-rc5-mm1.orig/drivers/md/Makefile 2008-08-29 07:52:02.000000000 +0900 +++ linux-2.6.27-rc5-mm1/drivers/md/Makefile 2008-10-02 19:56:48.000000000 +0900 @@ -7,6 +7,7 @@ dm-mod-objs := dm.o dm-table.o dm-target dm-multipath-objs := dm-path-selector.o dm-mpath.o dm-snapshot-objs := dm-snap.o dm-exception-store.o dm-mirror-objs := dm-raid1.o +dm-ioband-objs := dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-type.o md-mod-objs := md.o bitmap.o raid456-objs := raid5.o raid6algos.o raid6recov.o raid6tables.o \ raid6int1.o raid6int2.o raid6int4.o \ @@ -36,6 +37,7 @@ obj-$(CONFIG_DM_MULTIPATH) += dm-multipa obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o obj-$(CONFIG_DM_ZERO) += dm-zero.o +obj-$(CONFIG_DM_IOBAND) += dm-ioband.o quiet_cmd_unroll = UNROLL $@ cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \ diff -uprN linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband-ctl.c linux-2.6.27-rc5-mm1/drivers/md/dm-ioband-ctl.c --- linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband-ctl.c 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27-rc5-mm1/drivers/md/dm-ioband-ctl.c 2008-10-02 19:56:48.000000000 +0900 @@ -0,0 +1,1328 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * Authors: Hirokazu Takahashi + * Ryo Tsuruta + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ +#include +#include +#include +#include +#include +#include +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +#define DM_MSG_PREFIX "ioband" +#define POLICY_PARAM_START 6 +#define POLICY_PARAM_DELIM "=:," + +static LIST_HEAD(ioband_device_list); +/* to protect ioband_device_list */ +static DEFINE_SPINLOCK(ioband_devicelist_lock); + +static void suspend_ioband_device(struct ioband_device *, unsigned long, int); +static void resume_ioband_device(struct ioband_device *); +static void ioband_conduct(struct work_struct *); +static void ioband_hold_bio(struct ioband_group *, struct bio *); +static struct bio *ioband_pop_bio(struct ioband_group *); +static int ioband_set_param(struct ioband_group *, char *, char *); +static int ioband_group_attach(struct ioband_group *, int, char *); +static int ioband_group_type_select(struct ioband_group *, char *); + +long ioband_debug; /* just for debugging */ + +static void do_nothing(void) {} + +static int policy_init(struct ioband_device *dp, char *name, + int argc, char **argv) +{ + struct policy_type *p; + struct ioband_group *gp; + unsigned long flags; + int r; + + for (p = dm_ioband_policy_type; p->p_name; p++) { + if (!strcmp(name, p->p_name)) + break; + } + if (!p->p_name) + return -EINVAL; + + spin_lock_irqsave(&dp->g_lock, flags); + if (dp->g_policy == p) { + /* do nothing if the same policy is already set */ + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; + } + + suspend_ioband_device(dp, flags, 1); + list_for_each_entry(gp, &dp->g_groups, c_list) + dp->g_group_dtr(gp); + + /* switch to the new policy */ + dp->g_policy = p; + r = p->p_policy_init(dp, argc, argv); + if (!dp->g_hold_bio) + dp->g_hold_bio = ioband_hold_bio; + if (!dp->g_pop_bio) + dp->g_pop_bio = ioband_pop_bio; + + list_for_each_entry(gp, &dp->g_groups, c_list) + dp->g_group_ctr(gp, NULL); + resume_ioband_device(dp); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static struct ioband_device *alloc_ioband_device(char *name, + int io_throttle, int io_limit) + +{ + struct ioband_device *dp, *new; + unsigned long flags; + + new = kzalloc(sizeof(struct ioband_device), GFP_KERNEL); + if (!new) + return NULL; + + spin_lock_irqsave(&ioband_devicelist_lock, flags); + list_for_each_entry(dp, &ioband_device_list, g_list) { + if (!strcmp(dp->g_name, name)) { + dp->g_ref++; + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + kfree(new); + return dp; + } + } + + /* + * Prepare its own workqueue as generic_make_request() may + * potentially block the workqueue when submitting BIOs. + */ + new->g_ioband_wq = create_workqueue("kioband"); + if (!new->g_ioband_wq) { + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + kfree(new); + return NULL; + } + + INIT_DELAYED_WORK(&new->g_conductor, ioband_conduct); + INIT_LIST_HEAD(&new->g_groups); + INIT_LIST_HEAD(&new->g_list); + spin_lock_init(&new->g_lock); + mutex_init(&new->g_lock_device); + bio_list_init(&new->g_urgent_bios); + new->g_io_throttle = io_throttle; + new->g_io_limit[0] = io_limit; + new->g_io_limit[1] = io_limit; + new->g_issued[0] = 0; + new->g_issued[1] = 0; + new->g_blocked = 0; + new->g_ref = 1; + new->g_flags = 0; + strlcpy(new->g_name, name, sizeof(new->g_name)); + new->g_policy = NULL; + new->g_hold_bio = NULL; + new->g_pop_bio = NULL; + init_waitqueue_head(&new->g_waitq); + init_waitqueue_head(&new->g_waitq_suspend); + init_waitqueue_head(&new->g_waitq_flush); + list_add_tail(&new->g_list, &ioband_device_list); + + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + return new; +} + +static void release_ioband_device(struct ioband_device *dp) +{ + unsigned long flags; + + spin_lock_irqsave(&ioband_devicelist_lock, flags); + dp->g_ref--; + if (dp->g_ref > 0) { + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + return; + } + list_del(&dp->g_list); + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + destroy_workqueue(dp->g_ioband_wq); + kfree(dp); +} + +static int is_ioband_device_flushed(struct ioband_device *dp, + int wait_completion) +{ + struct ioband_group *gp; + + if (wait_completion && dp->g_issued[0] + dp->g_issued[1] > 0) + return 0; + if (dp->g_blocked || waitqueue_active(&dp->g_waitq)) + return 0; + list_for_each_entry(gp, &dp->g_groups, c_list) + if (waitqueue_active(&gp->c_waitq)) + return 0; + return 1; +} + +static void suspend_ioband_device(struct ioband_device *dp, + unsigned long flags, int wait_completion) +{ + struct ioband_group *gp; + + /* block incoming bios */ + set_device_suspended(dp); + + /* wake up all blocked processes and go down all ioband groups */ + wake_up_all(&dp->g_waitq); + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (!is_group_down(gp)) { + set_group_down(gp); + set_group_need_up(gp); + } + wake_up_all(&gp->c_waitq); + } + + /* flush the already mapped bios */ + spin_unlock_irqrestore(&dp->g_lock, flags); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + flush_workqueue(dp->g_ioband_wq); + + /* wait for all processes to wake up and bios to release */ + spin_lock_irqsave(&dp->g_lock, flags); + wait_event_lock_irq(dp->g_waitq_flush, + is_ioband_device_flushed(dp, wait_completion), + dp->g_lock, do_nothing()); +} + +static void resume_ioband_device(struct ioband_device *dp) +{ + struct ioband_group *gp; + + /* go up ioband groups */ + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (group_need_up(gp)) { + clear_group_need_up(gp); + clear_group_down(gp); + } + } + + /* accept incoming bios */ + wake_up_all(&dp->g_waitq_suspend); + clear_device_suspended(dp); +} + +static struct ioband_group *ioband_group_find( + struct ioband_group *head, int id) +{ + struct rb_node *node = head->c_group_root.rb_node; + + while (node) { + struct ioband_group *p = + container_of(node, struct ioband_group, c_group_node); + + if (p->c_id == id || id == IOBAND_ID_ANY) + return p; + node = (id < p->c_id) ? node->rb_left : node->rb_right; + } + return NULL; +} + +static void ioband_group_add_node(struct rb_root *root, + struct ioband_group *gp) +{ + struct rb_node **new = &root->rb_node, *parent = NULL; + struct ioband_group *p; + + while (*new) { + p = container_of(*new, struct ioband_group, c_group_node); + parent = *new; + new = (gp->c_id < p->c_id) ? + &(*new)->rb_left : &(*new)->rb_right; + } + + rb_link_node(&gp->c_group_node, parent, new); + rb_insert_color(&gp->c_group_node, root); +} + +static int ioband_group_init(struct ioband_group *gp, + struct ioband_group *head, struct ioband_device *dp, int id, char *param) +{ + unsigned long flags; + int r; + + INIT_LIST_HEAD(&gp->c_list); + bio_list_init(&gp->c_blocked_bios); + bio_list_init(&gp->c_prio_bios); + gp->c_id = id; /* should be verified */ + gp->c_blocked = 0; + gp->c_prio_blocked = 0; + memset(gp->c_stat, 0, sizeof(gp->c_stat)); + init_waitqueue_head(&gp->c_waitq); + gp->c_flags = 0; + gp->c_group_root = RB_ROOT; + gp->c_banddev = dp; + + spin_lock_irqsave(&dp->g_lock, flags); + if (head && ioband_group_find(head, id)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_group: id=%d already exists.", id); + return -EEXIST; + } + + list_add_tail(&gp->c_list, &dp->g_groups); + + r = dp->g_group_ctr(gp, param); + if (r) { + list_del(&gp->c_list); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; + } + + if (head) { + ioband_group_add_node(&head->c_group_root, gp); + gp->c_dev = head->c_dev; + gp->c_target = head->c_target; + } + + spin_unlock_irqrestore(&dp->g_lock, flags); + + return 0; +} + +static void ioband_group_release(struct ioband_group *head, + struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + list_del(&gp->c_list); + if (head) + rb_erase(&gp->c_group_node, &head->c_group_root); + dp->g_group_dtr(gp); + kfree(gp); +} + +static void ioband_group_destroy_all(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *group; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + while ((group = ioband_group_find(gp, IOBAND_ID_ANY))) + ioband_group_release(gp, group); + ioband_group_release(NULL, gp); + spin_unlock_irqrestore(&dp->g_lock, flags); +} + +static void ioband_group_stop_all(struct ioband_group *head, int suspend) +{ + struct ioband_device *dp = head->c_banddev; + struct ioband_group *p; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + set_group_down(p); + if (suspend) { + set_group_suspended(p); + dprintk(KERN_ERR "ioband suspend: gp(%p)\n", p); + } + } + set_group_down(head); + if (suspend) { + set_group_suspended(head); + dprintk(KERN_ERR "ioband suspend: gp(%p)\n", head); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + flush_workqueue(dp->g_ioband_wq); +} + +static void ioband_group_resume_all(struct ioband_group *head) +{ + struct ioband_device *dp = head->c_banddev; + struct ioband_group *p; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + for (node = rb_first(&head->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + clear_group_down(p); + clear_group_suspended(p); + dprintk(KERN_ERR "ioband resume: gp(%p)\n", p); + } + clear_group_down(head); + clear_group_suspended(head); + dprintk(KERN_ERR "ioband resume: gp(%p)\n", head); + spin_unlock_irqrestore(&dp->g_lock, flags); +} + +static int split_string(char *s, long *id, char **v) +{ + char *p, *q; + int r = 0; + + *id = IOBAND_ID_ANY; + p = strsep(&s, POLICY_PARAM_DELIM); + q = strsep(&s, POLICY_PARAM_DELIM); + if (!q) { + *v = p; + } else { + r = strict_strtol(p, 0, id); + *v = q; + } + return r; +} + +/* + * Create a new band device: + * parameters: + * + */ +static int ioband_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct ioband_group *gp; + struct ioband_device *dp; + struct dm_dev *dev; + int io_throttle; + int io_limit; + int i, r, start; + long val, id; + char *param; + + if (argc < POLICY_PARAM_START) { + ti->error = "Requires " __stringify(POLICY_PARAM_START) + " or more arguments"; + return -EINVAL; + } + + if (strlen(argv[1]) > IOBAND_NAME_MAX) { + ti->error = "Ioband device name is too long"; + return -EINVAL; + } + dprintk(KERN_ERR "ioband_ctr ioband device name:%s\n", argv[1]); + + r = strict_strtol(argv[2], 0, &val); + if (r || val < 0) { + ti->error = "Invalid io_throttle"; + return -EINVAL; + } + io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val; + + r = strict_strtol(argv[3], 0, &val); + if (r || val < 0) { + ti->error = "Invalid io_limit"; + return -EINVAL; + } + io_limit = val; + + r = dm_get_device(ti, argv[0], 0, ti->len, + dm_table_get_mode(ti->table), &dev); + if (r) { + ti->error = "Device lookup failed"; + return r; + } + + if (io_limit == 0) { + struct request_queue *q; + + q = bdev_get_queue(dev->bdev); + if (!q) { + ti->error = "Can't get queue size"; + r = -ENXIO; + goto release_dm_device; + } + dprintk(KERN_ERR "ioband_ctr nr_requests:%lu\n", + q->nr_requests); + io_limit = q->nr_requests; + } + + if (io_limit < io_throttle) + io_limit = io_throttle; + dprintk(KERN_ERR "ioband_ctr io_throttle:%d io_limit:%d\n", + io_throttle, io_limit); + + dp = alloc_ioband_device(argv[1], io_throttle, io_limit); + if (!dp) { + ti->error = "Cannot create ioband device"; + r = -EINVAL; + goto release_dm_device; + } + + mutex_lock(&dp->g_lock_device); + r = policy_init(dp, argv[POLICY_PARAM_START - 1], + argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]); + if (r) { + ti->error = "Invalid policy parameter"; + goto release_ioband_device; + } + + gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL); + if (!gp) { + ti->error = "Cannot allocate memory for ioband group"; + r = -ENOMEM; + goto release_ioband_device; + } + + ti->private = gp; + gp->c_target = ti; + gp->c_dev = dev; + + /* Find a default group parameter */ + for (start = POLICY_PARAM_START; start < argc; start++) + if (argv[start][0] == ':') + break; + param = (start < argc) ? &argv[start][1] : NULL; + + /* Create a default ioband group */ + r = ioband_group_init(gp, NULL, dp, IOBAND_ID_ANY, param); + if (r) { + kfree(gp); + ti->error = "Cannot create default ioband group"; + goto release_ioband_device; + } + + r = ioband_group_type_select(gp, argv[4]); + if (r) { + ti->error = "Cannot set ioband group type"; + goto release_ioband_group; + } + + /* Create sub ioband groups */ + for (i = start + 1; i < argc; i++) { + r = split_string(argv[i], &id, ¶m); + if (r) { + ti->error = "Invalid ioband group parameter"; + goto release_ioband_group; + } + r = ioband_group_attach(gp, id, param); + if (r) { + ti->error = "Cannot create ioband group"; + goto release_ioband_group; + } + } + mutex_unlock(&dp->g_lock_device); + return 0; + +release_ioband_group: + ioband_group_destroy_all(gp); +release_ioband_device: + mutex_unlock(&dp->g_lock_device); + release_ioband_device(dp); +release_dm_device: + dm_put_device(ti, dev); + return r; +} + +static void ioband_dtr(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_stop_all(gp, 0); + cancel_delayed_work_sync(&dp->g_conductor); + dm_put_device(ti, gp->c_dev); + ioband_group_destroy_all(gp); + mutex_unlock(&dp->g_lock_device); + release_ioband_device(dp); +} + +static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio) +{ + /* Todo: The list should be split into a read list and a write list */ + bio_list_add(&gp->c_blocked_bios, bio); +} + +static struct bio *ioband_pop_bio(struct ioband_group *gp) +{ + return bio_list_pop(&gp->c_blocked_bios); +} + +static int is_urgent_bio(struct bio *bio) +{ + struct page *page = bio_iovec_idx(bio, 0)->bv_page; + /* + * ToDo: A new flag should be added to struct bio, which indicates + * it contains urgent I/O requests. + */ + if (!PageReclaim(page)) + return 0; + if (PageSwapCache(page)) + return 2; + return 1; +} + +static inline int device_should_block(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + if (is_group_down(gp)) + return 0; + if (is_device_blocked(dp)) + return 1; + if (dp->g_blocked >= dp->g_io_limit[0] + dp->g_io_limit[1]) { + set_device_blocked(dp); + return 1; + } + return 0; +} + +static inline int group_should_block(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + if (is_group_down(gp)) + return 0; + if (is_group_blocked(gp)) + return 1; + if (dp->g_should_block(gp)) { + set_group_blocked(gp); + return 1; + } + return 0; +} + +static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) { + /* + * Kernel threads shouldn't be blocked easily since each of + * them may handle BIOs for several groups on several + * partitions. + */ + wait_event_lock_irq(dp->g_waitq, !device_should_block(gp), + dp->g_lock, do_nothing()); + } else { + wait_event_lock_irq(gp->c_waitq, !group_should_block(gp), + dp->g_lock, do_nothing()); + } +} + +static inline int should_pushback_bio(struct ioband_group *gp) +{ + return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target); +} + +static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_issued[bio_data_dir(bio)]++; + return dp->g_prepare_bio(gp, bio, 0); +} + +static inline int room_for_bio(struct ioband_device *dp) +{ + return dp->g_issued[0] < dp->g_io_limit[0] + || dp->g_issued[1] < dp->g_io_limit[1]; +} + +static void hold_bio(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_blocked++; + if (is_urgent_bio(bio)) { + /* + * ToDo: + * When barrier mode is supported, write bios sharing the same + * file system with the currnt one would be all moved + * to g_urgent_bios list. + * You don't have to care about barrier handling if the bio + * is for swapping. + */ + dp->g_prepare_bio(gp, bio, IOBAND_URGENT); + bio_list_add(&dp->g_urgent_bios, bio); + } else { + gp->c_blocked++; + dp->g_hold_bio(gp, bio); + } +} + +static inline int room_for_bio_rw(struct ioband_device *dp, int direct) +{ + return dp->g_issued[direct] < dp->g_io_limit[direct]; +} + +static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int direct) +{ + if (bio_list_empty(&gp->c_prio_bios)) + set_prio_queue(gp, direct); + bio_list_add(&gp->c_prio_bios, bio); + gp->c_prio_blocked++; +} + +static struct bio *pop_prio_bio(struct ioband_group *gp) +{ + struct bio *bio = bio_list_pop(&gp->c_prio_bios); + + if (bio_list_empty(&gp->c_prio_bios)) + clear_prio_queue(gp); + + if (bio) + gp->c_prio_blocked--; + return bio; +} + +static int make_issue_list(struct ioband_group *gp, struct bio *bio, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_blocked--; + gp->c_blocked--; + if (!gp->c_blocked && is_group_blocked(gp)) { + clear_group_blocked(gp); + wake_up_all(&gp->c_waitq); + } + if (should_pushback_bio(gp)) + bio_list_add(pushback_list, bio); + else { + int rw = bio_data_dir(bio); + + gp->c_stat[rw].deferred++; + gp->c_stat[rw].sectors += bio_sectors(bio); + bio_list_add(issue_list, bio); + } + return prepare_to_issue(gp, bio); +} + +static void release_urgent_bios(struct ioband_device *dp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct bio *bio; + + if (bio_list_empty(&dp->g_urgent_bios)) + return; + while (room_for_bio_rw(dp, 1)) { + bio = bio_list_pop(&dp->g_urgent_bios); + if (!bio) + return; + dp->g_blocked--; + dp->g_issued[bio_data_dir(bio)]++; + bio_list_add(issue_list, bio); + } +} + +static int release_prio_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + struct bio *bio; + int direct; + int ret; + + if (bio_list_empty(&gp->c_prio_bios)) + return R_OK; + direct = prio_queue_direct(gp); + while (gp->c_prio_blocked) { + if (!dp->g_can_submit(gp)) + return R_BLOCK; + if (!room_for_bio_rw(dp, direct)) + return R_OK; + bio = pop_prio_bio(gp); + if (!bio) + return R_OK; + ret = make_issue_list(gp, bio, issue_list, pushback_list); + if (ret) + return ret; + } + return R_OK; +} + +static int release_norm_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + struct bio *bio; + int direct; + int ret; + + while (gp->c_blocked - gp->c_prio_blocked) { + if (!dp->g_can_submit(gp)) + return R_BLOCK; + if (!room_for_bio(dp)) + return R_OK; + bio = dp->g_pop_bio(gp); + if (!bio) + return R_OK; + + direct = bio_data_dir(bio); + if (!room_for_bio_rw(dp, direct)) { + push_prio_bio(gp, bio, direct); + continue; + } + ret = make_issue_list(gp, bio, issue_list, pushback_list); + if (ret) + return ret; + } + return R_OK; +} + +static inline int release_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + int ret = release_prio_bios(gp, issue_list, pushback_list); + if (ret) + return ret; + return release_norm_bios(gp, issue_list, pushback_list); +} + +static struct ioband_group *ioband_group_get(struct ioband_group *head, + struct bio *bio) +{ + struct ioband_group *gp; + + if (!head->c_type->t_getid) + return head; + + gp = ioband_group_find(head, head->c_type->t_getid(bio)); + + if (!gp) + gp = head; + return gp; +} + +/* + * Start to control the bandwidth once the number of uncompleted BIOs + * exceeds the value of "io_throttle". + */ +static int ioband_map(struct dm_target *ti, struct bio *bio, + union map_info *map_context) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + unsigned long flags; + int rw; + + spin_lock_irqsave(&dp->g_lock, flags); + + /* + * The device is suspended while some of the ioband device + * configurations are being changed. + */ + if (is_device_suspended(dp)) + wait_event_lock_irq(dp->g_waitq_suspend, + !is_device_suspended(dp), dp->g_lock, do_nothing()); + + gp = ioband_group_get(gp, bio); + prevent_burst_bios(gp, bio); + if (should_pushback_bio(gp)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return DM_MAPIO_REQUEUE; + } + + bio->bi_bdev = gp->c_dev->bdev; + bio->bi_sector -= ti->begin; + rw = bio_data_dir(bio); + + if (!gp->c_blocked && room_for_bio_rw(dp, rw)) { + if (dp->g_can_submit(gp)) { + prepare_to_issue(gp, bio); + gp->c_stat[rw].immediate++; + gp->c_stat[rw].sectors += bio_sectors(bio); + spin_unlock_irqrestore(&dp->g_lock, flags); + return DM_MAPIO_REMAPPED; + } else if (!dp->g_blocked + && dp->g_issued[0] + dp->g_issued[1] == 0) { + dprintk(KERN_ERR "ioband_map: token expired " + "gp:%p bio:%p\n", gp, bio); + queue_delayed_work(dp->g_ioband_wq, + &dp->g_conductor, 1); + } + } + hold_bio(gp, bio); + spin_unlock_irqrestore(&dp->g_lock, flags); + + return DM_MAPIO_SUBMITTED; +} + +/* + * Select the best group to resubmit its BIOs. + */ +static struct ioband_group *choose_best_group(struct ioband_device *dp) +{ + struct ioband_group *gp; + struct ioband_group *best = NULL; + int highest = 0; + int pri; + + /* Todo: The algorithm should be optimized. + * It would be better to use rbtree. + */ + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (!gp->c_blocked || !room_for_bio(dp)) + continue; + if (gp->c_blocked == gp->c_prio_blocked + && !room_for_bio_rw(dp, prio_queue_direct(gp))) { + continue; + } + pri = dp->g_can_submit(gp); + if (pri > highest) { + highest = pri; + best = gp; + } + } + + return best; +} + +/* + * This function is called right after it becomes able to resubmit BIOs. + * It selects the best BIOs and passes them to the underlying layer. + */ +static void ioband_conduct(struct work_struct *work) +{ + struct ioband_device *dp = + container_of(work, struct ioband_device, g_conductor.work); + struct ioband_group *gp = NULL; + struct bio *bio; + unsigned long flags; + struct bio_list issue_list, pushback_list; + + bio_list_init(&issue_list); + bio_list_init(&pushback_list); + + spin_lock_irqsave(&dp->g_lock, flags); + release_urgent_bios(dp, &issue_list, &pushback_list); + if (dp->g_blocked) { + gp = choose_best_group(dp); + if (gp && release_bios(gp, &issue_list, &pushback_list) + == R_YIELD) + queue_delayed_work(dp->g_ioband_wq, + &dp->g_conductor, 0); + } + + if (is_device_blocked(dp) + && dp->g_blocked < dp->g_io_limit[0]+dp->g_io_limit[1]) { + clear_device_blocked(dp); + wake_up_all(&dp->g_waitq); + } + + if (dp->g_blocked && room_for_bio_rw(dp, 0) && room_for_bio_rw(dp, 1) && + bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) && + dp->g_restart_bios(dp)) { + dprintk(KERN_ERR "ioband_conduct: token expired dp:%p " + "issued(%d,%d) g_blocked(%d)\n", dp, + dp->g_issued[0], dp->g_issued[1], dp->g_blocked); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + } + + + spin_unlock_irqrestore(&dp->g_lock, flags); + + while ((bio = bio_list_pop(&issue_list))) + generic_make_request(bio); + while ((bio = bio_list_pop(&pushback_list))) + bio_endio(bio, -EIO); +} + +static int ioband_end_io(struct dm_target *ti, struct bio *bio, + int error, union map_info *map_context) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + unsigned long flags; + int r = error; + + /* + * XXX: A new error code for device mapper devices should be used + * rather than EIO. + */ + if (error == -EIO && should_pushback_bio(gp)) { + /* This ioband device is suspending */ + r = DM_ENDIO_REQUEUE; + } + /* + * Todo: The algorithm should be optimized to eliminate the spinlock. + */ + spin_lock_irqsave(&dp->g_lock, flags); + dp->g_issued[bio_data_dir(bio)]--; + + /* + * Todo: It would be better to introduce high/low water marks here + * not to kick the workqueues so often. + */ + if (dp->g_blocked) + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + else if (is_device_suspended(dp) + && dp->g_issued[0] + dp->g_issued[1] == 0) + wake_up_all(&dp->g_waitq_flush); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static void ioband_presuspend(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_stop_all(gp, 1); + mutex_unlock(&dp->g_lock_device); +} + +static void ioband_resume(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_resume_all(gp); + mutex_unlock(&dp->g_lock_device); +} + + +static void ioband_group_status(struct ioband_group *gp, int *szp, + char *result, unsigned int maxlen) +{ + struct ioband_group_stat *stat; + int i, sz = *szp; /* used in DMEMIT() */ + + DMEMIT(" %d", gp->c_id); + for (i = 0; i < 2; i++) { + stat = &gp->c_stat[i]; + DMEMIT(" %lu %lu %lu", + stat->immediate + stat->deferred, stat->deferred, + stat->sectors); + } + *szp = sz; +} + +static int ioband_status(struct dm_target *ti, status_type_t type, + char *result, unsigned int maxlen) +{ + struct ioband_group *gp = ti->private, *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + int sz = 0; /* used in DMEMIT() */ + unsigned long flags; + + mutex_lock(&dp->g_lock_device); + + switch (type) { + case STATUSTYPE_INFO: + spin_lock_irqsave(&dp->g_lock, flags); + DMEMIT("%s", dp->g_name); + ioband_group_status(gp, &sz, result, maxlen); + for (node = rb_first(&gp->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + ioband_group_status(p, &sz, result, maxlen); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + break; + + case STATUSTYPE_TABLE: + spin_lock_irqsave(&dp->g_lock, flags); + DMEMIT("%s %s %d %d %s %s", + gp->c_dev->name, dp->g_name, + dp->g_io_throttle, dp->g_io_limit[0], + gp->c_type->t_name, dp->g_policy->p_name); + dp->g_show(gp, &sz, result, maxlen); + spin_unlock_irqrestore(&dp->g_lock, flags); + break; + } + + mutex_unlock(&dp->g_lock_device); + return 0; +} + +static int ioband_group_type_select(struct ioband_group *gp, char *name) +{ + struct ioband_device *dp = gp->c_banddev; + struct group_type *t; + unsigned long flags; + + for (t = dm_ioband_group_type; (t->t_name); t++) { + if (!strcmp(name, t->t_name)) + break; + } + if (!t->t_name) { + DMWARN("ioband type select: %s isn't supported.", name); + return -EINVAL; + } + spin_lock_irqsave(&dp->g_lock, flags); + if (!RB_EMPTY_ROOT(&gp->c_group_root)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EBUSY; + } + gp->c_type = t; + spin_unlock_irqrestore(&dp->g_lock, flags); + + return 0; +} + +static int ioband_set_param(struct ioband_group *gp, char *cmd, char *value) +{ + struct ioband_device *dp = gp->c_banddev; + char *val_str; + long id; + unsigned long flags; + int r; + + r = split_string(value, &id, &val_str); + if (r) + return r; + + spin_lock_irqsave(&dp->g_lock, flags); + if (id != IOBAND_ID_ANY) { + gp = ioband_group_find(gp, id); + if (!gp) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_set_param: id=%ld not found.", id); + return -EINVAL; + } + } + r = dp->g_set_param(gp, cmd, val_str); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static int ioband_group_attach(struct ioband_group *gp, int id, char *param) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *sub_gp; + int r; + + if (id < 0) { + DMWARN("ioband_group_attach: invalid id:%d", id); + return -EINVAL; + } + if (!gp->c_type->t_getid) { + DMWARN("ioband_group_attach: " + "no ioband group type is specified"); + return -EINVAL; + } + + sub_gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL); + if (!sub_gp) + return -ENOMEM; + + r = ioband_group_init(sub_gp, gp, dp, id, param); + if (r < 0) { + kfree(sub_gp); + return r; + } + return 0; +} + +static int ioband_group_detach(struct ioband_group *gp, int id) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *sub_gp; + unsigned long flags; + + if (id < 0) { + DMWARN("ioband_group_detach: invalid id:%d", id); + return -EINVAL; + } + spin_lock_irqsave(&dp->g_lock, flags); + sub_gp = ioband_group_find(gp, id); + if (!sub_gp) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_group_detach: invalid id:%d", id); + return -EINVAL; + } + + /* + * Todo: Calling suspend_ioband_device() before releasing the + * ioband group has a large overhead. Need improvement. + */ + suspend_ioband_device(dp, flags, 0); + ioband_group_release(gp, sub_gp); + resume_ioband_device(dp); + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; +} + +/* + * Message parameters: + * "policy" + * ex) + * "policy" "weight" + * "type" "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid" + * "io_throttle" + * "io_limit" + * "attach" + * "detach" + * "any-command" : + * ex) + * "weight" 0: + * "token" 24: + */ +static int __ioband_message(struct dm_target *ti, + unsigned int argc, char **argv) +{ + struct ioband_group *gp = ti->private, *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + long val; + int r = 0; + unsigned long flags; + + if (argc == 1 && !strcmp(argv[0], "reset")) { + spin_lock_irqsave(&dp->g_lock, flags); + memset(gp->c_stat, 0, sizeof(gp->c_stat)); + for (node = rb_first(&gp->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + memset(p->c_stat, 0, sizeof(p->c_stat)); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; + } + + if (argc != 2) { + DMWARN("Unrecognised band message received."); + return -EINVAL; + } + if (!strcmp(argv[0], "debug")) { + r = strict_strtol(argv[1], 0, &val); + if (r || val < 0) + return -EINVAL; + ioband_debug = val; + return 0; + } else if (!strcmp(argv[0], "io_throttle")) { + r = strict_strtol(argv[1], 0, &val); + spin_lock_irqsave(&dp->g_lock, flags); + if (r || val < 0 || + val > dp->g_io_limit[0] || val > dp->g_io_limit[1]) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EINVAL; + } + dp->g_io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val; + spin_unlock_irqrestore(&dp->g_lock, flags); + ioband_set_param(gp, argv[0], argv[1]); + return 0; + } else if (!strcmp(argv[0], "io_limit")) { + r = strict_strtol(argv[1], 0, &val); + if (r || val < 0) + return -EINVAL; + spin_lock_irqsave(&dp->g_lock, flags); + if (val == 0) { + struct request_queue *q; + + q = bdev_get_queue(gp->c_dev->bdev); + if (!q) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -ENXIO; + } + val = q->nr_requests; + } + if (val < dp->g_io_throttle) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EINVAL; + } + dp->g_io_limit[0] = dp->g_io_limit[1] = val; + spin_unlock_irqrestore(&dp->g_lock, flags); + ioband_set_param(gp, argv[0], argv[1]); + return 0; + } else if (!strcmp(argv[0], "type")) { + return ioband_group_type_select(gp, argv[1]); + } else if (!strcmp(argv[0], "attach")) { + r = strict_strtol(argv[1], 0, &val); + if (r) + return r; + return ioband_group_attach(gp, val, NULL); + } else if (!strcmp(argv[0], "detach")) { + r = strict_strtol(argv[1], 0, &val); + if (r) + return r; + return ioband_group_detach(gp, val); + } else if (!strcmp(argv[0], "policy")) { + r = policy_init(dp, argv[1], 0, &argv[2]); + return r; + } else { + /* message anycommand : */ + r = ioband_set_param(gp, argv[0], argv[1]); + if (r < 0) + DMWARN("Unrecognised band message received."); + return r; + } + return 0; +} + +static int ioband_message(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + int r; + + mutex_lock(&dp->g_lock_device); + r = __ioband_message(ti, argc, argv); + mutex_unlock(&dp->g_lock_device); + return r; +} + +static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct ioband_group *gp = ti->private; + struct request_queue *q = bdev_get_queue(gp->c_dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = gp->c_dev->bdev; + bvm->bi_sector -= ti->begin; + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static struct target_type ioband_target = { + .name = "ioband", + .module = THIS_MODULE, + .version = {1, 7, 0}, + .ctr = ioband_ctr, + .dtr = ioband_dtr, + .map = ioband_map, + .end_io = ioband_end_io, + .presuspend = ioband_presuspend, + .resume = ioband_resume, + .status = ioband_status, + .message = ioband_message, + .merge = ioband_merge, +}; + +static int __init dm_ioband_init(void) +{ + int r; + + r = dm_register_target(&ioband_target); + if (r < 0) { + DMERR("register failed %d", r); + return r; + } + return r; +} + +static void __exit dm_ioband_exit(void) +{ + int r; + + r = dm_unregister_target(&ioband_target); + if (r < 0) + DMERR("unregister failed %d", r); +} + +module_init(dm_ioband_init); +module_exit(dm_ioband_exit); + +MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control"); +MODULE_AUTHOR("Hirokazu Takahashi , " + "Ryo Tsuruta +#include +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +/* + * The following functions determine when and which BIOs should + * be submitted to control the I/O flow. + * It is possible to add a new BIO scheduling policy with it. + */ + + +/* + * Functions for weight balancing policy based on the number of I/Os. + */ +#define DEFAULT_WEIGHT 100 +#define DEFAULT_TOKENPOOL 2048 +#define DEFAULT_BUCKET 2 +#define IOBAND_IOPRIO_BASE 100 +#define TOKEN_BATCH_UNIT 20 +#define PROCEED_THRESHOLD 8 +#define LOCAL_ACTIVE_RATIO 8 +#define GLOBAL_ACTIVE_RATIO 16 +#define OVERCOMMIT_RATE 4 + +/* + * Calculate the effective number of tokens this group has. + */ +static int get_token(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int token = gp->c_token; + int allowance = dp->g_epoch - gp->c_my_epoch; + + if (allowance) { + if (allowance > dp->g_carryover) + allowance = dp->g_carryover; + token += gp->c_token_initial * allowance; + } + if (is_group_down(gp)) + token += gp->c_token_initial * dp->g_carryover * 2; + + return token; +} + +/* + * Calculate the priority of a given group. + */ +static int iopriority(struct ioband_group *gp) +{ + return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1; +} + +/* + * This function is called when all the active group on the same ioband + * device has used up their tokens. It makes a new global epoch so that + * all groups on this device will get freshly assigned tokens. + */ +static int make_global_epoch(struct ioband_device *dp) +{ + struct ioband_group *gp = dp->g_dominant; + + /* + * Don't make a new epoch if the dominant group still has a lot of + * tokens, except when the I/O load is low. + */ + if (gp) { + int iopri = iopriority(gp); + if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE && + dp->g_issued[0] + dp->g_issued[1] >= dp->g_io_throttle) + return 0; + } + + dp->g_epoch++; + dprintk(KERN_ERR "make_epoch %d --> %d\n", + dp->g_epoch-1, dp->g_epoch); + + /* The leftover tokens will be used in the next epoch. */ + dp->g_token_extra = dp->g_token_left; + if (dp->g_token_extra < 0) + dp->g_token_extra = 0; + dp->g_token_left = dp->g_token_bucket; + + dp->g_expired = NULL; + dp->g_dominant = NULL; + + return 1; +} + +/* + * This function is called when this group has used up its own tokens. + * It will check whether it's possible to make a new epoch of this group. + */ +static inline int make_epoch(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int allowance = dp->g_epoch - gp->c_my_epoch; + + if (!allowance) + return 0; + if (allowance > dp->g_carryover) + allowance = dp->g_carryover; + gp->c_my_epoch = dp->g_epoch; + return allowance; +} + +/* + * Check whether this group has tokens to issue an I/O. Return 0 if it + * doesn't have any, otherwise return the priority of this group. + */ +static int is_token_left(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int allowance; + int delta; + int extra; + + if (gp->c_token > 0) + return iopriority(gp); + + if (is_group_down(gp)) { + gp->c_token = gp->c_token_initial; + return iopriority(gp); + } + allowance = make_epoch(gp); + if (!allowance) + return 0; + /* + * If this group has the right to get tokens for several epochs, + * give all of them to the group here. + */ + delta = gp->c_token_initial * allowance; + dp->g_token_left -= delta; + /* + * Give some extra tokens to this group when there have left unused + * tokens on this ioband device from the previous epoch. + */ + extra = dp->g_token_extra * gp->c_token_initial / + (dp->g_token_bucket - dp->g_token_extra/2); + delta += extra; + gp->c_token += delta; + gp->c_consumed = 0; + + if (gp == dp->g_current) + dp->g_yield_mark += delta; + dprintk(KERN_ERR "refill token: " + "gp:%p token:%d->%d extra(%d) allowance(%d)\n", + gp, gp->c_token - delta, gp->c_token, extra, allowance); + if (gp->c_token > 0) + return iopriority(gp); + dprintk(KERN_ERR "refill token: yet empty gp:%p token:%d\n", + gp, gp->c_token); + return 0; +} + +/* + * Use tokens to issue an I/O. After the operation, the number of tokens left + * on this group may become negative value, which will be treated as debt. + */ +static int consume_token(struct ioband_group *gp, int count, int flag) +{ + struct ioband_device *dp = gp->c_banddev; + + if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial && + gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) { + ; /* Do nothing unless this group is really active. */ + } else if (!dp->g_dominant || + get_token(gp) > get_token(dp->g_dominant)) { + /* + * Regard this group as the dominant group on this + * ioband device when it has larger number of tokens + * than those of the previous one. + */ + dp->g_dominant = gp; + } + if (dp->g_epoch == gp->c_my_epoch && + gp->c_token > 0 && gp->c_token - count <= 0) { + /* Remember the last group which used up its own tokens. */ + dp->g_expired = gp; + if (dp->g_dominant == gp) + dp->g_dominant = NULL; + } + + if (gp != dp->g_current) { + /* This group is the current already. */ + dp->g_current = gp; + dp->g_yield_mark = + gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit); + } + gp->c_token -= count; + gp->c_consumed += count; + if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) { + /* + * Return-value 1 means that this policy requests dm-ioband + * to give a chance to another group to be selected since + * this group has already issued enough amount of I/Os. + */ + dp->g_current = NULL; + return R_YIELD; + } + /* + * Return-value 0 means that this policy allows dm-ioband to select + * this group to issue I/Os without a break. + */ + return R_OK; +} + +/* + * Consume one token on each I/O. + */ +static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag) +{ + return consume_token(gp, 1, flag); +} + +/* + * Check if this group is able to receive a new bio. + */ +static int is_queue_full(struct ioband_group *gp) +{ + return gp->c_blocked >= gp->c_limit; +} + +static void set_weight(struct ioband_group *gp, int new) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *p; + + dp->g_weight_total += (new - gp->c_weight); + gp->c_weight = new; + + if (dp->g_weight_total == 0) { + list_for_each_entry(p, &dp->g_groups, c_list) + p->c_token = p->c_token_initial = p->c_limit = 1; + } else { + list_for_each_entry(p, &dp->g_groups, c_list) { + p->c_token = p->c_token_initial = + dp->g_token_bucket * p->c_weight / + dp->g_weight_total + 1; + p->c_limit = (dp->g_io_limit[0] + dp->g_io_limit[1]) * + p->c_weight / dp->g_weight_total / + OVERCOMMIT_RATE + 1; + } + } +} + +static void init_token_bucket(struct ioband_device *dp, int val) +{ + dp->g_token_bucket = ((dp->g_io_limit[0] + dp->g_io_limit[1]) * + DEFAULT_BUCKET) << dp->g_token_unit; + if (!val) + val = DEFAULT_TOKENPOOL << dp->g_token_unit; + else if (val < dp->g_token_bucket) + val = dp->g_token_bucket; + dp->g_carryover = val/dp->g_token_bucket; + dp->g_token_left = 0; +} + +static int policy_weight_param(struct ioband_group *gp, char *cmd, char *value) +{ + struct ioband_device *dp = gp->c_banddev; + long val; + int r = 0, err; + + err = strict_strtol(value, 0, &val); + if (!strcmp(cmd, "weight")) { + if (!err && 0 < val && val <= SHORT_MAX) + set_weight(gp, val); + else + r = -EINVAL; + } else if (!strcmp(cmd, "token")) { + if (!err && val > 0) { + init_token_bucket(dp, val); + set_weight(gp, gp->c_weight); + dp->g_token_extra = 0; + } else + r = -EINVAL; + } else if (!strcmp(cmd, "io_limit")) { + init_token_bucket(dp, dp->g_token_bucket * dp->g_carryover); + set_weight(gp, gp->c_weight); + } else { + r = -EINVAL; + } + return r; +} + +static int policy_weight_ctr(struct ioband_group *gp, char *arg) +{ + struct ioband_device *dp = gp->c_banddev; + + if (!arg) + arg = __stringify(DEFAULT_WEIGHT); + gp->c_my_epoch = dp->g_epoch; + gp->c_weight = 0; + gp->c_consumed = 0; + return policy_weight_param(gp, "weight", arg); +} + +static void policy_weight_dtr(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + set_weight(gp, 0); + dp->g_dominant = NULL; + dp->g_expired = NULL; +} + +static void policy_weight_show(struct ioband_group *gp, int *szp, + char *result, unsigned int maxlen) +{ + struct ioband_group *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + int sz = *szp; /* used in DMEMIT() */ + + DMEMIT(" %d :%d", dp->g_token_bucket * dp->g_carryover, gp->c_weight); + + for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + DMEMIT(" %d:%d", p->c_id, p->c_weight); + } + *szp = sz; +} + +/* + * + * g_can_submit : To determine whether a given group has the right to + * submit BIOs. The larger the return value the higher the + * priority to submit. Zero means it has no right. + * g_prepare_bio : Called right before submitting each BIO. + * g_restart_bios : Called if this ioband device has some BIOs blocked but none + * of them can be submitted now. This method has to + * reinitialize the data to restart to submit BIOs and return + * 0 or 1. + * The return value 0 means that it has become able to submit + * them now so that this ioband device will continue its work. + * The return value 1 means that it is still unable to submit + * them so that this device will stop its work. And this + * policy module has to reactivate the device when it gets + * to be able to submit BIOs. + * g_hold_bio : To hold a given BIO until it is submitted. + * The default function is used when this method is undefined. + * g_pop_bio : To select and get the best BIO to submit. + * g_group_ctr : To initalize the policy own members of struct ioband_group. + * g_group_dtr : Called when struct ioband_group is removed. + * g_set_param : To update the policy own date. + * The parameters can be passed through "dmsetup message" + * command. + * g_should_block : Called every time this ioband device receive a BIO. + * Return 1 if a given group can't receive any more BIOs, + * otherwise return 0. + * g_show : Show the configuration. + */ +static int policy_weight_init(struct ioband_device *dp, int argc, char **argv) +{ + long val; + int r = 0; + + if (argc < 1) + val = 0; + else { + r = strict_strtol(argv[0], 0, &val); + if (r || val < 0) + return -EINVAL; + } + + dp->g_can_submit = is_token_left; + dp->g_prepare_bio = prepare_token; + dp->g_restart_bios = make_global_epoch; + dp->g_group_ctr = policy_weight_ctr; + dp->g_group_dtr = policy_weight_dtr; + dp->g_set_param = policy_weight_param; + dp->g_should_block = is_queue_full; + dp->g_show = policy_weight_show; + + dp->g_epoch = 0; + dp->g_weight_total = 0; + dp->g_current = NULL; + dp->g_dominant = NULL; + dp->g_expired = NULL; + dp->g_token_extra = 0; + dp->g_token_unit = 0; + init_token_bucket(dp, val); + dp->g_token_left = dp->g_token_bucket; + + return 0; +} +/* weight balancing policy based on the number of I/Os. --- End --- */ + + +/* + * Functions for weight balancing policy based on I/O size. + * It just borrows a lot of functions from the regular weight balancing policy. + */ +static int w2_prepare_token(struct ioband_group *gp, struct bio *bio, int flag) +{ + /* Consume tokens depending on the size of a given bio. */ + return consume_token(gp, bio_sectors(bio), flag); +} + +static int w2_policy_weight_init(struct ioband_device *dp, + int argc, char **argv) +{ + long val; + int r = 0; + + if (argc < 1) + val = 0; + else { + r = strict_strtol(argv[0], 0, &val); + if (r || val < 0) + return -EINVAL; + } + + r = policy_weight_init(dp, argc, argv); + if (r < 0) + return r; + + dp->g_prepare_bio = w2_prepare_token; + dp->g_token_unit = PAGE_SHIFT - 9; + init_token_bucket(dp, val); + dp->g_token_left = dp->g_token_bucket; + return 0; +} +/* weight balancing policy based on I/O size. --- End --- */ + + +static int policy_default_init(struct ioband_device *dp, + int argc, char **argv) +{ + return policy_weight_init(dp, argc, argv); +} + +struct policy_type dm_ioband_policy_type[] = { + {"default", policy_default_init}, + {"weight", policy_weight_init}, + {"weight-iosize", w2_policy_weight_init}, + {NULL, policy_default_init} +}; diff -uprN linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband-type.c linux-2.6.27-rc5-mm1/drivers/md/dm-ioband-type.c --- linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband-type.c 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27-rc5-mm1/drivers/md/dm-ioband-type.c 2008-10-02 19:56:48.000000000 +0900 @@ -0,0 +1,76 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +/* + * Any I/O bandwidth can be divided into several bandwidth groups, each of which + * has its own unique ID. The following functions are called to determine + * which group a given BIO belongs to and return the ID of the group. + */ + +/* ToDo: unsigned long value would be better for group ID */ + +static int ioband_process_id(struct bio *bio) +{ + /* + * This function will work for KVM and Xen. + */ + return (int)current->tgid; +} + +static int ioband_process_group(struct bio *bio) +{ + return (int)task_pgrp_nr(current); +} + +static int ioband_uid(struct bio *bio) +{ + return (int)current_uid(); +} + +static int ioband_gid(struct bio *bio) +{ + return (int)current_gid(); +} + +static int ioband_cpuset(struct bio *bio) +{ + return 0; /* not implemented yet */ +} + +static int ioband_node(struct bio *bio) +{ + return 0; /* not implemented yet */ +} + +static int ioband_cgroup(struct bio *bio) +{ + /* + * This function should return the ID of the cgroup which issued "bio". + * The ID of the cgroup which the current process belongs to won't be + * suitable ID for this purpose, since some BIOs will be handled by kernel + * threads like aio or pdflush on behalf of the process requesting the BIOs. + */ + return 0; /* not implemented yet */ +} + +struct group_type dm_ioband_group_type[] = { + {"none", NULL}, + {"pgrp", ioband_process_group}, + {"pid", ioband_process_id}, + {"node", ioband_node}, + {"cpuset", ioband_cpuset}, + {"cgroup", ioband_cgroup}, + {"user", ioband_uid}, + {"uid", ioband_uid}, + {"gid", ioband_gid}, + {NULL, NULL} +}; diff -uprN linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband.h linux-2.6.27-rc5-mm1/drivers/md/dm-ioband.h --- linux-2.6.27-rc5-mm1.orig/drivers/md/dm-ioband.h 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27-rc5-mm1/drivers/md/dm-ioband.h 2008-10-02 19:56:48.000000000 +0900 @@ -0,0 +1,190 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ + +#include +#include + +#define DEFAULT_IO_THROTTLE 4 +#define DEFAULT_IO_LIMIT 128 +#define IOBAND_NAME_MAX 31 +#define IOBAND_ID_ANY (-1) + +struct ioband_group; + +struct ioband_device { + struct list_head g_groups; + struct delayed_work g_conductor; + struct workqueue_struct *g_ioband_wq; + struct bio_list g_urgent_bios; + int g_io_throttle; + int g_io_limit[2]; + int g_issued[2]; + int g_blocked; + spinlock_t g_lock; + struct mutex g_lock_device; + wait_queue_head_t g_waitq; + wait_queue_head_t g_waitq_suspend; + wait_queue_head_t g_waitq_flush; + + int g_ref; + struct list_head g_list; + int g_flags; + char g_name[IOBAND_NAME_MAX + 1]; + struct policy_type *g_policy; + + /* policy dependent */ + int (*g_can_submit)(struct ioband_group *); + int (*g_prepare_bio)(struct ioband_group *, struct bio *, int); + int (*g_restart_bios)(struct ioband_device *); + void (*g_hold_bio)(struct ioband_group *, struct bio *); + struct bio * (*g_pop_bio)(struct ioband_group *); + int (*g_group_ctr)(struct ioband_group *, char *); + void (*g_group_dtr)(struct ioband_group *); + int (*g_set_param)(struct ioband_group *, char *cmd, char *value); + int (*g_should_block)(struct ioband_group *); + void (*g_show)(struct ioband_group *, int *, char *, unsigned int); + + /* members for weight balancing policy */ + int g_epoch; + int g_weight_total; + /* the number of tokens which can be used in every epoch */ + int g_token_bucket; + /* how many epochs tokens can be carried over */ + int g_carryover; + /* how many tokens should be used for one page-sized I/O */ + int g_token_unit; + /* the last group which used a token */ + struct ioband_group *g_current; + /* give another group a chance to be scheduled when the rest + of tokens of the current group reaches this mark */ + int g_yield_mark; + /* the latest group which used up its tokens */ + struct ioband_group *g_expired; + /* the group which has the largest number of tokens in the + active groups */ + struct ioband_group *g_dominant; + /* the number of unused tokens in this epoch */ + int g_token_left; + /* left-over tokens from the previous epoch */ + int g_token_extra; +}; + +struct ioband_group_stat { + unsigned long sectors; + unsigned long immediate; + unsigned long deferred; +}; + +struct ioband_group { + struct list_head c_list; + struct ioband_device *c_banddev; + struct dm_dev *c_dev; + struct dm_target *c_target; + struct bio_list c_blocked_bios; + struct bio_list c_prio_bios; + struct rb_root c_group_root; + struct rb_node c_group_node; + int c_id; /* should be unsigned long or unsigned long long */ + char c_name[IOBAND_NAME_MAX + 1]; /* rfu */ + int c_blocked; + int c_prio_blocked; + wait_queue_head_t c_waitq; + int c_flags; + struct ioband_group_stat c_stat[2]; /* hold rd/wr status */ + struct group_type *c_type; + + /* members for weight balancing policy */ + int c_weight; + int c_my_epoch; + int c_token; + int c_token_initial; + int c_limit; + int c_consumed; + + /* rfu */ + /* struct bio_list c_ordered_tag_bios; */ +}; + +#define IOBAND_URGENT 1 + +#define DEV_BIO_BLOCKED 1 +#define DEV_SUSPENDED 2 + +#define set_device_blocked(dp) ((dp)->g_flags |= DEV_BIO_BLOCKED) +#define clear_device_blocked(dp) ((dp)->g_flags &= ~DEV_BIO_BLOCKED) +#define is_device_blocked(dp) ((dp)->g_flags & DEV_BIO_BLOCKED) + +#define set_device_suspended(dp) ((dp)->g_flags |= DEV_SUSPENDED) +#define clear_device_suspended(dp) ((dp)->g_flags &= ~DEV_SUSPENDED) +#define is_device_suspended(dp) ((dp)->g_flags & DEV_SUSPENDED) + +#define IOG_PRIO_BIO_WRITE 1 +#define IOG_PRIO_QUEUE 2 +#define IOG_BIO_BLOCKED 4 +#define IOG_GOING_DOWN 8 +#define IOG_SUSPENDED 16 +#define IOG_NEED_UP 32 + +#define R_OK 0 +#define R_BLOCK 1 +#define R_YIELD 2 + +#define set_group_blocked(gp) ((gp)->c_flags |= IOG_BIO_BLOCKED) +#define clear_group_blocked(gp) ((gp)->c_flags &= ~IOG_BIO_BLOCKED) +#define is_group_blocked(gp) ((gp)->c_flags & IOG_BIO_BLOCKED) + +#define set_group_down(gp) ((gp)->c_flags |= IOG_GOING_DOWN) +#define clear_group_down(gp) ((gp)->c_flags &= ~IOG_GOING_DOWN) +#define is_group_down(gp) ((gp)->c_flags & IOG_GOING_DOWN) + +#define set_group_suspended(gp) ((gp)->c_flags |= IOG_SUSPENDED) +#define clear_group_suspended(gp) ((gp)->c_flags &= ~IOG_SUSPENDED) +#define is_group_suspended(gp) ((gp)->c_flags & IOG_SUSPENDED) + +#define set_group_need_up(gp) ((gp)->c_flags |= IOG_NEED_UP) +#define clear_group_need_up(gp) ((gp)->c_flags &= ~IOG_NEED_UP) +#define group_need_up(gp) ((gp)->c_flags & IOG_NEED_UP) + +#define set_prio_read(gp) ((gp)->c_flags |= IOG_PRIO_QUEUE) +#define clear_prio_read(gp) ((gp)->c_flags &= ~IOG_PRIO_QUEUE) +#define is_prio_read(gp) \ + ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == IOG_PRIO_QUEUE) + +#define set_prio_write(gp) \ + ((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) +#define clear_prio_write(gp) \ + ((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) +#define is_prio_write(gp) \ + ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == \ + (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) + +#define set_prio_queue(gp, direct) \ + ((gp)->c_flags |= (IOG_PRIO_QUEUE|direct)) +#define clear_prio_queue(gp) clear_prio_write(gp) +#define is_prio_queue(gp) ((gp)->c_flags & IOG_PRIO_QUEUE) +#define prio_queue_direct(gp) ((gp)->c_flags & IOG_PRIO_BIO_WRITE) + + +struct policy_type { + const char *p_name; + int (*p_policy_init)(struct ioband_device *, int, char **); +}; + +extern struct policy_type dm_ioband_policy_type[]; + +struct group_type { + const char *t_name; + int (*t_getid)(struct bio *); +}; + +extern struct group_type dm_ioband_group_type[]; + +/* Just for debugging */ +extern long ioband_debug; +#define dprintk(format, a...) \ + if (ioband_debug > 0) ioband_debug--, printk(format, ##a) From ryov at valinux.co.jp Thu Oct 2 22:39:31 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 03 Oct 2008 14:39:31 +0900 (JST) Subject: [PATCH 2/2] dm-ioband: I/O bandwidth controller v1.7.0: Document In-Reply-To: <20081003.143824.226785638.ryov@valinux.co.jp> References: <20081003.143749.193701570.ryov@valinux.co.jp> <20081003.143824.226785638.ryov@valinux.co.jp> Message-ID: <20081003.143931.71107132.ryov@valinux.co.jp> This patch is the documentation of dm-ioband, design overview, installation, command, reference and examples. Based on 2.6.27-rc5-mm1 Signed-off-by: Ryo Tsuruta Signed-off-by: Hirokazu Takahashi diff -uprN linux-2.6.27-rc5-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc5-mm1/Documentation/device-mapper/ioband.txt --- linux-2.6.27-rc5-mm1.orig/Documentation/device-mapper/ioband.txt 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27-rc5-mm1/Documentation/device-mapper/ioband.txt 2008-10-02 19:56:48.000000000 +0900 @@ -0,0 +1,938 @@ + Block I/O bandwidth control: dm-ioband + + ------------------------------------------------------- + + Table of Contents + + [1]What's dm-ioband all about? + + [2]Differences from the CFQ I/O scheduler + + [3]How dm-ioband works. + + [4]Setup and Installation + + [5]Getting started + + [6]Command Reference + + [7]Examples + +What's dm-ioband all about? + + dm-ioband is an I/O bandwidth controller implemented as a device-mapper + driver. Several jobs using the same physical device have to share the + bandwidth of the device. dm-ioband gives bandwidth to each job according + to its weight, which each job can set its own value to. + + A job is a group of processes with the same pid or pgrp or uid or a + virtual machine such as KVM or Xen. A job can also be a cgroup by applying + the bio-cgroup patch, which can be found at + [8]http://people.valinux.co.jp/~ryov/bio-cgroup/. + + +------+ +------+ +------+ +------+ +------+ +------+ + |cgroup| |cgroup| | the | | pid | | pid | | the | jobs + | A | | B | |others| | X | | Y | |others| + +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ + +--V----+---V---+----V---+ +--V----+---V---+----V---+ + | group | group | default| | group | group | default| ioband groups + | | | group | | | | group | + +-------+-------+--------+ +-------+-------+--------+ + | ioband1 | | ioband2 | ioband devices + +-----------|------------+ +-----------|------------+ + +-----------V--------------+-------------V------------+ + | | | + | sdb1 | sdb2 | physical devices + +--------------------------+--------------------------+ + + + -------------------------------------------------------------------------- + +Differences from the CFQ I/O scheduler + + Dm-ioband is flexible to configure the bandwidth settings. + + Dm-ioband can work with any type of I/O scheduler such as the NOOP + scheduler, which is often chosen for high-end storages, since it is + implemented outside the I/O scheduling layer. It allows both of partition + based bandwidth control and job --- a group of processes --- based + control. In addition, it can set different configuration on each physical + device to control its bandwidth. + + Meanwhile the current implementation of the CFQ scheduler has 8 IO + priority levels and all jobs whose processes have the same IO priority + share the bandwidth assigned to this level between them. And IO priority + is an attribute of a process so that it equally effects to all block + devices. + + -------------------------------------------------------------------------- + +How dm-ioband works. + + Every ioband device has one ioband group, which by default is called the + default group. + + Ioband devices can also have extra ioband groups in them. Each ioband + group has a job to support and a weight. Proportional to the weight, + dm-ioband gives tokens to the group. + + A group passes on I/O requests that its job issues to the underlying + layer so long as it has tokens left, while requests are blocked if there + aren't any tokens left in the group. Tokens are refilled once all of + groups that have requests on a given physical device use up their tokens. + + There are two policies for token consumption. One is that a token is + consumed for each I/O request. The other is that a token is consumed for + each I/O sector, for example, one I/O request which consists of + 4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose + either policy. + + With this approach, a job running on an ioband group with large weight + is guaranteed a wide I/O bandwidth. + + -------------------------------------------------------------------------- + +Setup and Installation + + Build a kernel with these options enabled: + + CONFIG_MD + CONFIG_BLK_DEV_DM + CONFIG_DM_IOBAND + + + If compiled as module, use modprobe to load dm-ioband. + + # make modules + # make modules_install + # depmod -a + # modprobe dm-ioband + + + "dmsetup targets" command shows all available device-mapper targets. + "ioband" and the version number are displayed when dm-ioband has been + loaded. + + # dmsetup targets | grep ioband + ioband v1.7.0 + + + -------------------------------------------------------------------------- + +Getting started + + The following is a brief description how to control the I/O bandwidth of + disks. In this description, we'll take one disk with two partitions as an + example target. + + -------------------------------------------------------------------------- + + Create and map ioband devices + + Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped + to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2" + and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of + the bandwidth of the physical disk "/dev/sda" while "ioband2" can use 20%. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \ + "weight 0 :40" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \ + "weight 0 :10" | dmsetup create ioband2 + + + If the commands are successful then the device files + "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created. + + -------------------------------------------------------------------------- + + Additional bandwidth control + + In this example two extra ioband groups are created on "ioband1". The + first group consists of all the processes with user-id 1000 and the second + group consists of all the processes with user-id 2000. Their weights are + 30 and 20 respectively. + + # dmsetup message ioband1 0 type user + # dmsetup message ioband1 0 attach 1000 + # dmsetup message ioband1 0 attach 2000 + # dmsetup message ioband1 0 weight 1000:30 + # dmsetup message ioband1 0 weight 2000:20 + + + Now the processes in the user-id 1000 group can use 30% --- + 30/(30+20+40+10)*100 --- of the bandwidth of the physical disk. + + Table 1. Weight assignments + + +----------------------------------------------------------------+ + | ioband device | ioband group | ioband weight | + |---------------+--------------------------------+---------------| + | ioband1 | user id 1000 | 30 | + |---------------+--------------------------------+---------------| + | ioband1 | user id 2000 | 20 | + |---------------+--------------------------------+---------------| + | ioband1 | default group(the other users) | 40 | + |---------------+--------------------------------+---------------| + | ioband2 | default group | 10 | + +----------------------------------------------------------------+ + + -------------------------------------------------------------------------- + + Remove the ioband devices + + Remove the ioband devices when no longer used. + + # dmsetup remove ioband1 + # dmsetup remove ioband2 + + + -------------------------------------------------------------------------- + +Command Reference + + Create an ioband device + + SYNOPSIS + + dmsetup create IOBAND_DEVICE + + DESCRIPTION + + Create an ioband device with the given name IOBAND_DEVICE. + Generally, dmsetup reads a table from standard input. Each line of + the table specifies a single target and is of the form: + + start_sector num_sectors "ioband" device_file ioband_device_id \ + io_throttle io_limit ioband_group_type policy token_base \ + :weight [ioband_group_id:weight...] + + + start_sector, num_sectors + + The sector range of the underlying device where + dm-ioband maps. + + ioband + + Specify the string "ioband" as a target type. + + device_file + + Underlying device name. + + ioband_device_id + + The ID number for an ioband device. The same ID + must be set among the ioband devices that share the + same bandwidth, which means they work on the same + physical disk. + + io_throttle + + Dm-ioband starts to control the bandwidth when the + number of BIOs in progress exceeds this value. If 0 + is specified, dm-ioband uses the default value. + + io_limit + + Dm-ioband blocks all I/O requests for the + IOBAND_DEVICE when the number of BIOs in progress + exceeds this value. If 0 is specified, dm-ioband uses + the default value. + + ioband_group_type + + Specify how to evaluate the ioband group ID. The + type must be one of "none", "user", "gid", "pid" or + "pgrp." The type "cgroup" is enabled by applying the + bio-cgroup patch. Specify "none" if you don't need + any ioband groups other than the default ioband + group. + + policy + + Specify bandwidth control policy. A user can choose + either policy "weight" or "weight-iosize." + + weight + + This policy controls bandwidth + according to the proportional to the + weight of each ioband group based on the + number of I/O requests. + + weight-iosize + + This policy controls bandwidth + according to the proportional to the + weight of each ioband group based on the + number of I/O sectors. + + token_base + + The number of tokens which specified by token_base + will be distributed to all ioband groups according to + the proportional to the weight of each ioband group. + If 0 is specified, dm-ioband uses the default value. + + ioband_group_id:weight + + Set the weight of the ioband group specified by + ioband_group_id. If ioband_group_id is omitted, the + weight is assigned to the default ioband group. + + EXAMPLE + + Create an ioband device with the following parameters: + + * Starting sector = "0" + + * The number of sectors = "$(blockdev --getsize /dev/sda1)" + + * Target type = "ioband" + + * Underlying device name = "/dev/sda1" + + * Ioband device ID = "128" + + * I/O throttle = "10" + + * I/O limit = "400" + + * Ioband group type = "user" + + * Bandwidth control policy = "weight" + + * Token base = "2048" + + * Weight for the default ioband group = "100" + + * Weight for the ioband group 1000 = "80" + + * Weight for the ioband group 2000 = "20" + + * Ioband device name = "ioband1" + + # echo "0 $(blockdev --getsize /dev/sda1) ioband" \ + "/dev/sda1 128 10 400 user weight 2048 :100 1000:80 2000:20" \ + | dmsetup create ioband1 + + + Create two device groups (ID=1,2). The bandwidths of these + device groups will be individually controlled. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \ + "0 0 none weight 0 :80" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \ + "0 0 none weight 0 :20" | dmsetup create ioband2 + # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \ + "0 0 none weight 0 :60" | dmsetup create ioband3 + # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \ + "0 0 none weight 0 :40" | dmsetup create ioband4 + + + -------------------------------------------------------------------------- + + Remove the ioband device + + SYNOPSIS + + dmsetup remove IOBAND_DEVICE + + DESCRIPTION + + Remove the specified ioband device IOBAND_DEVICE. All the band + groups attached to the ioband device are also removed + automatically. + + EXAMPLE + + Remove ioband device "ioband1." + + # dmsetup remove ioband1 + + + -------------------------------------------------------------------------- + + Set an ioband group type + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 type TYPE + + DESCRIPTION + + Set the ioband group type of the specified ioband device + IOBAND_DEVICE. TYPE must be one of "none", "user", "gid", "pid" or + "pgrp." The type "cgroup" is enabled by applying the bio-cgroup + patch. Once the type is set, new ioband groups can be created on + IOBAND_DEVICE. + + EXAMPLE + + Set the ioband group type of ioband device "ioband1" to "user." + + # dmsetup message ioband1 0 type user + + + -------------------------------------------------------------------------- + + Create an ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 attach ID + + DESCRIPTION + + Create an ioband group and attach it to IOBAND_DEVICE. ID + specifies user-id, group-id, process-id or process-group-id + depending the ioband group type of IOBAND_DEVICE. + + EXAMPLE + + Create an ioband group which consists of all processes with + user-id 1000 and attach it to ioband device "ioband1." + + # dmsetup message ioband1 0 type user + # dmsetup message ioband1 0 attach 1000 + + + -------------------------------------------------------------------------- + + Detach the ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 detach ID + + DESCRIPTION + + Detach the ioband group specified by ID from ioband device + IOBAND_DEVICE. + + EXAMPLE + + Detach the ioband group with ID "2000" from ioband device + "ioband2." + + # dmsetup message ioband2 0 detach 1000 + + + -------------------------------------------------------------------------- + + Set bandwidth control policy + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 policy policy + + DESCRIPTION + + Set bandwidth control policy. This command applies to all ioband + devices which have the same ioband device ID as IOBAND_DEVICE. A + user can choose either policy "weight" or "weight-iosize." + + weight + + This policy controls bandwidth according to the + proportional to the weight of each ioband group based + on the number of I/O requests. + + weight-iosize + + This policy controls bandwidth according to the + proportional to the weight of each ioband group based + on the number of I/O sectors. + + EXAMPLE + + Set bandwidth control policy of ioband devices which have the + same ioband device ID as "ioband1" to "weight-iosize." + + # dmsetup message ioband1 0 policy weight-iosize + + + -------------------------------------------------------------------------- + + Set the weight of an ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 weight VAL + + dmsetup message IOBAND_DEVICE 0 weight ID:VAL + + DESCRIPTION + + Set the weight of the ioband group specified by ID. Set the + weight of the default ioband group of IOBAND_DEVICE if ID isn't + specified. + + The following example means that "ioband1" can use 80% --- + 40/(40+10)*100 --- of the bandwidth of the physical disk while + "ioband2" can use 20%. + + # dmsetup message ioband1 0 weight 40 + # dmsetup message ioband2 0 weight 10 + + + The following lines have the same effect as the above: + + # dmsetup message ioband1 0 weight 4 + # dmsetup message ioband2 0 weight 1 + + + VAL must be an integer larger than 0. The default value, which + is assigned to newly created ioband groups, is 100. + + EXAMPLE + + Set the weight of the default ioband group of "ioband1" to 40. + + # dmsetup message ioband1 0 weight 40 + + + Set the weight of the ioband group of "ioband1" with ID "1000" + to 10. + + # dmsetup message ioband1 0 weight 1000:10 + + + -------------------------------------------------------------------------- + + Set the number of tokens + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 token VAL + + DESCRIPTION + + Set the number of tokens to VAL. According to their weight, this + number of tokens will be distributed to all the ioband groups on + the physical device to which ioband device IOBAND_DEVICE belongs + when they use up their tokens. + + VAL must be an integer greater than 0. The default is 2048. + + EXAMPLE + + Set the number of tokens of the physical device to which + "ioband1" belongs to 256. + + # dmsetup message ioband1 0 token 256 + + + -------------------------------------------------------------------------- + + Set I/O throttling + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 io_throttle VAL + + DESCRIPTION + + Set the I/O throttling value of the physical disk to which + ioband device IOBAND_DEVICE belongs to VAL. Dm-ioband start to + control the bandwidth when the number of BIOs in progress on the + physical disk exceeds this value. + + EXAMPLE + + Set the I/O throttling value of "ioband1" to 16. + + # dmsetup message ioband1 0 io_throttle 16 + + + -------------------------------------------------------------------------- + + Set I/O limiting + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 io_limit VAL + + DESCRIPTION + + Set the I/O limiting value of the physical disk to which ioband + device IOBAND_DEVICE belongs to VAL. Dm-ioband will block all I/O + requests for the physical device if the number of BIOs in progress + on the physical disk exceeds this value. + + EXAMPLE + + Set the I/O limiting value of "ioband1" to 128. + + # dmsetup message ioband1 0 io_limit 128 + + + -------------------------------------------------------------------------- + + Display settings + + SYNOPSIS + + dmsetup table --target ioband + + DESCRIPTION + + Display the current table for the ioband device in a format. See + "dmsetup create" command for information on the table format. + + EXAMPLE + + The following output shows the current table of "ioband1." + + # dmsetup table --target ioband + ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \ + 2048 :100 1000:80 2000:20 + + + -------------------------------------------------------------------------- + + Display Statistics + + SYNOPSIS + + dmsetup status --target ioband + + DESCRIPTION + + Display the statistics of all the ioband devices whose target + type is "ioband." + + The output format is as below. the first five columns shows: + + * ioband device name + + * logical start sector of the device (must be 0) + + * device size in sectors + + * target type (must be "ioband") + + * device group ID + + The remaining columns show the statistics of each ioband group + on the band device. Each group uses seven columns for its + statistics. + + * ioband group ID (-1 means default) + + * total read requests + + * delayed read requests + + * total read sectors + + * total write requests + + * delayed write requests + + * total write sectors + + EXAMPLE + + The following output shows the statistics of two ioband devices. + Ioband2 only has the default ioband group and ioband1 has three + (default, 1001, 1002) ioband groups. + + # dmsetup status + ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352 + ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \ + 166 107 472 139 95 352 1002 211 146 520 210 147 504 + + + -------------------------------------------------------------------------- + + Reset status counter + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 reset + + DESCRIPTION + + Reset the statistics of ioband device IOBAND_DEVICE. + + EXAMPLE + + Reset the statistics of "ioband1." + + # dmsetup message ioband1 0 reset + + + -------------------------------------------------------------------------- + +Examples + + Example #1: Bandwidth control on Partitions + + This example describes how to control the bandwidth with disk + partitions. The following diagram illustrates the configuration of this + example. You may want to run a database on /dev/mapper/ioband1 and web + applications on /dev/mapper/ioband2. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical devices + +---------------------------+---------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Create ioband devices with the same device group ID and assign + weights of 80 and 40 to the default ioband groups respectively. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \ + "none weight 0 :80" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \ + "none weight 0 :40" | dmsetup create ioband2 + + + 2. Create filesystems on the ioband devices and mount them. + + # mkfs.ext3 /dev/mapper/ioband1 + # mount /dev/mapper/ioband1 /mnt1 + + # mkfs.ext3 /dev/mapper/ioband2 + # mount /dev/mapper/ioband2 /mnt2 + + + -------------------------------------------------------------------------- + + Example #2: Bandwidth control on Logical Volumes + + This example is similar to the example #1 but it uses LVM logical + volumes instead of disk partitions. This example shows how to configure + ioband devices on two striped logical volumes. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/lv0 | | /dev/mapper/lv1 | striped logical + | | | | volumes + +-------------------------------------------------------+ + | vg0 | volume group + +-------------|----------------------------|------------+ + | | + +-------------V------------+ +-------------V------------+ + | /dev/sdb | | /dev/sdc | physical devices + +--------------------------+ +--------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Initialize the partitions for use by LVM. + + # pvcreate /dev/sdb + # pvcreate /dev/sdc + + + 2. Create a new volume group named "vg0" with /dev/sdb and /dev/sdc. + + # vgcreate vg0 /dev/sdb /dev/sdc + + + 3. Create two logical volumes in "vg0." The volumes have to be striped. + + # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M + # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M + + + The rest is the same as the example #1. + + 4. Create ioband devices corresponding to each logical volume and + assign weights of 80 and 40 to the default ioband groups respectively. + + # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \ + "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \ + dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \ + "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \ + dmsetup create ioband2 + + + 5. Create filesystems on the ioband devices and mount them. + + # mkfs.ext3 /dev/mapper/ioband1 + # mount /dev/mapper/ioband1 /mnt1 + + # mkfs.ext3 /dev/mapper/ioband2 + # mount /dev/mapper/ioband2 /mnt2 + + + -------------------------------------------------------------------------- + + Example #3: Bandwidth control on processes + + This example describes how to control the bandwidth with groups of + processes. You may also want to run an additional application on the same + machine described in the example #1. This example shows how to add a new + ioband group for this application. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +-------------+------------+ +-------------+------------+ + | default | | user=1000 | default | ioband groups + | (80) | | (20) | (40) | (weight) + +-------------+------------+ +-------------+------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical device + +---------------------------+---------------------------+ + + + The following shows to set up a new ioband group on the machine that is + already configured as the example #1. The application will have a weight + of 20 and run with user-id 1000 on /dev/mapper/ioband2. + + 1. Set the type of ioband2 to "user." + + # dmsetup message ioband2 0 type user. + + + 2. Create a new ioband group on ioband2. + + # dmsetup message ioband2 0 attach 1000 + + + 3. Assign weight of 10 to this newly created ioband group. + + # dmsetup message ioband2 0 weight 1000:20 + + + -------------------------------------------------------------------------- + + Example #4: Bandwidth control for Xen virtual block devices + + This example describes how to control the bandwidth for Xen virtual + block devices. The following diagram illustrates the configuration of this + example. + + Virtual Machine 1 Virtual Machine 2 virtual machines + | | + +-------------V------------+ +-------------V------------+ + | /dev/xvda1 | | /dev/xvda1 | virtual block + +-------------|------------+ +-------------|------------+ devices + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical device + +---------------------------+---------------------------+ + + + The followings shows how to map ioband device "ioband1" and "ioband2" to + virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on + Virtual Machine 2" respectively on the machine configured as the example + #1. Add the following lines to the configuration files that are referenced + when creating "Virtual Machine 1" and "Virtual Machine 2." + + For "Virtual Machine 1" + disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ] + + For "Virtual Machine 2" + disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ] + + + -------------------------------------------------------------------------- + + Example #5: Bandwidth control for Xen blktap devices + + This example describes how to control the bandwidth for Xen virtual + block devices when Xen blktap devices are used. The following diagram + illustrates the configuration of this example. + + Virtual Machine 1 Virtual Machine 2 virtual machines + | | + +-------------V------------+ +-------------V------------+ + | /dev/xvda1 | | /dev/xvda1 | virtual block + +-------------|------------+ +-------------|------------+ devices + | | + +-------------V----------------------------V------------+ + | /dev/mapper/ioband1 | ioband device + +---------------------------+---------------------------+ + | default group | default group | ioband groups + | (80) | (40) | (weight) + +-------------|-------------+--------------|------------+ + | | + +-------------|----------------------------|------------+ + | +----------V----------+ +----------V---------+ | + | | vm1.img | | vm2.img | | disk image files + | +---------------------+ +--------------------+ | + | /vmdisk | mount point + +---------------------------|---------------------------+ + | + +---------------------------V---------------------------+ + | /dev/sda1 | physical device + +-------------------------------------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Create an ioband device. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \ + "1 0 0 none weight 0 :100" | dmsetup create ioband1 + + + 2. Add the following lines to the configuration files that are + referenced when creating "Virtual Machine 1" and "Virtual Machine 2." + Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used. + + For "Virtual Machine 1" + disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ] + + For "Virtual Machine 1" + disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ] + + + 3. Run the virtual machines. + + # xm create vm1 + # xm create vm2 + + + 4. Find out the process IDs of the daemons which control the blktap + devices. + + # lsof /vmdisk/disk[12].img + COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME + tapdisk 15011 root 11u REG 253,0 2147483648 48961 /vmdisk/vm1.img + tapdisk 15276 root 13u REG 253,0 2147483648 48962 /vmdisk/vm2.img + + + 5. Create new ioband groups of pid 15011 and pid 15276, which are + process IDs of the tapdisks, and assign weight of 80 and 40 to the + groups respectively. + + # dmsetup message ioband1 0 type pid + # dmsetup message ioband1 0 attach 15011 + # dmsetup message ioband1 0 weight 15011:80 + # dmsetup message ioband1 0 attach 15276 + # dmsetup message ioband1 0 weight 15276:40 From jun.nakajima at intel.com Fri Oct 3 15:33:30 2008 From: jun.nakajima at intel.com (Nakajima, Jun) Date: Fri, 3 Oct 2008 15:33:30 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E422CA.2010606@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> Message-ID: <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> On 10/1/2008 6:24:26 PM, H. Peter Anvin wrote: > Nakajima, Jun wrote: > > > > > > All I have seen out of Microsoft only covers CPUID levels > > > 0x40000000 as an vendor identification leaf and 0x40000001 as a > > > "hypervisor identification leaf", but you might have access to other information. > > > > No, it says "Leaf 0x40000001 as hypervisor vendor-neutral interface > > identification, which determines the semantics of leaves from > > 0x40000002 through 0x400000FF." The Leaf 0x40000000 returns vendor > > identifier signature (i.e. hypervisor identification) and the > > hypervisor CPUID leaf range, as in the proposal. > > > Resuming the thread :-) > In other words, 0x40000002+ is vendor-specific space, based on the > hypervisor specified in 0x40000001 (in theory); in practice both > 0x40000000:0x40000001 since M$ seem to use clever identifiers as > "Hypervisor 1". What it means their hypervisor returns the interface signature (i.e. "Hv#1"), and that defines the interface. If we use "Lv_1", for example, we can define the interface 0x40000002 through 0x400000FF for Linux. Since leaf 0x40000000 and 0x40000001 are separate, we can decouple the hypervisor vender from the interface it supports. This also allows a hypervisor to support multiple interfaces. And whether a guest wants to use the interface without checking the vender id is a different thing. For Linux, we don't want to hardcode the vender ids in the upstream code at least for such a generic interface. So I think we need to modify the proposal: Hypervisor interface identification Leaf: Leaf 0x40000001. This leaf returns the interface signature that the hypervisor implements. # EAX: "Lv_1" (or something) # EBX, ECX, EDX: Reserved. Lv_1 interface Leaves: Leaf range 0x40000002 - 0x4000000FF. In fact, both Xen and KVM are using the leaf 0x40000001 for different purposes today (Xen: Xen version number, KVM: KVM para-virtualization features). But I don't think this would break their existing binaries mainly because they would need to expose the interface explicitly now. > > > > This further underscores my belief that using 0x400000xx for > > > anything "standards-based" at all is utterly futile, and that this > > > space should be treated as vendor identification and the rest as > > > vendor-specific. Any hope of creating a standard that's actually > > > usable needs to be outside this space, e.g. in the 0x40SSSSxx > > > space I proposed earlier. > > > > Actually I'm not sure I'm following your logic. Are you saying using > > that 0x400000xx for anything "standards-based" is utterly futile > > because Microsoft said "the range is hypervisor vendor-neutral"? Or > > you were not sure what they meant there. If we are not clear, we can > > ask them. > > > > What I'm saying is that Microsoft is effectively squatting on the > 0x400000xx space with their definition. As written, it's not even > clear that it will remain consistent between *their own* hypervisors, > even less anyone else's. I hope the above clarified your concern. You can google-search a more detailed public spec. Let me know if you want to know a specific URL. > > -hpa > . Jun Nakajima | Intel Open Source Technology Center From hpa at zytor.com Fri Oct 3 16:30:29 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Fri, 03 Oct 2008 16:30:29 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> Message-ID: <48E6AB15.8060405@zytor.com> Nakajima, Jun wrote: > What it means their hypervisor returns the interface signature (i.e. "Hv#1"), and that defines the interface. If we use "Lv_1", for example, we can define the interface 0x40000002 through 0x400000FF for Linux. Since leaf 0x40000000 and 0x40000001 are separate, we can decouple the hypervisor vender from the interface it supports. Right so far. > This also allows a hypervisor to support multiple interfaces. Wrong. This isn't a two-way interface. It's a one-way interface, and it *SHOULD BE*; exposing different information depending on what is running is a hack that is utterly tortorous at best. > > In fact, both Xen and KVM are using the leaf 0x40000001 for different purposes today (Xen: Xen version number, KVM: KVM para-virtualization features). But I don't think this would break their existing binaries mainly because they would need to expose the interface explicitly now. > >>>> This further underscores my belief that using 0x400000xx for >>>> anything "standards-based" at all is utterly futile, and that this >>>> space should be treated as vendor identification and the rest as >>>> vendor-specific. Any hope of creating a standard that's actually >>>> usable needs to be outside this space, e.g. in the 0x40SSSSxx >>>> space I proposed earlier. >>> Actually I'm not sure I'm following your logic. Are you saying using >>> that 0x400000xx for anything "standards-based" is utterly futile >>> because Microsoft said "the range is hypervisor vendor-neutral"? Or >>> you were not sure what they meant there. If we are not clear, we can >>> ask them. >>> >> What I'm saying is that Microsoft is effectively squatting on the >> 0x400000xx space with their definition. As written, it's not even >> clear that it will remain consistent between *their own* hypervisors, >> even less anyone else's. > > I hope the above clarified your concern. You can google-search a more detailed public spec. Let me know if you want to know a specific URL. > No, it hasn't "clarified my concern" in any way. It's exactly *underscoring* it. In other words, I consider 0x400000xx unusable for anything that is standards-based. The interfaces everyone is currently using aren't designed to export multiple interfaces; they're designed to tell the guest which *one* interface is exported. That is fine, we just need to go elsewhere. -hpa From jun.nakajima at intel.com Fri Oct 3 17:27:53 2008 From: jun.nakajima at intel.com (Nakajima, Jun) Date: Fri, 3 Oct 2008 17:27:53 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E6AB15.8060405@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> Message-ID: <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> On 10/3/2008 4:30:29 PM, H. Peter Anvin wrote: > Nakajima, Jun wrote: > > What it means their hypervisor returns the interface signature (i.e. > > "Hv#1"), and that defines the interface. If we use "Lv_1", for > > example, we can define the interface 0x40000002 through 0x400000FF for Linux. > > Since leaf 0x40000000 and 0x40000001 are separate, we can decouple > > the hypervisor vender from the interface it supports. > > Right so far. > > > This also allows a hypervisor to support multiple interfaces. > > Wrong. > > This isn't a two-way interface. It's a one-way interface, and it > *SHOULD BE*; exposing different information depending on what is > running is a hack that is utterly tortorous at best. What I mean is that a hypervisor (with a single vender id) can support multiple interfaces, exposing a single interface to each guest that would expect a specific interface at runtime. > > > > > In fact, both Xen and KVM are using the leaf 0x40000001 for > > different purposes today (Xen: Xen version number, KVM: KVM > > para-virtualization features). But I don't think this would break > > their existing binaries mainly because they would need to expose the interface explicitly now. > > > > > > > This further underscores my belief that using 0x400000xx for > > > > > anything "standards-based" at all is utterly futile, and that > > > > > this space should be treated as vendor identification and the > > > > > rest as vendor-specific. Any hope of creating a standard > > > > > that's actually usable needs to be outside this space, e.g. in > > > > > the 0x40SSSSxx space I proposed earlier. > > > > Actually I'm not sure I'm following your logic. Are you saying > > > > using that 0x400000xx for anything "standards-based" is utterly > > > > futile because Microsoft said "the range is hypervisor > > > > vendor-neutral"? Or you were not sure what they meant there. If > > > > we are not clear, we can ask them. > > > > > > > What I'm saying is that Microsoft is effectively squatting on the > > > 0x400000xx space with their definition. As written, it's not even > > > clear that it will remain consistent between *their own* > > > hypervisors, even less anyone else's. > > > > I hope the above clarified your concern. You can google-search a > > more detailed public spec. Let me know if you want to know a specific URL. > > > > No, it hasn't "clarified my concern" in any way. It's exactly > *underscoring* it. In other words, I consider 0x400000xx unusable for > anything that is standards-based. The interfaces everyone is > currently using aren't designed to export multiple interfaces; they're > designed to tell the guest which *one* interface is exported. That is > fine, we just need to go elsewhere. > > -hpa What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-). The interface space can be distinct, but the contents are defined and implemented independently, thus you might find overlaps, inconsistency, etc. among the interfaces. And why is runtime "multiple interfaces" required for a standards-based interface? . Jun Nakajima | Intel Open Source Technology Center From hpa at zytor.com Fri Oct 3 17:35:39 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Fri, 03 Oct 2008 17:35:39 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> Message-ID: <48E6BA5B.2090804@zytor.com> Nakajima, Jun wrote: > > What I mean is that a hypervisor (with a single vender id) can support multiple interfaces, exposing a single interface to each guest that would expect a specific interface at runtime. > Yes, and for the reasons outlined in a previous post in this thread, this is an incredibly bad idea. We already hate the guts of the ACPI people for this reason. > > What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-). > By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU, since at very least they export Intel-derived and AMD-derived interfaces. This is in other words, a ridiculous claim. > The interface space can be distinct, but the contents are defined and implemented independently, thus you might find overlaps, inconsistency, etc. among the interfaces. And why is runtime "multiple interfaces" required for a standards-based interface? That is the whole point -- without a central coordinating authority, you're going to have to accommodate many definition sources. Otherwise, you're just back to where we started -- each hypervisor exports an interface and that's just that. If there are multiple interface specifications, they should be exported simulateously in non-conflicting numberspaces, and the *GUEST* gets to choose what to believe. We already do this for *all kinds* of information, including CPUID. It's the right thing to do. -hpa From avi at redhat.com Sat Oct 4 01:53:19 2008 From: avi at redhat.com (Avi Kivity) Date: Sat, 04 Oct 2008 11:53:19 +0300 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> Message-ID: <48E72EFF.5040101@redhat.com> Nakajima, Jun wrote: > What's the significance of supporting multiple interfaces to the same guest simultaneously, i.e. _runtime_? We don't want the guests to run on such a literarily Frankenstein machine. And practically, such testing/debugging would be good only for Halloween :-). > > If you can only expose one interface, you need to have the user choose. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. From jun.nakajima at intel.com Tue Oct 7 15:30:10 2008 From: jun.nakajima at intel.com (Nakajima, Jun) Date: Tue, 7 Oct 2008 15:30:10 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48E6BA5B.2090804@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> Message-ID: <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote: > Nakajima, Jun wrote: > > > > What's the significance of supporting multiple interfaces to the > > same guest simultaneously, i.e. _runtime_? We don't want the guests > > to run on such a literarily Frankenstein machine. And practically, > > such testing/debugging would be good only for Halloween :-). > > > > By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU, > since at very least they export Intel-derived and AMD-derived interfaces. > This is in other words, a ridiculous claim. The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially. Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them? > -hpa > . Jun Nakajima | Intel Open Source Technology Center From hpa at zytor.com Tue Oct 7 15:37:13 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Tue, 07 Oct 2008 15:37:13 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> Message-ID: <48EBE499.5000304@zytor.com> Nakajima, Jun wrote: > On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote: >> Nakajima, Jun wrote: >>> What's the significance of supporting multiple interfaces to the >>> same guest simultaneously, i.e. _runtime_? We don't want the guests >>> to run on such a literarily Frankenstein machine. And practically, >>> such testing/debugging would be good only for Halloween :-). >>> >> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU, >> since at very least they export Intel-derived and AMD-derived interfaces. >> This is in other words, a ridiculous claim. > > The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially. > > Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them? > The guest chooses what it wants to use. We already do this: for example, we use CPUID leaf 0x80000006 preferentially to CPUID leaf 2, simply because it is a better interface. And you're absolutely right that the guest may end up picking and choosing different parts of the interfaces. That's how it is supposed to work. -hpa From jeremy at goop.org Tue Oct 7 16:41:10 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Tue, 07 Oct 2008 16:41:10 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> Message-ID: <48EBF396.8000502@goop.org> Nakajima, Jun wrote: > On 10/3/2008 5:35:39 PM, H. Peter Anvin wrote: > >> Nakajima, Jun wrote: >> >>> What's the significance of supporting multiple interfaces to the >>> same guest simultaneously, i.e. _runtime_? We don't want the guests >>> to run on such a literarily Frankenstein machine. And practically, >>> such testing/debugging would be good only for Halloween :-). >>> >>> >> By that notion, EVERY CPU currently shipped is a "Frankenstein" CPU, >> since at very least they export Intel-derived and AMD-derived interfaces. >> This is in other words, a ridiculous claim. >> > > The big difference here is that you could create a VM at runtime (by combining the existing interfaces) that did not exist before (or was not tested before). For example, a hypervisor could show hyper-v, osx-v (if any), linux-v, etc., and a guest could create a VM with hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such combinations/variations can grow exponentially. > That would be crazy. > Or are you suggesting that multiple interfaces be _available_ to guests at runtime but the guest chooses one of them? > Right, that's what I've been suggesting. I think hypervisors should be able to offer multiple ABIs to guests, but a guest has to commit to using one exclusively (ie, once they start to use one then the others turn themselves off, kill the domain, etc). J From jeremy at goop.org Tue Oct 7 16:45:43 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Tue, 07 Oct 2008 16:45:43 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48EBE499.5000304@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> <48EBE499.5000304@zytor.com> Message-ID: <48EBF4A7.3080704@goop.org> H. Peter Anvin wrote: > And you're absolutely right that the guest may end up picking and > choosing different parts of the interfaces. That's how it is supposed > to work. No, that would be a horrible, horrible mistake. There's no sane way to implement that; it would mean that the hypervisor would have to have some kind of state model that incorporates all the ABIs in a consistent way. Any guest using multiple ABIs would effectively end up being dependent on a particular hypervisor via a frankensteinian interface that no other hypervisor would implement in the same way, even if they claim to implement the same set of interfaces. If the hypervisor just needs to deal with one at a time then it can have relatively simple ABI<->internal state translation. However, if you have the notion of hypervisor-agnostic or common interfaces, then you can include those as part of the rest of the ABI and make it sane (so Xen+common, hyperv+common, etc). J From hpa at zytor.com Tue Oct 7 16:45:34 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Tue, 07 Oct 2008 16:45:34 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48EBF396.8000502@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> <48EBF396.8000502@goop.org> Message-ID: <48EBF49E.9030803@zytor.com> Jeremy Fitzhardinge wrote: >> >> The big difference here is that you could create a VM at runtime (by >> combining the existing interfaces) that did not exist before (or was >> not tested before). For example, a hypervisor could show hyper-v, >> osx-v (if any), linux-v, etc., and a guest could create a VM with >> hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such >> combinations/variations can grow exponentially. > > That would be crazy. > Not necessarily, although the example above is extreme. Redundant interfaces is the norm in an evolving platform. >> Or are you suggesting that multiple interfaces be _available_ to >> guests at runtime but the guest chooses one of them? > > Right, that's what I've been suggesting. I think hypervisors should > be able to offer multiple ABIs to guests, but a guest has to commit to > using one exclusively (ie, once they start to use one then the others > turn themselves off, kill the domain, etc). Not inherently. Of course, there may be interfaces which are interently or by policy mutually exclusive, but a hypervisor should only export the interfaces it wants a guest to be able to use. This is particularly so with CPUID, which is a *data export* interface, it doesn't perform any action. -hpa From jeremy at goop.org Tue Oct 7 17:40:14 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Tue, 07 Oct 2008 17:40:14 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48EBF49E.9030803@zytor.com> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> <48EBF396.8000502@goop.org> <48EBF49E.9030803@zytor.com> Message-ID: <48EC016E.4040708@goop.org> H. Peter Anvin wrote: > Jeremy Fitzhardinge wrote: >>> >>> The big difference here is that you could create a VM at runtime (by >>> combining the existing interfaces) that did not exist before (or was >>> not tested before). For example, a hypervisor could show hyper-v, >>> osx-v (if any), linux-v, etc., and a guest could create a VM with >>> hyper-v MMU, osx-v interrupt handling, Linux-v timer, etc. And such >>> combinations/variations can grow exponentially. >> >> That would be crazy. >> > > Not necessarily, although the example above is extreme. Redundant > interfaces is the norm in an evolving platform. Sure. A common feature across all hypervisor-specific ABIs may get subsumed into a generic interface which is equivalent to all the others. That's fine. But nobody should expect to be able to mix hyperV's lazy tlb interface with KVM's pv mmu updates and expect to get a working result. >>> Or are you suggesting that multiple interfaces be _available_ to >>> guests at runtime but the guest chooses one of them? >> >> Right, that's what I've been suggesting. I think hypervisors >> should be able to offer multiple ABIs to guests, but a guest has to >> commit to using one exclusively (ie, once they start to use one then >> the others turn themselves off, kill the domain, etc). > > Not inherently. Of course, there may be interfaces which are > interently or by policy mutually exclusive, but a hypervisor should > only export the interfaces it wants a guest to be able to use. It should export any interface that it implements fully, but those interfaces may have contradictory or inconsistent semantics which prevent them from being used concurrently. > This is particularly so with CPUID, which is a *data export* > interface, it doesn't perform any action. Well, sure. There's two distinct issues: 1. Using cpuid to get information about the kernel's environment. If the environment is sane, then cpuid is a read-only, side-effect free way of getting information, and any information gathered is fair game. 2. One of the pieces of information you can get with cpuid is a discovery of what paravirtual hypercall interfaces the environment supports, which the guest can compare against its list of interfaces that it supports. If there's some amount of intersection, it can decide to use one of those interfaces. I'm saying that *in general* a guest should expect to be able to use one and only one of those interfaces. There will be explicitly defined exceptions to that - such as using generic ABIs in addition to hypervisor specific ABIs - but a guest can't expect to to be able to mix and match. A tricky issue with selecting an ABI is if two hypervisors end up using exactly the same mechanism for implementing hypercalls (or whatever), so that there needs to be some explicit way for the guest to nominate which interface its actually using... J From hpa at zytor.com Tue Oct 7 18:09:57 2008 From: hpa at zytor.com (H. Peter Anvin) Date: Tue, 07 Oct 2008 18:09:57 -0700 Subject: [RFC] CPUID usage for interaction between Hypervisors and Linux. In-Reply-To: <48EBF4A7.3080704@goop.org> References: <1222881242.9381.17.camel@alok-dev1> <48E3B19D.6060905@zytor.com> <1222882431.9381.23.camel@alok-dev1> <48E3BC21.4080803@goop.org> <1222895153.9381.69.camel@alok-dev1> <48E3FDD5.7040106@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15D927EA4@orsmsx505.amr.corp.intel.com> <48E422CA.2010606@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA221@orsmsx505.amr.corp.intel.com> <48E6AB15.8060405@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DCBA325@orsmsx505.amr.corp.intel.com> <48E6BA5B.2090804@zytor.com> <0B53E02A2965CE4F9ADB38B34501A3A15DE4F934@orsmsx505.amr.corp.intel.com> <48EBE499.5000304@zytor.com> <48EBF4A7.3080704@goop.org> Message-ID: <48EC0865.7050400@zytor.com> Jeremy Fitzhardinge wrote: > H. Peter Anvin wrote: >> And you're absolutely right that the guest may end up picking and >> choosing different parts of the interfaces. That's how it is supposed >> to work. > > No, that would be a horrible, horrible mistake. There's no sane way to > implement that; it would mean that the hypervisor would have to have > some kind of state model that incorporates all the ABIs in a consistent > way. Any guest using multiple ABIs would effectively end up being > dependent on a particular hypervisor via a frankensteinian interface > that no other hypervisor would implement in the same way, even if they > claim to implement the same set of interfaces. > > If the hypervisor just needs to deal with one at a time then it can have > relatively simple ABI<->internal state translation. > > However, if you have the notion of hypervisor-agnostic or common > interfaces, then you can include those as part of the rest of the ABI > and make it sane (so Xen+common, hyperv+common, etc). > It depends on what classes of interfaces you're talking about. I think you and Jun have a bit narrow definition of "ABI" in this context. This is functionally equivalent to hardware interfaces (after all, that is what the hypervisor ABI *is* as far as the kernel is concerned) -- noone expects, say, a SATA controller that can run in legacy IDE mode to also take AHCI commands at the same time, but the kernel *does* expect that a chipset which exports LAPIC, HPET, PMTMR and TSC clock sources can use all four at the same time. In the latter case the interfaces are inherently independent and refer to different chunks of hardware which just happen to be related in that they all are related to timing. In the former case, we're dealing with *one* piece of hardware which can operate in one of two modes. For hypervisors, you will end up with cases where you have both types -- for example, KVM will happily use VMware's video interface, but that doesn't mean KVM wants to use VMware's interfaces for storage. This is exactly how it should be: the extent this kind of mix and match that is possible is a matter of the definition of the individual interfaces themselves, not of the overall architecture. -hpa From yu.zhao at intel.com Tue Oct 7 19:23:18 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 8 Oct 2008 10:23:18 +0800 Subject: [PATCH 1/6 v3] PCI: export some functions and macros In-Reply-To: <20080927125927.GL27204@parisc-linux.org> References: <20080927125927.GL27204@parisc-linux.org> Message-ID: On Saturday, September 27, 2008 8:59 PM, Matthew Wilcox wrote: >On Sat, Sep 27, 2008 at 04:27:44PM +0800, Zhao, Yu wrote: >> Export some functions and move some macros from c file to header file. > >That's absolutely not everything this patch does. You need to split >this into smaller pieces and explain what you're doing and why for each >of them. Sure, I'll split it. > >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h >> index d807cd7..596efa6 100644 >> --- a/drivers/pci/pci.h >> +++ b/drivers/pci/pci.h >> @@ -1,3 +1,9 @@ >> +#ifndef DRIVERS_PCI_H >> +#define DRIVERS_PCI_H > >Do we really need header guards on this file? Maybe it's not necessary, but we use guards in almost all private headers. So I added this to make this file look not so different. > >> -/* >> - * If the type is not unknown, we assume that the lowest bit is 'enable'. >> - * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit. >> +/** >> + * pci_read_base - read a PCI BAR >> + * @dev: the PCI device >> + * @type: type of the BAR >> + * @res: resource buffer to be filled in >> + * @pos: BAR position in the config space >> + * >> + * Returns 1 if the BAR is 64-bit, or 0 if 32-bit. >> */ >> -static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, >> +int pci_read_base(struct pci_dev *dev, enum pci_bar_type type, > >The original intent here was to have a pci_read_base() that called >__pci_read_base() and then did things like translate physical BAR >addresses to virtual ones. That patch is in the archives somewhere. >We ended up not including that patch because my user found out he could >get the address he wanted from elsewhere. I'm not sure we want to >remove the __ at this point. I've studied your patch that adds wrapper of __pci_read_base. If you are going to push it again, I'm ok with keeping the name unchanged. > >The eventual goal is to fix up the BARs at this point, but there's >several architectures that will break if we do this now. It's on my >long-term todo list. > >> struct resource *res, unsigned int pos) >> { >> u32 l, sz, mask; >> >> - mask = type ? ~PCI_ROM_ADDRESS_ENABLE : ~0; >> + mask = (type == pci_bar_rom) ? ~PCI_ROM_ADDRESS_ENABLE : ~0; > >What's going on here? Why are you adding pci_bar_rom? For the rom we >use pci_bar_mem32. Take a look at, for example, the MCHBAR in the 965 >spec (313053.pdf). That's something that uses the pci_bar_mem64 type >and definitely wants to use the PCI_ROM_ADDRESS_ENABLE mask. Thanks for pointing this out. I wonder how the PC_ROM_ADDRESS_ENABLE mask is set for those non-standard BARs like MCHBAR after the probe stage -- I don't think pci_update_resource will take care of them. So how about adding BAR type checking again pci_bar_mem32 in pci_update_resource so we can set the bit there? > >> >> - if (type == pci_bar_unknown) { >> + if (type == pci_bar_rom) { >> + res->flags |= (l & IORESOURCE_ROM_ENABLE); >> + l &= PCI_ROM_ADDRESS_MASK; >> + mask = (u32)PCI_ROM_ADDRESS_MASK; >> + } else { > >This looks wrong too. > >> if (rom) { >> @@ -344,7 +340,7 @@ static void pci_read_bases(struct pci_dev *dev, unsigned >int howmany, int rom) >> res->flags = IORESOURCE_MEM | IORESOURCE_PREFETCH | >> IORESOURCE_READONLY | IORESOURCE_CACHEABLE >| >> IORESOURCE_SIZEALIGN; >> - __pci_read_base(dev, pci_bar_mem32, res, rom); >> + pci_read_base(dev, pci_bar_mem32, res, rom); >> } > >And you don't even change the type here ... have you tested this code on >a system which has a ROM? Oh, you caught it. > >> >> - for(i=0; i<3; i++) >> - child->resource[i] = >&dev->resource[PCI_BRIDGE_RESOURCES+i]; >> - > >Er, this is rather important. Why can you just delete it? I guess pci_alloc_child_bus has done this so we don't have to do it again. > >-- >Matthew Wilcox Intel Open Source Technology Centre >"Bill, look, we understand that you're interested in selling us this >operating system, but compare it to ours. We can't possibly take such >a retrograde step." >-- >To unsubscribe from this list: send the line "unsubscribe linux-pci" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html From yu.zhao at intel.com Tue Oct 7 19:32:35 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 8 Oct 2008 10:32:35 +0800 Subject: [PATCH 0/6 v3] PCI: Linux kernel SR-IOV support In-Reply-To: <200809271308.03547.javier@guerrag.com> References: <200809271308.03547.javier@guerrag.com> Message-ID: On Sunday, September 28, 2008 2:08 AM, Javier Guerra Giraldez wrote: >On Saturday 27 September 2008, Zhao, Yu wrote: >> Greetings, >> >> Following patches are intended to support SR-IOV capability in the Linux >> kernel. With these patches, people can turn a PCI device with the >> capability into multiple ones from software perspective, which can benefit >> KVM and achieve other purposes such as QoS, security, etc. > >sounds great, i think some Infiniband HBAs have this capability; they even >suggested using on Xen for faster (no hypervisor intervention) communication >between DomU's on the same box. (and transparently to out of the box, of >course) Thanks. We will also push Xen patches soon. And please feel free to let us know if you have any question when integrating these patches with your products. > >does it need an IOMMU (VT-d), or the whole magic is done by the PCI device? For native Linux, we can use Virtual Function without IOMMU. For KVM, it requires IOMMU so the guest can use VFs. For Xen HVM, it also requires IOMMU. For Xen PV, it doesn't need IOMMU. > >-- >Javier From yu.zhao at intel.com Tue Oct 7 19:49:27 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 8 Oct 2008 10:49:27 +0800 Subject: [PATCH 4/6 v3] PCI: support SR-IOV capability In-Reply-To: References: Message-ID: On Wednesday, October 01, 2008 6:40 AM, Roland Dreier wrote: > > + ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0; > > + pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl); > > + ssleep(1); > >You seem to sleep for 1 second wherever you write the IOV_CTRL >register. Why is this? Is this specified by PCI, or is it coming from >somewhere else? This is specified by on pp. 39 PCI SR-IOV specification 1.0. You can find it at: http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf Thanks. > > - R. From yu.zhao at intel.com Tue Oct 7 19:56:18 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 8 Oct 2008 10:56:18 +0800 Subject: [PATCH 3/6 v3] PCI: support ARI capability In-Reply-To: <20081002161701.GO13822@parisc-linux.org> References: <200810020903.16385.jbarnes@virtuousgeek.org> <20081002161701.GO13822@parisc-linux.org> Message-ID: On Friday, October 03, 2008 12:17 AM, Matthew Wilcox wrote: >On Thu, Oct 02, 2008 at 09:03:15AM -0700, Jesse Barnes wrote: >> Maybe we should be consistent with the other APIs and call it pci_enable_ari >> (like we do for wake & msi). Looks pretty good otherwise. > >Those APIs are for drivers ... this is internal. I don't have any >objection of my own, though I agree with Alex's remark that the printk >is unnecessary and just adds clutter to the boot process. Will rename the function to pci_enable_ari, and remove the printk. Thanks. > >-- >Matthew Wilcox Intel Open Source Technology Centre >"Bill, look, we understand that you're interested in selling us this >operating system, but compare it to ours. We can't possibly take such >a retrograde step." >-- >To unsubscribe from this list: send the line "unsubscribe kvm" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html From yu.zhao at intel.com Tue Oct 7 20:25:00 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 8 Oct 2008 11:25:00 +0800 Subject: [PATCH 2/6 v3] PCI: add new general functions In-Reply-To: <200810020921.32335.jbarnes@virtuousgeek.org> References: <200810020921.32335.jbarnes@virtuousgeek.org> Message-ID: On Friday, October 03, 2008 12:22 AM, Jesse Barnes wrote: >On Saturday, September 27, 2008 1:27 am Zhao, Yu wrote: >> Centralize capability related functions into several new functions and put >> PCI resource definitions into an enum. > >> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> index f99160d..f2feebc 100644 >> --- a/drivers/pci/pci-sysfs.c >> +++ b/drivers/pci/pci-sysfs.c > >The sysfs changes look fine, they should be submitted separately. Will do. > >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index 259eaff..400d3b3 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -356,25 +356,10 @@ pci_find_parent_resource(const struct pci_dev *dev, >> struct resource *res) static void >> pci_restore_bars(struct pci_dev *dev) >> { >> - int i, numres; >> - >> - switch (dev->hdr_type) { >> - case PCI_HEADER_TYPE_NORMAL: >> - numres = 6; >> - break; >> - case PCI_HEADER_TYPE_BRIDGE: >> - numres = 2; >> - break; >> - case PCI_HEADER_TYPE_CARDBUS: >> - numres = 1; >> - break; >> - default: >> - /* Should never get here, but just in case... */ >> - return; >> - } >> + int i; >> >> - for (i = 0; i < numres; i ++) >> - pci_update_resource(dev, &dev->resource[i], i); >> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) >> + pci_update_resource(dev, i); >> } > >This confused me for a minute until I saw that the new pci_update_resource >ignores invalid BAR numbers. Not sure if that's clearer than the current >code... When device has its own specific BARs, we have to add more 'case' statement in this function and may mass this function up. Simply ignoring the unused resources in pci_update_resource looks concise to me. Anyway, I can use keep the old structure if you feel the change brought more confusion than concision. > >> +/** >> + * pci_resource_bar - get position of the BAR associated with a resource >> + * @dev: the PCI device >> + * @resno: the resource number >> + * @type: the BAR type to be filled in >> + * >> + * Returns BAR position in config space, or 0 if the BAR is invalid. >> + */ >> +int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type >> *type) +{ >> + if (resno < PCI_ROM_RESOURCE) { >> + *type = pci_bar_unknown; >> + return PCI_BASE_ADDRESS_0 + 4 * resno; >> + } else if (resno == PCI_ROM_RESOURCE) { >> + *type = pci_bar_rom; >> + return dev->rom_base_reg; >> + } >> + >> + dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); >> + return 0; >> +} > >It looks like this will spew an error even under normal circumstances since >pci_restore_bars gets called at resume time, right? You could make this into It won't print the message unless there is something wrong with the system. pci_update_resource is only called when the resource # is less than PCI_BRIDGE_RESOURCES and it will ignore unused resource. So when pci_resource_bar gets involved, all resource # shouldn't big than PCI_ROM_RESOURCE (PCI_BRIDGE_RESOURCE = PCI_ROM_RESOURCE + 1) >a debug message or just get rid of it. Also now that I look at this, I don't >think it'll provide equivalent functionality to the old restore_bars code, >won't a cardbus bridge end up getting pci_update_resource called on invalid >BARs? The cardbus uses 1 BAR resource plus 4 (max) bridge resources. The pci_update_resource is only called when restoring the BAR resource. It won't be called to update the bridge resources for the reason I mentioned above. > >> +static void pci_init_capabilities(struct pci_dev *dev) >> +{ >> + /* MSI/MSI-X list */ >> + pci_msi_init_pci_dev(dev); >> + >> + /* Power Management */ >> + pci_pm_init(dev); >> + >> + /* Vital Product Data */ >> + pci_vpd_pci22_init(dev); >> +} >> + > >These capabilities changes look good, care to separate them out? Will do. > >Let's see if we can whittle down this patchset by extracting and applying all >the various cleanups; that should make the core bits easier to review. Thanks for the careful reviewing and the comments. I'll send the updated version soon according to all the comments I've got. > >Thanks, >-- >Jesse Barnes, Intel Open Source Technology Center >-- >To unsubscribe from this list: send the line "unsubscribe linux-pci" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html From baramsori72 at gmail.com Wed Oct 8 01:29:34 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Wed, 8 Oct 2008 17:29:34 +0900 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction In-Reply-To: <20081003.143749.193701570.ryov@valinux.co.jp> References: <20081003.143749.193701570.ryov@valinux.co.jp> Message-ID: <2891419e0810080129sb6b35y11362f4bef71c174@mail.gmail.com> Hi, Ryo Tsuruta I tested dm-ioband( the latest release, ver 1.7.0 ) IO controller, but I had a strange result from it. I have something wrong in test process? The test process and results are in attached file. Can you check my testing result and give me a helpful advices and comments? As you can show in attached file, I tested 4 cases in dm-ioband with tiobench() IO testing tool like as below - 1) 3 cgroups with different weight in same ioband device(ioband1) : Buffered IO - 2) 3 cgroups with different weight in same ioband device(ioband1) : Direct IO - 3) 3 cgroups with different weight in each ioband divice(ioband1, 2, 3) : Buffered IO - 4) 3 cgroups with different weight in each ioband divice(ioband1, 2, 3) : Direct IO But, IO bandwidth was not nearly controlled by dm-ioband You can refer the testing tool, tiobench, in http://sourceforge.net/projects/tiobench/ Originally, tiobench don't support the direct IO mode testing, so I added the O_DIRECT option to tiobench source code and recompile it to test the Direct IO cases Thanks, Dong-Jae, Kang --------------------------------------------------------------------- 2008/10/3 Ryo Tsuruta : > Hi everyone, > > This is the dm-ioband version 1.7.0 release. > > Dm-ioband is an I/O bandwidth controller implemented as a device-mapper > driver, which gives specified bandwidth to each job running on the same > physical device. > > - Can be applied to the kernel 2.6.27-rc5-mm1. > - Changes from 1.6.0 (posted on Sep 24, 2008): > - Fix a problem that processes issuing I/Os are permanently blocked > when I/O requests to reclaim pages are consecutively issued. > > You can apply the latest bio-cgroup patch to this dm-ioband version. > The bio-cgroup provides a BIO tracking mechanism with dm-ioband. > Please see the following site for more information: > Block I/O tracking > http://people.valinux.co.jp/~ryov/bio-cgroup/ > > Thanks, > Ryo Tsuruta > _______________________________________________ > Containers mailing list > Containers at lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > -------------- next part -------------- A non-text attachment was scrubbed... Name: dm-ioband_test_result.pdf Type: application/pdf Size: 69827 bytes Desc: not available Url : http://lists.linux-foundation.org/pipermail/virtualization/attachments/20081008/cd6e9f76/attachment-0001.pdf From ryov at valinux.co.jp Wed Oct 8 03:40:22 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Wed, 08 Oct 2008 19:40:22 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction In-Reply-To: <2891419e0810080129sb6b35y11362f4bef71c174@mail.gmail.com> References: <20081003.143749.193701570.ryov@valinux.co.jp> <2891419e0810080129sb6b35y11362f4bef71c174@mail.gmail.com> Message-ID: <20081008.194022.226783199.ryov@valinux.co.jp> Hi Dong-Jae, Thanks for being intersted in dm-ioband. > I tested dm-ioband( the latest release, ver 1.7.0 ) IO controller, but > I had a strange result from it. > I have something wrong in test process? > The test process and results are in attached file. > Can you check my testing result and give me a helpful advices and comments? There are some suggestions for you. 1. you have to specify a dm-ioband device at the command line to control bandwidth. # tiotest -R -d /dev/mapper/ioband1 -f 300 2. tiotest is not an appropriate tool to see how bandwith is shared among devices, becasue those three tiotests don't finish at the same time, a process which issues I/Os to a device with the highest weight finishes first, so you can't see how bandwidth is shared from the results of each tiotest. I use iostat to see the time variation of bandiwdth. The followings are the outputs of iostat just after starting three tiotests on the same setting as yours. # iostat -p dm-0 -p dm-1 -p dm-2 1 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dm-0 5430.00 0.00 10860.00 0 10860 dm-1 16516.00 0.00 16516.00 0 16516 dm-2 32246.00 0.00 32246.00 0 32246 avg-cpu: %user %nice %system %iowait %steal %idle 0.51 0.00 21.83 76.14 0.00 1.52 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dm-0 5720.00 0.00 11440.00 0 11440 dm-1 16138.00 0.00 16138.00 0 16138 dm-2 32734.00 0.00 32734.00 0 32734 ... > You can refer the testing tool, tiobench, in > http://sourceforge.net/projects/tiobench/ > Originally, tiobench don't support the direct IO mode testing, so I > added the O_DIRECT option to tiobench source code and recompile it to > test the Direct IO cases Could you give me the O_DIRECT patch? Thanks, Ryo Tsuruta From vbusireddy at nextio.com Wed Oct 8 08:25:47 2008 From: vbusireddy at nextio.com (Venugopal Busireddy) Date: Wed, 8 Oct 2008 10:25:47 -0500 Subject: [PATCH 1/6 v3] PCI: export some functions and macros Message-ID: Hi, Is the SR-IOV support available in the form a patch? Where can I get it from? Thanks, Venu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20081008/dc4f789e/attachment.htm From rusty at rustcorp.com.au Wed Oct 8 17:55:59 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 9 Oct 2008 11:55:59 +1100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <1223494499-18732-2-git-send-email-markmc@redhat.com> References: <> <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> Message-ID: <200810091155.59731.rusty@rustcorp.com.au> On Thursday 09 October 2008 06:34:59 Mark McLoughlin wrote: > From: Herbert Xu > > If segmentation offload is enabled by the host, we currently allocate > maximum sized packet buffers and pass them to the host. This uses up > 20 ring entries, allowing us to supply only 20 packet buffers to the > host with a 256 entry ring. This is a huge overhead when receiving > small packets, and is most keenly felt when receiving MTU sized > packets from off-host. Hi Mark! There are three approaches we should investigate before adding YA feature. Obviously, we can simply increase the number of ring entries. Secondly, we can put the virtio_net_hdr at the head of the skb data (this is also worth considering for xmit I think if we have headroom) and drop MAX_SKB_FRAGS which contains a gratuitous +2. Thirdly, we can try to coalesce contiguous buffers. The page caching scheme we have might help here, I don't know. Maybe we should be explicitly trying to allocate higher orders. Now, that said, we might need this anyway. But let's try the easy things first? (Or as well...) > The size of the logical buffer is > returned to the guest rather than the size of the individual smaller > buffers. That's a virtio transport breakage: can you use the standard virtio mechanism, just put the extended length or number of extra buffers inside the virtio_net_hdr? That makes more sense to me. > Make use of this support by supplying single page receive buffers to > the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of > the payload to the skb's linear data buffer and adjust the fragment > offset to point to the remaining data. This ensures proper alignment > and allows us to not use any paged data for small packets. If the > payload occupies multiple pages, we simply append those pages as > fragments and free the associated skbs. > + char *p = page_address(skb_shinfo(skb)->frags[0].page); ... > + memcpy(hdr, p, sizeof(*hdr)); > + p += sizeof(*hdr); I think you need kmap_atomic() here to access the page. And yes, that will effect performance :( A few more comments moved from the patch header into the source wouldn't go astray, but I'm happy to do that myself (it's been on my TODO for a while). Thanks! Rusty. From baramsori72 at gmail.com Wed Oct 8 23:15:18 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Thu, 9 Oct 2008 15:15:18 +0900 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction In-Reply-To: <20081008.194022.226783199.ryov@valinux.co.jp> References: <20081003.143749.193701570.ryov@valinux.co.jp> <2891419e0810080129sb6b35y11362f4bef71c174@mail.gmail.com> <20081008.194022.226783199.ryov@valinux.co.jp> Message-ID: <2891419e0810082315v28f2f4cbu5f95230db3be0bc1@mail.gmail.com> Hi, Ryo tsuruta. Thank you for your fast reply. Your comments was very helpful for me ^^ > 2. tiotest is not an appropriate tool to see how bandwith is shared > among devices, becasue those three tiotests don't finish at the > same time, a process which issues I/Os to a device with the highest > weight finishes first, so you can't see how bandwidth is shared > from the results of each tiotest. Yes, you are right, and it is good point for correct IO testing in dm-ioband and other controllers. So, I tested dm-ioband and bio-cgroup patches with another IO testing tool, xdd ver6.5(http://www.ioperformance.com/), after your reply. Xdd supports O_DIRECT mode and time limit options. I think, personally, it is proper tool for testing of IO controllers in Linux Container ML. And I found some strange points in test results. In fact, it will be not strange for other persons^^ 1. dm-ioband can control IO bandwidth well in O_DIRECT mode(read and write), I think the result is very reasonable. but it can't control it in Buffered mode when I checked just only output of xdd. I think bio-cgroup patches is for solving the problems, is it right? If so, how can I check or confirm the role of bio-cgroup patches? 2. As showed in test results, the IO performance in Buffered IO mode is very low compared with it in O_DIRECT mode. In my opinion, the reverse case is more natural in real life. Can you give me a answer about it? 3. Compared with physical bandwidth(when it is checked with one process and without dm-ioband device), the sum of the bandwidth by dm-ioband has very considerable gap with the physical bandwidth. I wonder the reason?. Is it overhead of dm-ioband or bio-cgroup patches? or Are there any another reasons? the new testing result is like below. - Testing target : the patches of dm-ioband ver1.7.0 and bio-cgroup latest version - Testing Cases 1.Read and write stress test in O_DIRECT IO mode 2.Read and write stress test in not Buffered IO mode - Testing tool : xdd ver6.5 ( http://www.ioperformance.com/ ) * Total bandwidth Read IO T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 1048576000 128000 15.700 66.790 8153.04 0.0001 0.00 read 8192 0 1 1048576000 128000 15.700 66.790 8153.04 0.0001 0.00 read 8192 1 1 1048576000 128000 15.700 66.790 8153.04 0.0001 0.00 read 8192 Write IO T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 1048576000 128000 14.730 71.185 8689.59 0.0001 0.00 write 8192 0 1 1048576000 128000 14.730 71.185 8689.59 0.0001 0.00 write 8192 1 1 1048576000 128000 14.730 71.185 8689.59 0.0001 0.00 write 8192 * Read IO test in O_DIRECT mode Command : xdd.linux -op read -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 ?dio Result : cgroup1 (weight : 10) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 84549632 10321 30.086 2.810 343.05 0.0029 0.00 read 8192 0 1 84549632 10321 30.086 2.810 343.05 0.0029 0.00 read 8192 1 1 84549632 10321 30.086 2.810 343.05 0.0029 0.00 read 8192 cgroup1 (weight : 30) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 256425984 31302 30.089 8.522 1040.31 0.0010 0.00 read 8192 0 1 256425984 31302 30.089 8.522 1040.31 0.0010 0.00 read 8192 1 1 256425984 31302 30.089 8.522 1040.31 0.0010 0.00 read 8192 cgroup1 (weight : 60) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 483467264 59017 30.000 16.116 1967.22 0.0005 0.00 read 8192 0 1 483467264 59017 30.000 16.116 1967.22 0.0005 0.00 read 8192 1 1 483467264 59017 30.000 16.116 1967.22 0.0005 0.00 read 8192 * Write IO test in O_DIRECT mode Command : xdd.linux -op write -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 ?dio Result : cgroup1 (weight : 10) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 106790912 13036 30.034 3.556 434.04 0.0023 0.00 write 8192 0 1 106790912 13036 30.034 3.556 434.04 0.0023 0.00 write 8192 1 1 106790912 13036 30.034 3.556 434.04 0.0023 0.00 write 8192 cgroup1 (weight : 30) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 347176960 42380 30.006 11.570 1412.40 0.0007 0.00 write 8192 0 1 347176960 42380 30.006 11.570 1412.40 0.0007 0.00 write 8192 1 1 347176960 42380 30.006 11.570 1412.40 0.0007 0.00 write 8192 cgroup1 (weight : 60) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 636256256 77668 30.000 21.209 2588.93 0.0004 0.00 write 8192 0 1 636256256 77668 30.000 21.209 2588.93 0.0004 0.00 write 8192 1 1 636256256 77668 30.000 21.209 2588.93 0.0004 0.00 write 8192 * Read IO test in Buffered IO mode Command : xdd.linux -op read -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 Result : cgroup1 (weight : 10) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 161284096 19688 30.012 5.374 656.00 0.0015 0.00 read 8192 0 1 161284096 19688 30.012 5.374 656.00 0.0015 0.00 read 8192 1 1 161284096 19688 30.012 5.374 656.00 0.0015 0.00 read 8192 cgroup1 (weight : 30) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 162816000 19875 30.005 5.426 662.38 0.0015 0.00 read 8192 0 1 162816000 19875 30.005 5.426 662.38 0.0015 0.00 read 8192 1 1 162816000 19875 30.005 5.426 662.38 0.0015 0.00 read 8192 cgroup1 (weight : 60) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 167198720 20410 30.002 5.573 680.29 0.0015 0.00 read 8192 0 1 167198720 20410 30.002 5.573 680.29 0.0015 0.00 read 8192 1 1 167198720 20410 30.002 5.573 680.29 0.0015 0.00 read 8192 * Write IO test in Buffered IO mode Command : xdd.linux -op write -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 Result : cgroup1 (weight : 10) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 550633472 67216 30.017 18.344 2239.30 0.0004 0.00 write 8192 0 1 550633472 67216 30.017 18.344 2239.30 0.0004 0.00 write 8192 1 1 550633472 67216 30.017 18.344 2239.30 0.0004 0.00 write 8192 cgroup1 (weight : 30) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 32768 4 32.278 0.001 0.12 8.0694 0.00 write 8192 0 1 32768 4 32.278 0.001 0.12 8.0694 0.00 write 8192 1 1 32768 4 32.278 0.001 0.12 8.0694 0.00 write 8192 cgroup1 (weight : 60) T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 1 4505600 550 31.875 0.141 17.25 0.0580 0.00 write 8192 0 1 4505600 550 31.875 0.141 17.25 0.0580 0.00 write 8192 1 1 4505600 550 31.875 0.141 17.25 0.0580 0.00 write 8192 > I use iostat to see the time variation of bandiwdth. The followings > are the outputs of iostat just after starting three tiotests on the > same setting as yours. > > # iostat -p dm-0 -p dm-1 -p dm-2 1 > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > dm-0 5430.00 0.00 10860.00 0 10860 > dm-1 16516.00 0.00 16516.00 0 16516 > dm-2 32246.00 0.00 32246.00 0 32246 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.51 0.00 21.83 76.14 0.00 1.52 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > dm-0 5720.00 0.00 11440.00 0 11440 > dm-1 16138.00 0.00 16138.00 0 16138 > dm-2 32734.00 0.00 32734.00 0 32734 > ... > Thank you for your kindness ^^ > > Could you give me the O_DIRECT patch? > Of course, if you want. But it is nothing Tiobench tool is very simple and light source code, so I just add the O_DIRECT option in tiotest.c of tiobench testing tool. Anyway, after I make a patch file, I send it to you Best Regards, Dong-Jae Kang From ryov at valinux.co.jp Thu Oct 9 05:14:14 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Thu, 09 Oct 2008 21:14:14 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction In-Reply-To: <2891419e0810082315v28f2f4cbu5f95230db3be0bc1@mail.gmail.com> References: <2891419e0810080129sb6b35y11362f4bef71c174@mail.gmail.com> <20081008.194022.226783199.ryov@valinux.co.jp> <2891419e0810082315v28f2f4cbu5f95230db3be0bc1@mail.gmail.com> Message-ID: <20081009.211414.193713198.ryov@valinux.co.jp> Hi Dong-Jae, > So, I tested dm-ioband and bio-cgroup patches with another IO testing > tool, xdd ver6.5(http://www.ioperformance.com/), after your reply. > Xdd supports O_DIRECT mode and time limit options. > I think, personally, it is proper tool for testing of IO controllers > in Linux Container ML. Xdd is really useful for me. Thanks for letting me know. > And I found some strange points in test results. In fact, it will be > not strange for other persons^^ > > 1. dm-ioband can control IO bandwidth well in O_DIRECT mode(read and > write), I think the result is very reasonable. but it can't control it > in Buffered mode when I checked just only output of xdd. I think > bio-cgroup patches is for solving the problems, is it right? If so, > how can I check or confirm the role of bio-cgroup patches? > > 2. As showed in test results, the IO performance in Buffered IO mode > is very low compared with it in O_DIRECT mode. In my opinion, the > reverse case is more natural in real life. > Can you give me a answer about it? Your results show all xdd programs belong to the same cgroup, could you explain me in detail about your test procedure? To know how many I/Os are actually issued to a physical device in buffered mode within a measurement period, you should check the /sys/block//stat file just before starting a test program and just after the end of the test program. The contents of the stat file is described in the following document: kernel/Documentation/block/stat.txt > 3. Compared with physical bandwidth(when it is checked with one > process and without dm-ioband device), the sum of the bandwidth by > dm-ioband has very considerable gap with the physical bandwidth. I > wonder the reason. Is it overhead of dm-ioband or bio-cgroup patches? > or Are there any another reasons? The followings are the results on my PC with SATA disk, and there is no big difference between with and without dm-ioband. Please try the same thing if you have time. without dm-ioband ================= # xdd.linux -op write -queuedepth 16 -targets 1 /dev/sdb1 \ -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 16 140001280 17090 30.121 4.648 567.38 0.0018 0.01 write 8192 with dm-ioband ============== * cgroup1 (weight 10) # cat /cgroup/1/bio.id 1 # echo $$ > /cgroup/1/tasks # xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 16 14393344 1757 30.430 0.473 57.74 0.0173 0.00 write 8192 * cgroup2 (weight 20) # cat /cgroup/2/bio.id 2 # echo $$ > /cgroup/2/tasks # xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1 -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 16 44113920 5385 30.380 1.452 177.25 0.0056 0.00 write 8192 * cgroup3 (weight 60) # cat /cgroup/3/bio.id 3 # echo $$ > /cgroup/3/tasks # xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1 -reqeize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize 0 16 82485248 10069 30.256 2.726 332.79 0.0030 0.00 write 8192 Total ===== Bytes Ops Rate IOPS w/o dm-ioband 140001280 17090 4.648 567.38 w/ dm-ioband 140992512 17211 4.651 567.78 > > Could you give me the O_DIRECT patch? > > > Of course, if you want. But it is nothing > Tiobench tool is very simple and light source code, so I just add the > O_DIRECT option in tiotest.c of tiobench testing tool. > Anyway, after I make a patch file, I send it to you Thank you very much! Ryo Tsuruta From herbert at gondor.apana.org.au Thu Oct 9 08:30:35 2008 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Thu, 9 Oct 2008 23:30:35 +0800 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <200810091155.59731.rusty@rustcorp.com.au> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> Message-ID: <20081009153035.GA21542@gondor.apana.org.au> On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote: > > There are three approaches we should investigate before adding YA feature. > Obviously, we can simply increase the number of ring entries. That's not going to work so well as you need to increase the ring size by MAX_SKB_FRAGS times to achieve the same level of effect. Basically the current scheme is either going to suck at non-TSO traffic or it's going to chew too much resources. > Secondly, we can put the virtio_net_hdr at the head of the skb data (this is > also worth considering for xmit I think if we have headroom) and drop > MAX_SKB_FRAGS which contains a gratuitous +2. That's fine but having skb->data in the ring still means two different kinds of memory in there and it sucks when you only have 1500-byte packets. > Thirdly, we can try to coalesce contiguous buffers. The page caching scheme > we have might help here, I don't know. Maybe we should be explicitly trying > to allocate higher orders. That's not really the key problem here. The problem here is that the scheme we're currently using in virtio-net is simply broken when it comes to 1500-byte sized packets. Most of the entries on the ring buffer go to waste. We need a scheme that handles both 1500-byte packets as well as 64K-byte size ones, and without holding down 16M of memory per guest. > > The size of the logical buffer is > > returned to the guest rather than the size of the individual smaller > > buffers. > > That's a virtio transport breakage: can you use the standard virtio mechanism, > just put the extended length or number of extra buffers inside the > virtio_net_hdr? Sure that sounds reasonable. > > Make use of this support by supplying single page receive buffers to > > the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of > > the payload to the skb's linear data buffer and adjust the fragment > > offset to point to the remaining data. This ensures proper alignment > > and allows us to not use any paged data for small packets. If the > > payload occupies multiple pages, we simply append those pages as > > fragments and free the associated skbs. > > > + char *p = page_address(skb_shinfo(skb)->frags[0].page); > ... > > + memcpy(hdr, p, sizeof(*hdr)); > > + p += sizeof(*hdr); > > I think you need kmap_atomic() here to access the page. And yes, that will > effect performance :( No we don't. kmap would only be necessary for highmem which we did not request. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From chrisw at sous-sol.org Thu Oct 9 08:35:46 2008 From: chrisw at sous-sol.org (Chris Wright) Date: Thu, 9 Oct 2008 08:35:46 -0700 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <200810091155.59731.rusty@rustcorp.com.au> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> Message-ID: <20081009153546.GA6912@sequoia.sous-sol.org> * Rusty Russell (rusty at rustcorp.com.au) wrote: > On Thursday 09 October 2008 06:34:59 Mark McLoughlin wrote: > > From: Herbert Xu > > > > If segmentation offload is enabled by the host, we currently allocate > > maximum sized packet buffers and pass them to the host. This uses up > > 20 ring entries, allowing us to supply only 20 packet buffers to the > > host with a 256 entry ring. This is a huge overhead when receiving > > small packets, and is most keenly felt when receiving MTU sized > > packets from off-host. > > There are three approaches we should investigate before adding YA feature. > Obviously, we can simply increase the number of ring entries. Tried that, it didn't help much. I don't have my numbers handy, but levelled off at about 512 and was a modest boost. It's still wasteful to preallocate like that on the off-chance it's a large packet. thanks, -chris From markmc at redhat.com Thu Oct 9 10:40:13 2008 From: markmc at redhat.com (Mark McLoughlin) Date: Thu, 09 Oct 2008 18:40:13 +0100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <20081009153035.GA21542@gondor.apana.org.au> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> Message-ID: <1223574013.13792.23.camel@blaa> On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote: > On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote: > > > > There are three approaches we should investigate before adding YA feature. > > Obviously, we can simply increase the number of ring entries. > > That's not going to work so well as you need to increase the ring > size by MAX_SKB_FRAGS times to achieve the same level of effect. > > Basically the current scheme is either going to suck at non-TSO > traffic or it's going to chew too much resources. Yeah ... to put some numbers on it, assume we have a 256 entry ring now. Currently, with GSO enabled in the host the guest will fill this with 12 buffer heads with 20 buffers per head (a 10 byte buffer, an MTU sized buffer and 18 page sized buffers). That means we allocate ~900k for receive buffers, 12k for the ring, fail to use 16 ring entries and the ring ends up with a capacity of 12 packets. In the case of MTU sized packets from an off-host source, that's a huge amount of overhead for ~17k of data. If we wanted to match the packet capacity that Herbert's suggestion enables (i.e. 256 packets), we'd need to bump the ring size to 4k entries (assuming we reduce it to 19 entries per packet). This would mean we'd need to allocate ~200k for the ring and ~18M in receive buffers. Again, assuming MTU sized packets, that's massive overhead for ~400k of data. > > Secondly, we can put the virtio_net_hdr at the head of the skb data (this is > > also worth considering for xmit I think if we have headroom) and drop > > MAX_SKB_FRAGS which contains a gratuitous +2. > > That's fine but having skb->data in the ring still means two > different kinds of memory in there and it sucks when you only > have 1500-byte packets. Also, including virtio_net_hdr in the data buffer would need another feature flag. Rightly or wrongly, KVM's implementation requires virtio_net_hdr to be the first buffer: if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { fprintf(stderr, "virtio-net header not in first element\n"); exit(1); } i.e. it's part of the ABI ... at least as KVM sees it :-) > > > The size of the logical buffer is > > > returned to the guest rather than the size of the individual smaller > > > buffers. > > > > That's a virtio transport breakage: can you use the standard virtio mechanism, > > just put the extended length or number of extra buffers inside the > > virtio_net_hdr? > > Sure that sounds reasonable. I'll give that a shot. Cheers, Mark. From anthony at codemonkey.ws Thu Oct 9 12:26:25 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Thu, 09 Oct 2008 14:26:25 -0500 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <1223574013.13792.23.camel@blaa> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> <1223574013.13792.23.camel@blaa> Message-ID: <48EE5AE1.5030002@codemonkey.ws> Mark McLoughlin wrote: > > Also, including virtio_net_hdr in the data buffer would need another > feature flag. Rightly or wrongly, KVM's implementation requires > virtio_net_hdr to be the first buffer: > > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > fprintf(stderr, "virtio-net header not in first element\n"); > exit(1); > } > > i.e. it's part of the ABI ... at least as KVM sees it :-) This is actually something that's broken in a nasty way. Having the header in the first element is not supposed to be part of the ABI but it sort of has to be ATM. If an older version of QEMU were to use a newer kernel, and the newer kernel had a larger header size, then if we just made the header be the first X bytes, QEMU has no way of knowing how many bytes that should be. Instead, the guest actually has to allocate the virtio-net header in such a way that it only presents the size depending on the features that the host supports. We don't use a simple versioning scheme, so you'd have to check for a combination of features advertised by the host but that's not good enough because the host may disable certain features. Perhaps the header size is whatever the longest element that has been commonly negotiated? So that's why this aggressive check is here. Not to necessarily cement this into the ABI but as a way to make someone figure out how to sanitize this all. Regards, Anthony Liguori From anthony at codemonkey.ws Thu Oct 9 12:26:25 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Thu, 09 Oct 2008 14:26:25 -0500 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <1223574013.13792.23.camel@blaa> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> <1223574013.13792.23.camel@blaa> Message-ID: <48EE5AE1.5030002@codemonkey.ws> Mark McLoughlin wrote: > > Also, including virtio_net_hdr in the data buffer would need another > feature flag. Rightly or wrongly, KVM's implementation requires > virtio_net_hdr to be the first buffer: > > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > fprintf(stderr, "virtio-net header not in first element\n"); > exit(1); > } > > i.e. it's part of the ABI ... at least as KVM sees it :-) This is actually something that's broken in a nasty way. Having the header in the first element is not supposed to be part of the ABI but it sort of has to be ATM. If an older version of QEMU were to use a newer kernel, and the newer kernel had a larger header size, then if we just made the header be the first X bytes, QEMU has no way of knowing how many bytes that should be. Instead, the guest actually has to allocate the virtio-net header in such a way that it only presents the size depending on the features that the host supports. We don't use a simple versioning scheme, so you'd have to check for a combination of features advertised by the host but that's not good enough because the host may disable certain features. Perhaps the header size is whatever the longest element that has been commonly negotiated? So that's why this aggressive check is here. Not to necessarily cement this into the ABI but as a way to make someone figure out how to sanitize this all. Regards, Anthony Liguori From yu.zhao at intel.com Fri Oct 10 00:24:49 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Fri, 10 Oct 2008 15:24:49 +0800 Subject: [PATCH 4/6 v3] PCI: support SR-IOV capability In-Reply-To: <20080930223851.GC13611@ldl.fc.hp.com> References: <20080930223851.GC13611@ldl.fc.hp.com> Message-ID: <48EF0341.9040703@intel.com> Alex Chiang wrote: > Do you want to emit a kobject_uevent here after success? > > Alternatively, have you investigated making these virtual > functions into real struct device's? You get a lot of sysfs stuff > for free if you do so, including correct place in sysfs hierarchy > and uevents, etc. The virtual functions are rpresented by 'struct pci_dev', so same as real device, the standard sysfs entries are created by pci_bus_add_device. > > My major complaints from last round (more documentation, > shouldn't be a PCI hotplug driver) have been addressed. I'll let > others comment about the other parts of your patch series. Thanks, Alex. > > Thanks. > > /ac > From markmc at redhat.com Fri Oct 10 01:30:31 2008 From: markmc at redhat.com (Mark McLoughlin) Date: Fri, 10 Oct 2008 09:30:31 +0100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <48EE5AE1.5030002@codemonkey.ws> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> <1223574013.13792.23.camel@blaa> <48EE5AE1.5030002@codemonkey.ws> Message-ID: <1223627431.3618.41.camel@blaa> On Thu, 2008-10-09 at 14:26 -0500, Anthony Liguori wrote: > Mark McLoughlin wrote: > > > > Also, including virtio_net_hdr in the data buffer would need another > > feature flag. Rightly or wrongly, KVM's implementation requires > > virtio_net_hdr to be the first buffer: > > > > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > > fprintf(stderr, "virtio-net header not in first element\n"); > > exit(1); > > } > > > > i.e. it's part of the ABI ... at least as KVM sees it :-) > > This is actually something that's broken in a nasty way. Having the > header in the first element is not supposed to be part of the ABI but it > sort of has to be ATM. > > If an older version of QEMU were to use a newer kernel, and the newer > kernel had a larger header size, then if we just made the header be the > first X bytes, QEMU has no way of knowing how many bytes that should be. > Instead, the guest actually has to allocate the virtio-net header in > such a way that it only presents the size depending on the features that > the host supports. We don't use a simple versioning scheme, so you'd > have to check for a combination of features advertised by the host but > that's not good enough because the host may disable certain features. > > Perhaps the header size is whatever the longest element that has been > commonly negotiated? > > So that's why this aggressive check is here. Not to necessarily cement > this into the ABI but as a way to make someone figure out how to > sanitize this all. Well, features may be orthogonal but they are still added sequentially to the ABI. So, you would have a kind of implicit ABI versioning, while still allowing individual selection of features. e.g. if NET_F_FOO adds "int foo" to the header and then NET_F_BAR adds "int bar" to the header then if NET_F_FOO is negotiated, the guest should only send a header with "foo" and if NET_F_FOO|NET_F_BAR or NET_F_BAR is negotiated, then the guest sends a header with both "foo" and "bar". Or put it another way, a host or guest may not implement NET_F_FOO but knowledge of the "foo" header field is part of the ABI of NET_F_BAR. That knowledge would be as simple as knowing that the field exists and that it should be ignored if the feature isn't used. Cheers, Mark. From markmc at redhat.com Fri Oct 10 05:56:55 2008 From: markmc at redhat.com (Mark McLoughlin) Date: Fri, 10 Oct 2008 13:56:55 +0100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <20081009153035.GA21542@gondor.apana.org.au> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223494499-18732-2-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> Message-ID: <1223643415.22246.3.camel@blaa> On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote: > On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote: > > > The size of the logical buffer is > > > returned to the guest rather than the size of the individual smaller > > > buffers. > > > > That's a virtio transport breakage: can you use the standard virtio mechanism, > > just put the extended length or number of extra buffers inside the > > virtio_net_hdr? > > Sure that sounds reasonable. Okay, here we go. The new header is lamely called virtio_net_hdr2 - I've added some padding in there so we can extend it further in future. It gets messy for lguest because tun/tap isn't using the same header format anymore. Rusty - let me know if this looks reasonable and, if so, I'll merge it back into the original patches and resend. Cheers, Mark. diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index da934c2..0f840f2 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -940,14 +940,21 @@ static void handle_net_output(int fd, struct virtqueue *vq, bool timeout) { unsigned int head, out, in, num = 0; int len; - struct iovec iov[vq->vring.num]; + struct iovec iov[vq->vring.num + 1]; static int last_timeout_num; /* Keep getting output buffers from the Guest until we run out. */ - while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) { + while ((head = get_vq_desc(vq, &iov[1], &out, &in)) != vq->vring.num) { if (in) errx(1, "Input buffers in output queue?"); - len = writev(vq->dev->fd, iov, out); + + /* tapfd needs a virtio_net_hdr, not virtio_net_hdr2 */ + iov[0].iov_base = iov[1].iov_base; + iov[0].iov_len = sizeof(struct virtio_net_hdr); + iov[1].iov_base += sizeof(struct virtio_net_hdr2); + iov[1].iov_len -= sizeof(struct virtio_net_hdr2); + + len = writev(vq->dev->fd, iov, out + 1); if (len < 0) err(1, "Writing network packet to tun"); add_used_and_trigger(fd, vq, head, len); @@ -998,18 +1005,24 @@ static unsigned int get_net_recv_head(struct device *dev, struct iovec *iov, /* Here we add used recv buffers to the used queue but, also, return unused * buffers to the avail queue. */ -static void add_net_recv_used(struct device *dev, unsigned int *heads, - int *bufsizes, int nheads, int used_len) +static void add_net_recv_used(struct device *dev, struct virtio_net_hdr2 *hdr2, + unsigned int *heads, int *bufsizes, + int nheads, int used_len) { int len, idx; /* Add the buffers we've actually used to the used queue */ len = idx = 0; while (len < used_len) { - add_used(dev->vq, heads[idx], used_len, idx); + if (bufsizes[idx] > (used_len - len)) + bufsizes[idx] = used_len - len; + add_used(dev->vq, heads[idx], bufsizes[idx], idx); len += bufsizes[idx++]; } + /* The guest needs to know how many buffers to fetch */ + hdr2->num_buffers = idx; + /* Return the rest of them back to the avail queue */ lg_last_avail(dev->vq) -= nheads - idx; dev->vq->inflight -= nheads - idx; @@ -1022,12 +1035,17 @@ static void add_net_recv_used(struct device *dev, unsigned int *heads, * Guest. */ static bool handle_tun_input(int fd, struct device *dev) { - struct iovec iov[dev->vq->vring.num]; + struct virtio_net_hdr hdr; + struct virtio_net_hdr2 *hdr2; + struct iovec iov[dev->vq->vring.num + 1]; unsigned int heads[NET_MAX_RECV_PAGES]; int bufsizes[NET_MAX_RECV_PAGES]; int nheads, len, iovcnt; - nheads = len = iovcnt = 0; + nheads = len = 0; + + /* First iov is for the header */ + iovcnt = 1; /* First we need enough network buffers from the Guests's recv * virtqueue for the largest possible packet. */ @@ -1056,13 +1074,26 @@ static bool handle_tun_input(int fd, struct device *dev) len += bufsizes[nheads++]; } + /* Read virtio_net_hdr from tapfd */ + iov[0].iov_base = &hdr; + iov[0].iov_len = sizeof(hdr); + + /* Read data into buffer after virtio_net_hdr2 */ + hdr2 = iov[1].iov_base; + iov[1].iov_base += sizeof(*hdr2); + iov[1].iov_len -= sizeof(*hdr2); + /* Read the packet from the device directly into the Guest's buffer. */ len = readv(dev->fd, iov, iovcnt); if (len <= 0) err(1, "reading network"); + /* Copy the virtio_net_hdr into the virtio_net_hdr2 */ + hdr2->hdr = hdr; + len += sizeof(*hdr2) - sizeof(hdr); + /* Return unused buffers to the recv queue */ - add_net_recv_used(dev, heads, bufsizes, nheads, len); + add_net_recv_used(dev, hdr2, heads, bufsizes, nheads, len); /* Fire in the hole ! */ trigger_irq(fd, dev->vq); diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 1780d6d..719e9dc 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -61,6 +61,7 @@ struct virtnet_info /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + bool use_vnet_hdr2; /* Receive & send queues. */ struct sk_buff_head recv; @@ -70,11 +71,21 @@ struct virtnet_info struct page *pages; }; +static inline struct virtio_net_hdr2 *skb_vnet_hdr2(struct sk_buff *skb) +{ + return (struct virtio_net_hdr2 *)skb->cb; +} + static inline struct virtio_net_hdr *skb_vnet_hdr(struct sk_buff *skb) { return (struct virtio_net_hdr *)skb->cb; } +static inline void vnet_hdr2_to_sg(struct scatterlist *sg, struct sk_buff *skb) +{ + sg_init_one(sg, skb_vnet_hdr2(skb), sizeof(struct virtio_net_hdr2)); +} + static inline void vnet_hdr_to_sg(struct scatterlist *sg, struct sk_buff *skb) { sg_init_one(sg, skb_vnet_hdr(skb), sizeof(struct virtio_net_hdr)); @@ -135,43 +146,40 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, dev->stats.rx_length_errors++; goto drop; } - len -= sizeof(struct virtio_net_hdr); if (vi->mergeable_rx_bufs) { + struct virtio_net_hdr2 *hdr2 = skb_vnet_hdr2(skb); unsigned int copy; - unsigned int plen; char *p = page_address(skb_shinfo(skb)->frags[0].page); - memcpy(hdr, p, sizeof(*hdr)); - p += sizeof(*hdr); + if (len > PAGE_SIZE) + len = PAGE_SIZE; + len -= sizeof(struct virtio_net_hdr2); - plen = PAGE_SIZE - sizeof(*hdr); - if (plen > len) - plen = len; + memcpy(hdr2, p, sizeof(*hdr2)); + p += sizeof(*hdr2); - copy = plen; + copy = len; if (copy > skb_tailroom(skb)) copy = skb_tailroom(skb); memcpy(skb_put(skb, copy), p, copy); len -= copy; - plen -= copy; - if (!plen) { + if (!len) { give_a_page(vi, skb_shinfo(skb)->frags[0].page); skb_shinfo(skb)->nr_frags--; } else { skb_shinfo(skb)->frags[0].page_offset += - sizeof(*hdr) + copy; - skb_shinfo(skb)->frags[0].size = plen; - skb->data_len += plen; - skb->len += plen; + sizeof(*hdr2) + copy; + skb_shinfo(skb)->frags[0].size = len; + skb->data_len += len; + skb->len += len; } - while ((len -= plen)) { + while (--hdr2->num_buffers) { struct sk_buff *nskb; - unsigned nlen; i = skb_shinfo(skb)->nr_frags; if (i >= MAX_SKB_FRAGS) { @@ -181,10 +189,10 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, goto drop; } - nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &nlen); + nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &len); if (!nskb) { - pr_debug("%s: packet length error %d < %d\n", - dev->name, skb->len, len); + pr_debug("%s: rx error: %d buffers missing\n", + dev->name, hdr2->num_buffers); dev->stats.rx_length_errors++; goto drop; } @@ -196,16 +204,17 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb, skb_shinfo(nskb)->nr_frags = 0; kfree_skb(nskb); - plen = PAGE_SIZE; - if (plen > len) - plen = len; + if (len > PAGE_SIZE) + len = PAGE_SIZE; - skb_shinfo(skb)->frags[i].size = plen; + skb_shinfo(skb)->frags[i].size = len; skb_shinfo(skb)->nr_frags++; - skb->data_len += plen; - skb->len += plen; + skb->data_len += len; + skb->len += len; } } else { + len -= sizeof(struct virtio_net_hdr); + if (len <= MAX_PACKET_LEN) trim_pages(vi, skb); @@ -451,6 +460,7 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) { int num, err; struct scatterlist sg[2+MAX_SKB_FRAGS]; + struct virtio_net_hdr2 *hdr2; struct virtio_net_hdr *hdr; const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; @@ -461,7 +471,9 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) dest[3], dest[4], dest[5]); /* Encode metadata header at front. */ - hdr = skb_vnet_hdr(skb); + hdr2 = skb_vnet_hdr2(skb); + hdr = &hdr2->hdr; + if (skb->ip_summed == CHECKSUM_PARTIAL) { hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; hdr->csum_start = skb->csum_start - skb_headroom(skb); @@ -489,7 +501,13 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) hdr->gso_size = hdr->hdr_len = 0; } - vnet_hdr_to_sg(sg, skb); + hdr2->num_buffers = 0; + + if (vi->use_vnet_hdr2) + vnet_hdr2_to_sg(sg, skb); + else + vnet_hdr_to_sg(sg, skb); + num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1; err = vi->svq->vq_ops->add_buf(vi->svq, sg, num, 0, skb); @@ -678,8 +696,10 @@ static int virtnet_probe(struct virtio_device *vdev) || virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_ECN)) vi->big_packets = true; - if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) { vi->mergeable_rx_bufs = true; + vi->use_vnet_hdr2 = true; + } /* We expect two virtqueues, receive then send. */ vi->rvq = vdev->config->find_vq(vdev, 0, skb_recv_done); diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 8f376a7..59a5079 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -45,4 +45,14 @@ struct virtio_net_hdr __u16 csum_start; /* Position to start checksumming from */ __u16 csum_offset; /* Offset after that to place checksum */ }; + +/* This is the version of the header to use when the MRG_RXBUF + * feature (or any later feature) has been negotiated. */ +struct virtio_net_hdr2 +{ + struct virtio_net_hdr hdr; + __u8 num_buffers; /* Number of merged rx buffers */ + __u8 pad[21]; /* Pad to 32 bytes */ +}; + #endif /* _LINUX_VIRTIO_NET_H */ From benh at kernel.crashing.org Sun Oct 12 22:16:53 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Oct 2008 16:16:53 +1100 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <200806031445.22561.borntraeger@de.ibm.com> References: <200806031444.21945.borntraeger@de.ibm.com> <200806031445.22561.borntraeger@de.ibm.com> Message-ID: <1223875013.8157.230.camel@pasglop> On Tue, 2008-06-03 at 14:45 +0200, Christian Borntraeger wrote: > This patch tries to change hvc_console to not use request_irq/free_irq if > the backend does not use irqs. This allows virtio_console to use hvc_console > without having a linker reference to request_irq/free_irq. > > The irq specific code is moved to hvc_irq.c and selected by the drivers that > use irqs (System p, System i, XEN). > > I replaced "irq" with the opaque name "data". The request_irq and free_irq > calls are replaced with notifier_add and notifier_del. I have also changed > the code a bit to call the notifier_add and notifier_del inside the spinlock > area as the callbacks are found via hp->ops. That's causing lockdep to scream, though I have a hard time figuring out what it thinks is wrong... Ingo, would you mind giving a hand parsing that output ? Thanks ! Cheers, Ben. console [udbg0] enabled pSeries detected, looking for LPAR capability... -> fw_feature_init() <- fw_feature_init() Machine is LPAR ! Using pSeries machine description Page orders: linear mapping = 24, virtual = 12, io = 12, vmemmap = 24 -> pSeries_init_early() -> fw_cmo_feature_init() CMO not available <- fw_cmo_feature_init() <- pSeries_init_early() Partition configured for 32 cpus. CPU maps initialized for 2 threads per core (thread shift is 1) Starting Linux PPC64 #22 SMP Mon Oct 13 14:18:44 EST 2008 ----------------------------------------------------- ppc64_pft_size = 0x1b physicalMemorySize = 0x80000000 htab_hash_mask = 0xfffff ----------------------------------------------------- Linux version 2.6.27-rc5-test (benh at grosgo) (gcc version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)) #22 SMP Mon Oct 13 14:18:44 EST 2008 [boot]0012 Setup Arch Node 0 Memory: 0x0-0x44000000 Node 1 Memory: 0x44000000-0x80000000 -> smp_init_pSeries() <- smp_init_pSeries() EEH: No capable adapters found PPC64 nvram contains 7168 bytes Using shared processor idle loop Zone PFN ranges: DMA 0x00000000 -> 0x00008000 Normal 0x00008000 -> 0x00008000 Movable zone start PFN for each node early_node_map[2] active PFN ranges 0: 0x00000000 -> 0x00004400 1: 0x00004400 -> 0x00008000 On node 0 totalpages: 17408 DMA zone: 17382 pages, LIFO batch:1 On node 1 totalpages: 15360 DMA zone: 15337 pages, LIFO batch:1 [boot]0015 Setup Done Built 2 zonelists in Node order, mobility grouping on. Total pages: 32719 Policy zone: DMA Kernel command line: root=/dev/sdb1 [boot]0020 XICS Init [boot]0021 XICS Done pic: no ISA interrupt controller PID hash table entries: 4096 (order: 12, 32768 bytes) time_init: decrementer frequency = 188.046000 MHz time_init: processor frequency = 1502.496000 MHz clocksource: timebase mult[154579e] shift[22] registered clockevent: decrementer mult[3023] shift[16] cpu[0] Console: colour dummy device 80x25 console handover: boot [udbg0] -> real [hvc0] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES: 8 ... MAX_LOCK_DEPTH: 48 ... MAX_LOCKDEP_KEYS: 8191 ... CLASSHASH_SIZE: 4096 ... MAX_LOCKDEP_ENTRIES: 8192 ... MAX_LOCKDEP_CHAINS: 16384 ... CHAINHASH_SIZE: 8192 memory used by lock dependency info: 4095 kB per task-struct memory footprint: 2688 bytes ------------------------ | Locking API testsuite: ---------------------------------------------------------------------------- | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------------------------------------------------- A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | initialize held: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | -------------------------------------------------------------------------- recursive read-lock: | ok | | ok | recursive read-lock #2: | ok | | ok | mixed read-write-lock: | ok | | ok | mixed write-read-lock: | ok | | ok | -------------------------------------------------------------------------- hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | ------------------------------------------------------- Good, all 218 testcases passed! | --------------------------------- Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes) freeing bootmem node 0 freeing bootmem node 1 Memory: 1995968k/2097152k available (9280k kernel code, 101184k reserved, 1216k data, 7639k bss, 1984k init) SLUB: Genslabs=17, HWalign=128, Order=0-3, MinObjects=0, CPUs=32, Nodes=16 Calibrating delay loop... 374.78 BogoMIPS (lpj=749568) Mount-cache hash table entries: 4096 xics: map virq 16, hwirq 0x2 xics: unmask virq 16 -> map to hwirq 0x2 clockevent: decrementer mult[3023] shift[16] cpu[1] Processor 1 found. clockevent: decrementer mult[3023] shift[16] cpu[2] Processor 2 found. clockevent: decrementer mult[3023] shift[16] cpu[3] Processor 3 found. clockevent: decrementer mult[3023] shift[16] cpu[4] Processor 4 found. clockevent: decrementer mult[3023] shift[16] cpu[5] Processor 5 found. clockevent: decrementer mult[3023] shift[16] cpu[6] Processor 6 found. clockevent: decrementer mult[3023] shift[16] cpu[7] Processor 7 found. Brought up 8 CPUs Node 0 CPUs: 0-7 Node 1 CPUs: CPU0 attaching sched-domain: domain 0: span 0-1 level SIBLING groups: 0 1 domain 1: span 0-7 level CPU groups: 0-1 2-3 4-5 6-7 domain 2: span 0-7 level NODE groups: 0-7 CPU1 attaching sched-domain: domain 0: span 0-1 level SIBLING groups: 1 0 domain 1: span 0-7 level CPU groups: 0-1 2-3 4-5 6-7 domain 2: span 0-7 level NODE groups: 0-7 CPU2 attaching sched-domain: domain 0: span 2-3 level SIBLING groups: 2 3 domain 1: span 0-7 level CPU groups: 2-3 4-5 6-7 0-1 domain 2: span 0-7 level NODE groups: 0-7 CPU3 attaching sched-domain: domain 0: span 2-3 level SIBLING groups: 3 2 domain 1: span 0-7 level CPU groups: 2-3 4-5 6-7 0-1 domain 2: span 0-7 level NODE groups: 0-7 CPU4 attaching sched-domain: domain 0: span 4-5 level SIBLING groups: 4 5 domain 1: span 0-7 level CPU groups: 4-5 6-7 0-1 2-3 domain 2: span 0-7 level NODE groups: 0-7 CPU5 attaching sched-domain: domain 0: span 4-5 level SIBLING groups: 5 4 domain 1: span 0-7 level CPU groups: 4-5 6-7 0-1 2-3 domain 2: span 0-7 level NODE groups: 0-7 CPU6 attaching sched-domain: domain 0: span 6-7 level SIBLING groups: 6 7 domain 1: span 0-7 level CPU groups: 6-7 0-1 2-3 4-5 domain 2: span 0-7 level NODE groups: 0-7 CPU7 attaching sched-domain: domain 0: span 6-7 level SIBLING groups: 7 6 domain 1: span 0-7 level CPU groups: 6-7 0-1 2-3 4-5 domain 2: span 0-7 level NODE groups: 0-7 khelper used greatest stack depth: 10464 bytes left net_namespace: 1280 bytes NET: Registered protocol family 16 IBM eBus Device Driver PCI: Probing PCI hardware PCI: Probing PCI hardware done SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb NET: Registered protocol family 2 IP route cache hash table entries: 16384 (order: 1, 131072 bytes) TCP established hash table entries: 65536 (order: 4, 1048576 bytes) TCP bind hash table entries: 65536 (order: 6, 4194304 bytes) TCP: Hash tables configured (established 65536 bind 65536) TCP reno registered NET: Registered protocol family 1 xics: map virq 17, hwirq 0xa0000 xics: map virq 18, hwirq 0xa0002 IOMMU table initialized, virtual merging enabled xics: map virq 20, hwirq 0xa0014 xics: map virq 19, hwirq 0x90001 xics: unmask virq 19 -> map to hwirq 0x90001 RTAS daemon started rtasd: will sleep for 7500 milliseconds rtasd: logging event RTAS: event: 14, Type: Platform Error, Severity: 2 audit: initializing netlink socket (disabled) type=2000 audit(1223873798.428:1): initialized HugeTLB registered 16 MB page size, pre-allocated 0 pages Installing knfsd (copyright (C) 1996 okir at monad.swb.de). msgmni has been set to 3896 Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254) io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered vio_register_driver: driver hvc_console registering HVSI: registered 0 devices Generic RTC Driver v1.07 Serial: 8250/16550 driver4 ports, IRQ sharing disabled brd: module loaded loop: module loaded Intel(R) PRO/1000 Network Driver - version 7.3.20-k3-NAPI Copyright (c) 1999-2006 Intel Corporation. IBM eHEA ethernet device driver (Release EHEA_0092) pcnet32.c:v1.35 21.Apr.2008 tsbogend at alpha.franken.de e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI e100: Copyright(c) 1999-2006 Intel Corporation /home/benh/kernels/linux-test-powerpc/drivers/net/ibmveth.c: ibmveth: IBM i/pSeries Virtual Ethernet Driver 1.03 vio_register_driver: driver ibmveth registering console [netcon0] enabled netconsole: network logging started Uniform Multi-Platform E-IDE driver vio_register_driver: driver ibmvscsi registering ibmvscsi 30000014: SRP_VERSION: 16.a xics: unmask virq 20 -> map to hwirq 0xa0014 scsi0 : IBM POWER Virtual SCSI Adapter 1.5.8 ibmvscsi 30000014: partner initialization complete ibmvscsi 30000014: sent SRP login ibmvscsi 30000014: SRP_LOGIN succeeded ibmvscsi 30000014: host srp version: 16.a, host partition 1-Diego-VIOS (1), OS 3, max io 262144 scsi 0:0:1:0: Direct-Access AIX VDASD 0001 PQ: 0 ANSI: 3 scsi 0:0:2:0: Direct-Access AIX VDASD 0001 PQ: 0 ANSI: 3 scsi 0:0:3:0: Direct-Access AIX VDASD 0001 PQ: 0 ANSI: 3 st: Version 20080504, fixed bufsize 32768, s/g segs 256 Driver 'st' needs updating - please use bus_type methods Driver 'sd' needs updating - please use bus_type methods sd 0:0:1:0: [sda] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:1:0: [sda] Write Protect is off sd 0:0:1:0: [sda] Mode Sense: 17 00 00 08 sd 0:0:1:0: [sda] Cache data unavailable sd 0:0:1:0: [sda] Assuming drive cache: write through sd 0:0:1:0: [sda] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:1:0: [sda] Write Protect is off sd 0:0:1:0: [sda] Mode Sense: 17 00 00 08 sd 0:0:1:0: [sda] Cache data unavailable sd 0:0:1:0: [sda] Assuming drive cache: write through sda: sda1 sda2 sda3 sd 0:0:1:0: [sda] Attached SCSI disk sd 0:0:2:0: [sdb] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:2:0: [sdb] Write Protect is off sd 0:0:2:0: [sdb] Mode Sense: 17 00 00 08 sd 0:0:2:0: [sdb] Cache data unavailable sd 0:0:2:0: [sdb] Assuming drive cache: write through sd 0:0:2:0: [sdb] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:2:0: [sdb] Write Protect is off sd 0:0:2:0: [sdb] Mode Sense: 17 00 00 08 sd 0:0:2:0: [sdb] Cache data unavailable sd 0:0:2:0: [sdb] Assuming drive cache: write through sdb: sdb1 sd 0:0:2:0: [sdb] Attached SCSI disk sd 0:0:3:0: [sdc] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:3:0: [sdc] Write Protect is off sd 0:0:3:0: [sdc] Mode Sense: 17 00 00 08 sd 0:0:3:0: [sdc] Cache data unavailable sd 0:0:3:0: [sdc] Assuming drive cache: write through sd 0:0:3:0: [sdc] 20971520 512-byte hardware sectors (10737 MB) sd 0:0:3:0: [sdc] Write Protect is off sd 0:0:3:0: [sdc] Mode Sense: 17 00 00 08 sd 0:0:3:0: [sdc] Cache data unavailable sd 0:0:3:0: [sdc] Assuming drive cache: write through sdc: sdc1 sdc2 sd 0:0:3:0: [sdc] Attached SCSI disk Driver 'sr' needs updating - please use bus_type methods sd 0:0:1:0: Attached scsi generic sg0 type 0 sd 0:0:2:0: Attached scsi generic sg1 type 0 sd 0:0:3:0: Attached scsi generic sg2 type 0 ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver Initializing USB Mass Storage driver... usbcore: registered new interface driver usb-storage USB Mass Storage support registered. mice: PS/2 mouse device common for all mice md: linear personality registered for level -1 md: raid0 personality registered for level 0 md: raid1 personality registered for level 1 device-mapper: ioctl: 4.14.0-ioctl (2008-04-23) initialised: dm-devel at redhat.com usbcore: registered new interface driver hiddev usbcore: registered new interface driver usbhid usbhid: v2.6:USB HID core driver oprofile: using ppc64/power5 performance monitoring. IPv4 over IPv4 tunneling driver TCP cubic registered NET: Registered protocol family 17 RPC: Registered udp transport module. RPC: Registered tcp transport module. registered taskstats version 1 md: Autodetecting RAID arrays. md: Scanned 0 and added 0 devices. md: autorun ... md: ... autorun DONE. EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 1984k freed xics: unmask virq 17 -> map to hwirq 0xa0000 xics: mask virq 17 xics: unmask virq 17 -> map to hwirq 0xa0000 runlevel used greatest stack depth: 8816 bytes left grep used greatest stack depth: 8160 bytes left mount used greatest stack depth: 7344 bytes left ckbcomp used greatest stack depth: 5920 bytes left EXT3 FS on sdb1, internal journal kjournald starting. Commit interval 5 seconds EXT3 FS on sdc2, internal journal EXT3-fs: mounted filesystem with ordered data mode. Adding 1015680k swap on /dev/mapper/VolGroup00-LogVol01. Priority:-1 extents:1 across:1015680k xics: unmask virq 18 -> map to hwirq 0xa0002 warning: `dhclient3' uses 32-bit capabilities (legacy support in use) ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.27-rc5-test #22 --------------------------------------------------------- swapper/0 just changed the state of lock: (&hp->lock){+...}, at: [] .hvc_poll+0x50/0x2f0 but this lock took another, hard-irq-unsafe lock in the past: (proc_subdir_lock){--..} and interrupts could create inverse lock ordering between them. other info that might help us debug this: no locks held by swapper/0. the first lock's dependencies: -> (&hp->lock){+...} ops: 1563368095744 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .hvc_get_by_index+0x54/0x110 [] .hvc_open+0x34/0x14c [] .tty_open+0x250/0x3b4 [] .chrdev_open+0x1c4/0x204 [] .__dentry_open+0x190/0x308 [] .do_filp_open+0x400/0x84c [] .do_sys_open+0x80/0x140 [] .init_post+0x4c/0x108 [] .kernel_init+0x2a8/0x2cc [] .kernel_thread+0x54/0x70 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .hvc_poll+0x50/0x2f0 [] .hvc_handle_interrupt+0x14/0x3c [] .handle_IRQ_event+0x50/0xc8 [] .handle_fasteoi_irq+0x120/0x1bc [] .call_handle_irq+0x1c/0x2c [] .do_IRQ+0x128/0x1fc [] hardware_interrupt_entry+0x1c/0x98 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] __key.17726+0x0/0x8 -> (&tty->buf.lock){....} ops: 146028888064 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .tty_buffer_request_room+0x40/0x198 [] .hvc_poll+0xd0/0x2f0 [] .khvcd+0x84/0x18c [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 } ... key at: [] __key.20956+0x0/0x8 -> (&zone->lock){.+..} ops: 167834437025792 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .free_pages_bulk+0x60/0x318 [] .free_hot_cold_page+0x20c/0x278 [] .free_all_bootmem_core+0x12c/0x240 [] .mem_init+0x9c/0x218 [] .start_kernel+0x39c/0x4b4 [] .start_here_common+0x1c/0x34 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__free_pages_ok+0x1a4/0x414 [] .__free_slab+0x140/0x16c [] .kmem_cache_free+0xec/0x148 [] .free_thread_info+0x24/0x3c [] .free_task+0x30/0x60 [] .delayed_put_task_struct+0x38/0x4c [] .__rcu_process_callbacks+0x1e4/0x2bc [] .rcu_process_callbacks+0x3c/0x64 [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 } ... key at: [] __key.26488+0x0/0x8 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .rmqueue_bulk+0x50/0xe0 [] .get_page_from_freelist+0x2d4/0x758 [] .__alloc_pages_internal+0x158/0x4dc [] .alloc_pages_current+0xcc/0xf4 [] .new_slab+0x88/0x32c [] .__slab_alloc+0x29c/0x51c [] .__kmalloc+0xc4/0x160 [] .tty_buffer_request_room+0xe8/0x198 [] .hvc_poll+0xd0/0x2f0 [] .khvcd+0x84/0x18c [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .tty_buffer_request_room+0x40/0x198 [] .hvc_poll+0xd0/0x2f0 [] .khvcd+0x84/0x18c [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 -> (&irq_desc_lock_class){++..} ops: 18859201396736 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .set_irq_chip+0x60/0xac [] .set_irq_chip_and_handler+0x20/0x48 [] .xics_host_map+0x6c/0x90 [] .irq_setup_virq+0x6c/0xa8 [] .irq_create_mapping+0x108/0x13c [] .smp_xics_probe+0x24/0xb4 [] .smp_prepare_cpus+0x98/0x188 [] .kernel_init+0x64/0x2cc [] .kernel_thread+0x54/0x70 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .handle_fasteoi_irq+0x3c/0x1bc [] .call_handle_irq+0x1c/0x2c [] .do_IRQ+0x128/0x1fc [] hardware_interrupt_entry+0x1c/0x98 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .handle_fasteoi_irq+0x3c/0x1bc [] .call_handle_irq+0x1c/0x2c [] .do_IRQ+0x128/0x1fc [] hardware_interrupt_entry+0x1c/0x98 [] .__do_softirq+0x8c/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] irq_desc_lock_class+0x0/0x8 -> (old_style_spin_init){....} ops: 63750199574528 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .rtas_call+0x80/0x288 [] .pSeries_cmo_feature_init+0x84/0x290 [] .pSeries_init_early+0x64/0x8c [] .setup_system+0x20c/0x3b0 [] .start_here_common+0xc/0x34 } ... key at: [] rtas+0x30/0xb0 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .rtas_call+0x80/0x288 [] .xics_unmask_irq+0x90/0x114 [] .xics_startup+0x10/0x24 [] .setup_irq+0x228/0x35c [] .request_irq+0xc4/0x114 [] .request_ras_irqs+0x184/0x1fc [] .init_ras_IRQ+0x94/0xb8 [] .do_one_initcall+0x8c/0x1c8 [] .kernel_init+0x254/0x2cc [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .free_irq+0x6c/0x18c [] .notifier_del_irq+0x28/0x48 [] .hvc_close+0xa0/0x110 [] .release_dev+0x244/0x580 [] .tty_release+0x24/0x44 [] .__fput+0xf8/0x1dc [] .filp_close+0xb4/0xdc [] .sys_close+0xac/0x100 [] syscall_exit+0x0/0x40 -> (proc_subdir_lock){--..} ops: 15307263442944 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 softirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 hardirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] proc_subdir_lock+0x18/0x38 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .remove_proc_entry+0x44/0x298 [] .unregister_handler_proc+0x40/0x58 [] .free_irq+0x124/0x18c [] .notifier_del_irq+0x28/0x48 [] .hvc_close+0xa0/0x110 [] .release_dev+0x244/0x580 [] .tty_release+0x24/0x44 [] .__fput+0xf8/0x1dc [] .filp_close+0xb4/0xdc [] .sys_close+0xac/0x100 [] syscall_exit+0x0/0x40 -> (&ent->pde_unload_lock){--..} ops: 178743653957632 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .remove_proc_entry+0xd4/0x298 [] .unregister_handler_proc+0x40/0x58 [] .free_irq+0x124/0x18c [] .notifier_del_irq+0x28/0x48 [] .hvc_close+0xa0/0x110 [] .release_dev+0x244/0x580 [] .tty_release+0x24/0x44 [] .__fput+0xf8/0x1dc [] .filp_close+0xb4/0xdc [] .sys_close+0xac/0x100 [] syscall_exit+0x0/0x40 softirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .proc_reg_open+0x68/0x1a4 [] .__dentry_open+0x190/0x308 [] .do_filp_open+0x400/0x84c [] .do_sys_open+0x80/0x140 [] .compat_sys_open+0x24/0x38 [] syscall_exit+0x0/0x40 hardirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .proc_reg_open+0x68/0x1a4 [] .__dentry_open+0x190/0x308 [] .do_filp_open+0x400/0x84c [] .do_sys_open+0x80/0x140 [] .compat_sys_open+0x24/0x38 [] syscall_exit+0x0/0x40 } ... key at: [] __key.16461+0x0/0xc ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .remove_proc_entry+0xd4/0x298 [] .unregister_handler_proc+0x40/0x58 [] .free_irq+0x124/0x18c [] .notifier_del_irq+0x28/0x48 [] .hvc_close+0xa0/0x110 [] .release_dev+0x244/0x580 [] .tty_release+0x24/0x44 [] .__fput+0xf8/0x1dc [] .filp_close+0xb4/0xdc [] .sys_close+0xac/0x100 [] syscall_exit+0x0/0x40 -> (proc_inum_lock){--..} ops: 3435973836800 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .proc_register+0x64/0x224 [] .create_proc_entry+0x80/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 softirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .proc_register+0x64/0x224 [] .create_proc_entry+0x80/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 hardirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .proc_register+0x64/0x224 [] .create_proc_entry+0x80/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] proc_inum_lock+0x18/0x38 -> (proc_inum_ida.lock){....} ops: 6932077215744 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .idr_pre_get+0x5c/0xc0 [] .ida_pre_get+0x28/0xa8 [] .proc_register+0x50/0x224 [] .create_proc_entry+0x80/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] proc_inum_ida+0x30/0x58 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .get_from_free_list+0x28/0x84 [] .idr_get_empty_slot+0x5c/0x2ec [] .ida_get_new_above+0x80/0x28c [] .proc_register+0x74/0x224 [] .create_proc_entry+0x80/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .free_proc_entry+0x4c/0xc4 [] .remove_proc_entry+0x268/0x298 [] .unregister_handler_proc+0x40/0x58 [] .free_irq+0x124/0x18c [] .notifier_del_irq+0x28/0x48 [] .hvc_close+0xa0/0x110 [] .release_dev+0x244/0x580 [] .tty_release+0x24/0x44 [] .__fput+0xf8/0x1dc [] .filp_close+0xb4/0xdc [] .sys_close+0xac/0x100 [] syscall_exit+0x0/0x40 -> (&q->lock){.+..} ops: 265274360070144 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irq+0x58/0xb4 [] .wait_for_common+0x48/0x1d4 [] .kthread_create+0xb8/0x110 [] .migration_call+0xc8/0x660 [] .migration_init+0x34/0x90 [] .do_one_initcall+0x8c/0x1c8 [] .kernel_init+0x80/0x2cc [] .kernel_thread+0x54/0x70 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .complete+0x28/0x80 [] .wakeme_after_rcu+0x14/0x28 [] .__rcu_process_callbacks+0x1e4/0x2bc [] .rcu_process_callbacks+0x3c/0x64 [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] __key.14384+0x0/0x8 -> (&rq->lock){++..} ops: 19113192877719552 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .rq_attach_root+0x30/0x1a8 [] .sched_init+0x334/0x408 [] .start_kernel+0x1dc/0x4b4 [] .start_here_common+0x1c/0x34 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .scheduler_tick+0x48/0x154 [] .update_process_times+0x60/0x8c [] .tick_periodic+0x9c/0xc4 [] .tick_handle_periodic+0x38/0xbc [] .timer_interrupt+0xac/0x100 [] decrementer_common+0x104/0x180 [] .dotest+0x4dc/0x544 [] .locking_selftest+0x124/0x17ec [] .start_kernel+0x32c/0x4b4 [] .start_here_common+0x1c/0x34 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .tg_shares_up+0x108/0x20c [] .walk_tg_tree+0xf0/0x140 [] .run_rebalance_domains+0x1ac/0x5fc [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] __key.39338+0x0/0x8 -> (&vec->lock){++..} ops: 11192684773376 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .cpupri_set+0x12c/0x1f0 [] .rq_online_rt+0xac/0xc4 [] .set_rq_online+0xa8/0xd4 [] .rq_attach_root+0x174/0x1a8 [] .sched_init+0x334/0x408 [] .start_kernel+0x1dc/0x4b4 [] .start_here_common+0x1c/0x34 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .cpupri_set+0x78/0x1f0 [] .__enqueue_rt_entity+0x114/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .softlockup_tick+0x130/0x214 [] .run_local_timers+0x24/0x38 [] .update_process_times+0x38/0x8c [] .tick_periodic+0x9c/0xc4 [] .tick_handle_periodic+0x38/0xbc [] .timer_interrupt+0xac/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .cpupri_set+0x78/0x1f0 [] .__enqueue_rt_entity+0x114/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .run_rebalance_domains+0x424/0x5fc [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 } ... key at: [] __key.11749+0x0/0x10 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .cpupri_set+0x12c/0x1f0 [] .rq_online_rt+0xac/0xc4 [] .set_rq_online+0xa8/0xd4 [] .rq_attach_root+0x174/0x1a8 [] .sched_init+0x334/0x408 [] .start_kernel+0x1dc/0x4b4 [] .start_here_common+0x1c/0x34 -> (&rt_b->rt_runtime_lock){++..} ops: 210453397504 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__enqueue_rt_entity+0x164/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .migration_call+0x164/0x660 [] .migration_init+0x60/0x90 [] .do_one_initcall+0x8c/0x1c8 [] .kernel_init+0x80/0x2cc [] .kernel_thread+0x54/0x70 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__enqueue_rt_entity+0x164/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .softlockup_tick+0x130/0x214 [] .run_local_timers+0x24/0x38 [] .update_process_times+0x38/0x8c [] .tick_periodic+0x9c/0xc4 [] .tick_handle_periodic+0x38/0xbc [] .timer_interrupt+0xac/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__enqueue_rt_entity+0x164/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .run_rebalance_domains+0x424/0x5fc [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 } ... key at: [] __key.31617+0x0/0x8 -> (&cpu_base->lock){++..} ops: 700504871010304 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irq+0x58/0xb4 [] .hrtimer_run_pending+0x3c/0x10c [] .run_timer_softirq+0x54/0x268 [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .dotest+0x4dc/0x544 [] .locking_selftest+0x124/0x17ec [] .start_kernel+0x32c/0x4b4 [] .start_here_common+0x1c/0x34 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .hrtimer_run_queues+0x178/0x318 [] .run_local_timers+0x10/0x38 [] .update_process_times+0x38/0x8c [] .tick_periodic+0x9c/0xc4 [] .tick_handle_periodic+0x38/0xbc [] .timer_interrupt+0xac/0x100 [] decrementer_common+0x104/0x180 [] ._spin_unlock_irqrestore+0x60/0x88 [] .rtas_call+0x1ec/0x288 [] .smp_pSeries_kick_cpu+0xc0/0x10c [] .__cpu_up+0x114/0x254 [] .cpu_up+0x11c/0x1f0 [] .kernel_init+0x15c/0x2cc [] .kernel_thread+0x54/0x70 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_irq+0x58/0xb4 [] .hrtimer_run_pending+0x3c/0x10c [] .run_timer_softirq+0x54/0x268 [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .dotest+0x4dc/0x544 [] .locking_selftest+0x124/0x17ec [] .start_kernel+0x32c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] __key.18246+0x0/0x8 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .lock_hrtimer_base+0x34/0x8c [] .hrtimer_start+0x4c/0x1b0 [] .__enqueue_rt_entity+0x19c/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .migration_call+0x164/0x660 [] .migration_init+0x60/0x90 [] .do_one_initcall+0x8c/0x1c8 [] .kernel_init+0x80/0x2cc [] .kernel_thread+0x54/0x70 -> (&rt_rq->rt_runtime_lock){+...} ops: 6090263625728 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .update_curr_rt+0xac/0x144 [] .dequeue_task_rt+0x20/0x5c [] .dequeue_task+0xdc/0x108 [] .deactivate_task+0x30/0x50 [] .schedule+0x1b8/0x804 [] .migration_thread+0x230/0x324 [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 in-hardirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .sched_rt_period_timer+0xd4/0x1f8 [] .hrtimer_run_queues+0x21c/0x318 [] .run_local_timers+0x10/0x38 [] .update_process_times+0x38/0x8c [] .tick_periodic+0x9c/0xc4 [] .tick_handle_periodic+0x38/0xbc [] .timer_interrupt+0xac/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .rest_init+0x7c/0x94 [] .start_kernel+0x48c/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] __key.39293+0x0/0x8 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__enable_runtime+0x60/0xb0 [] .rq_online_rt+0x98/0xc4 [] .set_rq_online+0xa8/0xd4 [] .migration_call+0x1c8/0x660 [] .notifier_call_chain+0x68/0xdc [] .cpu_up+0x188/0x1f0 [] .kernel_init+0x15c/0x2cc [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .__enqueue_rt_entity+0x164/0x1d0 [] .enqueue_task_rt+0x4c/0x84 [] .enqueue_task+0x84/0xac [] .activate_task+0x30/0x50 [] .try_to_wake_up+0x1a0/0x27c [] .migration_call+0x164/0x660 [] .migration_init+0x60/0x90 [] .do_one_initcall+0x8c/0x1c8 [] .kernel_init+0x80/0x2cc [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .update_curr_rt+0xac/0x144 [] .dequeue_task_rt+0x20/0x5c [] .dequeue_task+0xdc/0x108 [] .deactivate_task+0x30/0x50 [] .schedule+0x1b8/0x804 [] .migration_thread+0x230/0x324 [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 -> (&rq->lock/1){.+..} ops: 5330054414336 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_nested+0x44/0xa0 [] .double_rq_lock+0x78/0xc8 [] .__migrate_task+0xa8/0x194 [] .migration_thread+0x278/0x324 [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 in-softirq-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock_nested+0x44/0xa0 [] .double_rq_lock+0x78/0xc8 [] .run_rebalance_domains+0x23c/0x5fc [] .__do_softirq+0xc8/0x198 [] .call_do_softirq+0x14/0x24 [] .do_softirq+0x94/0x114 [] .irq_exit+0x70/0x88 [] .timer_interrupt+0xd4/0x100 [] decrementer_common+0x104/0x180 [] .cpu_idle+0x124/0x1f8 [] .start_secondary+0x350/0x388 [] .start_secondary_prolog+0x10/0x14 } ... key at: [] __key.39338+0x1/0x8 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_nested+0x44/0xa0 [] .double_rq_lock+0x78/0xc8 [] .__migrate_task+0xa8/0x194 [] .migration_thread+0x278/0x324 [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .task_rq_lock+0x70/0xd4 [] .try_to_wake_up+0xd4/0x27c [] .__wake_up_common+0x6c/0xe0 [] .complete+0x54/0x80 [] .kthread+0x38/0xc4 [] .kernel_thread+0x54/0x70 ... acquired at: [] .__lock_acquire+0x814/0x8ec [] .lock_acquire+0xa4/0xec [] ._spin_lock_irqsave+0x5c/0xc0 [] .__wake_up+0x34/0x88 [] .tty_wakeup+0x88/0xa4 [] .hvc_poll+0x270/0x2f0 [] .khvcd+0x84/0x18c [] .kthread+0x78/0xc4 [] .kernel_thread+0x54/0x70 the second lock's dependencies: -> (proc_subdir_lock){--..} ops: 15307263442944 { initial-use at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 softirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 hardirq-on-W at: [] .lock_acquire+0xa4/0xec [] ._spin_lock+0x44/0xa0 [] .xlate_proc_name+0x50/0xf8 [] .__proc_create+0x6c/0x15c [] .create_proc_entry+0x6c/0xb0 [] .proc_misc_init+0x3c/0x2f0 [] .proc_root_init+0x78/0x104 [] .start_kernel+0x474/0x4b4 [] .start_here_common+0x1c/0x34 } ... key at: [] proc_subdir_lock+0x18/0x38 stack backtrace: Call Trace: [c00000000fffb890] [c00000000000fa14] .show_stack+0x78/0x17c (unreliable) [c00000000fffb940] [c00000000007ee54] .print_irq_inversion_bug+0x1a4/0x1d4 [c00000000fffb9e0] [c000000000080a80] .mark_lock+0x320/0xa1c [c00000000fffba80] [c00000000008310c] .__lock_acquire+0x638/0x8ec [c00000000fffbb70] [c000000000083464] .lock_acquire+0xa4/0xec [c00000000fffbc30] [c0000000004e7614] ._spin_lock_irqsave+0x5c/0xc0 [c00000000fffbcd0] [c0000000002e61e0] .hvc_poll+0x50/0x2f0 [c00000000fffbdd0] [c0000000002e66d8] .hvc_handle_interrupt+0x14/0x3c [c00000000fffbe50] [c00000000009f140] .handle_IRQ_event+0x50/0xc8 [c00000000fffbef0] [c0000000000a15e4] .handle_fasteoi_irq+0x120/0x1bc [c00000000fffbf90] [c000000000025e00] .call_handle_irq+0x1c/0x2c [c000000000a43a40] [c00000000000d118] .do_IRQ+0x128/0x1fc [c000000000a43ae0] [c000000000004804] hardware_interrupt_entry+0x1c/0x98 --- Exception: 501 at .raw_local_irq_restore+0x3c/0x40 LR = .cpu_idle+0x130/0x1f8 [c000000000a43dd0] [c000000000011ffc] .cpu_idle+0x124/0x1f8 (unreliable) [c000000000a43e60] [c0000000004e7c6c] .rest_init+0x7c/0x94 [c000000000a43ee0] [c000000000720a58] .start_kernel+0x48c/0x4b4 [c000000000a43f90] [c000000000008368] .start_here_common+0x1c/0x34 From borntraeger at de.ibm.com Mon Oct 13 00:51:31 2008 From: borntraeger at de.ibm.com (Christian Borntraeger) Date: Mon, 13 Oct 2008 09:51:31 +0200 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <1223875013.8157.230.camel@pasglop> References: <200806031444.21945.borntraeger@de.ibm.com> <200806031445.22561.borntraeger@de.ibm.com> <1223875013.8157.230.camel@pasglop> Message-ID: <200810130951.31733.borntraeger@de.ibm.com> Am Montag, 13. Oktober 2008 schrieb Benjamin Herrenschmidt: > ? ... key ? ? ?at: [] proc_subdir_lock+0x18/0x38 > ?... acquired at: > ? ?[] .__lock_acquire+0x814/0x8ec > ? ?[] .lock_acquire+0xa4/0xec > ? ?[] ._spin_lock+0x44/0xa0 > ? ?[] .xlate_proc_name+0x50/0xf8 > ? ?[] .remove_proc_entry+0x44/0x298 > ? ?[] .unregister_handler_proc+0x40/0x58 > ? ?[] .free_irq+0x124/0x18c > ? ?[] .notifier_del_irq+0x28/0x48 > ? ?[] .hvc_close+0xa0/0x110 > ? ?[] .release_dev+0x244/0x580 > ? ?[] .tty_release+0x24/0x44 > ? ?[] .__fput+0xf8/0x1dc > ? ?[] .filp_close+0xb4/0xdc > ? ?[] .sys_close+0xac/0x100 > ? ?[] syscall_exit+0x0/0x40 Hmmm. Can you try if this patch fixes the lockdep trace? This would be analog to commit b1b135c8d619cb2c7045d6ee4e48375882518bb5 Author: Christian Borntraeger Date: Thu Aug 7 09:18:34 2008 +0200 fix spinlock recursion in hvc_console commit 611e097d7707741a336a0677d9d69bec40f29f3d Author: Christian Borntraeger hvc_console: rework setup to replace irq functions with callbacks introduced a spinlock recursion problem. Signed-off-by: Christian Borntraeger --- drivers/char/hvc_console.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6/drivers/char/hvc_console.c =================================================================== --- linux-2.6.orig/drivers/char/hvc_console.c +++ linux-2.6/drivers/char/hvc_console.c @@ -367,13 +367,13 @@ static void hvc_close(struct tty_struct spin_lock_irqsave(&hp->lock, flags); if (--hp->count == 0) { - if (hp->ops->notifier_del) - hp->ops->notifier_del(hp, hp->data); - /* We are done with the tty pointer now. */ hp->tty = NULL; spin_unlock_irqrestore(&hp->lock, flags); + if (hp->ops->notifier_del) + hp->ops->notifier_del(hp, hp->data); + /* * Chain calls chars_in_buffer() and returns immediately if * there is no buffered data otherwise sleeps on a wait queue @@ -416,11 +416,11 @@ static void hvc_hangup(struct tty_struct hp->n_outbuf = 0; hp->tty = NULL; + spin_unlock_irqrestore(&hp->lock, flags); + if (hp->ops->notifier_del) hp->ops->notifier_del(hp, hp->data); - spin_unlock_irqrestore(&hp->lock, flags); - while(temp_open_count) { --temp_open_count; kref_put(&hp->kref, destroy_hvc_struct); From benh at kernel.crashing.org Mon Oct 13 01:36:12 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Oct 2008 19:36:12 +1100 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <200810130951.31733.borntraeger@de.ibm.com> References: <200806031444.21945.borntraeger@de.ibm.com> <200806031445.22561.borntraeger@de.ibm.com> <1223875013.8157.230.camel@pasglop> <200810130951.31733.borntraeger@de.ibm.com> Message-ID: <1223886972.8157.241.camel@pasglop> > if (--hp->count == 0) { > - if (hp->ops->notifier_del) > - hp->ops->notifier_del(hp, hp->data); > - > /* We are done with the tty pointer now. */ > hp->tty = NULL; > spin_unlock_irqrestore(&hp->lock, flags); > > + if (hp->ops->notifier_del) > + hp->ops->notifier_del(hp, hp->data); > + I will try. Of course the risk here is that the interrupt happens after we set hp->tty to NULL, so we probably need to check within the interrupt handler for a NULL tty. I haven't checked if that's the case (I'm not in front of the code right now). Ben. From borntraeger at de.ibm.com Mon Oct 13 01:47:12 2008 From: borntraeger at de.ibm.com (Christian Borntraeger) Date: Mon, 13 Oct 2008 10:47:12 +0200 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <1223886972.8157.241.camel@pasglop> References: <200806031444.21945.borntraeger@de.ibm.com> <200810130951.31733.borntraeger@de.ibm.com> <1223886972.8157.241.camel@pasglop> Message-ID: <200810131047.12748.borntraeger@de.ibm.com> Am Montag, 13. Oktober 2008 schrieb Benjamin Herrenschmidt: > > > if (--hp->count == 0) { > > - if (hp->ops->notifier_del) > > - hp->ops->notifier_del(hp, hp->data); > > - > > /* We are done with the tty pointer now. */ > > hp->tty = NULL; > > spin_unlock_irqrestore(&hp->lock, flags); > > > > + if (hp->ops->notifier_del) > > + hp->ops->notifier_del(hp, hp->data); > > + > > I will try. Of course the risk here is that the interrupt happens > after we set hp->tty to NULL, so we probably need to check within the > interrupt handler for a NULL tty. I haven't checked if that's the case > (I'm not in front of the code right now). Even the old code (without my patch) was setting hp->tty to NULL before doing the irq_free, so that should be ok. Christian From benh at kernel.crashing.org Mon Oct 13 02:52:59 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Oct 2008 20:52:59 +1100 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <200810131047.12748.borntraeger@de.ibm.com> References: <200806031444.21945.borntraeger@de.ibm.com> <200810130951.31733.borntraeger@de.ibm.com> <1223886972.8157.241.camel@pasglop> <200810131047.12748.borntraeger@de.ibm.com> Message-ID: <1223891579.8157.251.camel@pasglop> On Mon, 2008-10-13 at 10:47 +0200, Christian Borntraeger wrote: > > I will try. Of course the risk here is that the interrupt happens > > after we set hp->tty to NULL, so we probably need to check within the > > interrupt handler for a NULL tty. I haven't checked if that's the case > > (I'm not in front of the code right now). > > Even the old code (without my patch) was setting hp->tty to NULL before doing > the irq_free, so that should be ok. Yup, just checked, it should be allright as long as it's cleared with the spinlock held, which seems to be the case with your patch. I'll test your fixup patch tomorrow to see if it clears the lockdep error. Thanks ! Ben. From eddie.dong at intel.com Mon Oct 13 17:23:34 2008 From: eddie.dong at intel.com (Dong, Eddie) Date: Tue, 14 Oct 2008 08:23:34 +0800 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081001160706.GI13822@parisc-linux.org> References: <20081001160706.GI13822@parisc-linux.org> Message-ID: <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> Matthew Wilcox wrote: > Wouldn't it be more useful to have the iov/N directories > be a symlink to the actual pci_dev used by the virtual > function? The main concern here is that a VF may be disabed such as when PF enter D3 state or undergo an reset and thus be plug-off, but user won't re-configure the VF in case the PF return back to working state. > >> +For network device, there are: >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan >> + (value update will notify PF driver) > > We already have tools to set the MAC and VLAN parameters > for network devices. Do you mean Ethtool? If yes, it is impossible for SR-IOV since the configuration has to be done in PF side, rather than VF. > > I'm not 100% convinced about this API. The assumption > here is that the driver will do it, but I think it should > probably be in the core. The driver probably wants to be Our concern is that the PF driver may put an default state when it is loaded so that SR-IOV can work without any user level configuration, but of course the driver won't dynamically change it. Do u mean we remove this ability? > notified that the PCI core is going to create a virtual > function, and would it please prepare to do so, but I'm > not convinced this should be triggered by the driver. > How would the driver decide to create a new virtual > function? > > > From my reading of the SR-IOV spec, this isn't how it's > supposed to work. The device is supposed to be a fully > functional PCI device that on demand can start peeling > off virtual functions; it's not supposed to boot up and > initialise all its virtual functions at once. The spec defines either we enable all VFs or Disable. Per VF enabling is not supported. Is this what you concern? Thanks, eddie From benh at kernel.crashing.org Mon Oct 13 17:38:34 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Oct 2008 11:38:34 +1100 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <200810130951.31733.borntraeger@de.ibm.com> References: <200806031444.21945.borntraeger@de.ibm.com> <200806031445.22561.borntraeger@de.ibm.com> <1223875013.8157.230.camel@pasglop> <200810130951.31733.borntraeger@de.ibm.com> Message-ID: <1223944714.8157.300.camel@pasglop> > Hmmm. > Can you try if this patch fixes the lockdep trace? Yup, the patch fixes it, I'll commit it via the powerpc.git tree if you don't have any objection. Cheers, Ben. > This would be analog to > commit b1b135c8d619cb2c7045d6ee4e48375882518bb5 > Author: Christian Borntraeger > Date: Thu Aug 7 09:18:34 2008 +0200 > > fix spinlock recursion in hvc_console > > commit 611e097d7707741a336a0677d9d69bec40f29f3d > Author: Christian Borntraeger > hvc_console: rework setup to replace irq functions with callbacks > introduced a spinlock recursion problem. > > > Signed-off-by: Christian Borntraeger > --- > drivers/char/hvc_console.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > Index: linux-2.6/drivers/char/hvc_console.c > =================================================================== > --- linux-2.6.orig/drivers/char/hvc_console.c > +++ linux-2.6/drivers/char/hvc_console.c > @@ -367,13 +367,13 @@ static void hvc_close(struct tty_struct > spin_lock_irqsave(&hp->lock, flags); > > if (--hp->count == 0) { > - if (hp->ops->notifier_del) > - hp->ops->notifier_del(hp, hp->data); > - > /* We are done with the tty pointer now. */ > hp->tty = NULL; > spin_unlock_irqrestore(&hp->lock, flags); > > + if (hp->ops->notifier_del) > + hp->ops->notifier_del(hp, hp->data); > + > /* > * Chain calls chars_in_buffer() and returns immediately if > * there is no buffered data otherwise sleeps on a wait queue > @@ -416,11 +416,11 @@ static void hvc_hangup(struct tty_struct > hp->n_outbuf = 0; > hp->tty = NULL; > > + spin_unlock_irqrestore(&hp->lock, flags); > + > if (hp->ops->notifier_del) > hp->ops->notifier_del(hp, hp->data); > > - spin_unlock_irqrestore(&hp->lock, flags); > - > while(temp_open_count) { > --temp_open_count; > kref_put(&hp->kref, destroy_hvc_struct); From matthew at wil.cx Mon Oct 13 18:08:27 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Mon, 13 Oct 2008 19:08:27 -0600 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> Message-ID: <20081014010827.GX25780@parisc-linux.org> On Tue, Oct 14, 2008 at 08:23:34AM +0800, Dong, Eddie wrote: > Matthew Wilcox wrote: > > Wouldn't it be more useful to have the iov/N directories > > be a symlink to the actual pci_dev used by the virtual > > function? > > The main concern here is that a VF may be disabed such as when PF enter > D3 state or undergo an reset and thus be plug-off, but user won't > re-configure the VF in case the PF return back to working state. If we're relying on the user to reconfigure virtual functions on return to D0 from D3, that's a very fragile system. > >> +For network device, there are: > >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac > >> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan > >> + (value update will notify PF driver) > > > > We already have tools to set the MAC and VLAN parameters > > for network devices. > > Do you mean Ethtool? If yes, it is impossible for SR-IOV since the > configuration has to be done in PF side, rather than VF. I don't think ethtool has that ability; ip(8) can set mac addresses and vconfig(8) sets vlan parameters. The device driver already has to be aware of SR-IOV. If it's going to support the standard tools (and it damn well ought to), then it should call the PF driver to set these kinds of parameters. > > I'm not 100% convinced about this API. The assumption > > here is that the driver will do it, but I think it should > > probably be in the core. The driver probably wants to be > > Our concern is that the PF driver may put an default state when it is > loaded so that SR-IOV can work without any user level configuration, but > of course the driver won't dynamically change it. > Do u mean we remove this ability? > > > notified that the PCI core is going to create a virtual > > function, and would it please prepare to do so, but I'm > > not convinced this should be triggered by the driver. > > How would the driver decide to create a new virtual > > function? Let me try to explain this a bit better. The user decides they want a new ethernet virtual function. In the scheme as you have set up: 1. User communicates to ethernet driver "I want a new VF" 2. Ethernet driver tells PCI core "create new VF". I propose: 1. User tells PCI core "I want a new VF on PCI device 0000:01:03.0" 2. PCI core tells driver "User wants a new VF" My scheme gives us a unified way of creating new VFs, yours requires each driver to invent a way for the user to tell them to create a new VF. Unless I've misunderstood your code and docs. > > From my reading of the SR-IOV spec, this isn't how it's > > supposed to work. The device is supposed to be a fully > > functional PCI device that on demand can start peeling > > off virtual functions; it's not supposed to boot up and > > initialise all its virtual functions at once. > > The spec defines either we enable all VFs or Disable. Per VF enabling is > not supported. > Is this what you concern? I don't think that's true. The spec requires you to enable all the VFs from 0 to NumVFs, but NumVFs can be lower than TotalVFs. At least, that's how I read it. But no, that isn't my concern. My concern is that you've written a driver here that seems to be a stub driver. That doesn't seem to be how SR-IOV is supposed to work; it's supposed to be a fully-functional driver that has SR-IOV knowledge added to it. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From eddie.dong at intel.com Mon Oct 13 19:31:03 2008 From: eddie.dong at intel.com (Dong, Eddie) Date: Tue, 14 Oct 2008 10:31:03 +0800 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081014010827.GX25780@parisc-linux.org> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> Message-ID: <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 08:23:34AM +0800, Dong, Eddie > wrote: >> Matthew Wilcox wrote: >>> Wouldn't it be more useful to have the iov/N directories >>> be a symlink to the actual pci_dev used by the virtual >>> function? >> >> The main concern here is that a VF may be disabed such >> as when PF enter D3 state or undergo an reset and thus >> be plug-off, but user won't re-configure the VF in case >> the PF return back to working state. > > If we're relying on the user to reconfigure virtual > functions on return to D0 from D3, that's a very fragile > system. No. that is the concern we don't put those configuration under VF nodes because it will disappear. Do I miss something? > >>>> +For network device, there are: >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan >>>> + (value update will notify PF driver) >>> >>> We already have tools to set the MAC and VLAN parameters >>> for network devices. >> >> Do you mean Ethtool? If yes, it is impossible for SR-IOV >> since the configuration has to be done in PF side, >> rather than VF. > > I don't think ethtool has that ability; ip(8) can set mac > addresses and vconfig(8) sets vlan parameters. > > The device driver already has to be aware of SR-IOV. If > it's going to support the standard tools (and it damn > well ought to), then it should call the PF driver to set > these kinds of parameters. OK, as if it has the VF parameter, will look into details. BTW, the SR-IOV patch is not only for network, some other devices such as IDE will use same code base as well and we image it could have other parameter to set such as starting LBA of a IDE VF. > >>> I'm not 100% convinced about this API. The assumption >>> here is that the driver will do it, but I think it >>> should probably be in the core. The driver probably >>> wants to be >> >> Our concern is that the PF driver may put an default >> state when it is loaded so that SR-IOV can work without >> any user level configuration, but of course the driver >> won't dynamically change it. >> Do u mean we remove this ability? >> >>> notified that the PCI core is going to create a virtual >>> function, and would it please prepare to do so, but I'm >>> not convinced this should be triggered by the driver. >>> How would the driver decide to create a new virtual >>> function? > > Let me try to explain this a bit better. > > The user decides they want a new ethernet virtual > function. In the scheme as you have set up: > > 1. User communicates to ethernet driver "I want a new VF" > 2. Ethernet driver tells PCI core "create new VF". > > I propose: > > 1. User tells PCI core "I want a new VF on PCI device > 0000:01:03.0" > 2. PCI core tells driver "User wants a new VF" If user need a new VF, the VF must be already enabled or existed in OS. Otherwise, we need to disable all VFs first and then change NumVFs to re-enable VFs. Spec says: "NumVFs may only be written while VF Enable is Clear" > > My scheme gives us a unified way of creating new VFs, > yours requires each driver to invent a way for the user > to tell them to create a new VF. Unless I've > misunderstood your code and docs. Assign a VF is kind of user & core job. > >>> From my reading of the SR-IOV spec, this isn't how it's >>> supposed to work. The device is supposed to be a fully >>> functional PCI device that on demand can start peeling >>> off virtual functions; it's not supposed to boot up and >>> initialise all its virtual functions at once. >> >> The spec defines either we enable all VFs or Disable. >> Per VF enabling is not supported. Is this what you >> concern? > > I don't think that's true. The spec requires you to > enable all the > VFs from 0 to NumVFs, but NumVFs can be lower than > TotalVFs. At least, that's how I read it. Yes, but setting NumVFs can only occur when VFs are disabled. Following are from spec. NumVFs may only be written while VF Enable is Clear. If NumVFs is written when VF Enable is Set, the results are undefined. The initial value of NumVFs is undefined. > > But no, that isn't my concern. My concern is that you've > written a driver here that seems to be a stub driver. > That doesn't seem to be > how SR-IOV is supposed to work; it's supposed to be a > fully-functional driver that has SR-IOV knowledge added > to it. Yes, it is a full feature driver as if PF has resource in, for example not all queues are assigned to VFs. Thx, eddie From yu.zhao at intel.com Mon Oct 13 19:14:35 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 10:14:35 +0800 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> Message-ID: <20081014021435.GA1482@yzhao12-linux.sh.intel.com> On Tue, Oct 14, 2008 at 10:31:03AM +0800, Dong, Eddie wrote: > Matthew Wilcox wrote: > > On Tue, Oct 14, 2008 at 08:23:34AM +0800, Dong, Eddie > > wrote: > >> Matthew Wilcox wrote: > >>> Wouldn't it be more useful to have the iov/N directories > >>> be a symlink to the actual pci_dev used by the virtual > >>> function? > >> > >> The main concern here is that a VF may be disabed such > >> as when PF enter D3 state or undergo an reset and thus > >> be plug-off, but user won't re-configure the VF in case > >> the PF return back to working state. > > > > If we're relying on the user to reconfigure virtual > > functions on return to D0 from D3, that's a very fragile > > system. > > No. that is the concern we don't put those configuration under VF nodes because it will disappear. > Do I miss something? > > > > >>>> +For network device, there are: > >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/mac > >>>> + - /sys/bus/pci/devices/BB:DD.F/iov/N/vlan > >>>> + (value update will notify PF driver) > >>> > >>> We already have tools to set the MAC and VLAN parameters > >>> for network devices. > >> > >> Do you mean Ethtool? If yes, it is impossible for SR-IOV > >> since the configuration has to be done in PF side, > >> rather than VF. > > > > I don't think ethtool has that ability; ip(8) can set mac > > addresses and vconfig(8) sets vlan parameters. > > > > The device driver already has to be aware of SR-IOV. If > > it's going to support the standard tools (and it damn > > well ought to), then it should call the PF driver to set > > these kinds of parameters. > > OK, as if it has the VF parameter, will look into details. Neither ip(8) nor vconfig(8) can set MAC and VLAN address for VF when the VF driver is not loaded. > BTW, the SR-IOV patch is not only for network, some other devices such as IDE will use same code base as well and we image it could have other parameter to set such as starting LBA of a IDE VF. As Eddie said, we have two problems here: 1) User has to set device specific parameters of a VF when he wants to use this VF with KVM (assign this device to KVM guest). In this case, VF driver is not loaded in the host environment. So operations which are implemented as driver callback (e.g. set_mac_address()) are not supported. 2) For security reason, some SR-IOV devices prohibit the VF driver configuring the VF via its own register space. Instead, the configurations must be done through the PF which the VF is associated with. This means PF driver has to receive parameters that are used to configure its VFs. These parameters obviously can be passed by traditional tools, if without modification for SR-IOV. From matthew at wil.cx Mon Oct 13 21:01:05 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Mon, 13 Oct 2008 22:01:05 -0600 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081014021435.GA1482@yzhao12-linux.sh.intel.com> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> <20081014021435.GA1482@yzhao12-linux.sh.intel.com> Message-ID: <20081014040105.GA25780@parisc-linux.org> On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: > > BTW, the SR-IOV patch is not only for network, some other devices such as IDE will use same code base as well and we image it could have other parameter to set such as starting LBA of a IDE VF. > > As Eddie said, we have two problems here: > 1) User has to set device specific parameters of a VF when he wants to > use this VF with KVM (assign this device to KVM guest). In this case, > VF driver is not loaded in the host environment. So operations which > are implemented as driver callback (e.g. set_mac_address()) are not > supported. I suspect what you want to do is create, then configure the device in the host, then assign it to the guest. > 2) For security reason, some SR-IOV devices prohibit the VF driver > configuring the VF via its own register space. Instead, the configurations > must be done through the PF which the VF is associated with. This means PF > driver has to receive parameters that are used to configure its VFs. These > parameters obviously can be passed by traditional tools, if without > modification for SR-IOV. I think that idea also covers this point. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From eddie.dong at intel.com Mon Oct 13 21:18:40 2008 From: eddie.dong at intel.com (Dong, Eddie) Date: Tue, 14 Oct 2008 12:18:40 +0800 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081014040105.GA25780@parisc-linux.org> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> <20081014021435.GA1482@yzhao12-linux.sh.intel.com> <20081014040105.GA25780@parisc-linux.org> Message-ID: <08DF4D958216244799FC84F3514D70F00235CF5E@pdsmsx415.ccr.corp.intel.com> Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: >>> BTW, the SR-IOV patch is not only for network, some >>> other devices such as IDE will use same code base as >>> well and we image it could have other parameter to set >>> such as starting LBA of a IDE VF. >> >> As Eddie said, we have two problems here: >> 1) User has to set device specific parameters of a VF >> when he wants to use this VF with KVM (assign this >> device to KVM guest). In this case, >> VF driver is not loaded in the host environment. So >> operations which >> are implemented as driver callback (e.g. >> set_mac_address()) are not supported. > > I suspect what you want to do is create, then configure > the device in the host, then assign it to the guest. That is not true. Rememver the created VFs will be destroyed no matter for PF power event or error recovery conducted reset. So what we want is: Config, create, assign, and then deassign and destroy and then recreate... > >> 2) For security reason, some SR-IOV devices prohibit the >> VF driver configuring the VF via its own register space. >> Instead, the configurations must be done through the PF >> which the VF is associated with. This means PF driver >> has to receive parameters that are used to configure its >> VFs. These parameters obviously can be passed by >> traditional tools, if without modification for SR-IOV. > > I think that idea also covers this point. > Sorry can u explain a little bit more? The SR-IOV patch won't define what kind of entries should be created or not, we leave network subsystem to decide what to do. Same for disk subsstem etc. Thx, eddie From matthew at wil.cx Mon Oct 13 21:46:26 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Mon, 13 Oct 2008 22:46:26 -0600 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <08DF4D958216244799FC84F3514D70F00235CF5E@pdsmsx415.ccr.corp.intel.com> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> <20081014021435.GA1482@yzhao12-linux.sh.intel.com> <20081014040105.GA25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CF5E@pdsmsx415.ccr.corp.intel.com> Message-ID: <20081014044626.GB25780@parisc-linux.org> On Tue, Oct 14, 2008 at 12:18:40PM +0800, Dong, Eddie wrote: > Matthew Wilcox wrote: > > On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: > >> As Eddie said, we have two problems here: > >> 1) User has to set device specific parameters of a VF > >> when he wants to use this VF with KVM (assign this > >> device to KVM guest). In this case, > >> VF driver is not loaded in the host environment. So > >> operations which > >> are implemented as driver callback (e.g. > >> set_mac_address()) are not supported. > > > > I suspect what you want to do is create, then configure > > the device in the host, then assign it to the guest. > > That is not true. Rememver the created VFs will be destroyed no matter > for PF power event or error recovery conducted reset. > So what we want is: > > Config, create, assign, and then deassign and destroy and then > recreate... Yes, but my point is this all happens in the _host_, not in the _guest_. > Sorry can u explain a little bit more? The SR-IOV patch won't define > what kind of entries should be created or not, we leave network > subsystem to decide what to do. Same for disk subsstem etc. No entries should be created. This needs to be not SR-IOV specific. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From yu.zhao at intel.com Mon Oct 13 21:06:59 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 12:06:59 +0800 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081014040105.GA25780@parisc-linux.org> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> <20081014021435.GA1482@yzhao12-linux.sh.intel.com> <20081014040105.GA25780@parisc-linux.org> Message-ID: <20081014040659.GB1482@yzhao12-linux.sh.intel.com> On Tue, Oct 14, 2008 at 12:01:05PM +0800, Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: > > > BTW, the SR-IOV patch is not only for network, some other devices such as IDE will use same code base as well and we image it could have other parameter to set such as starting LBA of a IDE VF. > > > > As Eddie said, we have two problems here: > > 1) User has to set device specific parameters of a VF when he wants to > > use this VF with KVM (assign this device to KVM guest). In this case, > > VF driver is not loaded in the host environment. So operations which > > are implemented as driver callback (e.g. set_mac_address()) are not > > supported. > > I suspect what you want to do is create, then configure the device in > the host, then assign it to the guest. > > > 2) For security reason, some SR-IOV devices prohibit the VF driver > > configuring the VF via its own register space. Instead, the configurations > > must be done through the PF which the VF is associated with. This means PF > > driver has to receive parameters that are used to configure its VFs. These > > parameters obviously can be passed by traditional tools, if without ^^^ Sorry, here I meant to say 'can not'. > > modification for SR-IOV. > > I think that idea also covers this point. Can you please elaborate this? Thanks a lot. > > -- > Matthew Wilcox Intel Open Source Technology Centre > "Bill, look, we understand that you're interested in selling us this > operating system, but compare it to ours. We can't possibly take such > a retrograde step." > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From yamahata at valinux.co.jp Mon Oct 13 22:51:16 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:16 +0900 Subject: [PATCH 01/32] ia64/pv_ops: avoid name conflict of get_irq_chip(). In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-2-git-send-email-yamahata@valinux.co.jp> The macro get_irq_chip() is defined in linux/include/linux/irq.h which cause name conflict with one in linux/arch/ia64/include/asm/paravirt.h. rename the latter to __get_irq_chip(). Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/paravirt.h | 4 ++-- arch/ia64/kernel/paravirt.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 660cab0..2bf3636 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -117,7 +117,7 @@ static inline void paravirt_post_smp_prepare_boot_cpu(void) struct pv_iosapic_ops { void (*pcat_compat_init)(void); - struct irq_chip *(*get_irq_chip)(unsigned long trigger); + struct irq_chip *(*__get_irq_chip)(unsigned long trigger); unsigned int (*__read)(char __iomem *iosapic, unsigned int reg); void (*__write)(char __iomem *iosapic, unsigned int reg, u32 val); @@ -135,7 +135,7 @@ iosapic_pcat_compat_init(void) static inline struct irq_chip* iosapic_get_irq_chip(unsigned long trigger) { - return pv_iosapic_ops.get_irq_chip(trigger); + return pv_iosapic_ops.__get_irq_chip(trigger); } static inline unsigned int diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index afaf5b9..de35d8e 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -332,7 +332,7 @@ ia64_native_iosapic_write(char __iomem *iosapic, unsigned int reg, u32 val) struct pv_iosapic_ops pv_iosapic_ops = { .pcat_compat_init = ia64_native_iosapic_pcat_compat_init, - .get_irq_chip = ia64_native_iosapic_get_irq_chip, + .__get_irq_chip = ia64_native_iosapic_get_irq_chip, .__read = ia64_native_iosapic_read, .__write = ia64_native_iosapic_write, -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:17 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:17 +0900 Subject: [PATCH 02/32] ia64/pv_ops: update native/inst.h to clobber predicate. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-3-git-send-email-yamahata@valinux.co.jp> add CLOBBER_PRED() to clobber predicate register. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/inst.h | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/ia64/include/asm/native/inst.h b/arch/ia64/include/asm/native/inst.h index c8efbf7..0a1026c 100644 --- a/arch/ia64/include/asm/native/inst.h +++ b/arch/ia64/include/asm/native/inst.h @@ -36,8 +36,13 @@ ;; \ movl clob = PARAVIRT_POISON; \ ;; +# define CLOBBER_PRED(pred_clob) \ + ;; \ + cmp.eq pred_clob, p0 = r0, r0 \ + ;; #else -# define CLOBBER(clob) /* nothing */ +# define CLOBBER(clob) /* nothing */ +# define CLOBBER_PRED(pred_clob) /* nothing */ #endif #define MOV_FROM_IFA(reg) \ @@ -136,7 +141,8 @@ #define SSM_PSR_I(pred, pred_clob, clob) \ (pred) ssm psr.i \ - CLOBBER(clob) + CLOBBER(clob) \ + CLOBBER_PRED(pred_clob) #define RSM_PSR_I(pred, clob0, clob1) \ (pred) rsm psr.i \ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:22 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:22 +0900 Subject: [PATCH 07/32] ia64/xen: introduce definitions necessary for ia64/xen hypercalls. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-8-git-send-email-yamahata@valinux.co.jp> import arch/ia64/include/asm/xen/interface.h to introduce definitions necessary for ia64/xen hypercalls. They are basic structures to communicate with xen hypervisor and will be used later. Cc: Robin Holt Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/xen/interface.h | 346 +++++++++++++++++++++++++++++++++ 1 files changed, 346 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/interface.h diff --git a/arch/ia64/include/asm/xen/interface.h b/arch/ia64/include/asm/xen/interface.h new file mode 100644 index 0000000..f00fab4 --- /dev/null +++ b/arch/ia64/include/asm/xen/interface.h @@ -0,0 +1,346 @@ +/****************************************************************************** + * arch-ia64/hypervisor-if.h + * + * Guest OS interface to IA64 Xen. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to + * deal in the Software without restriction, including without limitation the + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or + * sell copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER + * DEALINGS IN THE SOFTWARE. + * + * Copyright by those who contributed. (in alphabetical order) + * + * Anthony Xu + * Eddie Dong + * Fred Yang + * Kevin Tian + * Alex Williamson + * Chris Wright + * Christian Limpach + * Dietmar Hahn + * Hollis Blanchard + * Isaku Yamahata + * Jan Beulich + * John Levon + * Kazuhiro Suzuki + * Keir Fraser + * Kouya Shimura + * Masaki Kanno + * Matt Chapman + * Matthew Chapman + * Samuel Thibault + * Tomonari Horikoshi + * Tristan Gingold + * Tsunehisa Doi + * Yutaka Ezaki + * Zhang Xin + * Zhang xiantao + * dan.magenheimer at hp.com + * ian.pratt at cl.cam.ac.uk + * michael.fetterman at cl.cam.ac.uk + */ + +#ifndef _ASM_IA64_XEN_INTERFACE_H +#define _ASM_IA64_XEN_INTERFACE_H + +#define __DEFINE_GUEST_HANDLE(name, type) \ + typedef struct { type *p; } __guest_handle_ ## name + +#define DEFINE_GUEST_HANDLE_STRUCT(name) \ + __DEFINE_GUEST_HANDLE(name, struct name) +#define DEFINE_GUEST_HANDLE(name) __DEFINE_GUEST_HANDLE(name, name) +#define GUEST_HANDLE(name) __guest_handle_ ## name +#define GUEST_HANDLE_64(name) GUEST_HANDLE(name) +#define set_xen_guest_handle(hnd, val) do { (hnd).p = val; } while (0) + +#ifndef __ASSEMBLY__ +/* Guest handles for primitive C types. */ +__DEFINE_GUEST_HANDLE(uchar, unsigned char); +__DEFINE_GUEST_HANDLE(uint, unsigned int); +__DEFINE_GUEST_HANDLE(ulong, unsigned long); +__DEFINE_GUEST_HANDLE(u64, unsigned long); +DEFINE_GUEST_HANDLE(char); +DEFINE_GUEST_HANDLE(int); +DEFINE_GUEST_HANDLE(long); +DEFINE_GUEST_HANDLE(void); + +typedef unsigned long xen_pfn_t; +DEFINE_GUEST_HANDLE(xen_pfn_t); +#define PRI_xen_pfn "lx" +#endif + +/* Arch specific VIRQs definition */ +#define VIRQ_ITC VIRQ_ARCH_0 /* V. Virtual itc timer */ +#define VIRQ_MCA_CMC VIRQ_ARCH_1 /* MCA cmc interrupt */ +#define VIRQ_MCA_CPE VIRQ_ARCH_2 /* MCA cpe interrupt */ + +/* Maximum number of virtual CPUs in multi-processor guests. */ +/* keep sizeof(struct shared_page) <= PAGE_SIZE. + * this is checked in arch/ia64/xen/hypervisor.c. */ +#define MAX_VIRT_CPUS 64 + +#ifndef __ASSEMBLY__ + +#define INVALID_MFN (~0UL) + +union vac { + unsigned long value; + struct { + int a_int:1; + int a_from_int_cr:1; + int a_to_int_cr:1; + int a_from_psr:1; + int a_from_cpuid:1; + int a_cover:1; + int a_bsw:1; + long reserved:57; + }; +}; + +union vdc { + unsigned long value; + struct { + int d_vmsw:1; + int d_extint:1; + int d_ibr_dbr:1; + int d_pmc:1; + int d_to_pmd:1; + int d_itm:1; + long reserved:58; + }; +}; + +struct mapped_regs { + union vac vac; + union vdc vdc; + unsigned long virt_env_vaddr; + unsigned long reserved1[29]; + unsigned long vhpi; + unsigned long reserved2[95]; + union { + unsigned long vgr[16]; + unsigned long bank1_regs[16]; /* bank1 regs (r16-r31) + when bank0 active */ + }; + union { + unsigned long vbgr[16]; + unsigned long bank0_regs[16]; /* bank0 regs (r16-r31) + when bank1 active */ + }; + unsigned long vnat; + unsigned long vbnat; + unsigned long vcpuid[5]; + unsigned long reserved3[11]; + unsigned long vpsr; + unsigned long vpr; + unsigned long reserved4[76]; + union { + unsigned long vcr[128]; + struct { + unsigned long dcr; /* CR0 */ + unsigned long itm; + unsigned long iva; + unsigned long rsv1[5]; + unsigned long pta; /* CR8 */ + unsigned long rsv2[7]; + unsigned long ipsr; /* CR16 */ + unsigned long isr; + unsigned long rsv3; + unsigned long iip; + unsigned long ifa; + unsigned long itir; + unsigned long iipa; + unsigned long ifs; + unsigned long iim; /* CR24 */ + unsigned long iha; + unsigned long rsv4[38]; + unsigned long lid; /* CR64 */ + unsigned long ivr; + unsigned long tpr; + unsigned long eoi; + unsigned long irr[4]; + unsigned long itv; /* CR72 */ + unsigned long pmv; + unsigned long cmcv; + unsigned long rsv5[5]; + unsigned long lrr0; /* CR80 */ + unsigned long lrr1; + unsigned long rsv6[46]; + }; + }; + union { + unsigned long reserved5[128]; + struct { + unsigned long precover_ifs; + unsigned long unat; /* not sure if this is needed + until NaT arch is done */ + int interrupt_collection_enabled; /* virtual psr.ic */ + + /* virtual interrupt deliverable flag is + * evtchn_upcall_mask in shared info area now. + * interrupt_mask_addr is the address + * of evtchn_upcall_mask for current vcpu + */ + unsigned char *interrupt_mask_addr; + int pending_interruption; + unsigned char vpsr_pp; + unsigned char vpsr_dfh; + unsigned char hpsr_dfh; + unsigned char hpsr_mfh; + unsigned long reserved5_1[4]; + int metaphysical_mode; /* 1 = use metaphys mapping + 0 = use virtual */ + int banknum; /* 0 or 1, which virtual + register bank is active */ + unsigned long rrs[8]; /* region registers */ + unsigned long krs[8]; /* kernel registers */ + unsigned long tmp[16]; /* temp registers + (e.g. for hyperprivops) */ + }; + }; +}; + +struct arch_vcpu_info { + /* nothing */ +}; + +/* + * This structure is used for magic page in domain pseudo physical address + * space and the result of XENMEM_machine_memory_map. + * As the XENMEM_machine_memory_map result, + * xen_memory_map::nr_entries indicates the size in bytes + * including struct xen_ia64_memmap_info. Not the number of entries. + */ +struct xen_ia64_memmap_info { + uint64_t efi_memmap_size; /* size of EFI memory map */ + uint64_t efi_memdesc_size; /* size of an EFI memory map + * descriptor */ + uint32_t efi_memdesc_version; /* memory descriptor version */ + void *memdesc[0]; /* array of efi_memory_desc_t */ +}; + +struct arch_shared_info { + /* PFN of the start_info page. */ + unsigned long start_info_pfn; + + /* Interrupt vector for event channel. */ + int evtchn_vector; + + /* PFN of memmap_info page */ + unsigned int memmap_info_num_pages; /* currently only = 1 case is + supported. */ + unsigned long memmap_info_pfn; + + uint64_t pad[31]; +}; + +struct xen_callback { + unsigned long ip; +}; +typedef struct xen_callback xen_callback_t; + +#endif /* !__ASSEMBLY__ */ + +/* Size of the shared_info area (this is not related to page size). */ +#define XSI_SHIFT 14 +#define XSI_SIZE (1 << XSI_SHIFT) +/* Log size of mapped_regs area (64 KB - only 4KB is used). */ +#define XMAPPEDREGS_SHIFT 12 +#define XMAPPEDREGS_SIZE (1 << XMAPPEDREGS_SHIFT) +/* Offset of XASI (Xen arch shared info) wrt XSI_BASE. */ +#define XMAPPEDREGS_OFS XSI_SIZE + +/* Hyperprivops. */ +#define HYPERPRIVOP_START 0x1 +#define HYPERPRIVOP_RFI (HYPERPRIVOP_START + 0x0) +#define HYPERPRIVOP_RSM_DT (HYPERPRIVOP_START + 0x1) +#define HYPERPRIVOP_SSM_DT (HYPERPRIVOP_START + 0x2) +#define HYPERPRIVOP_COVER (HYPERPRIVOP_START + 0x3) +#define HYPERPRIVOP_ITC_D (HYPERPRIVOP_START + 0x4) +#define HYPERPRIVOP_ITC_I (HYPERPRIVOP_START + 0x5) +#define HYPERPRIVOP_SSM_I (HYPERPRIVOP_START + 0x6) +#define HYPERPRIVOP_GET_IVR (HYPERPRIVOP_START + 0x7) +#define HYPERPRIVOP_GET_TPR (HYPERPRIVOP_START + 0x8) +#define HYPERPRIVOP_SET_TPR (HYPERPRIVOP_START + 0x9) +#define HYPERPRIVOP_EOI (HYPERPRIVOP_START + 0xa) +#define HYPERPRIVOP_SET_ITM (HYPERPRIVOP_START + 0xb) +#define HYPERPRIVOP_THASH (HYPERPRIVOP_START + 0xc) +#define HYPERPRIVOP_PTC_GA (HYPERPRIVOP_START + 0xd) +#define HYPERPRIVOP_ITR_D (HYPERPRIVOP_START + 0xe) +#define HYPERPRIVOP_GET_RR (HYPERPRIVOP_START + 0xf) +#define HYPERPRIVOP_SET_RR (HYPERPRIVOP_START + 0x10) +#define HYPERPRIVOP_SET_KR (HYPERPRIVOP_START + 0x11) +#define HYPERPRIVOP_FC (HYPERPRIVOP_START + 0x12) +#define HYPERPRIVOP_GET_CPUID (HYPERPRIVOP_START + 0x13) +#define HYPERPRIVOP_GET_PMD (HYPERPRIVOP_START + 0x14) +#define HYPERPRIVOP_GET_EFLAG (HYPERPRIVOP_START + 0x15) +#define HYPERPRIVOP_SET_EFLAG (HYPERPRIVOP_START + 0x16) +#define HYPERPRIVOP_RSM_BE (HYPERPRIVOP_START + 0x17) +#define HYPERPRIVOP_GET_PSR (HYPERPRIVOP_START + 0x18) +#define HYPERPRIVOP_SET_RR0_TO_RR4 (HYPERPRIVOP_START + 0x19) +#define HYPERPRIVOP_MAX (0x1a) + +/* Fast and light hypercalls. */ +#define __HYPERVISOR_ia64_fast_eoi __HYPERVISOR_arch_1 + +/* Xencomm macros. */ +#define XENCOMM_INLINE_MASK 0xf800000000000000UL +#define XENCOMM_INLINE_FLAG 0x8000000000000000UL + +#ifndef __ASSEMBLY__ + +/* + * Optimization features. + * The hypervisor may do some special optimizations for guests. This hypercall + * can be used to switch on/of these special optimizations. + */ +#define __HYPERVISOR_opt_feature 0x700UL + +#define XEN_IA64_OPTF_OFF 0x0 +#define XEN_IA64_OPTF_ON 0x1 + +/* + * If this feature is switched on, the hypervisor inserts the + * tlb entries without calling the guests traphandler. + * This is useful in guests using region 7 for identity mapping + * like the linux kernel does. + */ +#define XEN_IA64_OPTF_IDENT_MAP_REG7 1 + +/* Identity mapping of region 4 addresses in HVM. */ +#define XEN_IA64_OPTF_IDENT_MAP_REG4 2 + +/* Identity mapping of region 5 addresses in HVM. */ +#define XEN_IA64_OPTF_IDENT_MAP_REG5 3 + +#define XEN_IA64_OPTF_IDENT_MAP_NOT_SET (0) + +struct xen_ia64_opt_feature { + unsigned long cmd; /* Which feature */ + unsigned char on; /* Switch feature on/off */ + union { + struct { + /* The page protection bit mask of the pte. + * This will be or'ed with the pte. */ + unsigned long pgprot; + unsigned long key; /* A protection key for itir.*/ + }; + }; +}; + +#endif /* __ASSEMBLY__ */ + +#endif /* _ASM_IA64_XEN_INTERFACE_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:20 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:20 +0900 Subject: [PATCH 05/32] ia64/xen: introduce sync bitops which is necessary for ia64/xen support. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-6-git-send-email-yamahata@valinux.co.jp> define sync bitops which is necessary for ia64/xen. This bit operation is used to communicate with VMM or other guest kernel Even when this kernel is built for UP, VMM might be SMP so that those operation must always use atomic operation. Cc: Robin Holt Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/sync_bitops.h | 51 +++++++++++++++++++++++++++++++++++ 1 files changed, 51 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/sync_bitops.h diff --git a/arch/ia64/include/asm/sync_bitops.h b/arch/ia64/include/asm/sync_bitops.h new file mode 100644 index 0000000..593c12e --- /dev/null +++ b/arch/ia64/include/asm/sync_bitops.h @@ -0,0 +1,51 @@ +#ifndef _ASM_IA64_SYNC_BITOPS_H +#define _ASM_IA64_SYNC_BITOPS_H + +/* + * Copyright (C) 2008 Isaku Yamahata + * + * Based on synch_bitops.h which Dan Magenhaimer wrote. + * + * bit operations which provide guaranteed strong synchronisation + * when communicating with Xen or other guest OSes running on other CPUs. + */ + +static inline void sync_set_bit(int nr, volatile void *addr) +{ + set_bit(nr, addr); +} + +static inline void sync_clear_bit(int nr, volatile void *addr) +{ + clear_bit(nr, addr); +} + +static inline void sync_change_bit(int nr, volatile void *addr) +{ + change_bit(nr, addr); +} + +static inline int sync_test_and_set_bit(int nr, volatile void *addr) +{ + return test_and_set_bit(nr, addr); +} + +static inline int sync_test_and_clear_bit(int nr, volatile void *addr) +{ + return test_and_clear_bit(nr, addr); +} + +static inline int sync_test_and_change_bit(int nr, volatile void *addr) +{ + return test_and_change_bit(nr, addr); +} + +static inline int sync_test_bit(int nr, const volatile void *addr) +{ + return test_bit(nr, addr); +} + +#define sync_cmpxchg(ptr, old, new) \ + ((__typeof__(*(ptr)))cmpxchg_acq((ptr), (old), (new))) + +#endif /* _ASM_IA64_SYNC_BITOPS_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:15 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:15 +0900 Subject: [PATCH 00/32] ia64/xen domU take 11 Message-ID: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> This patchset is ia64/xen domU patch take 11. Tony, please commit those patches. They are ready to commit because all the issues which were pointed out had been addressed and got enough reviews. This patchset does the followings. - Some preparation work. Mainly importing header files to define related structures. - Then, define functions related to hypercall which is the way to communicate with Xen hypervisor. - Add some helper functions which is necessary to utilize xen arch generic portion. - Next implements the xen instance of pv_ops introducing pv_info, pv_init_ops, pv_cpu_ops and its assembler counter part, pv_iosapic_ops, pv_irq_ops and, pv_time_ops step by step. - Introduce xen machine vector to describe xen platform. By using machine vector, xen domU implementation can be simplified. - Lastly update Kconfig to allow paravirtualization support and xen domU support to compile. For convenience the working full source is available from http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ branch: ia64-pv-ops-2008oct14-xen-ia64 For the status of this patch series http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge At this phase, we don't address the following issues. Those will be addressed after the first merge. - optimization by binary patch In fact, we had the patch to do that, but we intentionally dropped for patch size/readability/cleanness. - freeing the unused pages, i.e. pages for unused ivt.S. - complete save/restore support - ar.itc paravirtualization which is necessary for save/restore support Changes from take 10: - rebased to 2.6.27 - renamed pv_iosapic_ops::get_irq_chip to pv_iosapic_ops::__get_irq_chip. - improved SSM_PSR_I to detect invalid register usage. - fixed consider_steal_time() of pv_time_ops. Changes from take 9: - rebased to 2.6.27-rc4 - caught up for moving header files. - caught up for x86 xen changes (mainly xen mode predicate) - enhanced pv checker to detect inappropriate register usage. - typo Changes from take 8: - rebased to 2.6.26 - updated pvclock-abi.h Changes from take 7: - various typos - clean up sync_bitops.h - style fix on include/asm-ia64/xen/interface.h - reserve the "break" numbers in include/asm-ia64/break.h - xencomm clean up - dropped NET_SKB_PAD patch. It was a bug in xen-netfront.c. - CONFIG_IA64_XEN -> CONFIG_IA64_XEN_GUEST - catch up for x86 pvclock-abi.h - work around for IPI with IA64_TIME_VECTOR - add pv checker Changes from take 6: - rebased to linux ia64 test tree - xen bsw_1 simplification. - add documentation. Documentation/ia64/xen.txt - preliminary support for save/restore. - network fix. NET_SKB_PAD. Changes from take 5: - rebased to Linux 2.6.26-rc3 - fix ivt.S paravirtualization. One instruction was wrongly paravirtualized. It wasn't revealed with Xen HVM domain so far, but with real hw - multi entry point support. - revised changelog to add CCs. Changes from take 4: - fix synch bit ops definitions to prevent accidental namespace clashes. - rebased and fixed breakages due to the upstream change. Changes from take 3: - split the patch set into pv_op part and xen domU part. - many clean ups. - introduced pv_ops: pv_cpu_ops and pv_time_ops. Changes from take 2: - many clean ups following to comments. - clean up:assembly instruction macro. - introduced pv_ops: pv_info, pv_init_ops, pv_iosapic_ops, pv_irq_ops. Changes from take 1: Single IVT source code. compile multitimes using assembler macros. thanks, Diffstat: Documentation/ia64/xen.txt | 183 +++++++++++ arch/ia64/Kconfig | 32 ++ arch/ia64/Makefile | 2 arch/ia64/include/asm/break.h | 9 arch/ia64/include/asm/machvec.h | 2 arch/ia64/include/asm/machvec_xen.h | 22 + arch/ia64/include/asm/meminit.h | 3 arch/ia64/include/asm/native/inst.h | 10 arch/ia64/include/asm/native/pvchk_inst.h | 263 +++++++++++++++++ arch/ia64/include/asm/paravirt.h | 4 arch/ia64/include/asm/pvclock-abi.h | 5 arch/ia64/include/asm/sync_bitops.h | 51 +++ arch/ia64/include/asm/timex.h | 2 arch/ia64/include/asm/xen/events.h | 50 +++ arch/ia64/include/asm/xen/grant_table.h | 29 + arch/ia64/include/asm/xen/hypercall.h | 265 +++++++++++++++++ arch/ia64/include/asm/xen/hypervisor.h | 89 +++++ arch/ia64/include/asm/xen/inst.h | 458 ++++++++++++++++++++++++++++++ arch/ia64/include/asm/xen/interface.h | 346 ++++++++++++++++++++++ arch/ia64/include/asm/xen/irq.h | 44 ++ arch/ia64/include/asm/xen/minstate.h | 134 ++++++++ arch/ia64/include/asm/xen/page.h | 65 ++++ arch/ia64/include/asm/xen/privop.h | 129 ++++++++ arch/ia64/include/asm/xen/xcom_hcall.h | 51 +++ arch/ia64/include/asm/xen/xencomm.h | 42 ++ arch/ia64/kernel/Makefile | 18 + arch/ia64/kernel/acpi.c | 5 arch/ia64/kernel/asm-offsets.c | 31 ++ arch/ia64/kernel/nr-irqs.c | 1 arch/ia64/kernel/paravirt.c | 2 arch/ia64/kernel/paravirt_inst.h | 4 arch/ia64/kernel/process.c | 1 arch/ia64/scripts/pvcheck.sed | 32 ++ arch/ia64/xen/Kconfig | 26 + arch/ia64/xen/Makefile | 42 ++ arch/ia64/xen/grant-table.c | 155 ++++++++++ arch/ia64/xen/hypercall.S | 91 +++++ arch/ia64/xen/hypervisor.c | 96 ++++++ arch/ia64/xen/irq_xen.c | 435 ++++++++++++++++++++++++++++ arch/ia64/xen/irq_xen.h | 34 ++ arch/ia64/xen/machvec.c | 4 arch/ia64/xen/suspend.c | 45 ++ arch/ia64/xen/time.c | 213 +++++++++++++ arch/ia64/xen/time.h | 24 + arch/ia64/xen/xcom_hcall.c | 441 ++++++++++++++++++++++++++++ arch/ia64/xen/xen_pv_ops.c | 364 +++++++++++++++++++++++ arch/ia64/xen/xencomm.c | 105 ++++++ arch/ia64/xen/xenivt.S | 52 +++ arch/ia64/xen/xensetup.S | 83 +++++ 49 files changed, 4574 insertions(+), 20 deletions(-) From yamahata at valinux.co.jp Mon Oct 13 22:51:30 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:30 +0900 Subject: [PATCH 15/32] ia64/xen: add definitions necessary for xen event channel. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-16-git-send-email-yamahata@valinux.co.jp> Xen paravirtualizes interrupt as event channel. This patch defines arch specific part of xen event channel. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/events.h | 50 ++++++++++++++++++++++++++++++++++++ 1 files changed, 50 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/events.h diff --git a/arch/ia64/include/asm/xen/events.h b/arch/ia64/include/asm/xen/events.h new file mode 100644 index 0000000..7324878 --- /dev/null +++ b/arch/ia64/include/asm/xen/events.h @@ -0,0 +1,50 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/events.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ +#ifndef _ASM_IA64_XEN_EVENTS_H +#define _ASM_IA64_XEN_EVENTS_H + +enum ipi_vector { + XEN_RESCHEDULE_VECTOR, + XEN_IPI_VECTOR, + XEN_CMCP_VECTOR, + XEN_CPEP_VECTOR, + + XEN_NR_IPIS, +}; + +static inline int xen_irqs_disabled(struct pt_regs *regs) +{ + return !(ia64_psr(regs)->i); +} + +static inline void xen_do_IRQ(int irq, struct pt_regs *regs) +{ + struct pt_regs *old_regs; + old_regs = set_irq_regs(regs); + irq_enter(); + __do_IRQ(irq); + irq_exit(); + set_irq_regs(old_regs); +} +#define irq_ctx_init(cpu) do { } while (0) + +#endif /* _ASM_IA64_XEN_EVENTS_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:26 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:26 +0900 Subject: [PATCH 11/32] ia64/xen: define helper functions for xen hypercalls. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-12-git-send-email-yamahata@valinux.co.jp> introduce helper functions for xen hypercalls which traps to hypervisor. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypercall.h | 265 +++++++++++++++++++++++++++++++++ arch/ia64/include/asm/xen/privop.h | 129 ++++++++++++++++ arch/ia64/xen/Makefile | 5 + arch/ia64/xen/hypercall.S | 91 +++++++++++ 4 files changed, 490 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/hypercall.h create mode 100644 arch/ia64/include/asm/xen/privop.h create mode 100644 arch/ia64/xen/Makefile create mode 100644 arch/ia64/xen/hypercall.S diff --git a/arch/ia64/include/asm/xen/hypercall.h b/arch/ia64/include/asm/xen/hypercall.h new file mode 100644 index 0000000..96fc623 --- /dev/null +++ b/arch/ia64/include/asm/xen/hypercall.h @@ -0,0 +1,265 @@ +/****************************************************************************** + * hypercall.h + * + * Linux-specific hypervisor handling. + * + * Copyright (c) 2002-2004, K A Fraser + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#ifndef _ASM_IA64_XEN_HYPERCALL_H +#define _ASM_IA64_XEN_HYPERCALL_H + +#include +#include +#include +#include +struct xencomm_handle; +extern unsigned long __hypercall(unsigned long a1, unsigned long a2, + unsigned long a3, unsigned long a4, + unsigned long a5, unsigned long cmd); + +/* + * Assembler stubs for hyper-calls. + */ + +#define _hypercall0(type, name) \ +({ \ + long __res; \ + __res = __hypercall(0, 0, 0, 0, 0, __HYPERVISOR_##name);\ + (type)__res; \ +}) + +#define _hypercall1(type, name, a1) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + 0, 0, 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall2(type, name, a1, a2) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + 0, 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall3(type, name, a1, a2, a3) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall4(type, name, a1, a2, a3, a4) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + (unsigned long)a4, \ + 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall5(type, name, a1, a2, a3, a4, a5) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + (unsigned long)a4, \ + (unsigned long)a5, \ + __HYPERVISOR_##name); \ + (type)__res; \ +}) + + +static inline int +xencomm_arch_hypercall_sched_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, sched_op_new, cmd, arg); +} + +static inline long +HYPERVISOR_set_timer_op(u64 timeout) +{ + unsigned long timeout_hi = (unsigned long)(timeout >> 32); + unsigned long timeout_lo = (unsigned long)timeout; + return _hypercall2(long, set_timer_op, timeout_lo, timeout_hi); +} + +static inline int +xencomm_arch_hypercall_multicall(struct xencomm_handle *call_list, + int nr_calls) +{ + return _hypercall2(int, multicall, call_list, nr_calls); +} + +static inline int +xencomm_arch_hypercall_memory_op(unsigned int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, memory_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_event_channel_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, event_channel_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_xen_version(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, xen_version, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_console_io(int cmd, int count, + struct xencomm_handle *str) +{ + return _hypercall3(int, console_io, cmd, count, str); +} + +static inline int +xencomm_arch_hypercall_physdev_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, physdev_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_grant_table_op(unsigned int cmd, + struct xencomm_handle *uop, + unsigned int count) +{ + return _hypercall3(int, grant_table_op, cmd, uop, count); +} + +int HYPERVISOR_grant_table_op(unsigned int cmd, void *uop, unsigned int count); + +extern int xencomm_arch_hypercall_suspend(struct xencomm_handle *arg); + +static inline int +xencomm_arch_hypercall_callback_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, callback_op, cmd, arg); +} + +static inline long +xencomm_arch_hypercall_vcpu_op(int cmd, int cpu, void *arg) +{ + return _hypercall3(long, vcpu_op, cmd, cpu, arg); +} + +static inline int +HYPERVISOR_physdev_op(int cmd, void *arg) +{ + switch (cmd) { + case PHYSDEVOP_eoi: + return _hypercall1(int, ia64_fast_eoi, + ((struct physdev_eoi *)arg)->irq); + default: + return xencomm_hypercall_physdev_op(cmd, arg); + } +} + +static inline long +xencomm_arch_hypercall_opt_feature(struct xencomm_handle *arg) +{ + return _hypercall1(long, opt_feature, arg); +} + +/* for balloon driver */ +#define HYPERVISOR_update_va_mapping(va, new_val, flags) (0) + +/* Use xencomm to do hypercalls. */ +#define HYPERVISOR_sched_op xencomm_hypercall_sched_op +#define HYPERVISOR_event_channel_op xencomm_hypercall_event_channel_op +#define HYPERVISOR_callback_op xencomm_hypercall_callback_op +#define HYPERVISOR_multicall xencomm_hypercall_multicall +#define HYPERVISOR_xen_version xencomm_hypercall_xen_version +#define HYPERVISOR_console_io xencomm_hypercall_console_io +#define HYPERVISOR_memory_op xencomm_hypercall_memory_op +#define HYPERVISOR_suspend xencomm_hypercall_suspend +#define HYPERVISOR_vcpu_op xencomm_hypercall_vcpu_op +#define HYPERVISOR_opt_feature xencomm_hypercall_opt_feature + +/* to compile gnttab_copy_grant_page() in drivers/xen/core/gnttab.c */ +#define HYPERVISOR_mmu_update(req, count, success_count, domid) ({ BUG(); 0; }) + +static inline int +HYPERVISOR_shutdown( + unsigned int reason) +{ + struct sched_shutdown sched_shutdown = { + .reason = reason + }; + + int rc = HYPERVISOR_sched_op(SCHEDOP_shutdown, &sched_shutdown); + + return rc; +} + +/* for netfront.c, netback.c */ +#define MULTI_UVMFLAGS_INDEX 0 /* XXX any value */ + +static inline void +MULTI_update_va_mapping( + struct multicall_entry *mcl, unsigned long va, + pte_t new_val, unsigned long flags) +{ + mcl->op = __HYPERVISOR_update_va_mapping; + mcl->result = 0; +} + +static inline void +MULTI_grant_table_op(struct multicall_entry *mcl, unsigned int cmd, + void *uop, unsigned int count) +{ + mcl->op = __HYPERVISOR_grant_table_op; + mcl->args[0] = cmd; + mcl->args[1] = (unsigned long)uop; + mcl->args[2] = count; +} + +static inline void +MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, + int count, int *success_count, domid_t domid) +{ + mcl->op = __HYPERVISOR_mmu_update; + mcl->args[0] = (unsigned long)req; + mcl->args[1] = count; + mcl->args[2] = (unsigned long)success_count; + mcl->args[3] = domid; +} + +#endif /* _ASM_IA64_XEN_HYPERCALL_H */ diff --git a/arch/ia64/include/asm/xen/privop.h b/arch/ia64/include/asm/xen/privop.h new file mode 100644 index 0000000..71ec754 --- /dev/null +++ b/arch/ia64/include/asm/xen/privop.h @@ -0,0 +1,129 @@ +#ifndef _ASM_IA64_XEN_PRIVOP_H +#define _ASM_IA64_XEN_PRIVOP_H + +/* + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Paravirtualizations of privileged operations for Xen/ia64 + * + * + * inline privop and paravirt_alt support + * Copyright (c) 2007 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + */ + +#ifndef __ASSEMBLY__ +#include /* arch-ia64.h requires uint64_t */ +#endif +#include + +/* At 1 MB, before per-cpu space but still addressable using addl instead + of movl. */ +#define XSI_BASE 0xfffffffffff00000 + +/* Address of mapped regs. */ +#define XMAPPEDREGS_BASE (XSI_BASE + XSI_SIZE) + +#ifdef __ASSEMBLY__ +#define XEN_HYPER_RFI break HYPERPRIVOP_RFI +#define XEN_HYPER_RSM_PSR_DT break HYPERPRIVOP_RSM_DT +#define XEN_HYPER_SSM_PSR_DT break HYPERPRIVOP_SSM_DT +#define XEN_HYPER_COVER break HYPERPRIVOP_COVER +#define XEN_HYPER_ITC_D break HYPERPRIVOP_ITC_D +#define XEN_HYPER_ITC_I break HYPERPRIVOP_ITC_I +#define XEN_HYPER_SSM_I break HYPERPRIVOP_SSM_I +#define XEN_HYPER_GET_IVR break HYPERPRIVOP_GET_IVR +#define XEN_HYPER_THASH break HYPERPRIVOP_THASH +#define XEN_HYPER_ITR_D break HYPERPRIVOP_ITR_D +#define XEN_HYPER_SET_KR break HYPERPRIVOP_SET_KR +#define XEN_HYPER_GET_PSR break HYPERPRIVOP_GET_PSR +#define XEN_HYPER_SET_RR0_TO_RR4 break HYPERPRIVOP_SET_RR0_TO_RR4 + +#define XSI_IFS (XSI_BASE + XSI_IFS_OFS) +#define XSI_PRECOVER_IFS (XSI_BASE + XSI_PRECOVER_IFS_OFS) +#define XSI_IFA (XSI_BASE + XSI_IFA_OFS) +#define XSI_ISR (XSI_BASE + XSI_ISR_OFS) +#define XSI_IIM (XSI_BASE + XSI_IIM_OFS) +#define XSI_ITIR (XSI_BASE + XSI_ITIR_OFS) +#define XSI_PSR_I_ADDR (XSI_BASE + XSI_PSR_I_ADDR_OFS) +#define XSI_PSR_IC (XSI_BASE + XSI_PSR_IC_OFS) +#define XSI_IPSR (XSI_BASE + XSI_IPSR_OFS) +#define XSI_IIP (XSI_BASE + XSI_IIP_OFS) +#define XSI_B1NAT (XSI_BASE + XSI_B1NATS_OFS) +#define XSI_BANK1_R16 (XSI_BASE + XSI_BANK1_R16_OFS) +#define XSI_BANKNUM (XSI_BASE + XSI_BANKNUM_OFS) +#define XSI_IHA (XSI_BASE + XSI_IHA_OFS) +#endif + +#ifndef __ASSEMBLY__ + +/************************************************/ +/* Instructions paravirtualized for correctness */ +/************************************************/ + +/* "fc" and "thash" are privilege-sensitive instructions, meaning they + * may have different semantics depending on whether they are executed + * at PL0 vs PL!=0. When paravirtualized, these instructions mustn't + * be allowed to execute directly, lest incorrect semantics result. */ +extern void xen_fc(unsigned long addr); +extern unsigned long xen_thash(unsigned long addr); + +/* Note that "ttag" and "cover" are also privilege-sensitive; "ttag" + * is not currently used (though it may be in a long-format VHPT system!) + * and the semantics of cover only change if psr.ic is off which is very + * rare (and currently non-existent outside of assembly code */ + +/* There are also privilege-sensitive registers. These registers are + * readable at any privilege level but only writable at PL0. */ +extern unsigned long xen_get_cpuid(int index); +extern unsigned long xen_get_pmd(int index); + +extern unsigned long xen_get_eflag(void); /* see xen_ia64_getreg */ +extern void xen_set_eflag(unsigned long); /* see xen_ia64_setreg */ + +/************************************************/ +/* Instructions paravirtualized for performance */ +/************************************************/ + +/* Xen uses memory-mapped virtual privileged registers for access to many + * performance-sensitive privileged registers. Some, like the processor + * status register (psr), are broken up into multiple memory locations. + * Others, like "pend", are abstractions based on privileged registers. + * "Pend" is guaranteed to be set if reading cr.ivr would return a + * (non-spurious) interrupt. */ +#define XEN_MAPPEDREGS ((struct mapped_regs *)XMAPPEDREGS_BASE) + +#define XSI_PSR_I \ + (*XEN_MAPPEDREGS->interrupt_mask_addr) +#define xen_get_virtual_psr_i() \ + (!XSI_PSR_I) +#define xen_set_virtual_psr_i(_val) \ + ({ XSI_PSR_I = (uint8_t)(_val) ? 0 : 1; }) +#define xen_set_virtual_psr_ic(_val) \ + ({ XEN_MAPPEDREGS->interrupt_collection_enabled = _val ? 1 : 0; }) +#define xen_get_virtual_pend() \ + (*(((uint8_t *)XEN_MAPPEDREGS->interrupt_mask_addr) - 1)) + +/* Although all privileged operations can be left to trap and will + * be properly handled by Xen, some are frequent enough that we use + * hyperprivops for performance. */ +extern unsigned long xen_get_psr(void); +extern unsigned long xen_get_ivr(void); +extern unsigned long xen_get_tpr(void); +extern void xen_hyper_ssm_i(void); +extern void xen_set_itm(unsigned long); +extern void xen_set_tpr(unsigned long); +extern void xen_eoi(unsigned long); +extern unsigned long xen_get_rr(unsigned long index); +extern void xen_set_rr(unsigned long index, unsigned long val); +extern void xen_set_rr0_to_rr4(unsigned long val0, unsigned long val1, + unsigned long val2, unsigned long val3, + unsigned long val4); +extern void xen_set_kr(unsigned long index, unsigned long val); +extern void xen_ptcga(unsigned long addr, unsigned long size); + +#endif /* !__ASSEMBLY__ */ + +#endif /* _ASM_IA64_XEN_PRIVOP_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile new file mode 100644 index 0000000..c200704 --- /dev/null +++ b/arch/ia64/xen/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for Xen components +# + +obj-y := hypercall.o diff --git a/arch/ia64/xen/hypercall.S b/arch/ia64/xen/hypercall.S new file mode 100644 index 0000000..d4ff0b9 --- /dev/null +++ b/arch/ia64/xen/hypercall.S @@ -0,0 +1,91 @@ +/* + * Support routines for Xen hypercalls + * + * Copyright (C) 2005 Dan Magenheimer + * Copyright (C) 2008 Yaozu (Eddie) Dong + */ + +#include +#include +#include + +/* + * Hypercalls without parameter. + */ +#define __HCALL0(name,hcall) \ + GLOBAL_ENTRY(name); \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +/* + * Hypercalls with 1 parameter. + */ +#define __HCALL1(name,hcall) \ + GLOBAL_ENTRY(name); \ + mov r8=r32; \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +/* + * Hypercalls with 2 parameters. + */ +#define __HCALL2(name,hcall) \ + GLOBAL_ENTRY(name); \ + mov r8=r32; \ + mov r9=r33; \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +__HCALL0(xen_get_psr, HYPERPRIVOP_GET_PSR) +__HCALL0(xen_get_ivr, HYPERPRIVOP_GET_IVR) +__HCALL0(xen_get_tpr, HYPERPRIVOP_GET_TPR) +__HCALL0(xen_hyper_ssm_i, HYPERPRIVOP_SSM_I) + +__HCALL1(xen_set_tpr, HYPERPRIVOP_SET_TPR) +__HCALL1(xen_eoi, HYPERPRIVOP_EOI) +__HCALL1(xen_thash, HYPERPRIVOP_THASH) +__HCALL1(xen_set_itm, HYPERPRIVOP_SET_ITM) +__HCALL1(xen_get_rr, HYPERPRIVOP_GET_RR) +__HCALL1(xen_fc, HYPERPRIVOP_FC) +__HCALL1(xen_get_cpuid, HYPERPRIVOP_GET_CPUID) +__HCALL1(xen_get_pmd, HYPERPRIVOP_GET_PMD) + +__HCALL2(xen_ptcga, HYPERPRIVOP_PTC_GA) +__HCALL2(xen_set_rr, HYPERPRIVOP_SET_RR) +__HCALL2(xen_set_kr, HYPERPRIVOP_SET_KR) + +#ifdef CONFIG_IA32_SUPPORT +__HCALL1(xen_get_eflag, HYPERPRIVOP_GET_EFLAG) +__HCALL1(xen_set_eflag, HYPERPRIVOP_SET_EFLAG) // refer SDM vol1 3.1.8 +#endif /* CONFIG_IA32_SUPPORT */ + +GLOBAL_ENTRY(xen_set_rr0_to_rr4) + mov r8=r32 + mov r9=r33 + mov r10=r34 + mov r11=r35 + mov r14=r36 + XEN_HYPER_SET_RR0_TO_RR4 + br.ret.sptk.many rp + ;; +END(xen_set_rr0_to_rr4) + +GLOBAL_ENTRY(xen_send_ipi) + mov r14=r32 + mov r15=r33 + mov r2=0x400 + break 0x1000 + ;; + br.ret.sptk.many rp + ;; +END(xen_send_ipi) + +GLOBAL_ENTRY(__hypercall) + mov r2=r37 + break 0x1000 + br.ret.sptk.many b0 + ;; +END(__hypercall) -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:32 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:32 +0900 Subject: [PATCH 17/32] ia64/pv_ops/xen: elf note based xen startup. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-18-git-send-email-yamahata@valinux.co.jp> This patch enables elf note based xen startup for IA-64, which gives the kernel an early hint for running on xen like x86 case. In order to avoid the multi entry point, presumably extending booting protocol(i.e. extending struct ia64_boot_param) would be necessary. It probably means that elilo also needs modification. Signed-off-by: Qing He Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/asm-offsets.c | 4 ++ arch/ia64/xen/Makefile | 3 +- arch/ia64/xen/xen_pv_ops.c | 65 +++++++++++++++++++++++++++++++ arch/ia64/xen/xensetup.S | 83 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 154 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/xen_pv_ops.c create mode 100644 arch/ia64/xen/xensetup.S diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c index eaa988b..742dbb1 100644 --- a/arch/ia64/kernel/asm-offsets.c +++ b/arch/ia64/kernel/asm-offsets.c @@ -17,6 +17,7 @@ #include #include +#include #include "../kernel/sigframe.h" #include "../kernel/fsyscall_gtod_data.h" @@ -292,6 +293,9 @@ void foo(void) #ifdef CONFIG_XEN BLANK(); + DEFINE(XEN_NATIVE_ASM, XEN_NATIVE); + DEFINE(XEN_PV_DOMAIN_ASM, XEN_PV_DOMAIN); + #define DEFINE_MAPPED_REG_OFS(sym, field) \ DEFINE(sym, (XMAPPEDREGS_OFS + offsetof(struct mapped_regs, field))) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index eb59563..abc356f 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,5 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o xcom_hcall.o grant-table.o +obj-y := hypercall.o xensetup.o xen_pv_ops.o \ + xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c new file mode 100644 index 0000000..77db214 --- /dev/null +++ b/arch/ia64/xen/xen_pv_ops.c @@ -0,0 +1,65 @@ +/****************************************************************************** + * arch/ia64/xen/xen_pv_ops.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include +#include + +#include +#include +#include + +/*************************************************************************** + * general info + */ +static struct pv_info xen_info __initdata = { + .kernel_rpl = 2, /* or 1: determin at runtime */ + .paravirt_enabled = 1, + .name = "Xen/ia64", +}; + +#define IA64_RSC_PL_SHIFT 2 +#define IA64_RSC_PL_BIT_SIZE 2 +#define IA64_RSC_PL_MASK \ + (((1UL << IA64_RSC_PL_BIT_SIZE) - 1) << IA64_RSC_PL_SHIFT) + +static void __init +xen_info_init(void) +{ + /* Xenified Linux/ia64 may run on pl = 1 or 2. + * determin at run time. */ + unsigned long rsc = ia64_getreg(_IA64_REG_AR_RSC); + unsigned int rpl = (rsc & IA64_RSC_PL_MASK) >> IA64_RSC_PL_SHIFT; + xen_info.kernel_rpl = rpl; +} + +/*************************************************************************** + * pv_ops initialization + */ + +void __init +xen_setup_pv_ops(void) +{ + xen_info_init(); + pv_info = xen_info; +} diff --git a/arch/ia64/xen/xensetup.S b/arch/ia64/xen/xensetup.S new file mode 100644 index 0000000..28fed1f --- /dev/null +++ b/arch/ia64/xen/xensetup.S @@ -0,0 +1,83 @@ +/* + * Support routines for Xen + * + * Copyright (C) 2005 Dan Magenheimer + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + + .section .data.read_mostly + .align 8 + .global xen_domain_type +xen_domain_type: + data4 XEN_NATIVE_ASM + .previous + + __INIT +ENTRY(startup_xen) + // Calculate load offset. + // The constant, LOAD_OFFSET, can't be used because the boot + // loader doesn't always load to the LMA specified by the vmlinux.lds. + mov r9=ip // must be the first instruction to make sure + // that r9 = the physical address of startup_xen. + // Usually r9 = startup_xen - LOAD_OFFSET + movl r8=startup_xen + ;; + sub r9=r9,r8 // Usually r9 = -LOAD_OFFSET. + + mov r10=PARAVIRT_HYPERVISOR_TYPE_XEN + movl r11=_start + ;; + add r11=r11,r9 + movl r8=hypervisor_type + ;; + add r8=r8,r9 + mov b0=r11 + ;; + st8 [r8]=r10 + br.cond.sptk.many b0 + ;; +END(startup_xen) + + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz "linux") + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz "2.6") + ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz "xen-3.0") + ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, data8.ua startup_xen - LOAD_OFFSET) + +#define isBP p3 // are we the Bootstrap Processor? + + .text + +GLOBAL_ENTRY(xen_setup_hook) + mov r8=XEN_PV_DOMAIN_ASM +(isBP) movl r9=xen_domain_type;; +(isBP) st4 [r9]=r8 + movl r10=xen_ivt;; + + mov cr.iva=r10 + + /* Set xsi base. */ +#define FW_HYPERCALL_SET_SHARED_INFO_VA 0x600 +(isBP) mov r2=FW_HYPERCALL_SET_SHARED_INFO_VA +(isBP) movl r28=XSI_BASE;; +(isBP) break 0x1000;; + + /* setup pv_ops */ +(isBP) mov r4=rp + ;; +(isBP) br.call.sptk.many rp=xen_setup_pv_ops + ;; +(isBP) mov rp=r4 + ;; + + br.ret.sptk.many rp + ;; +END(xen_setup_hook) -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:29 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:29 +0900 Subject: [PATCH 14/32] ia64/xen: implement arch specific part of xen grant table. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-15-git-send-email-yamahata@valinux.co.jp> Xen implements grant tables which is for sharing pages with guest domains. This patch implements arch specific part of grant table initialization. and xen_alloc_vm_area()/xen_free_vm_area() which are helper functions for xen grant table. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/grant_table.h | 29 ++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/grant-table.c | 155 +++++++++++++++++++++++++++++++ 3 files changed, 185 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/grant_table.h create mode 100644 arch/ia64/xen/grant-table.c diff --git a/arch/ia64/include/asm/xen/grant_table.h b/arch/ia64/include/asm/xen/grant_table.h new file mode 100644 index 0000000..2b1fae0 --- /dev/null +++ b/arch/ia64/include/asm/xen/grant_table.h @@ -0,0 +1,29 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/grant_table.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_GRANT_TABLE_H +#define _ASM_IA64_XEN_GRANT_TABLE_H + +struct vm_struct *xen_alloc_vm_area(unsigned long size); +void xen_free_vm_area(struct vm_struct *area); + +#endif /* _ASM_IA64_XEN_GRANT_TABLE_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ae08822..eb59563 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o xcom_hcall.o +obj-y := hypercall.o xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/grant-table.c b/arch/ia64/xen/grant-table.c new file mode 100644 index 0000000..777dd9a --- /dev/null +++ b/arch/ia64/xen/grant-table.c @@ -0,0 +1,155 @@ +/****************************************************************************** + * arch/ia64/xen/grant-table.c + * + * Copyright (c) 2006 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include + +#include +#include +#include + +#include + +struct vm_struct *xen_alloc_vm_area(unsigned long size) +{ + int order; + unsigned long virt; + unsigned long nr_pages; + struct vm_struct *area; + + order = get_order(size); + virt = __get_free_pages(GFP_KERNEL, order); + if (virt == 0) + goto err0; + nr_pages = 1 << order; + scrub_pages(virt, nr_pages); + + area = kmalloc(sizeof(*area), GFP_KERNEL); + if (area == NULL) + goto err1; + + area->flags = VM_IOREMAP; + area->addr = (void *)virt; + area->size = size; + area->pages = NULL; + area->nr_pages = nr_pages; + area->phys_addr = 0; /* xenbus_map_ring_valloc uses this field! */ + + return area; + +err1: + free_pages(virt, order); +err0: + return NULL; +} +EXPORT_SYMBOL_GPL(xen_alloc_vm_area); + +void xen_free_vm_area(struct vm_struct *area) +{ + unsigned int order = get_order(area->size); + unsigned long i; + unsigned long phys_addr = __pa(area->addr); + + /* This area is used for foreign page mappping. + * So underlying machine page may not be assigned. */ + for (i = 0; i < (1 << order); i++) { + unsigned long ret; + unsigned long gpfn = (phys_addr >> PAGE_SHIFT) + i; + struct xen_memory_reservation reservation = { + .nr_extents = 1, + .address_bits = 0, + .extent_order = 0, + .domid = DOMID_SELF + }; + set_xen_guest_handle(reservation.extent_start, &gpfn); + ret = HYPERVISOR_memory_op(XENMEM_populate_physmap, + &reservation); + BUG_ON(ret != 1); + } + free_pages((unsigned long)area->addr, order); + kfree(area); +} +EXPORT_SYMBOL_GPL(xen_free_vm_area); + + +/**************************************************************************** + * grant table hack + * cmd: GNTTABOP_xxx + */ + +int arch_gnttab_map_shared(unsigned long *frames, unsigned long nr_gframes, + unsigned long max_nr_gframes, + struct grant_entry **__shared) +{ + *__shared = __va(frames[0] << PAGE_SHIFT); + return 0; +} + +void arch_gnttab_unmap_shared(struct grant_entry *shared, + unsigned long nr_gframes) +{ + /* nothing */ +} + +static void +gnttab_map_grant_ref_pre(struct gnttab_map_grant_ref *uop) +{ + uint32_t flags; + + flags = uop->flags; + + if (flags & GNTMAP_host_map) { + if (flags & GNTMAP_application_map) { + printk(KERN_DEBUG + "GNTMAP_application_map is not supported yet: " + "flags 0x%x\n", flags); + BUG(); + } + if (flags & GNTMAP_contains_pte) { + printk(KERN_DEBUG + "GNTMAP_contains_pte is not supported yet: " + "flags 0x%x\n", flags); + BUG(); + } + } else if (flags & GNTMAP_device_map) { + printk("GNTMAP_device_map is not supported yet 0x%x\n", flags); + BUG(); /* not yet. actually this flag is not used. */ + } else { + BUG(); + } +} + +int +HYPERVISOR_grant_table_op(unsigned int cmd, void *uop, unsigned int count) +{ + if (cmd == GNTTABOP_map_grant_ref) { + unsigned int i; + for (i = 0; i < count; i++) { + gnttab_map_grant_ref_pre( + (struct gnttab_map_grant_ref *)uop + i); + } + } + return xencomm_hypercall_grant_table_op(cmd, uop, count); +} + +EXPORT_SYMBOL(HYPERVISOR_grant_table_op); -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:27 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:27 +0900 Subject: [PATCH 12/32] ia64/xen: implement the arch specific part of xencomm. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-13-git-send-email-yamahata@valinux.co.jp> On ia64/xen, pointer argument for the hypercall is passed by pseudo physical address (guest physical address.) So it is necessary to convert virtual address into pseudo physical address right before issuing hypercall. The frame work is called xencomm. This patch implements arch specific part. Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" Cc: Akio Takebe --- arch/ia64/include/asm/xen/xencomm.h | 41 +++++++++++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xencomm.c | 94 +++++++++++++++++++++++++++++++++++ 3 files changed, 136 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/xencomm.h create mode 100644 arch/ia64/xen/xencomm.c diff --git a/arch/ia64/include/asm/xen/xencomm.h b/arch/ia64/include/asm/xen/xencomm.h new file mode 100644 index 0000000..28732cd --- /dev/null +++ b/arch/ia64/include/asm/xen/xencomm.h @@ -0,0 +1,41 @@ +/* + * Copyright (C) 2006 Hollis Blanchard , IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_IA64_XEN_XENCOMM_H +#define _ASM_IA64_XEN_XENCOMM_H + +#include +#include + +/* Must be called before any hypercall. */ +extern void xencomm_initialize(void); + +/* Check if virtual contiguity means physical contiguity + * where the passed address is a pointer value in virtual address. + * On ia64, identity mapping area in region 7 or the piece of region 5 + * that is mapped by itr[IA64_TR_KERNEL]/dtr[IA64_TR_KERNEL] + */ +static inline int xencomm_is_phys_contiguous(unsigned long addr) +{ + return (PAGE_OFFSET <= addr && + addr < (PAGE_OFFSET + (1UL << IA64_MAX_PHYS_BITS))) || + (KERNEL_START <= addr && + addr < KERNEL_START + KERNEL_TR_PAGE_SIZE); +} + +#endif /* _ASM_IA64_XEN_XENCOMM_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index c200704..ad0c9f7 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o +obj-y := hypercall.o xencomm.o diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c new file mode 100644 index 0000000..3dc307f --- /dev/null +++ b/arch/ia64/xen/xencomm.c @@ -0,0 +1,94 @@ +/* + * Copyright (C) 2006 Hollis Blanchard , IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include + +static unsigned long kernel_virtual_offset; + +void +xencomm_initialize(void) +{ + kernel_virtual_offset = KERNEL_START - ia64_tpa(KERNEL_START); +} + +/* Translate virtual address to physical address. */ +unsigned long +xencomm_vtop(unsigned long vaddr) +{ + struct page *page; + struct vm_area_struct *vma; + + if (vaddr == 0) + return 0UL; + + if (REGION_NUMBER(vaddr) == 5) { + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + + /* On ia64, TASK_SIZE refers to current. It is not initialized + during boot. + Furthermore the kernel is relocatable and __pa() doesn't + work on addresses. */ + if (vaddr >= KERNEL_START + && vaddr < (KERNEL_START + KERNEL_TR_PAGE_SIZE)) + return vaddr - kernel_virtual_offset; + + /* In kernel area -- virtually mapped. */ + pgd = pgd_offset_k(vaddr); + if (pgd_none(*pgd) || pgd_bad(*pgd)) + return ~0UL; + + pud = pud_offset(pgd, vaddr); + if (pud_none(*pud) || pud_bad(*pud)) + return ~0UL; + + pmd = pmd_offset(pud, vaddr); + if (pmd_none(*pmd) || pmd_bad(*pmd)) + return ~0UL; + + ptep = pte_offset_kernel(pmd, vaddr); + if (!ptep) + return ~0UL; + + return (pte_val(*ptep) & _PFN_MASK) | (vaddr & ~PAGE_MASK); + } + + if (vaddr > TASK_SIZE) { + /* percpu variables */ + if (REGION_NUMBER(vaddr) == 7 && + REGION_OFFSET(vaddr) >= (1ULL << IA64_MAX_PHYS_BITS)) + ia64_tpa(vaddr); + + /* kernel address */ + return __pa(vaddr); + } + + /* XXX double-check (lack of) locking */ + vma = find_extend_vma(current->mm, vaddr); + if (!vma) + return ~0UL; + + /* We assume the page is modified. */ + page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH); + if (!page) + return ~0UL; + + return (page_to_pfn(page) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK); +} -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:18 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:18 +0900 Subject: [PATCH 03/32] ia64: move function declaration, ia64_cpu_local_tick() from .c to .h In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-4-git-send-email-yamahata@valinux.co.jp> eliminate the function declaration ia64_cpu_local_tick() in process.c by defining in arch/ia64/include/asm/timex.h The same function will be used in a different .c file later. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/timex.h | 2 ++ arch/ia64/kernel/process.c | 1 - 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/timex.h b/arch/ia64/include/asm/timex.h index 05a6baf..4e03cfe 100644 --- a/arch/ia64/include/asm/timex.h +++ b/arch/ia64/include/asm/timex.h @@ -39,4 +39,6 @@ get_cycles (void) return ret; } +extern void ia64_cpu_local_tick (void); + #endif /* _ASM_IA64_TIMEX_H */ diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c index 3ab8373..8de0f46 100644 --- a/arch/ia64/kernel/process.c +++ b/arch/ia64/kernel/process.c @@ -251,7 +251,6 @@ default_idle (void) /* We don't actually take CPU down, just spin without interrupts. */ static inline void play_dead(void) { - extern void ia64_cpu_local_tick (void); unsigned int this_cpu = smp_processor_id(); /* Ack it */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:28 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:28 +0900 Subject: [PATCH 13/32] ia64/xen: xencomm conversion functions for hypercalls In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-14-git-send-email-yamahata@valinux.co.jp> On ia64/xen, pointer arguments for hypercall is passed by pseudo physical address(guest physical address.) So such hypercalls needs address conversion functions. This patch implements concrete conversion functions for such hypercalls. Signed-off-by: Akio Takebe Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/xcom_hcall.h | 51 ++++ arch/ia64/include/asm/xen/xencomm.h | 1 + arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xcom_hcall.c | 441 ++++++++++++++++++++++++++++++++ arch/ia64/xen/xencomm.c | 11 + 5 files changed, 505 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/xcom_hcall.h create mode 100644 arch/ia64/xen/xcom_hcall.c diff --git a/arch/ia64/include/asm/xen/xcom_hcall.h b/arch/ia64/include/asm/xen/xcom_hcall.h new file mode 100644 index 0000000..20b2950 --- /dev/null +++ b/arch/ia64/include/asm/xen/xcom_hcall.h @@ -0,0 +1,51 @@ +/* + * Copyright (C) 2006 Tristan Gingold , Bull SAS + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_IA64_XEN_XCOM_HCALL_H +#define _ASM_IA64_XEN_XCOM_HCALL_H + +/* These function creates inline or mini descriptor for the parameters and + calls the corresponding xencomm_arch_hypercall_X. + Architectures should defines HYPERVISOR_xxx as xencomm_hypercall_xxx unless + they want to use their own wrapper. */ +extern int xencomm_hypercall_console_io(int cmd, int count, char *str); + +extern int xencomm_hypercall_event_channel_op(int cmd, void *op); + +extern int xencomm_hypercall_xen_version(int cmd, void *arg); + +extern int xencomm_hypercall_physdev_op(int cmd, void *op); + +extern int xencomm_hypercall_grant_table_op(unsigned int cmd, void *op, + unsigned int count); + +extern int xencomm_hypercall_sched_op(int cmd, void *arg); + +extern int xencomm_hypercall_multicall(void *call_list, int nr_calls); + +extern int xencomm_hypercall_callback_op(int cmd, void *arg); + +extern int xencomm_hypercall_memory_op(unsigned int cmd, void *arg); + +extern int xencomm_hypercall_suspend(unsigned long srec); + +extern long xencomm_hypercall_vcpu_op(int cmd, int cpu, void *arg); + +extern long xencomm_hypercall_opt_feature(void *arg); + +#endif /* _ASM_IA64_XEN_XCOM_HCALL_H */ diff --git a/arch/ia64/include/asm/xen/xencomm.h b/arch/ia64/include/asm/xen/xencomm.h index 28732cd..cded677 100644 --- a/arch/ia64/include/asm/xen/xencomm.h +++ b/arch/ia64/include/asm/xen/xencomm.h @@ -24,6 +24,7 @@ /* Must be called before any hypercall. */ extern void xencomm_initialize(void); +extern int xencomm_is_initialized(void); /* Check if virtual contiguity means physical contiguity * where the passed address is a pointer value in virtual address. diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ad0c9f7..ae08822 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o +obj-y := hypercall.o xencomm.o xcom_hcall.o diff --git a/arch/ia64/xen/xcom_hcall.c b/arch/ia64/xen/xcom_hcall.c new file mode 100644 index 0000000..ccaf743 --- /dev/null +++ b/arch/ia64/xen/xcom_hcall.c @@ -0,0 +1,441 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Tristan Gingold + * + * Copyright (c) 2007 + * Isaku Yamahata + * VA Linux Systems Japan K.K. + * consolidate mini and inline version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* Xencomm notes: + * This file defines hypercalls to be used by xencomm. The hypercalls simply + * create inlines or mini descriptors for pointers and then call the raw arch + * hypercall xencomm_arch_hypercall_XXX + * + * If the arch wants to directly use these hypercalls, simply define macros + * in asm/xen/hypercall.h, eg: + * #define HYPERVISOR_sched_op xencomm_hypercall_sched_op + * + * The arch may also define HYPERVISOR_xxx as a function and do more operations + * before/after doing the hypercall. + * + * Note: because only inline or mini descriptors are created these functions + * must only be called with in kernel memory parameters. + */ + +int +xencomm_hypercall_console_io(int cmd, int count, char *str) +{ + /* xen early printk uses console io hypercall before + * xencomm initialization. In that case, we just ignore it. + */ + if (!xencomm_is_initialized()) + return 0; + + return xencomm_arch_hypercall_console_io + (cmd, count, xencomm_map_no_alloc(str, count)); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_console_io); + +int +xencomm_hypercall_event_channel_op(int cmd, void *op) +{ + struct xencomm_handle *desc; + desc = xencomm_map_no_alloc(op, sizeof(struct evtchn_op)); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_event_channel_op(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_event_channel_op); + +int +xencomm_hypercall_xen_version(int cmd, void *arg) +{ + struct xencomm_handle *desc; + unsigned int argsize; + + switch (cmd) { + case XENVER_version: + /* do not actually pass an argument */ + return xencomm_arch_hypercall_xen_version(cmd, 0); + case XENVER_extraversion: + argsize = sizeof(struct xen_extraversion); + break; + case XENVER_compile_info: + argsize = sizeof(struct xen_compile_info); + break; + case XENVER_capabilities: + argsize = sizeof(struct xen_capabilities_info); + break; + case XENVER_changeset: + argsize = sizeof(struct xen_changeset_info); + break; + case XENVER_platform_parameters: + argsize = sizeof(struct xen_platform_parameters); + break; + case XENVER_get_features: + argsize = (arg == NULL) ? 0 : sizeof(struct xen_feature_info); + break; + + default: + printk(KERN_DEBUG + "%s: unknown version op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_xen_version(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_xen_version); + +int +xencomm_hypercall_physdev_op(int cmd, void *op) +{ + unsigned int argsize; + + switch (cmd) { + case PHYSDEVOP_apic_read: + case PHYSDEVOP_apic_write: + argsize = sizeof(struct physdev_apic); + break; + case PHYSDEVOP_alloc_irq_vector: + case PHYSDEVOP_free_irq_vector: + argsize = sizeof(struct physdev_irq); + break; + case PHYSDEVOP_irq_status_query: + argsize = sizeof(struct physdev_irq_status_query); + break; + + default: + printk(KERN_DEBUG + "%s: unknown physdev op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_physdev_op + (cmd, xencomm_map_no_alloc(op, argsize)); +} + +static int +xencommize_grant_table_op(struct xencomm_mini **xc_area, + unsigned int cmd, void *op, unsigned int count, + struct xencomm_handle **desc) +{ + struct xencomm_handle *desc1; + unsigned int argsize; + + switch (cmd) { + case GNTTABOP_map_grant_ref: + argsize = sizeof(struct gnttab_map_grant_ref); + break; + case GNTTABOP_unmap_grant_ref: + argsize = sizeof(struct gnttab_unmap_grant_ref); + break; + case GNTTABOP_setup_table: + { + struct gnttab_setup_table *setup = op; + + argsize = sizeof(*setup); + + if (count != 1) + return -EINVAL; + desc1 = __xencomm_map_no_alloc + (xen_guest_handle(setup->frame_list), + setup->nr_frames * + sizeof(*xen_guest_handle(setup->frame_list)), + *xc_area); + if (desc1 == NULL) + return -EINVAL; + (*xc_area)++; + set_xen_guest_handle(setup->frame_list, (void *)desc1); + break; + } + case GNTTABOP_dump_table: + argsize = sizeof(struct gnttab_dump_table); + break; + case GNTTABOP_transfer: + argsize = sizeof(struct gnttab_transfer); + break; + case GNTTABOP_copy: + argsize = sizeof(struct gnttab_copy); + break; + case GNTTABOP_query_size: + argsize = sizeof(struct gnttab_query_size); + break; + default: + printk(KERN_DEBUG "%s: unknown hypercall grant table op %d\n", + __func__, cmd); + BUG(); + } + + *desc = __xencomm_map_no_alloc(op, count * argsize, *xc_area); + if (*desc == NULL) + return -EINVAL; + (*xc_area)++; + + return 0; +} + +int +xencomm_hypercall_grant_table_op(unsigned int cmd, void *op, + unsigned int count) +{ + int rc; + struct xencomm_handle *desc; + XENCOMM_MINI_ALIGNED(xc_area, 2); + + rc = xencommize_grant_table_op(&xc_area, cmd, op, count, &desc); + if (rc) + return rc; + + return xencomm_arch_hypercall_grant_table_op(cmd, desc, count); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_grant_table_op); + +int +xencomm_hypercall_sched_op(int cmd, void *arg) +{ + struct xencomm_handle *desc; + unsigned int argsize; + + switch (cmd) { + case SCHEDOP_yield: + case SCHEDOP_block: + argsize = 0; + break; + case SCHEDOP_shutdown: + argsize = sizeof(struct sched_shutdown); + break; + case SCHEDOP_poll: + { + struct sched_poll *poll = arg; + struct xencomm_handle *ports; + + argsize = sizeof(struct sched_poll); + ports = xencomm_map_no_alloc(xen_guest_handle(poll->ports), + sizeof(*xen_guest_handle(poll->ports))); + + set_xen_guest_handle(poll->ports, (void *)ports); + break; + } + default: + printk(KERN_DEBUG "%s: unknown sched op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_sched_op(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_sched_op); + +int +xencomm_hypercall_multicall(void *call_list, int nr_calls) +{ + int rc; + int i; + struct multicall_entry *mce; + struct xencomm_handle *desc; + XENCOMM_MINI_ALIGNED(xc_area, nr_calls * 2); + + for (i = 0; i < nr_calls; i++) { + mce = (struct multicall_entry *)call_list + i; + + switch (mce->op) { + case __HYPERVISOR_update_va_mapping: + case __HYPERVISOR_mmu_update: + /* No-op on ia64. */ + break; + case __HYPERVISOR_grant_table_op: + rc = xencommize_grant_table_op + (&xc_area, + mce->args[0], (void *)mce->args[1], + mce->args[2], &desc); + if (rc) + return rc; + mce->args[1] = (unsigned long)desc; + break; + case __HYPERVISOR_memory_op: + default: + printk(KERN_DEBUG + "%s: unhandled multicall op entry op %lu\n", + __func__, mce->op); + return -ENOSYS; + } + } + + desc = xencomm_map_no_alloc(call_list, + nr_calls * sizeof(struct multicall_entry)); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_multicall(desc, nr_calls); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_multicall); + +int +xencomm_hypercall_callback_op(int cmd, void *arg) +{ + unsigned int argsize; + switch (cmd) { + case CALLBACKOP_register: + argsize = sizeof(struct callback_register); + break; + case CALLBACKOP_unregister: + argsize = sizeof(struct callback_unregister); + break; + default: + printk(KERN_DEBUG + "%s: unknown callback op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_callback_op + (cmd, xencomm_map_no_alloc(arg, argsize)); +} + +static int +xencommize_memory_reservation(struct xencomm_mini *xc_area, + struct xen_memory_reservation *mop) +{ + struct xencomm_handle *desc; + + desc = __xencomm_map_no_alloc(xen_guest_handle(mop->extent_start), + mop->nr_extents * + sizeof(*xen_guest_handle(mop->extent_start)), + xc_area); + if (desc == NULL) + return -EINVAL; + + set_xen_guest_handle(mop->extent_start, (void *)desc); + return 0; +} + +int +xencomm_hypercall_memory_op(unsigned int cmd, void *arg) +{ + GUEST_HANDLE(xen_pfn_t) extent_start_va[2] = { {NULL}, {NULL} }; + struct xen_memory_reservation *xmr = NULL; + int rc; + struct xencomm_handle *desc; + unsigned int argsize; + XENCOMM_MINI_ALIGNED(xc_area, 2); + + switch (cmd) { + case XENMEM_increase_reservation: + case XENMEM_decrease_reservation: + case XENMEM_populate_physmap: + xmr = (struct xen_memory_reservation *)arg; + set_xen_guest_handle(extent_start_va[0], + xen_guest_handle(xmr->extent_start)); + + argsize = sizeof(*xmr); + rc = xencommize_memory_reservation(xc_area, xmr); + if (rc) + return rc; + xc_area++; + break; + + case XENMEM_maximum_ram_page: + argsize = 0; + break; + + case XENMEM_add_to_physmap: + argsize = sizeof(struct xen_add_to_physmap); + break; + + default: + printk(KERN_DEBUG "%s: unknown memory op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + rc = xencomm_arch_hypercall_memory_op(cmd, desc); + + switch (cmd) { + case XENMEM_increase_reservation: + case XENMEM_decrease_reservation: + case XENMEM_populate_physmap: + set_xen_guest_handle(xmr->extent_start, + xen_guest_handle(extent_start_va[0])); + break; + } + + return rc; +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_memory_op); + +int +xencomm_hypercall_suspend(unsigned long srec) +{ + struct sched_shutdown arg; + + arg.reason = SHUTDOWN_suspend; + + return xencomm_arch_hypercall_sched_op( + SCHEDOP_shutdown, xencomm_map_no_alloc(&arg, sizeof(arg))); +} + +long +xencomm_hypercall_vcpu_op(int cmd, int cpu, void *arg) +{ + unsigned int argsize; + switch (cmd) { + case VCPUOP_register_runstate_memory_area: { + struct vcpu_register_runstate_memory_area *area = + (struct vcpu_register_runstate_memory_area *)arg; + argsize = sizeof(*arg); + set_xen_guest_handle(area->addr.h, + (void *)xencomm_map_no_alloc(area->addr.v, + sizeof(area->addr.v))); + break; + } + + default: + printk(KERN_DEBUG "%s: unknown vcpu op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_vcpu_op(cmd, cpu, + xencomm_map_no_alloc(arg, argsize)); +} + +long +xencomm_hypercall_opt_feature(void *arg) +{ + return xencomm_arch_hypercall_opt_feature( + xencomm_map_no_alloc(arg, + sizeof(struct xen_ia64_opt_feature))); +} diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c index 3dc307f..1f5d7ac 100644 --- a/arch/ia64/xen/xencomm.c +++ b/arch/ia64/xen/xencomm.c @@ -19,11 +19,22 @@ #include static unsigned long kernel_virtual_offset; +static int is_xencomm_initialized; + +/* for xen early printk. It uses console io hypercall which uses xencomm. + * However early printk may use it before xencomm initialization. + */ +int +xencomm_is_initialized(void) +{ + return is_xencomm_initialized; +} void xencomm_initialize(void) { kernel_virtual_offset = KERNEL_START - ia64_tpa(KERNEL_START); + is_xencomm_initialized = 1; } /* Translate virtual address to physical address. */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:44 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:44 +0900 Subject: [PATCH 29/32] ia64/xen: preliminary support for save/restore. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-30-git-send-email-yamahata@valinux.co.jp> preliminary support for save/restore. Although Save/restore isn't fully working yet, this patch is necessary to compile. Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/{time.h => suspend.c} | 45 +++++++++++++++++++++++++++++++++- arch/ia64/xen/time.c | 33 +++++++++++++++++++++++++ arch/ia64/xen/time.h | 1 + 4 files changed, 78 insertions(+), 3 deletions(-) copy arch/ia64/xen/{time.h => suspend.c} (64%) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 972d085..0ad0224 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,7 +3,7 @@ # obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ - hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o suspend.o obj-$(CONFIG_IA64_GENERIC) += machvec.o diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/suspend.c similarity index 64% copy from arch/ia64/xen/time.h copy to arch/ia64/xen/suspend.c index b9c7ec5..fd66b04 100644 --- a/arch/ia64/xen/time.h +++ b/arch/ia64/xen/suspend.c @@ -1,5 +1,5 @@ /****************************************************************************** - * arch/ia64/xen/time.h + * arch/ia64/xen/suspend.c * * Copyright (c) 2008 Isaku Yamahata * VA Linux Systems Japan K.K. @@ -18,6 +18,47 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA * + * suspend/resume */ -extern struct pv_time_ops xen_time_ops __initdata; +#include +#include +#include "time.h" + +void +xen_mm_pin_all(void) +{ + /* nothing */ +} + +void +xen_mm_unpin_all(void) +{ + /* nothing */ +} + +void xen_pre_device_suspend(void) +{ + /* nothing */ +} + +void +xen_pre_suspend() +{ + /* nothing */ +} + +void +xen_post_suspend(int suspend_cancelled) +{ + if (suspend_cancelled) + return; + + xen_ia64_enable_opt_feature(); + /* add more if necessary */ +} + +void xen_arch_resume(void) +{ + xen_timer_resume_on_aps(); +} diff --git a/arch/ia64/xen/time.c b/arch/ia64/xen/time.c index ec168ec..d15a94c 100644 --- a/arch/ia64/xen/time.c +++ b/arch/ia64/xen/time.c @@ -26,6 +26,8 @@ #include #include +#include + #include #include @@ -178,3 +180,34 @@ struct pv_time_ops xen_time_ops __initdata = { .do_steal_accounting = xen_do_steal_accounting, .clocksource_resume = xen_itc_jitter_data_reset, }; + +/* Called after suspend, to resume time. */ +static void xen_local_tick_resume(void) +{ + /* Just trigger a tick. */ + ia64_cpu_local_tick(); + touch_softlockup_watchdog(); +} + +void +xen_timer_resume(void) +{ + unsigned int cpu; + + xen_local_tick_resume(); + + for_each_online_cpu(cpu) + xen_init_missing_ticks_accounting(cpu); +} + +static void ia64_cpu_local_tick_fn(void *unused) +{ + xen_local_tick_resume(); + xen_init_missing_ticks_accounting(smp_processor_id()); +} + +void +xen_timer_resume_on_aps(void) +{ + smp_call_function(&ia64_cpu_local_tick_fn, NULL, 1); +} diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/time.h index b9c7ec5..f98d7e1 100644 --- a/arch/ia64/xen/time.h +++ b/arch/ia64/xen/time.h @@ -21,3 +21,4 @@ */ extern struct pv_time_ops xen_time_ops __initdata; +void xen_timer_resume_on_aps(void); -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:40 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:40 +0900 Subject: [PATCH 25/32] ia64/pv_ops/xen: define the nubmer of irqs which xen needs. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-26-git-send-email-yamahata@valinux.co.jp> define arch/ia64/include/asm/xen/irq.h to define the number of irqs which xen needs. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/irq.h | 44 +++++++++++++++++++++++++++++++++++++++ arch/ia64/kernel/nr-irqs.c | 1 + 2 files changed, 45 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/irq.h diff --git a/arch/ia64/include/asm/xen/irq.h b/arch/ia64/include/asm/xen/irq.h new file mode 100644 index 0000000..a904509 --- /dev/null +++ b/arch/ia64/include/asm/xen/irq.h @@ -0,0 +1,44 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/irq.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_IRQ_H +#define _ASM_IA64_XEN_IRQ_H + +/* + * The flat IRQ space is divided into two regions: + * 1. A one-to-one mapping of real physical IRQs. This space is only used + * if we have physical device-access privilege. This region is at the + * start of the IRQ space so that existing device drivers do not need + * to be modified to translate physical IRQ numbers into our IRQ space. + * 3. A dynamic mapping of inter-domain and Xen-sourced virtual IRQs. These + * are bound using the provided bind/unbind functions. + */ + +#define XEN_PIRQ_BASE 0 +#define XEN_NR_PIRQS 256 + +#define XEN_DYNIRQ_BASE (XEN_PIRQ_BASE + XEN_NR_PIRQS) +#define XEN_NR_DYNIRQS (NR_CPUS * 8) + +#define XEN_NR_IRQS (XEN_NR_PIRQS + XEN_NR_DYNIRQS) + +#endif /* _ASM_IA64_XEN_IRQ_H */ diff --git a/arch/ia64/kernel/nr-irqs.c b/arch/ia64/kernel/nr-irqs.c index 8273afc..ee56457 100644 --- a/arch/ia64/kernel/nr-irqs.c +++ b/arch/ia64/kernel/nr-irqs.c @@ -10,6 +10,7 @@ #include #include #include +#include void foo(void) { -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:39 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:39 +0900 Subject: [PATCH 24/32] ia64/pv_ops/xen: implement xen pv_iosapic_ops. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-25-git-send-email-yamahata@valinux.co.jp> implement xen pv_iosapic_ops for xen paravirtualized iosapic. Signed-off-by: Isaku Yamahata --- arch/ia64/xen/xen_pv_ops.c | 52 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 5b23cd5..41a6cbf 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -292,6 +292,57 @@ const struct pv_cpu_asm_switch xen_cpu_asm_switch = { }; /*************************************************************************** + * pv_iosapic_ops + * iosapic read/write hooks. + */ +static void +xen_pcat_compat_init(void) +{ + /* nothing */ +} + +static struct irq_chip* +xen_iosapic_get_irq_chip(unsigned long trigger) +{ + return NULL; +} + +static unsigned int +xen_iosapic_read(char __iomem *iosapic, unsigned int reg) +{ + struct physdev_apic apic_op; + int ret; + + apic_op.apic_physbase = (unsigned long)iosapic - + __IA64_UNCACHED_OFFSET; + apic_op.reg = reg; + ret = HYPERVISOR_physdev_op(PHYSDEVOP_apic_read, &apic_op); + if (ret) + return ret; + return apic_op.value; +} + +static void +xen_iosapic_write(char __iomem *iosapic, unsigned int reg, u32 val) +{ + struct physdev_apic apic_op; + + apic_op.apic_physbase = (unsigned long)iosapic - + __IA64_UNCACHED_OFFSET; + apic_op.reg = reg; + apic_op.value = val; + HYPERVISOR_physdev_op(PHYSDEVOP_apic_write, &apic_op); +} + +static const struct pv_iosapic_ops xen_iosapic_ops __initdata = { + .pcat_compat_init = xen_pcat_compat_init, + .__get_irq_chip = xen_iosapic_get_irq_chip, + + .__read = xen_iosapic_read, + .__write = xen_iosapic_write, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -302,6 +353,7 @@ xen_setup_pv_ops(void) pv_info = xen_info; pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; + pv_iosapic_ops = xen_iosapic_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:42 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:42 +0900 Subject: [PATCH 27/32] ia64/pv_ops/xen: implement xen pv_time_ops. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-28-git-send-email-yamahata@valinux.co.jp> implement xen pv_time_ops to account steal time. Cc: Jeremy Fitzhardinge Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/time.c | 180 ++++++++++++++++++++++++++++++++++++++++++++ arch/ia64/xen/time.h | 23 ++++++ arch/ia64/xen/xen_pv_ops.c | 2 + 4 files changed, 206 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/time.c create mode 100644 arch/ia64/xen/time.h diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 01c4289..ed31c76 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,7 +3,7 @@ # obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ - hypervisor.o xencomm.o xcom_hcall.o grant-table.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN diff --git a/arch/ia64/xen/time.c b/arch/ia64/xen/time.c new file mode 100644 index 0000000..ec168ec --- /dev/null +++ b/arch/ia64/xen/time.c @@ -0,0 +1,180 @@ +/****************************************************************************** + * arch/ia64/xen/time.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include +#include +#include + +#include + +#include + +#include "../kernel/fsyscall_gtod_data.h" + +DEFINE_PER_CPU(struct vcpu_runstate_info, runstate); +DEFINE_PER_CPU(unsigned long, processed_stolen_time); +DEFINE_PER_CPU(unsigned long, processed_blocked_time); + +/* taken from i386/kernel/time-xen.c */ +static void xen_init_missing_ticks_accounting(int cpu) +{ + struct vcpu_register_runstate_memory_area area; + struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu); + int rc; + + memset(runstate, 0, sizeof(*runstate)); + + area.addr.v = runstate; + rc = HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, + &area); + WARN_ON(rc && rc != -ENOSYS); + + per_cpu(processed_blocked_time, cpu) = runstate->time[RUNSTATE_blocked]; + per_cpu(processed_stolen_time, cpu) = runstate->time[RUNSTATE_runnable] + + runstate->time[RUNSTATE_offline]; +} + +/* + * Runstate accounting + */ +/* stolen from arch/x86/xen/time.c */ +static void get_runstate_snapshot(struct vcpu_runstate_info *res) +{ + u64 state_time; + struct vcpu_runstate_info *state; + + BUG_ON(preemptible()); + + state = &__get_cpu_var(runstate); + + /* + * The runstate info is always updated by the hypervisor on + * the current CPU, so there's no need to use anything + * stronger than a compiler barrier when fetching it. + */ + do { + state_time = state->state_entry_time; + rmb(); + *res = *state; + rmb(); + } while (state->state_entry_time != state_time); +} + +#define NS_PER_TICK (1000000000LL/HZ) + +static unsigned long +consider_steal_time(unsigned long new_itm) +{ + unsigned long stolen, blocked; + unsigned long delta_itm = 0, stolentick = 0; + int cpu = smp_processor_id(); + struct vcpu_runstate_info runstate; + struct task_struct *p = current; + + get_runstate_snapshot(&runstate); + + /* + * Check for vcpu migration effect + * In this case, itc value is reversed. + * This causes huge stolen value. + * This function just checks and reject this effect. + */ + if (!time_after_eq(runstate.time[RUNSTATE_blocked], + per_cpu(processed_blocked_time, cpu))) + blocked = 0; + + if (!time_after_eq(runstate.time[RUNSTATE_runnable] + + runstate.time[RUNSTATE_offline], + per_cpu(processed_stolen_time, cpu))) + stolen = 0; + + if (!time_after(delta_itm + new_itm, ia64_get_itc())) + stolentick = ia64_get_itc() - new_itm; + + do_div(stolentick, NS_PER_TICK); + stolentick++; + + do_div(stolen, NS_PER_TICK); + + if (stolen > stolentick) + stolen = stolentick; + + stolentick -= stolen; + do_div(blocked, NS_PER_TICK); + + if (blocked > stolentick) + blocked = stolentick; + + if (stolen > 0 || blocked > 0) { + account_steal_time(NULL, jiffies_to_cputime(stolen)); + account_steal_time(idle_task(cpu), jiffies_to_cputime(blocked)); + run_local_timers(); + + if (rcu_pending(cpu)) + rcu_check_callbacks(cpu, user_mode(get_irq_regs())); + + scheduler_tick(); + run_posix_cpu_timers(p); + delta_itm += local_cpu_data->itm_delta * (stolen + blocked); + + if (cpu == time_keeper_id) { + write_seqlock(&xtime_lock); + do_timer(stolen + blocked); + local_cpu_data->itm_next = delta_itm + new_itm; + write_sequnlock(&xtime_lock); + } else { + local_cpu_data->itm_next = delta_itm + new_itm; + } + per_cpu(processed_stolen_time, cpu) += NS_PER_TICK * stolen; + per_cpu(processed_blocked_time, cpu) += NS_PER_TICK * blocked; + } + return delta_itm; +} + +static int xen_do_steal_accounting(unsigned long *new_itm) +{ + unsigned long delta_itm; + delta_itm = consider_steal_time(*new_itm); + *new_itm += delta_itm; + if (time_after(*new_itm, ia64_get_itc()) && delta_itm) + return 1; + + return 0; +} + +static void xen_itc_jitter_data_reset(void) +{ + u64 lcycle, ret; + + do { + lcycle = itc_jitter_data.itc_lastcycle; + ret = cmpxchg(&itc_jitter_data.itc_lastcycle, lcycle, 0); + } while (unlikely(ret != lcycle)); +} + +struct pv_time_ops xen_time_ops __initdata = { + .init_missing_ticks_accounting = xen_init_missing_ticks_accounting, + .do_steal_accounting = xen_do_steal_accounting, + .clocksource_resume = xen_itc_jitter_data_reset, +}; diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/time.h new file mode 100644 index 0000000..b9c7ec5 --- /dev/null +++ b/arch/ia64/xen/time.h @@ -0,0 +1,23 @@ +/****************************************************************************** + * arch/ia64/xen/time.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +extern struct pv_time_ops xen_time_ops __initdata; diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 4fe4e62..04cd123 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -30,6 +30,7 @@ #include #include "irq_xen.h" +#include "time.h" /*************************************************************************** * general info @@ -357,6 +358,7 @@ xen_setup_pv_ops(void) pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; pv_irq_ops = xen_irq_ops; + pv_time_ops = xen_time_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:41 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:41 +0900 Subject: [PATCH 26/32] ia64/pv_ops/xen: implement xen pv_irq_ops. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-27-git-send-email-yamahata@valinux.co.jp> implement xen pv_irq_ops to paravirtualize irq handling with xen event channel. Cc: Jeremy Fitzhardinge Signed-off-by: Akio Takebe Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/irq_xen.c | 435 ++++++++++++++++++++++++++++++++++++++++++++ arch/ia64/xen/irq_xen.h | 34 ++++ arch/ia64/xen/xen_pv_ops.c | 3 + 4 files changed, 473 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/irq_xen.c create mode 100644 arch/ia64/xen/irq_xen.h diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 9b77e8a..01c4289 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,7 +2,7 @@ # Makefile for Xen components # -obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ +obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN diff --git a/arch/ia64/xen/irq_xen.c b/arch/ia64/xen/irq_xen.c new file mode 100644 index 0000000..af93aad --- /dev/null +++ b/arch/ia64/xen/irq_xen.c @@ -0,0 +1,435 @@ +/****************************************************************************** + * arch/ia64/xen/irq_xen.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include + +#include +#include +#include + +#include + +#include "irq_xen.h" + +/*************************************************************************** + * pv_irq_ops + * irq operations + */ + +static int +xen_assign_irq_vector(int irq) +{ + struct physdev_irq irq_op; + + irq_op.irq = irq; + if (HYPERVISOR_physdev_op(PHYSDEVOP_alloc_irq_vector, &irq_op)) + return -ENOSPC; + + return irq_op.vector; +} + +static void +xen_free_irq_vector(int vector) +{ + struct physdev_irq irq_op; + + if (vector < IA64_FIRST_DEVICE_VECTOR || + vector > IA64_LAST_DEVICE_VECTOR) + return; + + irq_op.vector = vector; + if (HYPERVISOR_physdev_op(PHYSDEVOP_free_irq_vector, &irq_op)) + printk(KERN_WARNING "%s: xen_free_irq_vecotr fail vector=%d\n", + __func__, vector); +} + + +static DEFINE_PER_CPU(int, timer_irq) = -1; +static DEFINE_PER_CPU(int, ipi_irq) = -1; +static DEFINE_PER_CPU(int, resched_irq) = -1; +static DEFINE_PER_CPU(int, cmc_irq) = -1; +static DEFINE_PER_CPU(int, cmcp_irq) = -1; +static DEFINE_PER_CPU(int, cpep_irq) = -1; +#define NAME_SIZE 15 +static DEFINE_PER_CPU(char[NAME_SIZE], timer_name); +static DEFINE_PER_CPU(char[NAME_SIZE], ipi_name); +static DEFINE_PER_CPU(char[NAME_SIZE], resched_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cmc_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cmcp_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cpep_name); +#undef NAME_SIZE + +struct saved_irq { + unsigned int irq; + struct irqaction *action; +}; +/* 16 should be far optimistic value, since only several percpu irqs + * are registered early. + */ +#define MAX_LATE_IRQ 16 +static struct saved_irq saved_percpu_irqs[MAX_LATE_IRQ]; +static unsigned short late_irq_cnt; +static unsigned short saved_irq_cnt; +static int xen_slab_ready; + +#ifdef CONFIG_SMP +/* Dummy stub. Though we may check XEN_RESCHEDULE_VECTOR before __do_IRQ, + * it ends up to issue several memory accesses upon percpu data and + * thus adds unnecessary traffic to other paths. + */ +static irqreturn_t +xen_dummy_handler(int irq, void *dev_id) +{ + + return IRQ_HANDLED; +} + +static struct irqaction xen_ipi_irqaction = { + .handler = handle_IPI, + .flags = IRQF_DISABLED, + .name = "IPI" +}; + +static struct irqaction xen_resched_irqaction = { + .handler = xen_dummy_handler, + .flags = IRQF_DISABLED, + .name = "resched" +}; + +static struct irqaction xen_tlb_irqaction = { + .handler = xen_dummy_handler, + .flags = IRQF_DISABLED, + .name = "tlb_flush" +}; +#endif + +/* + * This is xen version percpu irq registration, which needs bind + * to xen specific evtchn sub-system. One trick here is that xen + * evtchn binding interface depends on kmalloc because related + * port needs to be freed at device/cpu down. So we cache the + * registration on BSP before slab is ready and then deal them + * at later point. For rest instances happening after slab ready, + * we hook them to xen evtchn immediately. + * + * FIXME: MCA is not supported by far, and thus "nomca" boot param is + * required. + */ +static void +__xen_register_percpu_irq(unsigned int cpu, unsigned int vec, + struct irqaction *action, int save) +{ + irq_desc_t *desc; + int irq = 0; + + if (xen_slab_ready) { + switch (vec) { + case IA64_TIMER_VECTOR: + snprintf(per_cpu(timer_name, cpu), + sizeof(per_cpu(timer_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_virq_to_irqhandler(VIRQ_ITC, cpu, + action->handler, action->flags, + per_cpu(timer_name, cpu), action->dev_id); + per_cpu(timer_irq, cpu) = irq; + break; + case IA64_IPI_RESCHEDULE: + snprintf(per_cpu(resched_name, cpu), + sizeof(per_cpu(resched_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_RESCHEDULE_VECTOR, cpu, + action->handler, action->flags, + per_cpu(resched_name, cpu), action->dev_id); + per_cpu(resched_irq, cpu) = irq; + break; + case IA64_IPI_VECTOR: + snprintf(per_cpu(ipi_name, cpu), + sizeof(per_cpu(ipi_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_IPI_VECTOR, cpu, + action->handler, action->flags, + per_cpu(ipi_name, cpu), action->dev_id); + per_cpu(ipi_irq, cpu) = irq; + break; + case IA64_CMC_VECTOR: + snprintf(per_cpu(cmc_name, cpu), + sizeof(per_cpu(cmc_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_virq_to_irqhandler(VIRQ_MCA_CMC, cpu, + action->handler, + action->flags, + per_cpu(cmc_name, cpu), + action->dev_id); + per_cpu(cmc_irq, cpu) = irq; + break; + case IA64_CMCP_VECTOR: + snprintf(per_cpu(cmcp_name, cpu), + sizeof(per_cpu(cmcp_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_CMCP_VECTOR, cpu, + action->handler, + action->flags, + per_cpu(cmcp_name, cpu), + action->dev_id); + per_cpu(cmcp_irq, cpu) = irq; + break; + case IA64_CPEP_VECTOR: + snprintf(per_cpu(cpep_name, cpu), + sizeof(per_cpu(cpep_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_CPEP_VECTOR, cpu, + action->handler, + action->flags, + per_cpu(cpep_name, cpu), + action->dev_id); + per_cpu(cpep_irq, cpu) = irq; + break; + case IA64_CPE_VECTOR: + case IA64_MCA_RENDEZ_VECTOR: + case IA64_PERFMON_VECTOR: + case IA64_MCA_WAKEUP_VECTOR: + case IA64_SPURIOUS_INT_VECTOR: + /* No need to complain, these aren't supported. */ + break; + default: + printk(KERN_WARNING "Percpu irq %d is unsupported " + "by xen!\n", vec); + break; + } + BUG_ON(irq < 0); + + if (irq > 0) { + /* + * Mark percpu. Without this, migrate_irqs() will + * mark the interrupt for migrations and trigger it + * on cpu hotplug. + */ + desc = irq_desc + irq; + desc->status |= IRQ_PER_CPU; + } + } + + /* For BSP, we cache registered percpu irqs, and then re-walk + * them when initializing APs + */ + if (!cpu && save) { + BUG_ON(saved_irq_cnt == MAX_LATE_IRQ); + saved_percpu_irqs[saved_irq_cnt].irq = vec; + saved_percpu_irqs[saved_irq_cnt].action = action; + saved_irq_cnt++; + if (!xen_slab_ready) + late_irq_cnt++; + } +} + +static void +xen_register_percpu_irq(ia64_vector vec, struct irqaction *action) +{ + __xen_register_percpu_irq(smp_processor_id(), vec, action, 1); +} + +static void +xen_bind_early_percpu_irq(void) +{ + int i; + + xen_slab_ready = 1; + /* There's no race when accessing this cached array, since only + * BSP will face with such step shortly + */ + for (i = 0; i < late_irq_cnt; i++) + __xen_register_percpu_irq(smp_processor_id(), + saved_percpu_irqs[i].irq, + saved_percpu_irqs[i].action, 0); +} + +/* FIXME: There's no obvious point to check whether slab is ready. So + * a hack is used here by utilizing a late time hook. + */ + +#ifdef CONFIG_HOTPLUG_CPU +static int __devinit +unbind_evtchn_callback(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + + if (action == CPU_DEAD) { + /* Unregister evtchn. */ + if (per_cpu(cpep_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cpep_irq, cpu), NULL); + per_cpu(cpep_irq, cpu) = -1; + } + if (per_cpu(cmcp_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cmcp_irq, cpu), NULL); + per_cpu(cmcp_irq, cpu) = -1; + } + if (per_cpu(cmc_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cmc_irq, cpu), NULL); + per_cpu(cmc_irq, cpu) = -1; + } + if (per_cpu(ipi_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(ipi_irq, cpu), NULL); + per_cpu(ipi_irq, cpu) = -1; + } + if (per_cpu(resched_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(resched_irq, cpu), + NULL); + per_cpu(resched_irq, cpu) = -1; + } + if (per_cpu(timer_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(timer_irq, cpu), NULL); + per_cpu(timer_irq, cpu) = -1; + } + } + return NOTIFY_OK; +} + +static struct notifier_block unbind_evtchn_notifier = { + .notifier_call = unbind_evtchn_callback, + .priority = 0 +}; +#endif + +void xen_smp_intr_init_early(unsigned int cpu) +{ +#ifdef CONFIG_SMP + unsigned int i; + + for (i = 0; i < saved_irq_cnt; i++) + __xen_register_percpu_irq(cpu, saved_percpu_irqs[i].irq, + saved_percpu_irqs[i].action, 0); +#endif +} + +void xen_smp_intr_init(void) +{ +#ifdef CONFIG_SMP + unsigned int cpu = smp_processor_id(); + struct callback_register event = { + .type = CALLBACKTYPE_event, + .address = { .ip = (unsigned long)&xen_event_callback }, + }; + + if (cpu == 0) { + /* Initialization was already done for boot cpu. */ +#ifdef CONFIG_HOTPLUG_CPU + /* Register the notifier only once. */ + register_cpu_notifier(&unbind_evtchn_notifier); +#endif + return; + } + + /* This should be piggyback when setup vcpu guest context */ + BUG_ON(HYPERVISOR_callback_op(CALLBACKOP_register, &event)); +#endif /* CONFIG_SMP */ +} + +void __init +xen_irq_init(void) +{ + struct callback_register event = { + .type = CALLBACKTYPE_event, + .address = { .ip = (unsigned long)&xen_event_callback }, + }; + + xen_init_IRQ(); + BUG_ON(HYPERVISOR_callback_op(CALLBACKOP_register, &event)); + late_time_init = xen_bind_early_percpu_irq; +} + +void +xen_platform_send_ipi(int cpu, int vector, int delivery_mode, int redirect) +{ +#ifdef CONFIG_SMP + /* TODO: we need to call vcpu_up here */ + if (unlikely(vector == ap_wakeup_vector)) { + /* XXX + * This should be in __cpu_up(cpu) in ia64 smpboot.c + * like x86. But don't want to modify it, + * keep it untouched. + */ + xen_smp_intr_init_early(cpu); + + xen_send_ipi(cpu, vector); + /* vcpu_prepare_and_up(cpu); */ + return; + } +#endif + + switch (vector) { + case IA64_IPI_VECTOR: + xen_send_IPI_one(cpu, XEN_IPI_VECTOR); + break; + case IA64_IPI_RESCHEDULE: + xen_send_IPI_one(cpu, XEN_RESCHEDULE_VECTOR); + break; + case IA64_CMCP_VECTOR: + xen_send_IPI_one(cpu, XEN_CMCP_VECTOR); + break; + case IA64_CPEP_VECTOR: + xen_send_IPI_one(cpu, XEN_CPEP_VECTOR); + break; + case IA64_TIMER_VECTOR: { + /* this is used only once by check_sal_cache_flush() + at boot time */ + static int used = 0; + if (!used) { + xen_send_ipi(cpu, IA64_TIMER_VECTOR); + used = 1; + break; + } + /* fallthrough */ + } + default: + printk(KERN_WARNING "Unsupported IPI type 0x%x\n", + vector); + notify_remote_via_irq(0); /* defaults to 0 irq */ + break; + } +} + +static void __init +xen_register_ipi(void) +{ +#ifdef CONFIG_SMP + register_percpu_irq(IA64_IPI_VECTOR, &xen_ipi_irqaction); + register_percpu_irq(IA64_IPI_RESCHEDULE, &xen_resched_irqaction); + register_percpu_irq(IA64_IPI_LOCAL_TLB_FLUSH, &xen_tlb_irqaction); +#endif +} + +static void +xen_resend_irq(unsigned int vector) +{ + (void)resend_irq_on_evtchn(vector); +} + +const struct pv_irq_ops xen_irq_ops __initdata = { + .register_ipi = xen_register_ipi, + + .assign_irq_vector = xen_assign_irq_vector, + .free_irq_vector = xen_free_irq_vector, + .register_percpu_irq = xen_register_percpu_irq, + + .resend_irq = xen_resend_irq, +}; diff --git a/arch/ia64/xen/irq_xen.h b/arch/ia64/xen/irq_xen.h new file mode 100644 index 0000000..26110f3 --- /dev/null +++ b/arch/ia64/xen/irq_xen.h @@ -0,0 +1,34 @@ +/****************************************************************************** + * arch/ia64/xen/irq_xen.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef IRQ_XEN_H +#define IRQ_XEN_H + +extern void (*late_time_init)(void); +extern char xen_event_callback; +void __init xen_init_IRQ(void); + +extern const struct pv_irq_ops xen_irq_ops __initdata; +extern void xen_smp_intr_init(void); +extern void xen_send_ipi(int cpu, int vec); + +#endif /* IRQ_XEN_H */ diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 41a6cbf..4fe4e62 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -29,6 +29,8 @@ #include #include +#include "irq_xen.h" + /*************************************************************************** * general info */ @@ -354,6 +356,7 @@ xen_setup_pv_ops(void) pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; + pv_irq_ops = xen_irq_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:43 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:43 +0900 Subject: [PATCH 28/32] ia64/xen: define xen machine vector for domU. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-29-git-send-email-yamahata@valinux.co.jp> define xen machine vector for domU. Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/Makefile | 2 ++ arch/ia64/include/asm/machvec.h | 2 ++ arch/ia64/include/asm/machvec_xen.h | 22 ++++++++++++++++++++++ arch/ia64/kernel/acpi.c | 5 +++++ arch/ia64/xen/Makefile | 2 ++ arch/ia64/xen/machvec.c | 4 ++++ 6 files changed, 37 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/machvec_xen.h create mode 100644 arch/ia64/xen/machvec.c diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile index 905d25b..4024250 100644 --- a/arch/ia64/Makefile +++ b/arch/ia64/Makefile @@ -56,9 +56,11 @@ core-$(CONFIG_IA64_DIG) += arch/ia64/dig/ core-$(CONFIG_IA64_GENERIC) += arch/ia64/dig/ core-$(CONFIG_IA64_HP_ZX1) += arch/ia64/dig/ core-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += arch/ia64/dig/ +core-$(CONFIG_IA64_XEN_GUEST) += arch/ia64/dig/ core-$(CONFIG_IA64_SGI_SN2) += arch/ia64/sn/ core-$(CONFIG_IA64_SGI_UV) += arch/ia64/uv/ core-$(CONFIG_KVM) += arch/ia64/kvm/ +core-$(CONFIG_XEN) += arch/ia64/xen/ drivers-$(CONFIG_PCI) += arch/ia64/pci/ drivers-$(CONFIG_IA64_HP_SIM) += arch/ia64/hp/sim/ diff --git a/arch/ia64/include/asm/machvec.h b/arch/ia64/include/asm/machvec.h index 2b850cc..de99cb2 100644 --- a/arch/ia64/include/asm/machvec.h +++ b/arch/ia64/include/asm/machvec.h @@ -128,6 +128,8 @@ extern void machvec_tlb_migrate_finish (struct mm_struct *); # include # elif defined (CONFIG_IA64_SGI_UV) # include +# elif defined (CONFIG_IA64_XEN_GUEST) +# include # elif defined (CONFIG_IA64_GENERIC) # ifdef MACHVEC_PLATFORM_HEADER diff --git a/arch/ia64/include/asm/machvec_xen.h b/arch/ia64/include/asm/machvec_xen.h new file mode 100644 index 0000000..55f9228 --- /dev/null +++ b/arch/ia64/include/asm/machvec_xen.h @@ -0,0 +1,22 @@ +#ifndef _ASM_IA64_MACHVEC_XEN_h +#define _ASM_IA64_MACHVEC_XEN_h + +extern ia64_mv_setup_t dig_setup; +extern ia64_mv_cpu_init_t xen_cpu_init; +extern ia64_mv_irq_init_t xen_irq_init; +extern ia64_mv_send_ipi_t xen_platform_send_ipi; + +/* + * This stuff has dual use! + * + * For a generic kernel, the macros are used to initialize the + * platform's machvec structure. When compiling a non-generic kernel, + * the macros are used directly. + */ +#define platform_name "xen" +#define platform_setup dig_setup +#define platform_cpu_init xen_cpu_init +#define platform_irq_init xen_irq_init +#define platform_send_ipi xen_platform_send_ipi + +#endif /* _ASM_IA64_MACHVEC_XEN_h */ diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 5d1eb7e..0093649 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -52,6 +52,7 @@ #include #include #include +#include #define BAD_MADT_ENTRY(entry, end) ( \ (!entry) || (unsigned long)entry + sizeof(*entry) > end || \ @@ -121,6 +122,8 @@ acpi_get_sysname(void) return "uv"; else return "sn2"; + } else if (xen_pv_domain() && !strcmp(hdr->oem_id, "XEN")) { + return "xen"; } return "dig"; @@ -137,6 +140,8 @@ acpi_get_sysname(void) return "uv"; # elif defined (CONFIG_IA64_DIG) return "dig"; +# elif defined (CONFIG_IA64_XEN_GUEST) + return "xen"; # else # error Unknown platform. Fix acpi.c. # endif diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ed31c76..972d085 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -5,6 +5,8 @@ obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o +obj-$(CONFIG_IA64_GENERIC) += machvec.o + AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile diff --git a/arch/ia64/xen/machvec.c b/arch/ia64/xen/machvec.c new file mode 100644 index 0000000..4ad588a --- /dev/null +++ b/arch/ia64/xen/machvec.c @@ -0,0 +1,4 @@ +#define MACHVEC_PLATFORM_NAME xen +#define MACHVEC_PLATFORM_HEADER +#include + -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:47 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:47 +0900 Subject: [PATCH 32/32] ia64/pv_ops: paravirtualized istruction checker. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-33-git-send-email-yamahata@valinux.co.jp> This patch implements a checker to detect instructions which should be paravirtualized instead of direct writing raw instruction. This patch does rough check so that it doesn't fully cover all cases, but it can detects most cases of paravirtualization breakage of hand written assembly codes. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/pvchk_inst.h | 263 +++++++++++++++++++++++++++++ arch/ia64/kernel/Makefile | 18 ++ arch/ia64/kernel/paravirt_inst.h | 4 +- arch/ia64/scripts/pvcheck.sed | 32 ++++ 4 files changed, 316 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/native/pvchk_inst.h create mode 100644 arch/ia64/scripts/pvcheck.sed diff --git a/arch/ia64/include/asm/native/pvchk_inst.h b/arch/ia64/include/asm/native/pvchk_inst.h new file mode 100644 index 0000000..b8e6eb1 --- /dev/null +++ b/arch/ia64/include/asm/native/pvchk_inst.h @@ -0,0 +1,263 @@ +#ifndef _ASM_NATIVE_PVCHK_INST_H +#define _ASM_NATIVE_PVCHK_INST_H + +/****************************************************************************** + * arch/ia64/include/asm/native/pvchk_inst.h + * Checker for paravirtualizations of privileged operations. + * + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +/********************************************** + * Instructions paravirtualized for correctness + **********************************************/ + +/* "fc" and "thash" are privilege-sensitive instructions, meaning they + * may have different semantics depending on whether they are executed + * at PL0 vs PL!=0. When paravirtualized, these instructions mustn't + * be allowed to execute directly, lest incorrect semantics result. + */ + +#define fc .error "fc should not be used directly." +#define thash .error "thash should not be used directly." + +/* Note that "ttag" and "cover" are also privilege-sensitive; "ttag" + * is not currently used (though it may be in a long-format VHPT system!) + * and the semantics of cover only change if psr.ic is off which is very + * rare (and currently non-existent outside of assembly code + */ +#define ttag .error "ttag should not be used directly." +#define cover .error "cover should not be used directly." + +/* There are also privilege-sensitive registers. These registers are + * readable at any privilege level but only writable at PL0. + */ +#define cpuid .error "cpuid should not be used directly." +#define pmd .error "pmd should not be used directly." + +/* + * mov ar.eflag = + * mov = ar.eflag + */ + +/********************************************** + * Instructions paravirtualized for performance + **********************************************/ +/* + * Those instructions include '.' which can't be handled by cpp. + * or can't be handled by cpp easily. + * They are handled by sed instead of cpp. + */ + +/* for .S + * itc.i + * itc.d + * + * bsw.0 + * bsw.1 + * + * ssm psr.ic | PSR_DEFAULT_BITS + * ssm psr.ic + * rsm psr.ic + * ssm psr.i + * rsm psr.i + * rsm psr.i | psr.ic + * rsm psr.dt + * ssm psr.dt + * + * mov = cr.ifa + * mov = cr.itir + * mov = cr.isr + * mov = cr.iha + * mov = cr.ipsr + * mov = cr.iim + * mov = cr.iip + * mov = cr.ivr + * mov = psr + * + * mov cr.ifa = + * mov cr.itir = + * mov cr.iha = + * mov cr.ipsr = + * mov cr.ifs = + * mov cr.iip = + * mov cr.kr = + */ + +/* for intrinsics + * ssm psr.i + * rsm psr.i + * mov = psr + * mov = ivr + * mov = tpr + * mov cr.itm = + * mov eoi = + * mov rr[] = + * mov = rr[] + * mov = kr + * mov kr = + * ptc.ga + */ + +/************************************************************* + * define paravirtualized instrcution macros as nop to ingore. + * and check whether arguments are appropriate. + *************************************************************/ + +/* check whether reg is a regular register */ +.macro is_rreg_in reg + .ifc "\reg", "r0" + nop 0 + .exitm + .endif + ;; + mov \reg = r0 + ;; +.endm +#define IS_RREG_IN(reg) is_rreg_in reg ; + +#define IS_RREG_OUT(reg) \ + ;; \ + mov reg = r0 \ + ;; + +#define IS_RREG_CLOB(reg) IS_RREG_OUT(reg) + +/* check whether pred is a predicate register */ +#define IS_PRED_IN(pred) \ + ;; \ + (pred) nop 0 \ + ;; + +#define IS_PRED_OUT(pred) \ + ;; \ + cmp.eq pred, p0 = r0, r0 \ + ;; + +#define IS_PRED_CLOB(pred) IS_PRED_OUT(pred) + + +#define DO_SAVE_MIN(__COVER, SAVE_IFS, EXTRA, WORKAROUND) \ + nop 0 +#define MOV_FROM_IFA(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_ITIR(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_ISR(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IHA(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IPSR(pred, reg) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IIM(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IIP(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IVR(reg, clob) \ + IS_RREG_OUT(reg) \ + IS_RREG_CLOB(clob) +#define MOV_FROM_PSR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IFA(reg, clob) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_ITIR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IHA(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IPSR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IFS(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IIP(reg, clob) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_KR(kr, reg, clob0, clob1) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define ITC_I(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define ITC_D(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define ITC_I_AND_D(pred_i, pred_d, reg, clob) \ + IS_PRED_IN(pred_i) \ + IS_PRED_IN(pred_d) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define THASH(pred, reg0, reg1, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg0) \ + IS_RREG_IN(reg1) \ + IS_RREG_CLOB(clob) +#define SSM_PSR_IC_AND_DEFAULT_BITS_AND_SRLZ_I(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define SSM_PSR_IC_AND_SRLZ_D(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define RSM_PSR_IC(clob) \ + IS_RREG_CLOB(clob) +#define SSM_PSR_I(pred, pred_clob, clob) \ + IS_PRED_IN(pred) \ + IS_PRED_CLOB(pred_clob) \ + IS_RREG_CLOB(clob) +#define RSM_PSR_I(pred, clob0, clob1) \ + IS_PRED_IN(pred) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define RSM_PSR_I_IC(clob0, clob1, clob2) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) \ + IS_RREG_CLOB(clob2) +#define RSM_PSR_DT \ + nop 0 +#define SSM_PSR_DT_AND_SRLZ_I \ + nop 0 +#define BSW_0(clob0, clob1, clob2) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) \ + IS_RREG_CLOB(clob2) +#define BSW_1(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define COVER \ + nop 0 +#define RFI \ + br.ret.sptk.many rp /* defining nop causes dependency error */ + +#endif /* _ASM_NATIVE_PVCHK_INST_H */ diff --git a/arch/ia64/kernel/Makefile b/arch/ia64/kernel/Makefile index 87fea11..55e6ca8 100644 --- a/arch/ia64/kernel/Makefile +++ b/arch/ia64/kernel/Makefile @@ -112,5 +112,23 @@ clean-files += $(objtree)/include/asm-ia64/nr-irqs.h ASM_PARAVIRT_OBJS = ivt.o entry.o define paravirtualized_native AFLAGS_$(1) += -D__IA64_ASM_PARAVIRTUALIZED_NATIVE +AFLAGS_pvchk-sed-$(1) += -D__IA64_ASM_PARAVIRTUALIZED_PVCHECK +extra-y += pvchk-$(1) endef $(foreach obj,$(ASM_PARAVIRT_OBJS),$(eval $(call paravirtualized_native,$(obj)))) + +# +# Checker for paravirtualizations of privileged operations. +# +quiet_cmd_pv_check_sed = PVCHK $@ +define cmd_pv_check_sed + sed -f $(srctree)/arch/$(SRCARCH)/scripts/pvcheck.sed $< > $@ +endef + +$(obj)/pvchk-sed-%.s: $(src)/%.S $(srctree)/arch/$(SRCARCH)/scripts/pvcheck.sed FORCE + $(call if_changed_dep,as_s_S) +$(obj)/pvchk-%.s: $(obj)/pvchk-sed-%.s FORCE + $(call if_changed,pv_check_sed) +$(obj)/pvchk-%.o: $(obj)/pvchk-%.s FORCE + $(call if_changed,as_o_S) +.PRECIOUS: $(obj)/pvchk-sed-%.s $(obj)/pvchk-%.s $(obj)/pvchk-%.o diff --git a/arch/ia64/kernel/paravirt_inst.h b/arch/ia64/kernel/paravirt_inst.h index 5cad6fb..64d6d81 100644 --- a/arch/ia64/kernel/paravirt_inst.h +++ b/arch/ia64/kernel/paravirt_inst.h @@ -20,7 +20,9 @@ * */ -#ifdef __IA64_ASM_PARAVIRTUALIZED_XEN +#ifdef __IA64_ASM_PARAVIRTUALIZED_PVCHECK +#include +#elif defined(__IA64_ASM_PARAVIRTUALIZED_XEN) #include #include #else diff --git a/arch/ia64/scripts/pvcheck.sed b/arch/ia64/scripts/pvcheck.sed new file mode 100644 index 0000000..abdaca7 --- /dev/null +++ b/arch/ia64/scripts/pvcheck.sed @@ -0,0 +1,32 @@ +# +# Checker for paravirtualizations of privileged operations. +# +s/ssm.*psr\.ic/.warning \"ssm psr.ic should not be used directly\"/g +s/rsm.*psr\.ic/.warning \"rsm psr.ic should not be used directly\"/g +s/ssm.*psr\.i/.warning \"ssm psr.i should not be used directly\"/g +s/rsm.*psr\.i/.warning \"rsm psr.i should not be used directly\"/g +s/ssm.*psr\.dt/.warning \"ssm psr.dt should not be used directly\"/g +s/rsm.*psr\.dt/.warning \"rsm psr.dt should not be used directly\"/g +s/mov.*=.*cr\.ifa/.warning \"cr.ifa should not used directly\"/g +s/mov.*=.*cr\.itir/.warning \"cr.itir should not used directly\"/g +s/mov.*=.*cr\.isr/.warning \"cr.isr should not used directly\"/g +s/mov.*=.*cr\.iha/.warning \"cr.iha should not used directly\"/g +s/mov.*=.*cr\.ipsr/.warning \"cr.ipsr should not used directly\"/g +s/mov.*=.*cr\.iim/.warning \"cr.iim should not used directly\"/g +s/mov.*=.*cr\.iip/.warning \"cr.iip should not used directly\"/g +s/mov.*=.*cr\.ivr/.warning \"cr.ivr should not used directly\"/g +s/mov.*=[^\.]*psr/.warning \"psr should not used directly\"/g # avoid ar.fpsr +s/mov.*=.*ar\.eflags/.warning \"ar.eflags should not used directly\"/g +s/mov.*cr\.ifa.*=/.warning \"cr.ifa should not used directly\"/g +s/mov.*cr\.itir.*=/.warning \"cr.itir should not used directly\"/g +s/mov.*cr\.iha.*=/.warning \"cr.iha should not used directly\"/g +s/mov.*cr\.ipsr.*=/.warning \"cr.ipsr should not used directly\"/g +s/mov.*cr\.ifs.*=/.warning \"cr.ifs should not used directly\"/g +s/mov.*cr\.iip.*=/.warning \"cr.iip should not used directly\"/g +s/mov.*cr\.kr.*=/.warning \"cr.kr should not used directly\"/g +s/mov.*ar\.eflags.*=/.warning \"ar.eflags should not used directly\"/g +s/itc\.i/.warning \"itc.i should not be used directly.\"/g +s/itc\.d/.warning \"itc.d should not be used directly.\"/g +s/bsw\.0/.warning \"bsw.0 should not be used directly.\"/g +s/bsw\.1/.warning \"bsw.1 should not be used directly.\"/g +s/ptc\.ga/.warning \"ptc.ga should not be used directly.\"/g -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:34 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:34 +0900 Subject: [PATCH 19/32] ia64/pv_ops/xen: define xen pv_cpu_ops. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-20-git-send-email-yamahata@valinux.co.jp> define xen pv_cpu_ops which implementes xen paravirtualized privileged instructions. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/xen/xen_pv_ops.c | 114 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 114 insertions(+), 0 deletions(-) diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index fc9d599..c236f04 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -163,6 +163,119 @@ static const struct pv_init_ops xen_init_ops __initdata = { }; /*************************************************************************** + * pv_cpu_ops + * intrinsics hooks. + */ + +static void xen_setreg(int regnum, unsigned long val) +{ + switch (regnum) { + case _IA64_REG_AR_KR0 ... _IA64_REG_AR_KR7: + xen_set_kr(regnum - _IA64_REG_AR_KR0, val); + break; +#ifdef CONFIG_IA32_SUPPORT + case _IA64_REG_AR_EFLAG: + xen_set_eflag(val); + break; +#endif + case _IA64_REG_CR_TPR: + xen_set_tpr(val); + break; + case _IA64_REG_CR_ITM: + xen_set_itm(val); + break; + case _IA64_REG_CR_EOI: + xen_eoi(val); + break; + default: + ia64_native_setreg_func(regnum, val); + break; + } +} + +static unsigned long xen_getreg(int regnum) +{ + unsigned long res; + + switch (regnum) { + case _IA64_REG_PSR: + res = xen_get_psr(); + break; +#ifdef CONFIG_IA32_SUPPORT + case _IA64_REG_AR_EFLAG: + res = xen_get_eflag(); + break; +#endif + case _IA64_REG_CR_IVR: + res = xen_get_ivr(); + break; + case _IA64_REG_CR_TPR: + res = xen_get_tpr(); + break; + default: + res = ia64_native_getreg_func(regnum); + break; + } + return res; +} + +/* turning on interrupts is a bit more complicated.. write to the + * memory-mapped virtual psr.i bit first (to avoid race condition), + * then if any interrupts were pending, we have to execute a hyperprivop + * to ensure the pending interrupt gets delivered; else we're done! */ +static void +xen_ssm_i(void) +{ + int old = xen_get_virtual_psr_i(); + xen_set_virtual_psr_i(1); + barrier(); + if (!old && xen_get_virtual_pend()) + xen_hyper_ssm_i(); +} + +/* turning off interrupts can be paravirtualized simply by writing + * to a memory-mapped virtual psr.i bit (implemented as a 16-bit bool) */ +static void +xen_rsm_i(void) +{ + xen_set_virtual_psr_i(0); + barrier(); +} + +static unsigned long +xen_get_psr_i(void) +{ + return xen_get_virtual_psr_i() ? IA64_PSR_I : 0; +} + +static void +xen_intrin_local_irq_restore(unsigned long mask) +{ + if (mask & IA64_PSR_I) + xen_ssm_i(); + else + xen_rsm_i(); +} + +static const struct pv_cpu_ops xen_cpu_ops __initdata = { + .fc = xen_fc, + .thash = xen_thash, + .get_cpuid = xen_get_cpuid, + .get_pmd = xen_get_pmd, + .getreg = xen_getreg, + .setreg = xen_setreg, + .ptcga = xen_ptcga, + .get_rr = xen_get_rr, + .set_rr = xen_set_rr, + .set_rr0_to_rr4 = xen_set_rr0_to_rr4, + .ssm_i = xen_ssm_i, + .rsm_i = xen_rsm_i, + .get_psr_i = xen_get_psr_i, + .intrin_local_irq_restore + = xen_intrin_local_irq_restore, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -172,4 +285,5 @@ xen_setup_pv_ops(void) xen_info_init(); pv_info = xen_info; pv_init_ops = xen_init_ops; + pv_cpu_ops = xen_cpu_ops; } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:46 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:46 +0900 Subject: [PATCH 31/32] ia64/xen: a recipe for using xen/ia64 with pv_ops. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-32-git-send-email-yamahata@valinux.co.jp> Recipe for useing xen/ia64 with pv_ops domU. Signed-off-by: Akio Takebe Signed-off-by: Isaku Yamahata --- Documentation/ia64/xen.txt | 183 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 183 insertions(+), 0 deletions(-) create mode 100644 Documentation/ia64/xen.txt diff --git a/Documentation/ia64/xen.txt b/Documentation/ia64/xen.txt new file mode 100644 index 0000000..a5c6993 --- /dev/null +++ b/Documentation/ia64/xen.txt @@ -0,0 +1,183 @@ + Recipe for getting/building/running Xen/ia64 with pv_ops + -------------------------------------------------------- + +This recipe discribes how to get xen-ia64 source and build it, +and run domU with pv_ops. + +=========== +Requirement +=========== + + - python + - mercurial + it (aka "hg") is a open-source source code + management software. See the below. + http://www.selenic.com/mercurial/wiki/ + - git + - bridge-utils + +================================= +Getting and Building Xen and Dom0 +================================= + + My enviroment is; + Machine : Tiger4 + Domain0 OS : RHEL5 + DomainU OS : RHEL5 + + 1. Download source + # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg + # cd xen-unstable.hg + # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg + + 2. # make world + + 3. # make install-tools + + 4. copy kernels and xen + # cp xen/xen.gz /boot/efi/efi/redhat/ + # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \ + /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen + + 5. make initrd for Dom0/DomU + # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \ + O=$(/bin/pwd)/build-linux-2.6.18-xen_ia64 + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \ + 2.6.18.8-xen --builtin mptspi --builtin mptbase \ + --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \ + --builtin ehci-hcd + +================================ +Making a disk image for guest OS +================================ + + 1. make file + # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0 + # mke2fs -F -j /root/rhel5.img + # mount -o loop /root/rhel5.img /mnt + # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt + # mkdir /mnt/{root,proc,sys,home,tmp} + + Note: You may miss some device files. If so, please create them + with mknod. Or you can use tar intead of cp. + + 2. modify DomU's fstab + # vi /mnt/etc/fstab + /dev/xvda1 / ext3 defaults 1 1 + none /dev/pts devpts gid=5,mode=620 0 0 + none /dev/shm tmpfs defaults 0 0 + none /proc proc defaults 0 0 + none /sys sysfs defaults 0 0 + + 3. modify inittab + set runlevel to 3 to avoid X trying to start + # vi /mnt/etc/inittab + id:3:initdefault: + Start a getty on the hvc0 console + X0:2345:respawn:/sbin/mingetty hvc0 + tty1-6 mingetty can be commented out + + 4. add hvc0 into /etc/securetty + # vi /mnt/etc/securetty (add hvc0) + + 5. umount + # umount /mnt + +FYI, virt-manager can also make a disk image for guest OS. +It's GUI tools and easy to make it. + +================== +Boot Xen & Domain0 +================== + + 1. replace elilo + elilo of RHEL5 can boot Xen and Dom0. + If you use old elilo (e.g RHEL4), please download from the below + http://elilo.sourceforge.net/cgi-bin/blosxom + and copy into /boot/efi/efi/redhat/ + # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi + + 2. modify elilo.conf (like the below) + # vi /boot/efi/efi/redhat/elilo.conf + prompt + timeout=20 + default=xen + relocatable + + image=vmlinuz-2.6.18.8-xen + label=xen + vmm=xen.gz + initrd=initrd-2.6.18.8-xen.img + read-only + append=" -- rhgb root=/dev/sda2" + +The append options before "--" are for xen hypervisor, +the options after "--" are for dom0. + +FYI, your machine may need console options like +"com1=19200,8n1 console=vga,com1". For example, +append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \ +console=ttyS0 root=/dev/sda2" + +===================================== +Getting and Building domU with pv_ops +===================================== + + 1. get pv_ops tree + # git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/ + + 2. git branch (if necessary) + # cd linux-2.6-xen-ia64/ + # git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19 + (Note: The current branch is xen-ia64-domu-minimal-2008may19. + But you would find the new branch. You can see with + "git branch -r" to get the branch lists. + http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ + is also available. The tree is based on + git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test) + + + 3. copy .config for pv_ops of domU + # cp arch/ia64/configs/xen_domu_wip_defconfig .config + + 4. make kernel with pv_ops + # make oldconfig + # make + + 5. install the kernel and initrd + # cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU + # make modules_install + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \ + 2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \ + --builtin mptbase --builtin mptscsih --builtin uhci-hcd \ + --builtin ohci-hcd --builtin ehci-hcd + +======================== +Boot DomainU with pv_ops +======================== + + 1. make config of DomU + # vi /etc/xen/rhel5 + kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU" + ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img" + vcpus = 1 + memory = 512 + name = "rhel5" + disk = [ 'file:/root/rhel5.img,xvda1,w' ] + root = "/dev/xvda1 ro" + extra= "rhgb console=hvc0" + + 2. After boot xen and dom0, start xend + # /etc/init.d/xend start + ( In the debugging case, # XEND_DEBUG=1 xend trace_start ) + + 3. start domU + # xm create -c rhel5 + +========= +Reference +========= +- Wiki of Xen/IA64 upstream merge + http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge + +Witten by Akio Takebe on 28 May 2008 -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:45 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:45 +0900 Subject: [PATCH 30/32] ia64/pv_ops: update Kconfig for paravirtualized guest and xen. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-31-git-send-email-yamahata@valinux.co.jp> introduce CONFIG_PARAVIRT_GUEST, CONFIG_PARAVIRT for paravirtualized guest. introduce CONFIG_XEN, CONFIG_IA64_XEN_GUEST for xen. Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/Kconfig | 32 ++++++++++++++++++++++++++++++++ arch/ia64/xen/Kconfig | 26 ++++++++++++++++++++++++++ 2 files changed, 58 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/xen/Kconfig diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 48e496f..34639f1 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -116,6 +116,33 @@ config AUDIT_ARCH bool default y +menuconfig PARAVIRT_GUEST + bool "Paravirtualized guest support" + help + Say Y here to get to see options related to running Linux under + various hypervisors. This option alone does not add any kernel code. + + If you say N, all options in this submenu will be skipped and disabled. + +if PARAVIRT_GUEST + +config PARAVIRT + bool "Enable paravirtualization code" + depends on PARAVIRT_GUEST + default y + bool + default y + help + This changes the kernel so it can modify itself when it is run + under a hypervisor, potentially improving performance significantly + over full virtualization. However, when run without a hypervisor + the kernel is theoretically slower and slightly larger. + + +source "arch/ia64/xen/Kconfig" + +endif + choice prompt "System type" default IA64_GENERIC @@ -137,6 +164,7 @@ config IA64_GENERIC SGI-SN2 For SGI Altix systems SGI-UV For SGI UV systems Ski-simulator For the HP simulator + Xen-domU For xen domU system If you don't know what to do, choose "generic". @@ -187,6 +215,10 @@ config IA64_HP_SIM bool "Ski-simulator" select SWIOTLB +config IA64_XEN_GUEST + bool "Xen guest" + depends on XEN + endchoice choice diff --git a/arch/ia64/xen/Kconfig b/arch/ia64/xen/Kconfig new file mode 100644 index 0000000..f1683a2 --- /dev/null +++ b/arch/ia64/xen/Kconfig @@ -0,0 +1,26 @@ +# +# This Kconfig describes xen/ia64 options +# + +config XEN + bool "Xen hypervisor support" + default y + depends on PARAVIRT && MCKINLEY && IA64_PAGE_SIZE_16KB && EXPERIMENTAL + select XEN_XENCOMM + select NO_IDLE_HZ + + # those are required to save/restore. + select ARCH_SUSPEND_POSSIBLE + select SUSPEND + select PM_SLEEP + help + Enable Xen hypervisor support. Resulting kernel runs + both as a guest OS on Xen and natively on hardware. + +config XEN_XENCOMM + depends on XEN + bool + +config NO_IDLE_HZ + depends on XEN + bool -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:19 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:19 +0900 Subject: [PATCH 04/32] ia64/xen: reserve "break" numbers used for xen hypercalls. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-5-git-send-email-yamahata@valinux.co.jp> reserve "break" numbers used for xen hypercalls to avoid reuse for something else. Cc: "Luck, Tony" Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/break.h | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/break.h b/arch/ia64/include/asm/break.h index f034020..e90c40e 100644 --- a/arch/ia64/include/asm/break.h +++ b/arch/ia64/include/asm/break.h @@ -20,4 +20,13 @@ */ #define __IA64_BREAK_SYSCALL 0x100000 +/* + * Xen specific break numbers: + */ +#define __IA64_XEN_HYPERCALL 0x1000 +/* [__IA64_XEN_HYPERPRIVOP_START, __IA64_XEN_HYPERPRIVOP_MAX] is used + for xen hyperprivops */ +#define __IA64_XEN_HYPERPRIVOP_START 0x1 +#define __IA64_XEN_HYPERPRIVOP_MAX 0x1a + #endif /* _ASM_IA64_BREAK_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:35 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:35 +0900 Subject: [PATCH 20/32] ia64/pv_ops/xen: define xen paravirtualized instructions for hand written assembly code In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-21-git-send-email-yamahata@valinux.co.jp> define xen paravirtualized instructions for hand written assembly code. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata Cc: Akio Takebe --- arch/ia64/include/asm/xen/inst.h | 447 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 447 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/inst.h diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h new file mode 100644 index 0000000..03895e9 --- /dev/null +++ b/arch/ia64/include/asm/xen/inst.h @@ -0,0 +1,447 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/inst.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include + +#define MOV_FROM_IFA(reg) \ + movl reg = XSI_IFA; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_ITIR(reg) \ + movl reg = XSI_ITIR; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_ISR(reg) \ + movl reg = XSI_ISR; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IHA(reg) \ + movl reg = XSI_IHA; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IPSR(pred, reg) \ +(pred) movl reg = XSI_IPSR; \ + ;; \ +(pred) ld8 reg = [reg] + +#define MOV_FROM_IIM(reg) \ + movl reg = XSI_IIM; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IIP(reg) \ + movl reg = XSI_IIP; \ + ;; \ + ld8 reg = [reg] + +.macro __MOV_FROM_IVR reg, clob + .ifc "\reg", "r8" + XEN_HYPER_GET_IVR + .exitm + .endif + .ifc "\clob", "r8" + XEN_HYPER_GET_IVR + ;; + mov \reg = r8 + .exitm + .endif + + mov \clob = r8 + ;; + XEN_HYPER_GET_IVR + ;; + mov \reg = r8 + ;; + mov r8 = \clob +.endm +#define MOV_FROM_IVR(reg, clob) __MOV_FROM_IVR reg, clob + +.macro __MOV_FROM_PSR pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_GET_PSR; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) XEN_HYPER_GET_PSR + ;; + (\pred) mov \reg = r8 + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) XEN_HYPER_GET_PSR + ;; + (\pred) mov \reg = r8 + (\pred) mov r8 = \clob +.endm +#define MOV_FROM_PSR(pred, reg, clob) __MOV_FROM_PSR pred, reg, clob + + +#define MOV_TO_IFA(reg, clob) \ + movl clob = XSI_IFA; \ + ;; \ + st8 [clob] = reg \ + +#define MOV_TO_ITIR(pred, reg, clob) \ +(pred) movl clob = XSI_ITIR; \ + ;; \ +(pred) st8 [clob] = reg + +#define MOV_TO_IHA(pred, reg, clob) \ +(pred) movl clob = XSI_IHA; \ + ;; \ +(pred) st8 [clob] = reg + +#define MOV_TO_IPSR(pred, reg, clob) \ +(pred) movl clob = XSI_IPSR; \ + ;; \ +(pred) st8 [clob] = reg; \ + ;; + +#define MOV_TO_IFS(pred, reg, clob) \ +(pred) movl clob = XSI_IFS; \ + ;; \ +(pred) st8 [clob] = reg; \ + ;; + +#define MOV_TO_IIP(reg, clob) \ + movl clob = XSI_IIP; \ + ;; \ + st8 [clob] = reg + +.macro ____MOV_TO_KR kr, reg, clob0, clob1 + .ifc "\clob0", "r9" + .error "clob0 \clob0 must not be r9" + .endif + .ifc "\clob1", "r8" + .error "clob1 \clob1 must not be r8" + .endif + + .ifnc "\reg", "r9" + .ifnc "\clob1", "r9" + mov \clob1 = r9 + .endif + mov r9 = \reg + .endif + .ifnc "\clob0", "r8" + mov \clob0 = r8 + .endif + mov r8 = \kr + ;; + XEN_HYPER_SET_KR + + .ifnc "\reg", "r9" + .ifnc "\clob1", "r9" + mov r9 = \clob1 + .endif + .endif + .ifnc "\clob0", "r8" + mov r8 = \clob0 + .endif +.endm + +.macro __MOV_TO_KR kr, reg, clob0, clob1 + .ifc "\clob0", "r9" + ____MOV_TO_KR \kr, \reg, \clob1, \clob0 + .exitm + .endif + .ifc "\clob1", "r8" + ____MOV_TO_KR \kr, \reg, \clob1, \clob0 + .exitm + .endif + + ____MOV_TO_KR \kr, \reg, \clob0, \clob1 +.endm + +#define MOV_TO_KR(kr, reg, clob0, clob1) \ + __MOV_TO_KR IA64_KR_ ## kr, reg, clob0, clob1 + + +.macro __ITC_I pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_ITC_I + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_I + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_I + ;; + (\pred) mov r8 = \clob + ;; +.endm +#define ITC_I(pred, reg, clob) __ITC_I pred, reg, clob + +.macro __ITC_D pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_ITC_D + ;; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_D + ;; + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_D + ;; + (\pred) mov r8 = \clob + ;; +.endm +#define ITC_D(pred, reg, clob) __ITC_D pred, reg, clob + +.macro __ITC_I_AND_D pred_i, pred_d, reg, clob + .ifc "\reg", "r8" + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + .exitm + .endif + .ifc "\clob", "r8" + mov r8 = \reg + ;; + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + .exitm + .endif + + mov \clob = r8 + mov r8 = \reg + ;; + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + mov r8 = \clob + ;; +.endm +#define ITC_I_AND_D(pred_i, pred_d, reg, clob) \ + __ITC_I_AND_D pred_i, pred_d, reg, clob + +.macro __THASH pred, reg0, reg1, clob + .ifc "\reg0", "r8" + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + .exitm + .endc + .ifc "\reg1", "r8" + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + ;; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + ;; + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + (\pred) mov r8 = \clob + ;; +.endm +#define THASH(pred, reg0, reg1, clob) __THASH pred, reg0, reg1, clob + +#define SSM_PSR_IC_AND_DEFAULT_BITS_AND_SRLZ_I(clob0, clob1) \ + mov clob0 = 1; \ + movl clob1 = XSI_PSR_IC; \ + ;; \ + st4 [clob1] = clob0 \ + ;; + +#define SSM_PSR_IC_AND_SRLZ_D(clob0, clob1) \ + ;; \ + srlz.d; \ + mov clob1 = 1; \ + movl clob0 = XSI_PSR_IC; \ + ;; \ + st4 [clob0] = clob1 + +#define RSM_PSR_IC(clob) \ + movl clob = XSI_PSR_IC; \ + ;; \ + st4 [clob] = r0; \ + ;; + +/* pred will be clobbered */ +#define MASK_TO_PEND_OFS (-1) +#define SSM_PSR_I(pred, pred_clob, clob) \ +(pred) movl clob = XSI_PSR_I_ADDR \ + ;; \ +(pred) ld8 clob = [clob] \ + ;; \ + /* if (pred) vpsr.i = 1 */ \ + /* if (pred) (vcpu->vcpu_info->evtchn_upcall_mask)=0 */ \ +(pred) st1 [clob] = r0, MASK_TO_PEND_OFS \ + ;; \ + /* if (vcpu->vcpu_info->evtchn_upcall_pending) */ \ +(pred) ld1 clob = [clob] \ + ;; \ +(pred) cmp.ne.unc pred_clob, p0 = clob, r0 \ + ;; \ +(pred_clob)XEN_HYPER_SSM_I /* do areal ssm psr.i */ + +#define RSM_PSR_I(pred, clob0, clob1) \ + movl clob0 = XSI_PSR_I_ADDR; \ + mov clob1 = 1; \ + ;; \ + ld8 clob0 = [clob0]; \ + ;; \ +(pred) st1 [clob0] = clob1 + +#define RSM_PSR_I_IC(clob0, clob1, clob2) \ + movl clob0 = XSI_PSR_I_ADDR; \ + movl clob1 = XSI_PSR_IC; \ + ;; \ + ld8 clob0 = [clob0]; \ + mov clob2 = 1; \ + ;; \ + /* note: clears both vpsr.i and vpsr.ic! */ \ + st1 [clob0] = clob2; \ + st4 [clob1] = r0; \ + ;; + +#define RSM_PSR_DT \ + XEN_HYPER_RSM_PSR_DT + +#define SSM_PSR_DT_AND_SRLZ_I \ + XEN_HYPER_SSM_PSR_DT + +#define BSW_0(clob0, clob1, clob2) \ + ;; \ + /* r16-r31 all now hold bank1 values */ \ + mov clob2 = ar.unat; \ + movl clob0 = XSI_BANK1_R16; \ + movl clob1 = XSI_BANK1_R16 + 8; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r16, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r17, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r18, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r19, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r20, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r21, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r22, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r23, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r24, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r25, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r26, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r27, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r28, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r29, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r30, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r31, 16; \ + ;; \ + mov clob1 = ar.unat; \ + movl clob0 = XSI_B1NAT; \ + ;; \ + st8 [clob0] = clob1; \ + mov ar.unat = clob2; \ + movl clob0 = XSI_BANKNUM; \ + ;; \ + st4 [clob0] = r0 + + + /* FIXME: THIS CODE IS NOT NaT SAFE! */ +#define XEN_BSW_1(clob) \ + mov clob = ar.unat; \ + movl r30 = XSI_B1NAT; \ + ;; \ + ld8 r30 = [r30]; \ + mov r31 = 1; \ + ;; \ + mov ar.unat = r30; \ + movl r30 = XSI_BANKNUM; \ + ;; \ + st4 [r30] = r31; \ + movl r30 = XSI_BANK1_R16; \ + movl r31 = XSI_BANK1_R16+8; \ + ;; \ + ld8.fill r16 = [r30], 16; \ + ld8.fill r17 = [r31], 16; \ + ;; \ + ld8.fill r18 = [r30], 16; \ + ld8.fill r19 = [r31], 16; \ + ;; \ + ld8.fill r20 = [r30], 16; \ + ld8.fill r21 = [r31], 16; \ + ;; \ + ld8.fill r22 = [r30], 16; \ + ld8.fill r23 = [r31], 16; \ + ;; \ + ld8.fill r24 = [r30], 16; \ + ld8.fill r25 = [r31], 16; \ + ;; \ + ld8.fill r26 = [r30], 16; \ + ld8.fill r27 = [r31], 16; \ + ;; \ + ld8.fill r28 = [r30], 16; \ + ld8.fill r29 = [r31], 16; \ + ;; \ + ld8.fill r30 = [r30]; \ + ld8.fill r31 = [r31]; \ + ;; \ + mov ar.unat = clob + +#define BSW_1(clob0, clob1) XEN_BSW_1(clob1) + + +#define COVER \ + XEN_HYPER_COVER + +#define RFI \ + XEN_HYPER_RFI; \ + dv_serialize_data -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:33 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:33 +0900 Subject: [PATCH 18/32] ia64/pv_ops/xen: define xen pv_init_ops for various xen initialization. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-19-git-send-email-yamahata@valinux.co.jp> This patch implements xen version of pv_init_ops to do various xen initialization. This patch also includes ia64 counter part of x86 xen early printk support patches. Signed-off-by: Akio Takebe Signed-off-by: Alex Williamson Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypervisor.h | 14 ++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/hypervisor.c | 96 ++++++++++++++++++++++++++++ arch/ia64/xen/xen_pv_ops.c | 110 ++++++++++++++++++++++++++++++++ 4 files changed, 221 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/hypervisor.c diff --git a/arch/ia64/include/asm/xen/hypervisor.h b/arch/ia64/include/asm/xen/hypervisor.h index d1f84e1..7a804e8 100644 --- a/arch/ia64/include/asm/xen/hypervisor.h +++ b/arch/ia64/include/asm/xen/hypervisor.h @@ -59,8 +59,22 @@ extern enum xen_domain_type xen_domain_type; /* deprecated. remove this */ #define is_running_on_xen() (xen_domain_type == XEN_PV_DOMAIN) +extern struct shared_info *HYPERVISOR_shared_info; extern struct start_info *xen_start_info; +void __init xen_setup_vcpu_info_placement(void); +void force_evtchn_callback(void); + +/* for drivers/xen/balloon/balloon.c */ +#ifdef CONFIG_XEN_SCRUB_PAGES +#define scrub_pages(_p, _n) memset((void *)(_p), 0, (_n) << PAGE_SHIFT) +#else +#define scrub_pages(_p, _n) ((void)0) +#endif + +/* For setup_arch() in arch/ia64/kernel/setup.c */ +void xen_ia64_enable_opt_feature(void); + #else /* CONFIG_XEN */ #define xen_domain() (0) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index abc356f..7cb4247 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,4 +3,4 @@ # obj-y := hypercall.o xensetup.o xen_pv_ops.o \ - xencomm.o xcom_hcall.o grant-table.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/hypervisor.c b/arch/ia64/xen/hypervisor.c new file mode 100644 index 0000000..cac4d97 --- /dev/null +++ b/arch/ia64/xen/hypervisor.c @@ -0,0 +1,96 @@ +/****************************************************************************** + * arch/ia64/xen/hypervisor.c + * + * Copyright (c) 2006 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include + +#include "irq_xen.h" + +struct shared_info *HYPERVISOR_shared_info __read_mostly = + (struct shared_info *)XSI_BASE; +EXPORT_SYMBOL(HYPERVISOR_shared_info); + +DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu); + +struct start_info *xen_start_info; +EXPORT_SYMBOL(xen_start_info); + +EXPORT_SYMBOL(xen_domain_type); + +EXPORT_SYMBOL(__hypercall); + +/* Stolen from arch/x86/xen/enlighten.c */ +/* + * Flag to determine whether vcpu info placement is available on all + * VCPUs. We assume it is to start with, and then set it to zero on + * the first failure. This is because it can succeed on some VCPUs + * and not others, since it can involve hypervisor memory allocation, + * or because the guest failed to guarantee all the appropriate + * constraints on all VCPUs (ie buffer can't cross a page boundary). + * + * Note that any particular CPU may be using a placed vcpu structure, + * but we can only optimise if the all are. + * + * 0: not available, 1: available + */ + +static void __init xen_vcpu_setup(int cpu) +{ + /* + * WARNING: + * before changing MAX_VIRT_CPUS, + * check that shared_info fits on a page + */ + BUILD_BUG_ON(sizeof(struct shared_info) > PAGE_SIZE); + per_cpu(xen_vcpu, cpu) = &HYPERVISOR_shared_info->vcpu_info[cpu]; +} + +void __init xen_setup_vcpu_info_placement(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + xen_vcpu_setup(cpu); +} + +void __cpuinit +xen_cpu_init(void) +{ + xen_smp_intr_init(); +} + +/************************************************************************** + * opt feature + */ +void +xen_ia64_enable_opt_feature(void) +{ + /* Enable region 7 identity map optimizations in Xen */ + struct xen_ia64_opt_feature optf; + + optf.cmd = XEN_IA64_OPTF_IDENT_MAP_REG7; + optf.on = XEN_IA64_OPTF_ON; + optf.pgprot = pgprot_val(PAGE_KERNEL); + optf.key = 0; /* No key on linux. */ + HYPERVISOR_opt_feature(&optf); +} diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 77db214..fc9d599 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -54,6 +54,115 @@ xen_info_init(void) } /*************************************************************************** + * pv_init_ops + * initialization hooks. + */ + +static void +xen_panic_hypercall(struct unw_frame_info *info, void *arg) +{ + current->thread.ksp = (__u64)info->sw - 16; + HYPERVISOR_shutdown(SHUTDOWN_crash); + /* we're never actually going to get here... */ +} + +static int +xen_panic_event(struct notifier_block *this, unsigned long event, void *ptr) +{ + unw_init_running(xen_panic_hypercall, NULL); + /* we're never actually going to get here... */ + return NOTIFY_DONE; +} + +static struct notifier_block xen_panic_block = { + xen_panic_event, NULL, 0 /* try to go last */ +}; + +static void xen_pm_power_off(void) +{ + local_irq_disable(); + HYPERVISOR_shutdown(SHUTDOWN_poweroff); +} + +static void __init +xen_banner(void) +{ + printk(KERN_INFO + "Running on Xen! pl = %d start_info_pfn=0x%lx nr_pages=%ld " + "flags=0x%x\n", + xen_info.kernel_rpl, + HYPERVISOR_shared_info->arch.start_info_pfn, + xen_start_info->nr_pages, xen_start_info->flags); +} + +static int __init +xen_reserve_memory(struct rsvd_region *region) +{ + region->start = (unsigned long)__va( + (HYPERVISOR_shared_info->arch.start_info_pfn << PAGE_SHIFT)); + region->end = region->start + PAGE_SIZE; + return 1; +} + +static void __init +xen_arch_setup_early(void) +{ + struct shared_info *s; + BUG_ON(!xen_pv_domain()); + + s = HYPERVISOR_shared_info; + xen_start_info = __va(s->arch.start_info_pfn << PAGE_SHIFT); + + /* Must be done before any hypercall. */ + xencomm_initialize(); + + xen_setup_features(); + /* Register a call for panic conditions. */ + atomic_notifier_chain_register(&panic_notifier_list, + &xen_panic_block); + pm_power_off = xen_pm_power_off; + + xen_ia64_enable_opt_feature(); +} + +static void __init +xen_arch_setup_console(char **cmdline_p) +{ + add_preferred_console("xenboot", 0, NULL); + add_preferred_console("tty", 0, NULL); + /* use hvc_xen */ + add_preferred_console("hvc", 0, NULL); + +#if !defined(CONFIG_VT) || !defined(CONFIG_DUMMY_CONSOLE) + conswitchp = NULL; +#endif +} + +static int __init +xen_arch_setup_nomca(void) +{ + return 1; +} + +static void __init +xen_post_smp_prepare_boot_cpu(void) +{ + xen_setup_vcpu_info_placement(); +} + +static const struct pv_init_ops xen_init_ops __initdata = { + .banner = xen_banner, + + .reserve_memory = xen_reserve_memory, + + .arch_setup_early = xen_arch_setup_early, + .arch_setup_console = xen_arch_setup_console, + .arch_setup_nomca = xen_arch_setup_nomca, + + .post_smp_prepare_boot_cpu = xen_post_smp_prepare_boot_cpu, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -62,4 +171,5 @@ xen_setup_pv_ops(void) { xen_info_init(); pv_info = xen_info; + pv_init_ops = xen_init_ops; } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:21 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:21 +0900 Subject: [PATCH 06/32] ia64/xen: increase IA64_MAX_RSVD_REGIONS. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-7-git-send-email-yamahata@valinux.co.jp> Xenlinux/ia64 needs to reserve one more region passed from xen hypervisor as start info. Cc: Robin Holt Cc: Bjorn Helgaas Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/meminit.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/meminit.h b/arch/ia64/include/asm/meminit.h index 7245a57..6bc96ee 100644 --- a/arch/ia64/include/asm/meminit.h +++ b/arch/ia64/include/asm/meminit.h @@ -18,10 +18,11 @@ * - crash dumping code reserved region * - Kernel memory map built from EFI memory map * - ELF core header + * - xen start info if CONFIG_XEN * * More could be added if necessary */ -#define IA64_MAX_RSVD_REGIONS 8 +#define IA64_MAX_RSVD_REGIONS 9 struct rsvd_region { unsigned long start; /* virtual address of beginning of element */ -- 1.6.0.2 From borntraeger at de.ibm.com Mon Oct 13 23:42:31 2008 From: borntraeger at de.ibm.com (Christian Borntraeger) Date: Tue, 14 Oct 2008 08:42:31 +0200 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <1223944714.8157.300.camel@pasglop> References: <200806031444.21945.borntraeger@de.ibm.com> <200810130951.31733.borntraeger@de.ibm.com> <1223944714.8157.300.camel@pasglop> Message-ID: <200810140842.31806.borntraeger@de.ibm.com> Am Dienstag, 14. Oktober 2008 schrieb Benjamin Herrenschmidt: > > > Hmmm. > > Can you try if this patch fixes the lockdep trace? > > Yup, the patch fixes it, I'll commit it via the powerpc.git tree if you > don't have any objection. Sure, go ahead. From yamahata at valinux.co.jp Mon Oct 13 22:51:24 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:24 +0900 Subject: [PATCH 09/32] ia64/xen: add a necessary header file to compile include/xen/interface/xen.h In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-10-git-send-email-yamahata@valinux.co.jp> Create include/asm-ia64/pvclock-abi.h to compile which includes include/asm-x86/pvclock-abi.h because ia64/xen uses same structure. Hopefully include/asm-x86/pvclock-abi.h would be moved to somewhere more generic. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/pvclock-abi.h | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/pvclock-abi.h diff --git a/arch/ia64/include/asm/pvclock-abi.h b/arch/ia64/include/asm/pvclock-abi.h new file mode 100644 index 0000000..63cbcca --- /dev/null +++ b/arch/ia64/include/asm/pvclock-abi.h @@ -0,0 +1,5 @@ +/* + * use same structure to x86's + * Hopefully asm-x86/pvclock-abi.h would be moved to somewhere more generic. + */ +#include -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:36 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:36 +0900 Subject: [PATCH 21/32] ia64/pv_ops/xen: paravirtualize DO_SAVE_MIN for xen. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-22-git-send-email-yamahata@valinux.co.jp> paravirtualize DO_SAVE_MIN in minstate.h for xen. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 2 + arch/ia64/include/asm/xen/minstate.h | 134 ++++++++++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/minstate.h diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 03895e9..1e92ed0 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -22,6 +22,8 @@ #include +#define DO_SAVE_MIN XEN_DO_SAVE_MIN + #define MOV_FROM_IFA(reg) \ movl reg = XSI_IFA; \ ;; \ diff --git a/arch/ia64/include/asm/xen/minstate.h b/arch/ia64/include/asm/xen/minstate.h new file mode 100644 index 0000000..4d92d9b --- /dev/null +++ b/arch/ia64/include/asm/xen/minstate.h @@ -0,0 +1,134 @@ +/* + * DO_SAVE_MIN switches to the kernel stacks (if necessary) and saves + * the minimum state necessary that allows us to turn psr.ic back + * on. + * + * Assumed state upon entry: + * psr.ic: off + * r31: contains saved predicates (pr) + * + * Upon exit, the state is as follows: + * psr.ic: off + * r2 = points to &pt_regs.r16 + * r8 = contents of ar.ccv + * r9 = contents of ar.csd + * r10 = contents of ar.ssd + * r11 = FPSR_DEFAULT + * r12 = kernel sp (kernel virtual address) + * r13 = points to current task_struct (kernel virtual address) + * p15 = TRUE if psr.i is set in cr.ipsr + * predicate registers (other than p2, p3, and p15), b6, r3, r14, r15: + * preserved + * CONFIG_XEN note: p6/p7 are not preserved + * + * Note that psr.ic is NOT turned on by this macro. This is so that + * we can pass interruption state as arguments to a handler. + */ +#define XEN_DO_SAVE_MIN(__COVER,SAVE_IFS,EXTRA,WORKAROUND) \ + mov r16=IA64_KR(CURRENT); /* M */ \ + mov r27=ar.rsc; /* M */ \ + mov r20=r1; /* A */ \ + mov r25=ar.unat; /* M */ \ + MOV_FROM_IPSR(p0,r29); /* M */ \ + MOV_FROM_IIP(r28); /* M */ \ + mov r21=ar.fpsr; /* M */ \ + mov r26=ar.pfs; /* I */ \ + __COVER; /* B;; (or nothing) */ \ + adds r16=IA64_TASK_THREAD_ON_USTACK_OFFSET,r16; \ + ;; \ + ld1 r17=[r16]; /* load current->thread.on_ustack flag */ \ + st1 [r16]=r0; /* clear current->thread.on_ustack flag */ \ + adds r1=-IA64_TASK_THREAD_ON_USTACK_OFFSET,r16 \ + /* switch from user to kernel RBS: */ \ + ;; \ + invala; /* M */ \ + /* SAVE_IFS;*/ /* see xen special handling below */ \ + cmp.eq pKStk,pUStk=r0,r17; /* are we in kernel mode already? */ \ + ;; \ +(pUStk) mov ar.rsc=0; /* set enforced lazy mode, pl 0, little-endian, loadrs=0 */ \ + ;; \ +(pUStk) mov.m r24=ar.rnat; \ +(pUStk) addl r22=IA64_RBS_OFFSET,r1; /* compute base of RBS */ \ +(pKStk) mov r1=sp; /* get sp */ \ + ;; \ +(pUStk) lfetch.fault.excl.nt1 [r22]; \ +(pUStk) addl r1=IA64_STK_OFFSET-IA64_PT_REGS_SIZE,r1; /* compute base of memory stack */ \ +(pUStk) mov r23=ar.bspstore; /* save ar.bspstore */ \ + ;; \ +(pUStk) mov ar.bspstore=r22; /* switch to kernel RBS */ \ +(pKStk) addl r1=-IA64_PT_REGS_SIZE,r1; /* if in kernel mode, use sp (r12) */ \ + ;; \ +(pUStk) mov r18=ar.bsp; \ +(pUStk) mov ar.rsc=0x3; /* set eager mode, pl 0, little-endian, loadrs=0 */ \ + adds r17=2*L1_CACHE_BYTES,r1; /* really: biggest cache-line size */ \ + adds r16=PT(CR_IPSR),r1; \ + ;; \ + lfetch.fault.excl.nt1 [r17],L1_CACHE_BYTES; \ + st8 [r16]=r29; /* save cr.ipsr */ \ + ;; \ + lfetch.fault.excl.nt1 [r17]; \ + tbit.nz p15,p0=r29,IA64_PSR_I_BIT; \ + mov r29=b0 \ + ;; \ + WORKAROUND; \ + adds r16=PT(R8),r1; /* initialize first base pointer */ \ + adds r17=PT(R9),r1; /* initialize second base pointer */ \ +(pKStk) mov r18=r0; /* make sure r18 isn't NaT */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r8,16; \ +.mem.offset 8,0; st8.spill [r17]=r9,16; \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r10,24; \ + movl r8=XSI_PRECOVER_IFS; \ +.mem.offset 8,0; st8.spill [r17]=r11,24; \ + ;; \ + /* xen special handling for possibly lazy cover */ \ + /* SAVE_MIN case in dispatch_ia32_handler: mov r30=r0 */ \ + ld8 r30=[r8]; \ +(pUStk) sub r18=r18,r22; /* r18=RSE.ndirty*8 */ \ + st8 [r16]=r28,16; /* save cr.iip */ \ + ;; \ + st8 [r17]=r30,16; /* save cr.ifs */ \ + mov r8=ar.ccv; \ + mov r9=ar.csd; \ + mov r10=ar.ssd; \ + movl r11=FPSR_DEFAULT; /* L-unit */ \ + ;; \ + st8 [r16]=r25,16; /* save ar.unat */ \ + st8 [r17]=r26,16; /* save ar.pfs */ \ + shl r18=r18,16; /* compute ar.rsc to be used for "loadrs" */ \ + ;; \ + st8 [r16]=r27,16; /* save ar.rsc */ \ +(pUStk) st8 [r17]=r24,16; /* save ar.rnat */ \ +(pKStk) adds r17=16,r17; /* skip over ar_rnat field */ \ + ;; /* avoid RAW on r16 & r17 */ \ +(pUStk) st8 [r16]=r23,16; /* save ar.bspstore */ \ + st8 [r17]=r31,16; /* save predicates */ \ +(pKStk) adds r16=16,r16; /* skip over ar_bspstore field */ \ + ;; \ + st8 [r16]=r29,16; /* save b0 */ \ + st8 [r17]=r18,16; /* save ar.rsc value for "loadrs" */ \ + cmp.eq pNonSys,pSys=r0,r0 /* initialize pSys=0, pNonSys=1 */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r20,16; /* save original r1 */ \ +.mem.offset 8,0; st8.spill [r17]=r12,16; \ + adds r12=-16,r1; /* switch to kernel memory stack (with 16 bytes of scratch) */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r13,16; \ +.mem.offset 8,0; st8.spill [r17]=r21,16; /* save ar.fpsr */ \ + mov r13=IA64_KR(CURRENT); /* establish `current' */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r15,16; \ +.mem.offset 8,0; st8.spill [r17]=r14,16; \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r2,16; \ +.mem.offset 8,0; st8.spill [r17]=r3,16; \ + ACCOUNT_GET_STAMP \ + adds r2=IA64_PT_REGS_R16_OFFSET,r1; \ + ;; \ + EXTRA; \ + movl r1=__gp; /* establish kernel global pointer */ \ + ;; \ + ACCOUNT_SYS_ENTER \ + BSW_1(r3,r14); /* switch back to bank 1 (must be last in insn group) */ \ + ;; -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:23 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:23 +0900 Subject: [PATCH 08/32] ia64/xen: define several constants for ia64/xen. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-9-git-send-email-yamahata@valinux.co.jp> define several constants for ia64/xen. Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/asm-offsets.c | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c index 94c44b1..eaa988b 100644 --- a/arch/ia64/kernel/asm-offsets.c +++ b/arch/ia64/kernel/asm-offsets.c @@ -16,6 +16,8 @@ #include #include +#include + #include "../kernel/sigframe.h" #include "../kernel/fsyscall_gtod_data.h" @@ -286,4 +288,29 @@ void foo(void) offsetof (struct itc_jitter_data_t, itc_jitter)); DEFINE(IA64_ITC_LASTCYCLE_OFFSET, offsetof (struct itc_jitter_data_t, itc_lastcycle)); + +#ifdef CONFIG_XEN + BLANK(); + +#define DEFINE_MAPPED_REG_OFS(sym, field) \ + DEFINE(sym, (XMAPPEDREGS_OFS + offsetof(struct mapped_regs, field))) + + DEFINE_MAPPED_REG_OFS(XSI_PSR_I_ADDR_OFS, interrupt_mask_addr); + DEFINE_MAPPED_REG_OFS(XSI_IPSR_OFS, ipsr); + DEFINE_MAPPED_REG_OFS(XSI_IIP_OFS, iip); + DEFINE_MAPPED_REG_OFS(XSI_IFS_OFS, ifs); + DEFINE_MAPPED_REG_OFS(XSI_PRECOVER_IFS_OFS, precover_ifs); + DEFINE_MAPPED_REG_OFS(XSI_ISR_OFS, isr); + DEFINE_MAPPED_REG_OFS(XSI_IFA_OFS, ifa); + DEFINE_MAPPED_REG_OFS(XSI_IIPA_OFS, iipa); + DEFINE_MAPPED_REG_OFS(XSI_IIM_OFS, iim); + DEFINE_MAPPED_REG_OFS(XSI_IHA_OFS, iha); + DEFINE_MAPPED_REG_OFS(XSI_ITIR_OFS, itir); + DEFINE_MAPPED_REG_OFS(XSI_PSR_IC_OFS, interrupt_collection_enabled); + DEFINE_MAPPED_REG_OFS(XSI_BANKNUM_OFS, banknum); + DEFINE_MAPPED_REG_OFS(XSI_BANK0_R16_OFS, bank0_regs[0]); + DEFINE_MAPPED_REG_OFS(XSI_BANK1_R16_OFS, bank1_regs[0]); + DEFINE_MAPPED_REG_OFS(XSI_B0NATS_OFS, vbnat); + DEFINE_MAPPED_REG_OFS(XSI_B1NATS_OFS, vnat); +#endif /* CONFIG_XEN */ } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:31 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:31 +0900 Subject: [PATCH 16/32] ia64/xen: introduce helper function to identify domain mode. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-17-git-send-email-yamahata@valinux.co.jp> There are four operating modes Xen code may find itself running in: - native - hvm domain - pv dom0 - pv domU Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypervisor.h | 75 ++++++++++++++++++++++++++++++++ 1 files changed, 75 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/hypervisor.h diff --git a/arch/ia64/include/asm/xen/hypervisor.h b/arch/ia64/include/asm/xen/hypervisor.h new file mode 100644 index 0000000..d1f84e1 --- /dev/null +++ b/arch/ia64/include/asm/xen/hypervisor.h @@ -0,0 +1,75 @@ +/****************************************************************************** + * hypervisor.h + * + * Linux-specific hypervisor handling. + * + * Copyright (c) 2002-2004, K A Fraser + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#ifndef _ASM_IA64_XEN_HYPERVISOR_H +#define _ASM_IA64_XEN_HYPERVISOR_H + +#ifdef CONFIG_XEN + +#include +#include +#include /* to compile feature.c */ +#include /* to comiple xen-netfront.c */ +#include + +/* xen_domain_type is set before executing any C code by early_xen_setup */ +enum xen_domain_type { + XEN_NATIVE, + XEN_PV_DOMAIN, + XEN_HVM_DOMAIN, +}; + +extern enum xen_domain_type xen_domain_type; + +#define xen_domain() (xen_domain_type != XEN_NATIVE) +#define xen_pv_domain() (xen_domain_type == XEN_PV_DOMAIN) +#define xen_initial_domain() (xen_pv_domain() && \ + (xen_start_info->flags & SIF_INITDOMAIN)) +#define xen_hvm_domain() (xen_domain_type == XEN_HVM_DOMAIN) + +/* deprecated. remove this */ +#define is_running_on_xen() (xen_domain_type == XEN_PV_DOMAIN) + +extern struct start_info *xen_start_info; + +#else /* CONFIG_XEN */ + +#define xen_domain() (0) +#define xen_pv_domain() (0) +#define xen_initial_domain() (0) +#define xen_hvm_domain() (0) +#define is_running_on_xen() (0) /* deprecated. remove this */ +#endif + +#define is_initial_xendomain() (0) /* deprecated. remove this */ + +#endif /* _ASM_IA64_XEN_HYPERVISOR_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:25 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:25 +0900 Subject: [PATCH 10/32] ia64/xen: define helper functions for xen related address conversion. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-11-git-send-email-yamahata@valinux.co.jp> Xen needs some address conversions between pseudo physical address (guest phsyical address), guest machine address (real machine address) and dma address. Define helper functions for those address conversion. Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/page.h | 65 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 65 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/page.h diff --git a/arch/ia64/include/asm/xen/page.h b/arch/ia64/include/asm/xen/page.h new file mode 100644 index 0000000..03441a7 --- /dev/null +++ b/arch/ia64/include/asm/xen/page.h @@ -0,0 +1,65 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/page.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_PAGE_H +#define _ASM_IA64_XEN_PAGE_H + +#define INVALID_P2M_ENTRY (~0UL) + +static inline unsigned long mfn_to_pfn(unsigned long mfn) +{ + return mfn; +} + +static inline unsigned long pfn_to_mfn(unsigned long pfn) +{ + return pfn; +} + +#define phys_to_machine_mapping_valid(_x) (1) + +static inline void *mfn_to_virt(unsigned long mfn) +{ + return __va(mfn << PAGE_SHIFT); +} + +static inline unsigned long virt_to_mfn(void *virt) +{ + return __pa(virt) >> PAGE_SHIFT; +} + +/* for tpmfront.c */ +static inline unsigned long virt_to_machine(void *virt) +{ + return __pa(virt); +} + +static inline void set_phys_to_machine(unsigned long pfn, unsigned long mfn) +{ + /* nothing */ +} + +#define pte_mfn(_x) pte_pfn(_x) +#define mfn_pte(_x, _y) __pte_ma(0) /* unmodified use */ +#define __pte_ma(_x) ((pte_t) {(_x)}) /* unmodified use */ + +#endif /* _ASM_IA64_XEN_PAGE_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:38 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:38 +0900 Subject: [PATCH 23/32] ia64/pv_ops/xen: paravirtualize entry.S for ia64/xen. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-24-git-send-email-yamahata@valinux.co.jp> paravirtualize entry.S for ia64/xen by multi compile. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 8 ++++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xen_pv_ops.c | 18 ++++++++++++++++++ 3 files changed, 27 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index e6a25c3..19c2ae1 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -25,6 +25,14 @@ #define ia64_ivt xen_ivt #define DO_SAVE_MIN XEN_DO_SAVE_MIN +#define __paravirt_switch_to xen_switch_to +#define __paravirt_leave_syscall xen_leave_syscall +#define __paravirt_work_processed_syscall xen_work_processed_syscall +#define __paravirt_leave_kernel xen_leave_kernel +#define __paravirt_pending_syscall_end xen_work_pending_syscall_end +#define __paravirt_work_processed_syscall_target \ + xen_work_processed_syscall + #define MOV_FROM_IFA(reg) \ movl reg = XSI_IFA; \ ;; \ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 5c87e4a..9b77e8a 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -8,7 +8,7 @@ obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile -ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S +ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S entry.S ASM_PARAVIRT_OBJS = $(addprefix xen-,$(ASM_PARAVIRT_MULTI_COMPILE_SRCS:.S=.o)) obj-y += $(ASM_PARAVIRT_OBJS) define paravirtualized_xen diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index c236f04..5b23cd5 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -275,6 +275,22 @@ static const struct pv_cpu_ops xen_cpu_ops __initdata = { = xen_intrin_local_irq_restore, }; +/****************************************************************************** + * replacement of hand written assembly codes. + */ + +extern char xen_switch_to; +extern char xen_leave_syscall; +extern char xen_work_processed_syscall; +extern char xen_leave_kernel; + +const struct pv_cpu_asm_switch xen_cpu_asm_switch = { + .switch_to = (unsigned long)&xen_switch_to, + .leave_syscall = (unsigned long)&xen_leave_syscall, + .work_processed_syscall = (unsigned long)&xen_work_processed_syscall, + .leave_kernel = (unsigned long)&xen_leave_kernel, +}; + /*************************************************************************** * pv_ops initialization */ @@ -286,4 +302,6 @@ xen_setup_pv_ops(void) pv_info = xen_info; pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; + + paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Mon Oct 13 22:51:37 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Tue, 14 Oct 2008 14:51:37 +0900 Subject: [PATCH 22/32] ia64/pv_ops/xen: paravirtualize ivt.S for xen. In-Reply-To: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1223963507-28056-23-git-send-email-yamahata@valinux.co.jp> paravirtualize ivt.S for xen by multi compile. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 1 + arch/ia64/xen/Makefile | 16 +++++++++++- arch/ia64/xen/xenivt.S | 52 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/xenivt.S diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 1e92ed0..e6a25c3 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -22,6 +22,7 @@ #include +#define ia64_ivt xen_ivt #define DO_SAVE_MIN XEN_DO_SAVE_MIN #define MOV_FROM_IFA(reg) \ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 7cb4247..5c87e4a 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,5 +2,19 @@ # Makefile for Xen components # -obj-y := hypercall.o xensetup.o xen_pv_ops.o \ +obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o + +AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN + +# xen multi compile +ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S +ASM_PARAVIRT_OBJS = $(addprefix xen-,$(ASM_PARAVIRT_MULTI_COMPILE_SRCS:.S=.o)) +obj-y += $(ASM_PARAVIRT_OBJS) +define paravirtualized_xen +AFLAGS_$(1) += -D__IA64_ASM_PARAVIRTUALIZED_XEN +endef +$(foreach o,$(ASM_PARAVIRT_OBJS),$(eval $(call paravirtualized_xen,$(o)))) + +$(obj)/xen-%.o: $(src)/../kernel/%.S FORCE + $(call if_changed_dep,as_o_S) diff --git a/arch/ia64/xen/xenivt.S b/arch/ia64/xen/xenivt.S new file mode 100644 index 0000000..3e71d50 --- /dev/null +++ b/arch/ia64/xen/xenivt.S @@ -0,0 +1,52 @@ +/* + * arch/ia64/xen/ivt.S + * + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * pv_ops. + */ + +#include +#include +#include + +#include "../kernel/minstate.h" + + .section .text,"ax" +GLOBAL_ENTRY(xen_event_callback) + mov r31=pr // prepare to save predicates + ;; + SAVE_MIN_WITH_COVER // uses r31; defines r2 and r3 + ;; + movl r3=XSI_PSR_IC + mov r14=1 + ;; + st4 [r3]=r14 + ;; + adds r3=8,r2 // set up second base pointer for SAVE_REST + srlz.i // ensure everybody knows psr.ic is back on + ;; + SAVE_REST + ;; +1: + alloc r14=ar.pfs,0,0,1,0 // must be first in an insn group + add out0=16,sp // pass pointer to pt_regs as first arg + ;; + br.call.sptk.many b0=xen_evtchn_do_upcall + ;; + movl r20=XSI_PSR_I_ADDR + ;; + ld8 r20=[r20] + ;; + adds r20=-1,r20 // vcpu_info->evtchn_upcall_pending + ;; + ld1 r20=[r20] + ;; + cmp.ne p6,p0=r20,r0 // if there are pending events, + (p6) br.spnt.few 1b // call evtchn_do_upcall again. + br.sptk.many xen_leave_kernel // we know ia64_leave_kernel is + // paravirtualized as xen_leave_kernel +END(xen_event_callback) -- 1.6.0.2 From benh at kernel.crashing.org Tue Oct 14 01:18:51 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Oct 2008 19:18:51 +1100 Subject: [RFC 1/3] hvc_console: rework setup to replace irq functions with callbacks In-Reply-To: <200810140842.31806.borntraeger@de.ibm.com> References: <200806031444.21945.borntraeger@de.ibm.com> <200810130951.31733.borntraeger@de.ibm.com> <1223944714.8157.300.camel@pasglop> <200810140842.31806.borntraeger@de.ibm.com> Message-ID: <1223972331.8157.331.camel@pasglop> On Tue, 2008-10-14 at 08:42 +0200, Christian Borntraeger wrote: > Am Dienstag, 14. Oktober 2008 schrieb Benjamin Herrenschmidt: > > > > > Hmmm. > > > Can you try if this patch fixes the lockdep trace? > > > > Yup, the patch fixes it, I'll commit it via the powerpc.git tree if you > > don't have any objection. > > Sure, go ahead. Allright, I have a batch about ready to go to Linus, so I'll add that and ask him to pull tomorrow. Thanks, Ben. From yu.zhao at intel.com Tue Oct 14 03:34:24 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:34:24 +0800 Subject: [PATCH 0/8 v4] PCI: Linux kernel SR-IOV support Message-ID: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. [PATCH 1/8 v4] PCI: define PCI resource names in a 'enum' [PATCH 2/8 v4] PCI: export __pci_read_base [PATCH 3/8 v4] PCI: export pci_alloc_child_bus [PATCH 4/8 v4] PCI: add a wrapper for resource_alignment [PATCH 5/8 v4] PCI: add a new function to map BAR offset [PATCH 6/8 v4] PCI: support the SR-IOV capability [PATCH 7/8 v4] PCI: reserve bus range for the SR-IOV device [PATCH 8/8 v4] PCI: document the changes --- b/Documentation/DocBook/kernel-api.tmpl | 1 b/Documentation/PCI/pci-iov-howto.txt | 223 ++++++++ b/drivers/pci/Kconfig | 12 b/drivers/pci/Makefile | 2 b/drivers/pci/iov.c | 853 ++++++++++++++++++++++++++++++++ b/drivers/pci/pci-sysfs.c | 4 b/drivers/pci/pci.c | 19 b/drivers/pci/pci.h | 9 b/drivers/pci/probe.c | 2 b/drivers/pci/proc.c | 7 b/drivers/pci/setup-bus.c | 4 b/drivers/pci/setup-res.c | 8 b/include/linux/pci.h | 38 - b/include/linux/pci_regs.h | 22 drivers/pci/iov.c | 24 drivers/pci/pci-sysfs.c | 4 drivers/pci/pci.c | 61 ++ drivers/pci/pci.h | 65 ++ drivers/pci/probe.c | 39 - drivers/pci/setup-res.c | 14 include/linux/pci.h | 57 ++ 21 files changed, 1397 insertions(+), 71 deletions(-) --- Single Root I/O Virtualization (SR-IOV) capability defined by PCI-SIG is intended to enable multiple system software to share PCI hardware resources. PCI device that supports this capability can be extended to one Physical Functions plus multiple Virtual Functions. Physical Function, which could be considered as the "real" PCI device, reflects the hardware instance and manages all physical resources. Virtual Functions are associated with a Physical Function and shares physical resources with the Physical Function.Software can control allocation of Virtual Functions via registers encapsulated in the capability structure. SR-IOV specification can be found at http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.netxen.com/products/chipsolutions/NX3031.html http://www.neterion.com/products/x3100.html From yu.zhao at intel.com Tue Oct 14 03:46:34 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:46:34 +0800 Subject: [PATCH 1/8 v4] PCI: define PCI resource names in an 'enum' In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014104634.GA1734@yzhao12-linux.sh.intel.com> This patch moves all definitions of PCI resource names to an 'enum', and also replaces some hard-coded resource variables with symbol names. This change eases the introduction of device specific resources. Signed-off-by: Yu Zhao --- drivers/pci/pci-sysfs.c | 4 +++- drivers/pci/pci.c | 19 ++----------------- drivers/pci/probe.c | 2 +- drivers/pci/proc.c | 7 ++++--- include/linux/pci.h | 37 ++++++++++++++++++++++++------------- 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 2cad6da..c41b783 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf) struct pci_dev * pci_dev = to_pci_dev(dev); char * str = buf; int i; - int max = 7; + int max; resource_size_t start, end; if (pci_dev->subordinate) max = DEVICE_COUNT_RESOURCE; + else + max = PCI_BRIDGE_RESOURCES; for (i = 0; i < max; i++) { struct resource *res = &pci_dev->resource[i]; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 5ecd2d7..a9c64b0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -359,24 +359,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res) static void pci_restore_bars(struct pci_dev *dev) { - int i, numres; - - switch (dev->hdr_type) { - case PCI_HEADER_TYPE_NORMAL: - numres = 6; - break; - case PCI_HEADER_TYPE_BRIDGE: - numres = 2; - break; - case PCI_HEADER_TYPE_CARDBUS: - numres = 1; - break; - default: - /* Should never get here, but just in case... */ - return; - } + int i; - for (i = 0; i < numres; i++) + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) pci_update_resource(dev, i); } diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index dcd6bf1..03ddfee 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -492,7 +492,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->subordinate = 0xff; /* Set up default resource pointers and names.. */ - for (i = 0; i < 4; i++) { + for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; child->resource[i]->name = child->name; } diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index e1098c3..f6f2a59 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v) dev->vendor, dev->device, dev->irq); - /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */ - for (i=0; i<7; i++) { + + /* only print standard and ROM resources to preserve compatibility */ + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", (unsigned long long)(start | (dev->resource[i].flags & PCI_REGION_FLAG_MASK))); } - for (i=0; i<7; i++) { + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", diff --git a/include/linux/pci.h b/include/linux/pci.h index f280783..497d639 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -76,7 +76,30 @@ enum pci_mmap_state { #define PCI_DMA_FROMDEVICE 2 #define PCI_DMA_NONE 3 -#define DEVICE_COUNT_RESOURCE 12 +/* + * For PCI devices, the region numbers are assigned this way: + */ +enum { + /* #0-5: standard PCI regions */ + PCI_STD_RESOURCES, + PCI_STD_RESOURCES_END = 5, + + /* #6: expansion ROM */ + PCI_ROM_RESOURCE, + + /* address space assigned to buses behind the bridge */ +#ifndef PCI_BRIDGE_RES_NUM +#define PCI_BRIDGE_RES_NUM 4 +#endif + PCI_BRIDGE_RESOURCES, + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1, + + /* total resources associated with a PCI device */ + PCI_NUM_RESOURCES, + + /* preserve this for compatibility */ + DEVICE_COUNT_RESOURCE +}; typedef int __bitwise pci_power_t; @@ -262,18 +285,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev, hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space); } -/* - * For PCI devices, the region numbers are assigned this way: - * - * 0-5 standard PCI regions - * 6 expansion ROM - * 7-10 bridges: address space assigned to buses behind the bridge - */ - -#define PCI_ROM_RESOURCE 6 -#define PCI_BRIDGE_RESOURCES 7 -#define PCI_NUM_RESOURCES 11 - #ifndef PCI_BUS_NUM_RESOURCES #define PCI_BUS_NUM_RESOURCES 16 #endif -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 03:48:37 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:48:37 +0800 Subject: [PATCH 2/8 v4] PCI: export __pci_read_base In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014104837.GB1734@yzhao12-linux.sh.intel.com> Export __pci_read_base() so it can be used by whole PCI subsystem. Signed-off-by: Yu Zhao --- drivers/pci/pci.h | 9 +++++++++ drivers/pci/probe.c | 20 +++++++++----------- 2 files changed, 18 insertions(+), 11 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 69b6365..922b742 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -150,6 +150,15 @@ struct pci_slot_attribute { }; #define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, attr) +enum pci_bar_type { + pci_bar_unknown, /* Standard PCI BAR probe */ + pci_bar_io, /* An io port BAR */ + pci_bar_mem32, /* A 32-bit memory BAR */ + pci_bar_mem64, /* A 64-bit memory BAR */ +}; + +extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, + struct resource *res, unsigned int reg); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 03ddfee..2326609 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -201,13 +201,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask) return size; } -enum pci_bar_type { - pci_bar_unknown, /* Standard PCI BAR probe */ - pci_bar_io, /* An io port BAR */ - pci_bar_mem32, /* A 32-bit memory BAR */ - pci_bar_mem64, /* A 64-bit memory BAR */ -}; - static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) { if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) { @@ -222,11 +215,16 @@ static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) return pci_bar_mem32; } -/* - * If the type is not unknown, we assume that the lowest bit is 'enable'. - * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit. +/** + * pci_read_base - read a PCI BAR + * @dev: the PCI device + * @type: type of the BAR + * @res: resource buffer to be filled in + * @pos: BAR position in the config space + * + * Returns 1 if the BAR is 64-bit, or 0 if 32-bit. */ -static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, +int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int pos) { u32 l, sz, mask; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 03:53:23 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:53:23 +0800 Subject: [PATCH 3/8 v4] PCI: export pci_alloc_child_bus In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014105323.GC1734@yzhao12-linux.sh.intel.com> Export pci_alloc_child_bus(), and make it be able to handle buses without bridge devices. Some devices such as SR-IOV devices use more than one bus number while there is no explicit bridge devices since they have internal routing mechanism. Signed-off-by: Yu Zhao --- drivers/pci/pci.h | 2 ++ drivers/pci/probe.c | 9 ++++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 922b742..c6fa8ab 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -159,6 +159,8 @@ enum pci_bar_type { extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); +extern struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, + struct pci_dev *bridge, int busnr); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 2326609..9c680b8 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -454,7 +454,7 @@ static struct pci_bus * pci_alloc_bus(void) return b; } -static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, +struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, struct pci_dev *bridge, int busnr) { struct pci_bus *child; @@ -467,12 +467,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, if (!child) return NULL; - child->self = bridge; child->parent = parent; child->ops = parent->ops; child->sysdata = parent->sysdata; child->bus_flags = parent->bus_flags; - child->bridge = get_device(&bridge->dev); /* initialize some portions of the bus device, but don't register it * now as the parent is not properly set up yet. This device will get @@ -489,6 +487,11 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->primary = parent->secondary; child->subordinate = 0xff; + if (!bridge) + return child; + + child->self = bridge; + child->bridge = get_device(&bridge->dev); /* Set up default resource pointers and names.. */ for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 03:55:08 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:55:08 +0800 Subject: [PATCH 4/8 v4] PCI: add a wrapper for resource_alignment In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014105508.GD1734@yzhao12-linux.sh.intel.com> Add a wrap of resource_alignment so it can handle device specific resource alignment. Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 25 +++++++++++++++++++++++++ drivers/pci/pci.h | 1 + drivers/pci/setup-bus.c | 4 ++-- drivers/pci/setup-res.c | 7 ++++--- 4 files changed, 32 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a9c64b0..381e958 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1884,6 +1884,31 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) return bars; } +/** + * pci_resource_alignment - get a PCI BAR resource alignment + * @dev: the PCI device + * @resno: the resource number + * + * Returns alignment size on success, or 0 on error. + */ +int pci_resource_alignment(struct pci_dev *dev, int resno) +{ + resource_size_t align; + struct resource *res = dev->resource + resno; + + align = resource_alignment(res); + if (align) + return align; + + if (resno <= PCI_ROM_RESOURCE) + return resource_size(res); + else if (resno <= PCI_BRIDGE_RES_END) + return res->start; + + dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index c6fa8ab..720b7d6 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -161,6 +161,7 @@ extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); extern struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, struct pci_dev *bridge, int busnr); +extern int pci_resource_alignment(struct pci_dev *dev, int resno); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 6c78cf8..d454ec3 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -25,6 +25,7 @@ #include #include #include +#include "pci.h" static void pbus_assign_resources_sorted(struct pci_bus *bus) @@ -351,8 +352,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, unsigned long if (r->parent || (r->flags & mask) != type) continue; r_size = resource_size(r); - /* For bridges size != alignment */ - align = resource_alignment(r); + align = pci_resource_alignment(dev, i); order = __ffs(align) - 20; if (order > 11) { dev_warn(&dev->dev, "BAR %d bad alignment %llx: " diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index a81caac..ecff483 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -137,7 +137,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) size = resource_size(res); min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM; - align = resource_alignment(res); + align = pci_resource_alignment(dev, resno); if (!align) { dev_err(&dev->dev, "BAR %d: can't allocate resource (bogus " "alignment) [%#llx-%#llx] flags %#lx\n", @@ -235,7 +235,7 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) if (!(r->flags) || r->parent) continue; - r_align = resource_alignment(r); + r_align = pci_resource_alignment(dev, i); if (!r_align) { dev_warn(&dev->dev, "BAR %d: bogus alignment " "[%#llx-%#llx] flags %#lx\n", @@ -248,7 +248,8 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) struct resource_list *ln = list->next; if (ln) - align = resource_alignment(ln->res); + align = pci_resource_alignment(ln->dev, + ln->res - ln->dev->resource); if (r_align > align) { tmp = kmalloc(sizeof(*tmp), GFP_KERNEL); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 03:57:52 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:57:52 +0800 Subject: [PATCH 5/8 v4] PCI: add a new function to map BAR offset In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014105752.GE1734@yzhao12-linux.sh.intel.com> Add a new function to map resource number to base register (offset and type). Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 22 ++++++++++++++++++++++ drivers/pci/pci.h | 2 ++ drivers/pci/setup-res.c | 13 +++++-------- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 381e958..3575124 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1909,6 +1909,28 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) return 0; } +/** + * pci_resource_bar - get position of the BAR associated with a resource + * @dev: the PCI device + * @resno: the resource number + * @type: the BAR type to be filled in + * + * Returns BAR position in config space, or 0 if the BAR is invalid. + */ +int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) +{ + if (resno < PCI_ROM_RESOURCE) { + *type = pci_bar_unknown; + return PCI_BASE_ADDRESS_0 + 4 * resno; + } else if (resno == PCI_ROM_RESOURCE) { + *type = pci_bar_mem32; + return dev->rom_base_reg; + } + + dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 720b7d6..e2237ad 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -162,6 +162,8 @@ extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, extern struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, struct pci_dev *bridge, int busnr); extern int pci_resource_alignment(struct pci_dev *dev, int resno); +extern int pci_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index ecff483..c3585a0 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -31,6 +31,7 @@ void pci_update_resource(struct pci_dev *dev, int resno) struct pci_bus_region region; u32 new, check, mask; int reg; + enum pci_bar_type type; struct resource *res = dev->resource + resno; /* @@ -64,17 +65,13 @@ void pci_update_resource(struct pci_dev *dev, int resno) else mask = (u32)PCI_BASE_ADDRESS_MEM_MASK; - if (resno < 6) { - reg = PCI_BASE_ADDRESS_0 + 4 * resno; - } else if (resno == PCI_ROM_RESOURCE) { + reg = pci_resource_bar(dev, resno, &type); + if (!reg) + return; + if (type != pci_bar_unknown) { if (!(res->flags & IORESOURCE_ROM_ENABLE)) return; new |= PCI_ROM_ADDRESS_ENABLE; - reg = dev->rom_base_reg; - } else { - /* Hmm, non-standard resource. */ - - return; /* kill uninitialised var warning */ } pci_write_config_dword(dev, reg, new); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 04:00:23 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 19:00:23 +0800 Subject: [PATCH 7/8 v4] PCI: reserve bus range for the SR-IOV device In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014110023.GG1734@yzhao12-linux.sh.intel.com> Reserve bus range for SR-IOV at device scanning stage. Signed-off-by: Yu Zhao --- drivers/pci/iov.c | 24 ++++++++++++++++++++++++ drivers/pci/pci.h | 5 +++++ drivers/pci/probe.c | 3 +++ 3 files changed, 32 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 3cf9709..7685c6b 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -603,6 +603,30 @@ void pci_iov_remove_sysfs(struct pci_dev *dev) kfree(iov->ve); } +/** + * pci_iov_bus_range - find bus range used by SR-IOV capability + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + if (!dev->iov) + continue; + vf_rid(dev, dev->iov->totalvfs - 1, &busnr, &devfn); + if (busnr > max) + max = busnr; + } + + return max ? max - bus->number : 0; +} + int pci_iov_resource_align(struct pci_dev *dev, int resno) { if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index c66a4bd..71149b5 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -206,6 +206,7 @@ void pci_iov_remove_sysfs(struct pci_dev *dev); extern int pci_iov_resource_align(struct pci_dev *dev, int resno); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -229,6 +230,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +extern inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 831d8d0..b11f4b8 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1129,6 +1129,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn < 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 03:59:28 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 18:59:28 +0800 Subject: [PATCH 6/8 v4] PCI: support the SR-IOV capability In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014105928.GF1734@yzhao12-linux.sh.intel.com> Support Single Root I/O Virtualization (SR-IOV) capability. Signed-off-by: Yu Zhao --- drivers/pci/Kconfig | 12 + drivers/pci/Makefile | 2 + drivers/pci/iov.c | 853 ++++++++++++++++++++++++++++++++++++++++++++++ drivers/pci/pci-sysfs.c | 4 + drivers/pci/pci.c | 14 +- drivers/pci/pci.h | 55 +++ drivers/pci/probe.c | 4 + include/linux/pci.h | 57 +++ include/linux/pci_regs.h | 21 ++ 9 files changed, 1021 insertions(+), 1 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index e1ca425..e7c0836 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -50,3 +50,15 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool "PCI SR-IOV support" + depends on PCI + select PCI_MSI + default n + help + This option allows device drivers to enable Single Root I/O + Virtualization. Each Virtual Function's PCI configuration + space can be accessed using its own Bus, Device and Function + Number (Routing ID). Each Virtual Function also has PCI Memory + Space, which is used to map its own register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 7d63f8c..47bb456 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -53,3 +53,5 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o ifeq ($(CONFIG_PCI_DEBUG),y) EXTRA_CFLAGS += -DDEBUG endif + +obj-$(CONFIG_PCI_IOV) += iov.o diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 0000000..3cf9709 --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,853 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2008 Intel Corporation + * + * PCI Express Single Root I/O Virtualization capability support. + */ + +#include +#include +#include +#include +#include +#include "pci.h" + +#define VF_NAME_LEN 8 + + +struct iov_attr { + struct attribute attr; + ssize_t (*show)(struct kobject *, + struct iov_attr *, char *); + ssize_t (*store)(struct kobject *, + struct iov_attr *, const char *, size_t); +}; + +#define iov_config_attr(field) \ +static ssize_t field##_show(struct kobject *kobj, \ + struct iov_attr *attr, char *buf) \ +{ \ + struct pci_iov *iov = container_of(kobj, struct pci_iov, kobj); \ + \ + return sprintf(buf, "%d\n", iov->field); \ +} + +iov_config_attr(is_enabled); +iov_config_attr(totalvfs); +iov_config_attr(initialvfs); +iov_config_attr(numvfs); + +struct vf_entry { + int vfn; + struct kobject kobj; + struct pci_iov *iov; + struct iov_attr *attr; + char name[VF_NAME_LEN]; + char (*param)[PCI_IOV_PARAM_LEN]; +}; + +static ssize_t iov_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct iov_attr *ia = container_of(attr, struct iov_attr, attr); + + return ia->show ? ia->show(kobj, ia, buf) : -EIO; +} + +static ssize_t iov_attr_store(struct kobject *kobj, + struct attribute *attr, const char *buf, size_t len) +{ + struct iov_attr *ia = container_of(attr, struct iov_attr, attr); + + return ia->store ? ia->store(kobj, ia, buf, len) : -EIO; +} + +static struct sysfs_ops iov_attr_ops = { + .show = iov_attr_show, + .store = iov_attr_store, +}; + +static struct kobj_type iov_ktype = { + .sysfs_ops = &iov_attr_ops, +}; + +static inline void vf_rid(struct pci_dev *dev, int vfn, u8 *busnr, u8 *devfn) +{ + u16 rid; + + rid = (dev->bus->number << 8) + dev->devfn + + dev->iov->offset + dev->iov->stride * vfn; + *busnr = rid >> 8; + *devfn = rid & 0xff; +} + +static int vf_add(struct pci_dev *dev, int vfn) +{ + int i; + int rc; + u8 busnr, devfn; + unsigned long size; + struct pci_dev *new; + struct pci_bus *bus; + struct resource *res; + + vf_rid(dev, vfn, &busnr, &devfn); + + new = alloc_pci_dev(); + if (!new) + return -ENOMEM; + + if (dev->bus->number == busnr) + new->bus = bus = dev->bus; + else { + list_for_each_entry(bus, &dev->bus->children, node) + if (bus->number == busnr) { + new->bus = bus; + break; + } + BUG_ON(!new->bus); + } + + new->sysdata = bus->sysdata; + new->dev.parent = dev->dev.parent; + new->dev.bus = dev->dev.bus; + new->devfn = devfn; + new->hdr_type = PCI_HEADER_TYPE_NORMAL; + new->multifunction = 0; + new->vendor = dev->vendor; + pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_DID, &new->device); + new->cfg_size = PCI_CFG_SPACE_EXP_SIZE; + new->error_state = pci_channel_io_normal; + new->is_pcie = 1; + new->pcie_type = PCI_EXP_TYPE_ENDPOINT; + new->dma_mask = 0xffffffff; + + dev_set_name(&new->dev, "%04x:%02x:%02x.%d", pci_domain_nr(bus), + busnr, PCI_SLOT(devfn), PCI_FUNC(devfn)); + + pci_read_config_byte(new, PCI_REVISION_ID, &new->revision); + new->class = dev->class; + new->current_state = PCI_UNKNOWN; + new->irq = 0; + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + if (!res->parent) + continue; + new->resource[i].name = pci_name(new); + new->resource[i].flags = res->flags; + size = resource_size(res) / dev->iov->totalvfs; + new->resource[i].start = res->start + size * vfn; + new->resource[i].end = new->resource[i].start + size - 1; + rc = request_resource(res, &new->resource[i]); + BUG_ON(rc); + } + + new->subsystem_vendor = dev->subsystem_vendor; + pci_read_config_word(new, PCI_SUBSYSTEM_ID, &new->subsystem_device); + + pci_device_add(new, bus); + return pci_bus_add_device(new); +} + +static void vf_remove(struct pci_dev *dev, int vfn) +{ + u8 busnr, devfn; + struct pci_dev *tmp; + + vf_rid(dev, vfn, &busnr, &devfn); + + tmp = pci_get_bus_and_slot(busnr, devfn); + if (!tmp) + return; + + pci_dev_put(tmp); + pci_remove_bus_device(tmp); +} + +static int iov_enable(struct pci_iov *iov) +{ + int rc; + int i, j; + u16 ctrl; + + if (!iov->notify) + return -ENODEV; + + if (iov->is_enabled) + return 0; + + iov->notify(iov->dev, iov->numvfs | PCI_IOV_ENABLE); + pci_read_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl |= (PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + for (i = 0; i < iov->numvfs; i++) { + rc = vf_add(iov->dev, i); + if (rc) + goto failed; + } + + iov->notify(iov->dev, iov->numvfs | + PCI_IOV_ENABLE | PCI_IOV_POST_EVENT); + iov->is_enabled = 1; + return 0; + +failed: + for (j = 0; j < i; j++) + vf_remove(iov->dev, j); + + pci_read_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + return rc; +} + +static int iov_disable(struct pci_iov *iov) +{ + int i; + u16 ctrl; + + if (!iov->notify) + return -ENODEV; + + if (!iov->is_enabled) + return 0; + + iov->notify(iov->dev, PCI_IOV_DISABLE); + for (i = 0; i < iov->numvfs; i++) + vf_remove(iov->dev, i); + + pci_read_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(iov->dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + iov->notify(iov->dev, PCI_IOV_DISABLE | PCI_IOV_POST_EVENT); + iov->is_enabled = 0; + return 0; +} + +static int iov_set_numvfs(struct pci_iov *iov, int numvfs) +{ + u16 offset, stride; + + if (!iov->notify) + return -ENODEV; + + if (numvfs == iov->numvfs) + return 0; + + if (numvfs < 0 || numvfs > iov->initialvfs || iov->is_enabled) + return -EINVAL; + + pci_write_config_word(iov->dev, iov->cap + PCI_IOV_NUM_VF, numvfs); + pci_read_config_word(iov->dev, iov->cap + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(iov->dev, iov->cap + PCI_IOV_VF_STRIDE, &stride); + if ((numvfs && !offset) || (numvfs > 1 && !stride)) + return -EIO; + + iov->offset = offset; + iov->stride = stride; + iov->numvfs = numvfs; + return 0; +} + +static ssize_t is_enabled_store(struct kobject *kobj, struct iov_attr *attr, + const char *buf, size_t count) +{ + int rc; + long enable; + struct pci_iov *iov = container_of(kobj, struct pci_iov, kobj); + + rc = strict_strtol(buf, 0, &enable); + if (rc) + return rc; + + mutex_lock(&iov->mutex); + switch (enable) { + case 0: + rc = iov_disable(iov); + break; + case 1: + rc = iov_enable(iov); + break; + default: + rc = -EINVAL; + } + mutex_unlock(&iov->mutex); + + return rc ? rc : count; +} + +static ssize_t numvfs_store(struct kobject *kobj, struct iov_attr *attr, + const char *buf, size_t count) +{ + int rc; + long numvfs; + struct pci_iov *iov = container_of(kobj, struct pci_iov, kobj); + + rc = strict_strtol(buf, 0, &numvfs); + if (rc) + return rc; + + mutex_lock(&iov->mutex); + rc = iov_set_numvfs(iov, numvfs); + mutex_unlock(&iov->mutex); + + return rc ? rc : count; +} + + +static struct iov_attr iov_attr[] = { + __ATTR_RO(totalvfs), + __ATTR_RO(initialvfs), + __ATTR(numvfs, S_IWUSR | S_IRUGO, numvfs_show, numvfs_store), + __ATTR(enable, S_IWUSR | S_IRUGO, is_enabled_show, is_enabled_store), +}; + +static ssize_t vf_show(struct kobject *kobj, struct iov_attr *attr, + char *buf) +{ + int vfn; + struct vf_entry *ve = container_of(kobj, struct vf_entry, kobj); + + vfn = attr - ve->attr; + ve->iov->notify(ve->iov->dev, vfn | PCI_IOV_RD_CONF); + + return sprintf(buf, "%s\n", ve->param[vfn]); +} + +static ssize_t vf_store(struct kobject *kobj, struct iov_attr *attr, + const char *buf, size_t count) +{ + int vfn; + struct vf_entry *ve = container_of(kobj, struct vf_entry, kobj); + + vfn = attr - ve->attr; + sscanf(buf, "%63s", ve->param[vfn]); + ve->iov->notify(ve->iov->dev, vfn | PCI_IOV_WR_CONF); + + return count; +} + +static ssize_t rid_show(struct kobject *kobj, struct iov_attr *attr, + char *buf) +{ + u8 busnr, devfn; + struct vf_entry *ve = container_of(kobj, struct vf_entry, kobj); + + vf_rid(ve->iov->dev, ve->vfn, &busnr, &devfn); + + return sprintf(buf, "%04x:%02x:%02x.%d\n", + pci_domain_nr(ve->iov->dev->bus), + busnr, PCI_SLOT(devfn), PCI_FUNC(devfn)); +} + +static struct iov_attr vf_attr = __ATTR_RO(rid); + +int iov_alloc_bus(struct pci_bus *bus, int busnr) +{ + int i; + int rc = 0; + struct pci_bus *child, *next; + struct list_head head; + + INIT_LIST_HEAD(&head); + + down_write(&pci_bus_sem); + + for (i = bus->number + 1; i <= busnr; i++) { + list_for_each_entry(child, &bus->children, node) + if (child->number == i) + break; + if (child->number == i) + continue; + child = pci_alloc_child_bus(bus, NULL, i); + if (!child) { + rc = -ENOMEM; + break; + } + child->subordinate = i; + child->dev.parent = bus->bridge; + rc = device_register(&child->dev); + if (rc) { + kfree(child); + break; + } + child->is_added = 1; + list_add_tail(&child->node, &head); + } + + if (rc) + list_for_each_entry_safe(child, next, &head, node) { + device_unregister(&child->dev); + kfree(child); + } + else + list_for_each_entry_safe(child, next, &head, node) + list_move_tail(&child->node, &bus->children); + + up_write(&pci_bus_sem); + + return rc; +} + +void iov_release_bus(struct pci_bus *bus) +{ + struct pci_dev *dev; + struct pci_bus *child, *next; + struct list_head head; + + INIT_LIST_HEAD(&head); + + down_write(&pci_bus_sem); + + list_for_each_entry(dev, &bus->devices, bus_list) + if (dev->iov && dev->iov->notify) + goto done; + + list_for_each_entry_safe(child, next, &bus->children, node) + if (!child->bridge) + list_move(&child->node, &head); +done: + up_write(&pci_bus_sem); + + list_for_each_entry_safe(child, next, &head, node) + pci_remove_bus(child); +} + +/** + * pci_iov_init - initialize device's SR-IOV capability + * @dev: the PCI device + * + * Returns 0 on success, or negative on failure. + * + * The major differences between Virtual Function and PCI device are: + * 1) the device with multiple bus numbers uses internal routing, so + * there is no explicit bridge device in this case. + * 2) Virtual Function memory spaces are designated by BARs encapsulated + * in the capability structure, and the BARs in Virtual Function PCI + * configuration space are read-only zero. + */ +int pci_iov_init(struct pci_dev *dev) +{ + int i; + int pos; + u32 pgsz; + u16 ctrl, total, initial, offset, stride; + struct pci_iov *iov; + struct resource *res; + + if (!dev->is_pcie || (dev->pcie_type != PCI_EXP_TYPE_RC_END && + dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)) + return -ENODEV; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_IOV); + if (!pos) + return -ENODEV; + + ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0; + pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl); + ssleep(1); + + pci_read_config_word(dev, pos + PCI_IOV_TOTAL_VF, &total); + pci_read_config_word(dev, pos + PCI_IOV_INITIAL_VF, &initial); + pci_write_config_word(dev, pos + PCI_IOV_NUM_VF, initial); + pci_read_config_word(dev, pos + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(dev, pos + PCI_IOV_VF_STRIDE, &stride); + if (!total || initial > total || (initial && !offset) || + (initial > 1 && !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_IOV_SUP_PGSIZE, &pgsz); + i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0; + pgsz &= ~((1 << i) - 1); + if (!pgsz) + return -EIO; + + pgsz &= ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_IOV_SYS_PGSIZE, pgsz); + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) + return -ENOMEM; + + iov->dev = dev; + iov->cap = pos; + iov->totalvfs = total; + iov->initialvfs = initial; + iov->offset = offset; + iov->stride = stride; + iov->align = pgsz << 12; + mutex_init(&iov->mutex); + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + pos = iov->cap + PCI_IOV_BAR_0 + i * 4; + i += __pci_read_base(dev, pci_bar_unknown, res, pos); + if (!res->flags) + continue; + res->flags &= ~IORESOURCE_SIZEALIGN; + res->end = res->start + resource_size(res) * total - 1; + } + + dev->iov = iov; + + return 0; +} + +/** + * pci_iov_release - release resources used by SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_release(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + mutex_destroy(&dev->iov->mutex); + kfree(dev->iov); + dev->iov = NULL; +} + +/** + * pci_iov_create_sysfs - create sysfs for SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_create_sysfs(struct pci_dev *dev) +{ + int rc; + int i, j; + struct pci_iov *iov = dev->iov; + + if (!iov) + return; + + iov->ve = kzalloc(sizeof(*iov->ve) * iov->totalvfs, GFP_KERNEL); + if (!iov->ve) + return; + + for (i = 0; i < iov->totalvfs; i++) { + iov->ve[i].vfn = i; + iov->ve[i].iov = iov; + } + + rc = kobject_init_and_add(&iov->kobj, &iov_ktype, + &dev->dev.kobj, "iov"); + if (rc) + goto failed1; + + for (i = 0; i < ARRAY_SIZE(iov_attr); i++) { + rc = sysfs_create_file(&iov->kobj, &iov_attr[i].attr); + if (rc) + goto failed2; + } + + for (i = 0; i < iov->totalvfs; i++) { + sprintf(iov->ve[i].name, "%d", i); + rc = kobject_init_and_add(&iov->ve[i].kobj, &iov_ktype, + &iov->kobj, iov->ve[i].name); + if (rc) + goto failed3; + rc = sysfs_create_file(&iov->ve[i].kobj, &vf_attr.attr); + if (rc) { + kobject_put(&iov->ve[i].kobj); + goto failed3; + } + } + + return; + +failed3: + for (j = 0; j < i; j++) { + sysfs_remove_file(&iov->ve[j].kobj, &vf_attr.attr); + kobject_put(&iov->ve[j].kobj); + } +failed2: + for (j = 0; j < i; j++) + sysfs_remove_file(&dev->iov->kobj, &iov_attr[j].attr); + kobject_put(&iov->kobj); +failed1: + kfree(iov->ve); + iov->ve = NULL; + + dev_err(&dev->dev, "can't create sysfs for SR-IOV.\n"); +} + +/** + * pci_iov_remove_sysfs - remove sysfs of SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_remove_sysfs(struct pci_dev *dev) +{ + int i; + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->ve) + return; + + for (i = 0; i < iov->totalvfs; i++) { + sysfs_remove_file(&iov->ve[i].kobj, &vf_attr.attr); + kobject_put(&iov->ve[i].kobj); + } + + for (i = 0; i < ARRAY_SIZE(iov_attr); i++) + sysfs_remove_file(&dev->iov->kobj, &iov_attr[i].attr); + + kobject_put(&iov->kobj); + kfree(iov->ve); +} + +int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + return dev->iov->align; +} + +int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + *type = pci_bar_unknown; + return dev->iov->cap + PCI_IOV_BAR_0 + + 4 * (resno - PCI_IOV_RESOURCES); +} + +/** + * pci_iov_register - register SR-IOV service + * @dev: the PCI device + * @notify: callback function for SR-IOV events + * @entries: sysfs entries used by Physical Function driver + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_register(struct pci_dev *dev, int (*notify)(struct pci_dev *, u32), + char **entries) +{ + int rc; + int n, i, j, k; + u8 busnr, devfn; + struct iov_attr *attr; + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->ve) + return -ENODEV; + + if (!notify) + return -EINVAL; + + vf_rid(dev, iov->totalvfs - 1, &busnr, &devfn); + if (busnr > dev->bus->subordinate) + return -EIO; + + iov->notify = notify; + rc = iov_alloc_bus(dev->bus, busnr); + if (rc) + return rc; + + for (n = 0; entries && entries[n] && *entries[n]; n++) + ; + if (!n) + return 0; + + for (i = 0; i < iov->totalvfs; i++) { + rc = -ENOMEM; + iov->ve[i].param = kzalloc(PCI_IOV_PARAM_LEN * n, GFP_KERNEL); + if (!iov->ve[i].param) + goto failed; + attr = kzalloc(sizeof(*attr) * n, GFP_KERNEL); + if (!attr) { + kfree(iov->ve[i].param); + goto failed; + } + iov->ve[i].attr = attr; + for (j = 0; j < n; j++) { + attr[j].attr.name = entries[j]; + attr[j].attr.mode = S_IWUSR | S_IRUGO; + attr[j].show = vf_show; + attr[j].store = vf_store; + rc = sysfs_create_file(&iov->ve[i].kobj, &attr[j].attr); + if (rc) { + while (j--) + sysfs_remove_file(&iov->ve[i].kobj, + &attr[j].attr); + kfree(iov->ve[i].attr); + kfree(iov->ve[i].param); + goto failed; + } + } + } + + iov->nentries = n; + return 0; + +failed: + for (k = 0; k < i; k++) { + for (j = 0; j < n; j++) + sysfs_remove_file(&iov->ve[k].kobj, + &iov->ve[k].attr[j].attr); + kfree(iov->ve[k].attr); + kfree(iov->ve[k].param); + } + + return rc; +} +EXPORT_SYMBOL_GPL(pci_iov_register); + +/** + * pci_iov_unregister - unregister SR-IOV service + * @dev: the PCI device + */ +void pci_iov_unregister(struct pci_dev *dev) +{ + int i, j; + struct pci_iov *iov = dev->iov; + + BUG_ON(!iov || !iov->notify); + + if (!iov->nentries) + return; + + for (i = 0; i < iov->totalvfs; i++) { + for (j = 0; j < iov->nentries; j++) + sysfs_remove_file(&iov->ve[i].kobj, + &iov->ve[i].attr[j].attr); + kfree(iov->ve[i].attr); + kfree(iov->ve[i].param); + } + iov->notify = NULL; + iov_release_bus(dev->bus); +} +EXPORT_SYMBOL_GPL(pci_iov_unregister); + +/** + * pci_iov_enable - enable SR-IOV capability + * @dev: the PCI device + * @numvfs: number of VFs to be available + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + int rc; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!iov->notify) + return -EINVAL; + + mutex_lock(&iov->mutex); + rc = iov_set_numvfs(iov, numvfs); + if (rc) + goto done; + rc = iov_enable(iov); +done: + mutex_unlock(&iov->mutex); + + return rc; +} +EXPORT_SYMBOL_GPL(pci_iov_enable); + +/** + * pci_iov_disable - disable SR-IOV capability + * @dev: the PCI device + * + * Should be called upon Physical Function driver removal, and power + * state change. All previous allocated Virtual Functions are reclaimed. + */ +void pci_iov_disable(struct pci_dev *dev) +{ + struct pci_iov *iov = dev->iov; + + BUG_ON(!iov || !iov->notify); + mutex_lock(&iov->mutex); + iov_disable(iov); + mutex_unlock(&iov->mutex); +} +EXPORT_SYMBOL_GPL(pci_iov_disable); + +/** + * pci_iov_read_config - read SR-IOV configurations + * @dev: the PCI device + * @vfn: Virtual Function Number + * @entry: the entry to be read + * @buf: the buffer to be filled + * @size: size of the buffer + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_read_config(struct pci_dev *dev, int vfn, + char *entry, char *buf, int size) +{ + int i; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!iov->notify || !iov->ve || !iov->nentries) + return -EINVAL; + + if (vfn < 0 || vfn >= iov->totalvfs) + return -EINVAL; + + for (i = 0; i < iov->nentries; i++) + if (!strcmp(iov->ve[vfn].attr[i].attr.name, entry)) { + strncpy(buf, iov->ve[vfn].param[i], size); + buf[size - 1] = '\0'; + return 0; + } + + return -EINVAL; +} +EXPORT_SYMBOL_GPL(pci_iov_read_config); + +/** + * pci_iov_write_config - write SR-IOV configurations + * @dev: the PCI device + * @vfn: Virtual Function Number + * @entry: the entry to be written + * @buf: the buffer contains configurations + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_write_config(struct pci_dev *dev, int vfn, + char *entry, char *buf) +{ + int i; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!iov->notify || !iov->ve || !iov->nentries) + return -EINVAL; + + if (vfn < 0 || vfn >= iov->totalvfs) + return -EINVAL; + + for (i = 0; i < iov->nentries; i++) + if (!strcmp(iov->ve[vfn].attr[i].attr.name, entry)) { + strncpy(iov->ve[vfn].param[i], buf, PCI_IOV_PARAM_LEN); + iov->ve[vfn].param[i][PCI_IOV_PARAM_LEN - 1] = '\0'; + return 0; + } + + return -EINVAL; +} +EXPORT_SYMBOL_GPL(pci_iov_write_config); diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index c41b783..9494659 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -764,6 +764,9 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev) /* Active State Power Management */ pcie_aspm_create_sysfs_dev_files(dev); + /* Single Root I/O Virtualization */ + pci_iov_create_sysfs(dev); + return 0; } @@ -849,6 +852,7 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev) } pcie_aspm_remove_sysfs_dev_files(dev); + pci_iov_remove_sysfs(dev); } /** diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 3575124..4cfdbdb 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1902,7 +1902,12 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) if (resno <= PCI_ROM_RESOURCE) return resource_size(res); - else if (resno <= PCI_BRIDGE_RES_END) + else if (resno < PCI_BRIDGE_RESOURCES) { + /* may be device specific resource */ + align = pci_iov_resource_align(dev, resno); + if (align) + return align; + } else if (resno <= PCI_BRIDGE_RES_END) return res->start; dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); @@ -1919,12 +1924,19 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) */ int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) { + int reg; + if (resno < PCI_ROM_RESOURCE) { *type = pci_bar_unknown; return PCI_BASE_ADDRESS_0 + 4 * resno; } else if (resno == PCI_ROM_RESOURCE) { *type = pci_bar_mem32; return dev->rom_base_reg; + } else if (resno < PCI_BRIDGE_RESOURCES) { + /* may be device specific resource */ + reg = pci_iov_resource_bar(dev, resno, type); + if (reg) + return reg; } dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index e2237ad..c66a4bd 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -176,4 +176,59 @@ static inline int pci_ari_enabled(struct pci_dev *dev) return dev->ari_enabled; } +/* Single Root I/O Virtualization */ +#define PCI_IOV_PARAM_LEN 64 + +struct vf_entry; + +struct pci_iov { + int cap; /* capability position */ + int align; /* page size used to map memory space */ + int is_enabled; /* status of SR-IOV */ + int nentries; /* number of sysfs entries used by PF driver */ + u16 totalvfs; /* total VFs associated with the PF */ + u16 initialvfs; /* initial VFs associated with the PF */ + u16 numvfs; /* number of VFs available */ + u16 offset; /* first VF Routing ID offset */ + u16 stride; /* following VF stride */ + struct mutex mutex; /* lock for SR-IOV */ + struct kobject kobj; /* koject for IOV */ + struct pci_dev *dev; /* Physical Function */ + struct vf_entry *ve; /* Virtual Function related */ + int (*notify)(struct pci_dev *, u32); /* event callback function */ +}; + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_init(struct pci_dev *dev); +extern void pci_iov_release(struct pci_dev *dev); +void pci_iov_create_sysfs(struct pci_dev *dev); +void pci_iov_remove_sysfs(struct pci_dev *dev); +extern int pci_iov_resource_align(struct pci_dev *dev, int resno); +extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); +#else +static inline int pci_iov_init(struct pci_dev *dev) +{ + return -EIO; +} +static inline void pci_iov_release(struct pci_dev *dev) +{ +} +static inline void pci_iov_create_sysfs(struct pci_dev *dev) +{ +} +static inline void pci_iov_remove_sysfs(struct pci_dev *dev) +{ +} +static inline int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + return 0; +} +static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + return 0; +} +#endif /* CONFIG_PCI_IOV */ + #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 9c680b8..831d8d0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -845,6 +845,7 @@ static int pci_setup_device(struct pci_dev * dev) static void pci_release_capabilities(struct pci_dev *dev) { pci_vpd_release(dev); + pci_iov_release(dev); } /** @@ -1023,6 +1024,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Alternative Routing-ID Forwarding */ pci_enable_ari(dev); + + /* Single Root I/O Virtualization */ + pci_iov_init(dev); } void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) diff --git a/include/linux/pci.h b/include/linux/pci.h index 497d639..a7d2fd4 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -87,6 +87,12 @@ enum { /* #6: expansion ROM */ PCI_ROM_RESOURCE, + /* device specific resources */ +#ifdef CONFIG_PCI_IOV + PCI_IOV_RESOURCES, + PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1, +#endif + /* address space assigned to buses behind the bridge */ #ifndef PCI_BRIDGE_RES_NUM #define PCI_BRIDGE_RES_NUM 4 @@ -165,6 +171,7 @@ struct pci_cap_saved_state { struct pcie_link_state; struct pci_vpd; +struct pci_iov; /* * The pci_dev structure is used to describe PCI devices. @@ -253,6 +260,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + struct pci_iov *iov; }; extern struct pci_dev *alloc_pci_dev(void); @@ -1128,5 +1136,54 @@ static inline void pci_mmcfg_early_init(void) { } static inline void pci_mmcfg_late_init(void) { } #endif +/* SR-IOV events masks */ +#define PCI_IOV_VIRTFN_ID 0x0000FFFFU /* Virtual Function Number */ +#define PCI_IOV_NUM_VIRTFN 0x0000FFFFU /* num of Virtual Functions */ +#define PCI_IOV_EVENT_TYPE 0x80000000U /* event type (pre/post) */ +/* SR-IOV events values */ +#define PCI_IOV_ENABLE 0x00010000U /* SR-IOV enable request */ +#define PCI_IOV_DISABLE 0x00020000U /* SR-IOV disable request */ +#define PCI_IOV_RD_CONF 0x00040000U /* read configuration */ +#define PCI_IOV_WR_CONF 0x00080000U /* write configuration */ +#define PCI_IOV_POST_EVENT 0x80000000U /* post event */ + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_enable(struct pci_dev *dev, int numvfs); +extern void pci_iov_disable(struct pci_dev *dev); +extern int pci_iov_register(struct pci_dev *dev, + int (*notify)(struct pci_dev *dev, u32 event), char **entries); +extern void pci_iov_unregister(struct pci_dev *dev); +extern int pci_iov_read_config(struct pci_dev *dev, int id, + char *entry, char *buf, int size); +extern int pci_iov_write_config(struct pci_dev *dev, int id, + char *entry, char *buf); +#else +static inline int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + return -EIO; +} +static inline void pci_iov_disable(struct pci_dev *dev) +{ +} +static inline int pci_iov_register(struct pci_dev *dev, + int (*notify)(struct pci_dev *dev, u32 event), char **entries) +{ + return -EIO; +} +static inline void pci_iov_unregister(struct pci_dev *dev) +{ +} +static inline int pci_iov_read_config(struct pci_dev *dev, int id, + char *entry, char *buf, int size) +{ + return -EIO; +} +static inline int pci_iov_write_config(struct pci_dev *dev, int id, + char *entry, char *buf) +{ + return -EIO; +} +#endif /* CONFIG_PCI_IOV */ + #endif /* __KERNEL__ */ #endif /* LINUX_PCI_H */ diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index eb6686b..1b28b3f 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -363,6 +363,7 @@ #define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */ #define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */ #define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */ +#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */ #define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */ #define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */ #define PCI_EXP_DEVCAP 4 /* Device capabilities */ @@ -434,6 +435,7 @@ #define PCI_EXT_CAP_ID_DSN 3 #define PCI_EXT_CAP_ID_PWR 4 #define PCI_EXT_CAP_ID_ARI 14 +#define PCI_EXT_CAP_ID_IOV 16 /* Advanced Error Reporting */ #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */ @@ -551,4 +553,23 @@ #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */ #define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */ +/* Single Root I/O Virtualization */ +#define PCI_IOV_CAP 0x04 /* SR-IOV Capabilities */ +#define PCI_IOV_CTRL 0x08 /* SR-IOV Control */ +#define PCI_IOV_CTRL_VFE 0x01 /* VF Enable */ +#define PCI_IOV_CTRL_MSE 0x08 /* VF Memory Space Enable */ +#define PCI_IOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */ +#define PCI_IOV_STATUS 0x0a /* SR-IOV Status */ +#define PCI_IOV_INITIAL_VF 0x0c /* Initial VFs */ +#define PCI_IOV_TOTAL_VF 0x0e /* Total VFs */ +#define PCI_IOV_NUM_VF 0x10 /* Number of VFs */ +#define PCI_IOV_FUNC_LINK 0x12 /* Function Dependency Link */ +#define PCI_IOV_VF_OFFSET 0x14 /* First VF Offset */ +#define PCI_IOV_VF_STRIDE 0x16 /* Following VF Stride */ +#define PCI_IOV_VF_DID 0x1a /* VF Device ID */ +#define PCI_IOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */ +#define PCI_IOV_SYS_PGSIZE 0x20 /* System Page Size */ +#define PCI_IOV_BAR_0 0x24 /* VF BAR0 */ +#define PCI_IOV_NUM_BAR 6 /* Number of VF BARs */ + #endif /* LINUX_PCI_REGS_H */ -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 14 04:01:57 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 14 Oct 2008 19:01:57 +0800 Subject: [PATCH 8/8 v4] PCI: document the changes In-Reply-To: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> Message-ID: <20081014110157.GH1734@yzhao12-linux.sh.intel.com> Create how-to for SR-IOV user and device driver developer. Signed-off-by: Yu Zhao --- Documentation/DocBook/kernel-api.tmpl | 1 + Documentation/PCI/pci-iov-howto.txt | 222 +++++++++++++++++++++++++++++++++ 2 files changed, 223 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index b7b1482..5cb6491 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c --> !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c PCI Hotplug Support Library !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 0000000..15d846d --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,222 @@ + PCI Express Single Root I/O Virtualization HOWTO + Copyright (C) 2008 Intel Corporation + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function while +the virtual devices are referred to as Virtual Functions. Allocation +of Virtual Functions can be dynamically controlled by Physical Function +via registers encapsulated in the capability. By default, this feature +is not enabled and the Physical Function behaves as traditional PCIe +device. Once it's turned on, each Virtual Function's PCI configuration +space can be accessed by its own Bus, Device and Function Number (Routing +ID). And each Virtual Function also has PCI Memory Space, which is used +to map its register set. Virtual Function device driver operates on the +register set so it can be functional and appear as a real existing PCI +device. + +2. User Guide + +2.1 How can I manage SR-IOV + +If a device supports SR-IOV, then there should be some entries under +Physical Function's PCI device directory. These entries are in directory: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/ + (XXXX:BB:DD:F is domain:bus:dev:fun) +and + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/N + (N is VF number from 0 to initialvfs-1) + +To enable or disable SR-IOV: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/enable + (writing 1/0 means enable/disable VFs, state change will + notify PF driver) + +To change number of Virtual Functions: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/numvfs + (writing positive integer to this file will change NumVFs) + +The total and initial number of VFs can get from: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/totalvfs + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/initialvfs + +The identifier of a VF that belongs to this PF can get from: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/N/rid + +2.2 How can I use Virtual Functions + +Virtual Functions are treated as hot-plugged PCI devices in the kernel, +so they should be able to work in the same way as real PCI devices. +NOTE: Virtual Function device driver must be loaded to make it work. + + +3. Developer Guide + +3.1 SR-IOV APIs + +To register SR-IOV service, Physical Function device driver needs to call: + int pci_iov_register(struct pci_dev *dev, + int (*notify)(struct pci_dev *, u32), char **entries) + The 'notify' is a callback function that the SR-IOV code will invoke + it when events related to VFs happen (e.g. user read/write the sysfs + entries). The first argument is PF itself, the second argument is + event type and value. For now, following events type are supported: + - PCI_IOV_ENABLE: SR-IOV enable request + - PCI_IOV_DISABLE: SR-IOV disable request + - PCI_IOV_RD_CONF: read configuration + - PCI_IOV_WR_CONF: write configuration + - PCI_IOV_POST_EVENT: post event + And event values can be extract using following masks: + - PCI_IOV_VIRTFN_ID: Virtual Function Number + - PCI_IOV_NUM_VIRTFN: num of Virtual Functions + - PCI_IOV_EVENT_TYPE: event type (pre/post) + The 'entries' is is a list of sysfs entry names that will be to + created by the SR-IOV code. + +Note: entries could be NULL if PF driver doesn't want to create new entries +under /sys/bus/pci/devices/XXXX:BB:DD.F/iov/N/. + +To unregister SR-IOV service, Physical Function device driver needs to call: + void pci_iov_unregister(struct pci_dev *dev) + +To enable SR-IOV, Physical Function device driver needs to call: + int pci_iov_enable(struct pci_dev *dev, int numvfs) + 'numvfs' is the number of VFs that PF wants to enable. + +To disable SR-IOV, Physical Function device driver needs to call: + void pci_iov_disable(struct pci_dev *dev) + +Note: above two functions sleeps 1 second waiting on hardware transaction +completion according to SR-IOV specification. + +To read or write VFs configuration: + - int pci_iov_read_config(struct pci_dev *dev, int vfn, + char *entry, char *buf, int size); + - int pci_iov_write_config(struct pci_dev *dev, int vfn, + char *entry, char *buf); +3.2 Usage example + +Following piece of code illustrates the usage of APIs above. + +static char *entries[] = { "foo", "bar", NULL }; + +static int callback(struct pci_dev *dev, u32 event) +{ + int err; + int vfn; + int numvfs; + + if (event & PCI_IOV_ENABLE) { + /* + * request to enable SR-IOV, NumVFs is available. + * Note: if the PF want to support PM, it has to + * check the device power state here to see if + * the request is allowed or not. + */ + + numvfs = event & PCI_IOV_NUM_VIRTFN; + + } else if (event & PCI_IOV_DISABLE) { + /* + * request to disable SR-IOV. + */ + ... + + } else if (event & PCI_IOV_RD_CONF) { + /* + * request to read VF configuration, Virtual + * Function Number is available. + */ + + vfn = event & PCI_IOV_VIRTFN_ID; + + /* pass the config to SR-IOV code so user can read it */ + err = pci_iov_write_config(dev, vfn, entry, buf); + + } else if (event & PCI_IOV_WR_CONF) { + /* + * request to write VF configuration, Virtual + * Function Number is available. + */ + + vfn = event & PCI_IOV_VIRTFN_ID; + + /* read the config that has been written by user */ + err = pci_iov_read_config(dev, vfn, entry, buf, size); + + } else + return -EINVAL; + + return err; +} + +static int __devinit dev_probe(struct pci_dev *dev, + const struct pci_device_id *id) +{ + int err; + + err = pci_iov_register(dev, callback, entries); + ... + + err = pci_iov_enable(dev, nr_virtfn, callback); + + ... + + return err; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + ... + + pci_iov_disable(dev); + + ... + + pci_iov_unregister(dev); + + ... +} + +#ifdef CONFIG_PM +/* + * If Physical Function supports the power management, then the + * SR-IOV needs to be disabled before the adapter goes to sleep, + * because Virtual Functions will not work when the adapter is in + * the power-saving mode. + * The SR-IOV can be enabled again after the adapter wakes up. + */ +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + + pci_iov_disable(dev); + + ... +} + +static int dev_resume(struct pci_dev *dev) +{ + ... + + pci_iov_enable(dev, numvfs); + + ... +} +#endif + +static struct pci_driver dev_driver = { + .name = "SR-IOV Physical Function driver", + .id_table = dev_id_table, + .probe = dev_probe, + .remove = __devexit_p(dev_remove), +#ifdef CONFIG_PM + .suspend = dev_suspend, + .resume = dev_resume, +#endif +}; -- 1.5.6.4 From matthew at wil.cx Tue Oct 14 05:30:02 2008 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 14 Oct 2008 06:30:02 -0600 Subject: [PATCH 6/8 v4] PCI: support the SR-IOV capability In-Reply-To: <20081014105928.GF1734@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> <20081014105928.GF1734@yzhao12-linux.sh.intel.com> Message-ID: <20081014123002.GA15064@parisc-linux.org> On Tue, Oct 14, 2008 at 06:59:28PM +0800, Yu Zhao wrote: > +++ b/drivers/pci/pci.h > @@ -176,4 +176,59 @@ static inline int pci_ari_enabled(struct pci_dev *dev) > +struct pci_iov { > + int cap; /* capability position */ > + int align; /* page size used to map memory space */ > + int is_enabled; /* status of SR-IOV */ > + int nentries; /* number of sysfs entries used by PF driver */ > + u16 totalvfs; /* total VFs associated with the PF */ > + u16 initialvfs; /* initial VFs associated with the PF */ > + u16 numvfs; /* number of VFs available */ > + u16 offset; /* first VF Routing ID offset */ > + u16 stride; /* following VF stride */ > + struct mutex mutex; /* lock for SR-IOV */ > + struct kobject kobj; /* koject for IOV */ > + struct pci_dev *dev; /* Physical Function */ > + struct vf_entry *ve; /* Virtual Function related */ > + int (*notify)(struct pci_dev *, u32); /* event callback function */ > +}; > +++ b/include/linux/pci.h > @@ -87,6 +87,12 @@ enum { > /* #6: expansion ROM */ > PCI_ROM_RESOURCE, > > + /* device specific resources */ > +#ifdef CONFIG_PCI_IOV > + PCI_IOV_RESOURCES, > + PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1, > +#endif > + > /* address space assigned to buses behind the bridge */ > #ifndef PCI_BRIDGE_RES_NUM > #define PCI_BRIDGE_RES_NUM 4 Why expand the number of resources in struct pci_dev instead of putting the new resources in struct pci_iov? -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From greg at kroah.com Tue Oct 14 07:37:44 2008 From: greg at kroah.com (Greg KH) Date: Tue, 14 Oct 2008 07:37:44 -0700 Subject: [PATCH 6/8 v4] PCI: support the SR-IOV capability In-Reply-To: <20081014105928.GF1734@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> <20081014105928.GF1734@yzhao12-linux.sh.intel.com> Message-ID: <20081014143744.GA12251@kroah.com> On Tue, Oct 14, 2008 at 06:59:28PM +0800, Yu Zhao wrote: > +struct pci_iov { > + int cap; /* capability position */ > + int align; /* page size used to map memory space */ > + int is_enabled; /* status of SR-IOV */ > + int nentries; /* number of sysfs entries used by PF driver */ > + u16 totalvfs; /* total VFs associated with the PF */ > + u16 initialvfs; /* initial VFs associated with the PF */ > + u16 numvfs; /* number of VFs available */ > + u16 offset; /* first VF Routing ID offset */ > + u16 stride; /* following VF stride */ > + struct mutex mutex; /* lock for SR-IOV */ > + struct kobject kobj; /* koject for IOV */ Why isn't this a real struct device? That way you get all of the proper userspace notification and the like, with kobjects, you do not. thanks, greg k-h From anthony at codemonkey.ws Tue Oct 14 11:16:19 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Tue, 14 Oct 2008 13:16:19 -0500 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081014175900.GA18344@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> Message-ID: <48F4E1F3.3050606@codemonkey.ws> Gleb Natapov wrote: > On Tue, Oct 14, 2008 at 08:50:48AM -0500, Anthony Liguori wrote: > >> Gleb Natapov wrote: >> >>> On Mon, Oct 13, 2008 at 01:32:35PM -0500, Anthony Liguori wrote: >>> >>> >>> netlink was designed to be interface to userspace and is used like this >>> by different subsystems (not just network). What full blown socket (and >>> by that I presume you mean new address family) will give you over netlink? >>> File system? We need a simple stream semantics is this justify another >>> virtual file system? The choice was between char device and netlink. >>> Nelink was simpler and gives broadcast as a bonus. >>> >>> >> The problem that you aren't solving, that IMHO is critical to solve, is >> the namespace issue. How do you determine who gets to use what channel >> in userspace and in the host? >> > Management software determines this. You have an image that is managed > by particular software and management daemons on the guest knows what > channels to use to communicate with their counterparts on the host. > Is there a need to provide more then that? > No, I don't think this is enough. I think we need to support multiple independent management agents using these interfaces so we can't rely on a single management too orchestrating who uses what channel. For instance, we'll have something that's open that does copy/paste and DnD, but then CIM people may want to have monitor agents that use the interface for monitoring. >> It's not a problem when you just have one >> tool, but if you expect people to start using this interface, >> arbitrating it quickly becomes a problem. >> > I expect that software on the host and on the guest will belong to the > same management solution. > There won't always be one set of tools that use this functionality. >> sockets have a concept of addressing and a vfs has a natural namespace. >> That's what I was suggesting those interfaces. >> >> > What address should look like if we will choose to use new address family? > An example will help me understand what problem you are trying to point out > easily. > One thing that's been discussed is to use something that looked much like struct sockaddr_un. As long as the strings were unique, they could be in whatever format people wanted. Of course, you should also take a look at VMware's VMCI. If we're going to have a socket interface, if we can have a compatible userspace interface, that would probably be a good thing. >>> >>> >>>> Having a limit of only 4 links seems like a problem to me too. >>>> >>>> >>>> >>> This can be easily extended. >>> >>> >> There shouldn't be an inherent limit in the userspace interface. >> >> > Well, qemu has those limits for all other interfaces (like number of > nics, serial ports, parallel ports), but if vmchannels are somehow > different in this regards there is no problem to dynamically grow their > number. > Having a limit in QEMU is fine, we just don't want the limit to be in the guest driver. It's relatively easy to increase the limit or make it dynamic in QEMU but if it requires guest-visible changes, that's much more difficult to fix. Regards, Anthony Liguori > > -- > Gleb. > From tony.luck at intel.com Tue Oct 14 14:46:41 2008 From: tony.luck at intel.com (Luck, Tony) Date: Tue, 14 Oct 2008 14:46:41 -0700 Subject: [PATCH 09/32] ia64/xen: add a necessary header file to compile include/xen/interface/xen.h In-Reply-To: <1223963507-28056-10-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-10-git-send-email-yamahata@valinux.co.jp> Message-ID: <57C9024A16AD2D4C97DC78E552063EA353274D0A@orsmsx505.amr.corp.intel.com> +++ b/arch/ia64/include/asm/pvclock-abi.h @@ -0,0 +1,5 @@ +/* + * use same structure to x86's + * Hopefully asm-x86/pvclock-abi.h would be moved to somewhere more generic. + */ +#include I will trade out this patch for one that just makes a copy of the x86 include file. This #include will break if/when x86 moves their include files to arch/x86/include/asm -Tony From tony.luck at intel.com Tue Oct 14 14:58:48 2008 From: tony.luck at intel.com (Luck, Tony) Date: Tue, 14 Oct 2008 14:58:48 -0700 Subject: [PATCH 32/32] ia64/pv_ops: paravirtualized istruction checker. In-Reply-To: <1223963507-28056-33-git-send-email-yamahata@valinux.co.jp> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-33-git-send-email-yamahata@valinux.co.jp> Message-ID: <57C9024A16AD2D4C97DC78E552063EA353274D37@orsmsx505.amr.corp.intel.com> > This patch implements a checker to detect instructions which > should be paravirtualized instead of direct writing raw instruction. > This patch does rough check so that it doesn't fully cover all cases, > but it can detects most cases of paravirtualization breakage of hand > written assembly codes. There are still some "itc.d" instructions in ivt.S (in the #ifndef CONFIG_SMP code). This checker caught them ... but the error messages from the build were not as elegant as they might be AS arch/ia64/kernel/pvchk-ivt.o arch/ia64/kernel/ivt.S: Assembler messages: arch/ia64/kernel/ivt.S:583: Warning: itc.d should not be used directly. arch/ia64/kernel/ivt.S:583: Error: junk at end of line, first unrecognized character is `r' arch/ia64/kernel/ivt.S:649: Warning: itc.i should not be used directly. arch/ia64/kernel/ivt.S:649: Error: junk at end of line, first unrecognized character is `r' arch/ia64/kernel/ivt.S:701: Warning: itc.d should not be used directly. arch/ia64/kernel/ivt.S:701: Error: junk at end of line, first unrecognized character is `r' make[1]: *** [arch/ia64/kernel/pvchk-ivt.o] Error 1 make: *** [arch/ia64/kernel] Error 2 I'll hold off on applying part32/32 until the CONFIG_SMP=n case is fixed. -Tony From jeremy at goop.org Tue Oct 14 15:42:46 2008 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Tue, 14 Oct 2008 15:42:46 -0700 Subject: [PATCH 09/32] ia64/xen: add a necessary header file to compile include/xen/interface/xen.h In-Reply-To: <57C9024A16AD2D4C97DC78E552063EA353274D0A@orsmsx505.amr.corp.intel.com> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-10-git-send-email-yamahata@valinux.co.jp> <57C9024A16AD2D4C97DC78E552063EA353274D0A@orsmsx505.amr.corp.intel.com> Message-ID: <48F52066.5040604@goop.org> Luck, Tony wrote: > +++ b/arch/ia64/include/asm/pvclock-abi.h > @@ -0,0 +1,5 @@ > +/* > + * use same structure to x86's > + * Hopefully asm-x86/pvclock-abi.h would be moved to somewhere more generic. > + */ > +#include > > > I will trade out this patch for one that just makes > a copy of the x86 include file. This #include will > break if/when x86 moves their include files to > arch/x86/include/asm > I think Ingo is planning it fairly soon. J From yu.zhao at intel.com Tue Oct 14 19:04:31 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 15 Oct 2008 10:04:31 +0800 Subject: [PATCH 6/8 v4] PCI: support the SR-IOV capability In-Reply-To: <20081014123002.GA15064@parisc-linux.org> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> <20081014105928.GF1734@yzhao12-linux.sh.intel.com> <20081014123002.GA15064@parisc-linux.org> Message-ID: <48F54FAF.7060204@intel.com> Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 06:59:28PM +0800, Yu Zhao wrote: >> +++ b/include/linux/pci.h >> @@ -87,6 +87,12 @@ enum { >> /* #6: expansion ROM */ >> PCI_ROM_RESOURCE, >> >> + /* device specific resources */ >> +#ifdef CONFIG_PCI_IOV >> + PCI_IOV_RESOURCES, >> + PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1, >> +#endif >> + >> /* address space assigned to buses behind the bridge */ >> #ifndef PCI_BRIDGE_RES_NUM >> #define PCI_BRIDGE_RES_NUM 4 > > Why expand the number of resources in struct pci_dev instead of putting > the new resources in struct pci_iov? Yes, it's supposed to be in the 'struct pci_iov', and the resources used to be there in early version. But later I found all resource related functions such as pci_assign_resource, pdev_sort_resources, pbus_size_mem, etc. assume the resources are bundled with 'struct pci_dev' and address them using their indexes. Encapsulating resources into 'pci_iov' will impact all these functions. And I think we can postpone the change of these functions until the PCIM comes out, if the IOV is the only one who uses non-standard resources. > > -- > Matthew Wilcox Intel Open Source Technology Centre > "Bill, look, we understand that you're interested in selling us this > operating system, but compare it to ours. We can't possibly take such > a retrograde step." From yamahata at valinux.co.jp Tue Oct 14 20:18:26 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Wed, 15 Oct 2008 12:18:26 +0900 Subject: [PATCH 09/32] ia64/xen: add a necessary header file to compile include/xen/interface/xen.h In-Reply-To: <57C9024A16AD2D4C97DC78E552063EA353274D0A@orsmsx505.amr.corp.intel.com> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-10-git-send-email-yamahata@valinux.co.jp> <57C9024A16AD2D4C97DC78E552063EA353274D0A@orsmsx505.amr.corp.intel.com> Message-ID: <20081015031826.GC6061%yamahata@valinux.co.jp> On Tue, Oct 14, 2008 at 02:46:41PM -0700, Luck, Tony wrote: > +++ b/arch/ia64/include/asm/pvclock-abi.h > @@ -0,0 +1,5 @@ > +/* > + * use same structure to x86's > + * Hopefully asm-x86/pvclock-abi.h would be moved to somewhere more generic. > + */ > +#include > > > I will trade out this patch for one that just makes > a copy of the x86 include file. This #include will > break if/when x86 moves their include files to > arch/x86/include/asm I see. Then the first version is preferable. Here is the one. From yamahata at valinux.co.jp Tue Oct 14 20:20:41 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Wed, 15 Oct 2008 12:20:41 +0900 Subject: [PATCH 32/32] ia64/pv_ops: paravirtualized istruction checker. In-Reply-To: <57C9024A16AD2D4C97DC78E552063EA353274D37@orsmsx505.amr.corp.intel.com> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-33-git-send-email-yamahata@valinux.co.jp> <57C9024A16AD2D4C97DC78E552063EA353274D37@orsmsx505.amr.corp.intel.com> Message-ID: <20081015032041.GD6061%yamahata@valinux.co.jp> On Tue, Oct 14, 2008 at 02:58:48PM -0700, Luck, Tony wrote: > > This patch implements a checker to detect instructions which > > should be paravirtualized instead of direct writing raw instruction. > > This patch does rough check so that it doesn't fully cover all cases, > > but it can detects most cases of paravirtualization breakage of hand > > written assembly codes. > > There are still some "itc.d" instructions in ivt.S (in the #ifndef > CONFIG_SMP code). This checker caught them ... but the error messages > from the build were not as elegant as they might be > > AS arch/ia64/kernel/pvchk-ivt.o > arch/ia64/kernel/ivt.S: Assembler messages: > arch/ia64/kernel/ivt.S:583: Warning: itc.d should not be used directly. > arch/ia64/kernel/ivt.S:583: Error: junk at end of line, first unrecognized character is `r' > arch/ia64/kernel/ivt.S:649: Warning: itc.i should not be used directly. > arch/ia64/kernel/ivt.S:649: Error: junk at end of line, first unrecognized character is `r' > arch/ia64/kernel/ivt.S:701: Warning: itc.d should not be used directly. > arch/ia64/kernel/ivt.S:701: Error: junk at end of line, first unrecognized character is `r' > make[1]: *** [arch/ia64/kernel/pvchk-ivt.o] Error 1 > make: *** [arch/ia64/kernel] Error 2 > > > I'll hold off on applying part32/32 until the CONFIG_SMP=n case > is fixed. Here is the updated pv checker patch. sed script is updated such that the trailing strings are eliminated. From yamahata at valinux.co.jp Tue Oct 14 20:22:43 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Wed, 15 Oct 2008 12:22:43 +0900 Subject: [PATCH 32/32] ia64/pv_ops: paravirtualized istruction checker. In-Reply-To: <57C9024A16AD2D4C97DC78E552063EA353274D37@orsmsx505.amr.corp.intel.com> References: <1223963507-28056-1-git-send-email-yamahata@valinux.co.jp> <1223963507-28056-33-git-send-email-yamahata@valinux.co.jp> <57C9024A16AD2D4C97DC78E552063EA353274D37@orsmsx505.amr.corp.intel.com> Message-ID: <20081015032243.GE6061%yamahata@valinux.co.jp> On Tue, Oct 14, 2008 at 02:58:48PM -0700, Luck, Tony wrote: > > This patch implements a checker to detect instructions which > > should be paravirtualized instead of direct writing raw instruction. > > This patch does rough check so that it doesn't fully cover all cases, > > but it can detects most cases of paravirtualization breakage of hand > > written assembly codes. > > There are still some "itc.d" instructions in ivt.S (in the #ifndef > CONFIG_SMP code). This checker caught them ... but the error messages > from the build were not as elegant as they might be > > AS arch/ia64/kernel/pvchk-ivt.o > arch/ia64/kernel/ivt.S: Assembler messages: > arch/ia64/kernel/ivt.S:583: Warning: itc.d should not be used directly. > arch/ia64/kernel/ivt.S:583: Error: junk at end of line, first unrecognized character is `r' > arch/ia64/kernel/ivt.S:649: Warning: itc.i should not be used directly. > arch/ia64/kernel/ivt.S:649: Error: junk at end of line, first unrecognized character is `r' > arch/ia64/kernel/ivt.S:701: Warning: itc.d should not be used directly. > arch/ia64/kernel/ivt.S:701: Error: junk at end of line, first unrecognized character is `r' > make[1]: *** [arch/ia64/kernel/pvchk-ivt.o] Error 1 > make: *** [arch/ia64/kernel] Error 2 > > > I'll hold off on applying part32/32 until the CONFIG_SMP=n case > is fixed. > > -Tony > Here is the patch to fix the CONFIG_SMP=n case. BTW with CONFIG_SMP=n, I got the following error so that I had to disable kvm. arch/ia64/kvm/vmm.c: In function 'vmm_spin_unlock': arch/ia64/kvm/vmm.c:63: error: 'raw_spinlock_t' has no member named 'lock' From yamahata at valinux.co.jp Tue Oct 14 19:48:39 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Wed, 15 Oct 2008 11:48:39 +0900 Subject: [PATCH] ia64/pv_ops: fix paraviatualization of ivt.S with CONFIG_SMP=n Message-ID: When CONFIG_SMP=n, three instruction in ivt.S were missed to paravirtualize. paravirtualize them. Cc: "Luck, Tony" Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/ivt.S | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/ia64/kernel/ivt.S b/arch/ia64/kernel/ivt.S index 416a952..f675d8e 100644 --- a/arch/ia64/kernel/ivt.S +++ b/arch/ia64/kernel/ivt.S @@ -580,7 +580,7 @@ ENTRY(dirty_bit) mov b0=r29 // restore b0 ;; st8 [r17]=r18 // store back updated PTE - itc.d r18 // install updated PTE + ITC_D(p0, r18, r16) // install updated PTE #endif mov pr=r31,-1 // restore pr RFI @@ -646,7 +646,7 @@ ENTRY(iaccess_bit) mov b0=r29 // restore b0 ;; st8 [r17]=r18 // store back updated PTE - itc.i r18 // install updated PTE + ITC_I(p0, r18, r16) // install updated PTE #endif /* !CONFIG_SMP */ mov pr=r31,-1 RFI @@ -698,7 +698,7 @@ ENTRY(daccess_bit) or r18=_PAGE_A,r18 // set the accessed bit ;; st8 [r17]=r18 // store back updated PTE - itc.d r18 // install updated PTE + ITC_D(p0, r18, r16) // install updated PTE #endif mov b0=r29 // restore b0 mov pr=r31,-1 -- 1.6.0.2 -- yamahata From anthony at codemonkey.ws Wed Oct 15 07:02:30 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 15 Oct 2008 09:02:30 -0500 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015125837.GQ11435@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> Message-ID: <48F5F7F6.5010208@codemonkey.ws> Gleb Natapov wrote: > On Tue, Oct 14, 2008 at 01:16:19PM -0500, Anthony Liguori wrote: > >> One thing that's been discussed is to use something that looked much >> > Where is has been discussed? Was it on a public mailing list with online > archive? > Probably? This subject has been discussed to death in various places (including within the Xen community). >> like struct sockaddr_un. As long as the strings were unique, they could >> be in whatever format people wanted. >> >> > So basically what you are saying is that you want to use string IDs instead of > numerical IDs in a hope that the chance of colliding IDs will be smaller? (in the > current implementation ID length is 32 bit, so the chances for two services to > pick the same ID is small). > But people don't choose random 32-bit integers and since your implementation only supports channels 0..4 obviously, the intention isn't to choose random integers. When using integers, it would be necessary to have some sort of authority that assigns out the integers to avoid conflict. A protocol like this scales better if such an authority isn't necessary. > But why pick constant ID for a service at all? Management software can > assign unique IDs for each service during image preparation. First, not everyone has "management software". Even so, it's not the center of the world. If I want to add a new feature to QEMU that makes use of one of these channels (say Unity/Coherence), does that mean I now have to tell every management tool (libvirt et al) about this interface so they can assign an ID to it? How does the guest software know what channel to use? You basically assume yet another host<=>guest communication mechanism to tell the guest software what channel to use. That seems to defeat the purpose. > So one > management software will use channel 42 for DnD and 22 for CIM and another > will use 13 for DnD and 42 for CIM. All is need is to not hard code > channel IDs into agents. You will not be able to move such images from one > management software to another easily of cause, but this task is not so easy > today too. > It's so much simpler to just use unique identifiers for each service. Be it UUID, a string in the form of a reverse fqdn or URL, or whatever. >> Of course, you should also take a look at VMware's VMCI. If we're going >> to have a socket interface, if we can have a compatible userspace >> interface, that would probably be a good thing. >> >> > As good as VMware backdoor interface that we chose not to use because we > can't control it? > I suggested you look at VMCI mainly to see the addressing mechanism. AF_IUCV is something else to look at although there's a lot of legacy there. I'm not suggesting we be binary compatible with VMCI, but if their addressing mechanism is sufficiently general (like a string), then I see no reason not to use the same addressing mechanism or something similar to it. > If you like string IDs better > than numerical IDs and you are OK with "lookup by name" way of doing > things in VMCI I can easily add channel 0 (will be implemented by qemu > itself and thus always present) that will do name to ID mapping. > It's not a bad idea to have a bootstrap channel. Do channel exist during the entirely life time of the guest? Can disconnects/reconnects happen on a channel? Can a guest listen on a channel? Certainly, sockets are a pretty well established model so it makes a certain amount of sense to have these things behave like traditional sockets. Regards, Anthony Liguori From gleb at redhat.com Wed Oct 15 05:58:37 2008 From: gleb at redhat.com (Gleb Natapov) Date: Wed, 15 Oct 2008 14:58:37 +0200 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <48F4E1F3.3050606@codemonkey.ws> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> Message-ID: <20081015125837.GQ11435@redhat.com> On Tue, Oct 14, 2008 at 01:16:19PM -0500, Anthony Liguori wrote: >>> sockets have a concept of addressing and a vfs has a natural >>> namespace. That's what I was suggesting those interfaces. >>> >>> >> What address should look like if we will choose to use new address family? >> An example will help me understand what problem you are trying to point out >> easily. >> > > One thing that's been discussed is to use something that looked much Where is has been discussed? Was it on a public mailing list with online archive? > like struct sockaddr_un. As long as the strings were unique, they could > be in whatever format people wanted. > So basically what you are saying is that you want to use string IDs instead of numerical IDs in a hope that the chance of colliding IDs will be smaller? (in the current implementation ID length is 32 bit, so the chances for two services to pick the same ID is small). But why pick constant ID for a service at all? Management software can assign unique IDs for each service during image preparation. So one management software will use channel 42 for DnD and 22 for CIM and another will use 13 for DnD and 42 for CIM. All is need is to not hard code channel IDs into agents. You will not be able to move such images from one management software to another easily of cause, but this task is not so easy today too. > Of course, you should also take a look at VMware's VMCI. If we're going > to have a socket interface, if we can have a compatible userspace > interface, that would probably be a good thing. > As good as VMware backdoor interface that we chose not to use because we can't control it? I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). Wow, it looks like somebody was assigned a task to design a cross platform communication layer that should accommodate every past, present and future requirement no matter how likely or weird it may be :) But seriously if we drop all the cross platform craft from there and add name resolution server to proposed vmchannel we will have something very similar (sans shared memory of cause). If you like string IDs better than numerical IDs and you are OK with "lookup by name" way of doing things in VMCI I can easily add channel 0 (will be implemented by qemu itself and thus always present) that will do name to ID mapping. But why stop there. Lets run CORBA name service on channel 0 and run CORBA objects on others (joke). >>>> >>>>> Having a limit of only 4 links seems like a problem to me too. >>>>> >>>>> >>>> This can be easily extended. >>>> >>> There shouldn't be an inherent limit in the userspace interface. >>> >>> >> Well, qemu has those limits for all other interfaces (like number of >> nics, serial ports, parallel ports), but if vmchannels are somehow >> different in this regards there is no problem to dynamically grow their >> number. >> > > Having a limit in QEMU is fine, we just don't want the limit to be in > the guest driver. It's relatively easy to increase the limit or make it > dynamic in QEMU but if it requires guest-visible changes, that's much > more difficult to fix. > That's OK then. There is no any compile time limits on a number of channels in the current Linux driver. The number is only limited by PCI configuration space as I pass available channel IDs there. -- Gleb. From biggadike at vmware.com Wed Oct 15 07:18:52 2008 From: biggadike at vmware.com (Andrew Biggadike) Date: Wed, 15 Oct 2008 07:18:52 -0700 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015125837.GQ11435@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> Message-ID: <20081015141852.GA19554@vmware.com> Gleb Natapov wrote: > > Of course, you should also take a look at VMware's VMCI. If we're going > > to have a socket interface, if we can have a compatible userspace > > interface, that would probably be a good thing. > > I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). I believe Anthony intended for you to look at the sockets interface to VMCI: http://www.vmware.com/pdf/ws65_s2_vmci_sockets.pdf. The link you referenced was experimental and is actually now deprecated in favor of the sockets interface. From gleb at redhat.com Wed Oct 15 07:30:29 2008 From: gleb at redhat.com (Gleb Natapov) Date: Wed, 15 Oct 2008 16:30:29 +0200 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015141852.GA19554@vmware.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> <20081015141852.GA19554@vmware.com> Message-ID: <20081015143029.GR11435@redhat.com> On Wed, Oct 15, 2008 at 07:18:52AM -0700, Andrew Biggadike wrote: > Gleb Natapov wrote: > > > Of course, you should also take a look at VMware's VMCI. If we're going > > > to have a socket interface, if we can have a compatible userspace > > > interface, that would probably be a good thing. > > > > I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). > > I believe Anthony intended for you to look at the sockets interface to > VMCI: http://www.vmware.com/pdf/ws65_s2_vmci_sockets.pdf. > Thanks for the link! Are you going to push this interface into upstream kernel? > The link you referenced was experimental and is actually now > deprecated in favor of the sockets interface. You should influence google somehow to drop old link. That the first search result for "vmci vmware" ;) -- Gleb. From biggadike at vmware.com Wed Oct 15 08:00:51 2008 From: biggadike at vmware.com (Andrew Biggadike) Date: Wed, 15 Oct 2008 08:00:51 -0700 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015143029.GR11435@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> <20081015141852.GA19554@vmware.com> <20081015143029.GR11435@redhat.com> Message-ID: <20081015150051.GB19554@vmware.com> Gleb Natapov wrote: > On Wed, Oct 15, 2008 at 07:18:52AM -0700, Andrew Biggadike wrote: > > Gleb Natapov wrote: > > > > Of course, you should also take a look at VMware's VMCI. If we're going > > > > to have a socket interface, if we can have a compatible userspace > > > > interface, that would probably be a good thing. > > > > > > I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). > > > I believe Anthony intended for you to look at the sockets interface to > > VMCI: http://www.vmware.com/pdf/ws65_s2_vmci_sockets.pdf. > > > Thanks for the link! Are you going to push this interface into upstream kernel? Pushing it upstream is definitely something we want to do, since it will make our development easier and allow us to clean up some unnecessary code in the module. That said, before we can start that effort we have some issues to resolve internally (for example, getting our upgrade story straight in this model and ensuring our infrastructure / installer can handle it) so we don't have an estimate for when this will happen just yet. There are people here whose jobs are to resolve these things and they are working on it, so I am confident this will happen even if slower than we'd all like. > > The link you referenced was experimental and is actually now > > deprecated in favor of the sockets interface. > You should influence google somehow to drop old link. That the first > search result for "vmci vmware" ;) You know, we talked a while ago about putting a notice at the top of that page stating what I said and redirecting to the VMCI Sockets programming guide, but somehow that seems to have not gotten done. From anthony at codemonkey.ws Wed Oct 15 08:56:02 2008 From: anthony at codemonkey.ws (Anthony Liguori) Date: Wed, 15 Oct 2008 10:56:02 -0500 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015154212.GS11435@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> <20081015141852.GA19554@vmware.com> <20081015154212.GS11435@redhat.com> Message-ID: <48F61292.9020504@codemonkey.ws> Gleb Natapov wrote: > Andrew, > > On Wed, Oct 15, 2008 at 07:18:52AM -0700, Andrew Biggadike wrote: > >> Gleb Natapov wrote: >> >>>> Of course, you should also take a look at VMware's VMCI. If we're going >>>> to have a socket interface, if we can have a compatible userspace >>>> interface, that would probably be a good thing. >>>> >>> I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). >>> >> I believe Anthony intended for you to look at the sockets interface to >> VMCI: http://www.vmware.com/pdf/ws65_s2_vmci_sockets.pdf. >> >> > Using VMCI socket requires loading kernel module in a guest and in a host. > Is this correct? > Note that their addressing scheme uses a CID/port pair. I think it's interesting and somewhat safe because it basically mirrors an IP/port pair. That makes it relatively safe because that addressing mechanism is well known (with it's advantages and flaws). For instance, you need some sort of authority to assign out ports. It doesn't really help with discovery either. Another possibility would be to have the address be like sockaddr_un. You could actually have it be file paths. The effect would be that any VMs that can communicate with each other could have a common namespace. You could extend the analogy and actually create controllable permissions that could be used to control who can talk to who. You could even create a synthetic filesystem in the guest that could mount this namespace allowing very sophisticated enumeration/permission control. This is probably the complete opposite end in terms of having a novel interface. The best solution is probably somewhere between the two. Regards, Anthony Liguori > -- > Gleb. > From biggadike at vmware.com Wed Oct 15 09:59:36 2008 From: biggadike at vmware.com (Andrew Biggadike) Date: Wed, 15 Oct 2008 09:59:36 -0700 Subject: [PATCH][RFC] vmchannel a data channel between host and guest. In-Reply-To: <20081015154212.GS11435@redhat.com> References: <20081012124534.GK11435@redhat.com> <48F39443.4070203@codemonkey.ws> <20081014090540.GB13153@redhat.com> <48F4A3B8.8050603@us.ibm.com> <20081014175900.GA18344@redhat.com> <48F4E1F3.3050606@codemonkey.ws> <20081015125837.GQ11435@redhat.com> <20081015141852.GA19554@vmware.com> <20081015154212.GS11435@redhat.com> Message-ID: <20081015165936.GA1770@vmware.com> Gleb Natapov wrote: > On Wed, Oct 15, 2008 at 07:18:52AM -0700, Andrew Biggadike wrote: > > Gleb Natapov wrote: > > > > Of course, you should also take a look at VMware's VMCI. If we're going > > > > to have a socket interface, if we can have a compatible userspace > > > > interface, that would probably be a good thing. > > > > > > I looked at what I could find about VMCI (http://pubs.vmware.com/vmci-sdk/index.html). > > > > I believe Anthony intended for you to look at the sockets interface to > > VMCI: http://www.vmware.com/pdf/ws65_s2_vmci_sockets.pdf. > > > Using VMCI socket requires loading kernel module in a guest and in a host. > Is this correct? Yes, any context (in VMCI terms) that wants to allow for VMCI Socket endpoints needs both the vmci and the vsock kernel modules loaded. In case you're asking because you're going to try it out, note that our currently released version of VMCI Sockets (with Workstation 6.5) does not yet support SOCK_STREAM on the host, just guests. That gets a lot of people at the moment. From rusty at rustcorp.com.au Wed Oct 15 21:43:38 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 16 Oct 2008 15:43:38 +1100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <20081009153035.GA21542@gondor.apana.org.au> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <200810091155.59731.rusty@rustcorp.com.au> <20081009153035.GA21542@gondor.apana.org.au> Message-ID: <200810161543.39812.rusty@rustcorp.com.au> On Friday 10 October 2008 02:30:35 Herbert Xu wrote: > On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote: > > Secondly, we can put the virtio_net_hdr at the head of the skb data (this > > is also worth considering for xmit I think if we have headroom) and drop > > MAX_SKB_FRAGS which contains a gratuitous +2. > > That's fine but having skb->data in the ring still means two > different kinds of memory in there and it sucks when you only > have 1500-byte packets. No, you really want to do this for 1500 byte packets since it increases the effective space in the ring. Unfortunately, Mark points out that kvm assumes the header is standalone: Anthony and I discussed this a while back and decided it *wasn't* a good assumption. TODO: YA feature bit... > We need a scheme that handles both 1500-byte packets as well > as 64K-byte size ones, and without holding down 16M of memory > per guest. Ah, thanks for that. It's not so much ring entries, as guest memory you're trying to save. That makes much more sense. > > > + char *p = page_address(skb_shinfo(skb)->frags[0].page); > > > > ... > > > > > + memcpy(hdr, p, sizeof(*hdr)); > > > + p += sizeof(*hdr); > > > > I think you need kmap_atomic() here to access the page. And yes, that > > will effect performance :( > > No we don't. kmap would only be necessary for highmem which we > did not request. Good point. Could you humor me with a comment to that effect? Prevents me making the same mistake again. Thanks! Rusty. PS. Laptop broke, was MIA for a week. Working overtime now. From rusty at rustcorp.com.au Wed Oct 15 22:08:53 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 16 Oct 2008 16:08:53 +1100 Subject: [PATCH 1/2] virtio_net: Recycle some more rx buffer pages In-Reply-To: <1223494499-18732-1-git-send-email-markmc@redhat.com> References: <> <1223494499-18732-1-git-send-email-markmc@redhat.com> Message-ID: <200810161608.54017.rusty@rustcorp.com.au> On Thursday 09 October 2008 06:34:58 Mark McLoughlin wrote: > Each time we re-fill the recv queue with buffers, we allocate > one too many skbs and free it again when adding fails. We should > recycle the pages allocated in this case. > > A previous version of this patch made trim_pages() trim trailing > unused pages from skbs with some paged data, but this actually > caused a barely measurable slowdown. Yes, I noticed a similar effect. Not quite sure why though. Applied. Thanks! Rusty. From rusty at rustcorp.com.au Thu Oct 16 02:15:49 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 16 Oct 2008 20:15:49 +1100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <48EE5AE1.5030002@codemonkey.ws> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223574013.13792.23.camel@blaa> <48EE5AE1.5030002@codemonkey.ws> Message-ID: <200810162015.49654.rusty@rustcorp.com.au> On Friday 10 October 2008 06:26:25 Anthony Liguori wrote: > Mark McLoughlin wrote: > > Also, including virtio_net_hdr in the data buffer would need another > > feature flag. Rightly or wrongly, KVM's implementation requires > > virtio_net_hdr to be the first buffer: > > > > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > > fprintf(stderr, "virtio-net header not in first element\n"); > > exit(1); > > } > > > > i.e. it's part of the ABI ... at least as KVM sees it :-) > > This is actually something that's broken in a nasty way. Having the > header in the first element is not supposed to be part of the ABI but it > sort of has to be ATM. > > If an older version of QEMU were to use a newer kernel, and the newer > kernel had a larger header size, then if we just made the header be the > first X bytes, QEMU has no way of knowing how many bytes that should be. > Instead, the guest actually has to allocate the virtio-net header in > such a way that it only presents the size depending on the features that > the host supports. We don't use a simple versioning scheme, so you'd > have to check for a combination of features advertised by the host but > that's not good enough because the host may disable certain features. > > Perhaps the header size is whatever the longest element that has been > commonly negotiated? Yes. The feature implies the header extension. Not knowing implies no extension is possible. Rusty. From rusty at rustcorp.com.au Thu Oct 16 02:15:49 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 16 Oct 2008 20:15:49 +1100 Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme In-Reply-To: <48EE5AE1.5030002@codemonkey.ws> References: <1223494499-18732-1-git-send-email-markmc@redhat.com> <1223574013.13792.23.camel@blaa> <48EE5AE1.5030002@codemonkey.ws> Message-ID: <200810162015.49654.rusty@rustcorp.com.au> On Friday 10 October 2008 06:26:25 Anthony Liguori wrote: > Mark McLoughlin wrote: > > Also, including virtio_net_hdr in the data buffer would need another > > feature flag. Rightly or wrongly, KVM's implementation requires > > virtio_net_hdr to be the first buffer: > > > > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > > fprintf(stderr, "virtio-net header not in first element\n"); > > exit(1); > > } > > > > i.e. it's part of the ABI ... at least as KVM sees it :-) > > This is actually something that's broken in a nasty way. Having the > header in the first element is not supposed to be part of the ABI but it > sort of has to be ATM. > > If an older version of QEMU were to use a newer kernel, and the newer > kernel had a larger header size, then if we just made the header be the > first X bytes, QEMU has no way of knowing how many bytes that should be. > Instead, the guest actually has to allocate the virtio-net header in > such a way that it only presents the size depending on the features that > the host supports. We don't use a simple versioning scheme, so you'd > have to check for a combination of features advertised by the host but > that's not good enough because the host may disable certain features. > > Perhaps the header size is whatever the longest element that has been > commonly negotiated? Yes. The feature implies the header extension. Not knowing implies no extension is possible. Rusty. From yamahata at valinux.co.jp Thu Oct 16 19:17:41 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:41 +0900 Subject: [PATCH 01/33] ia64/pv_ops: fix paraviatualization of ivt.S with CONFIG_SMP=n In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-2-git-send-email-yamahata@valinux.co.jp> When CONFIG_SMP=n, three instruction in ivt.S were missed to paravirtualize. paravirtualize them. Cc: "Luck, Tony" Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/ivt.S | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/ia64/kernel/ivt.S b/arch/ia64/kernel/ivt.S index 416a952..f675d8e 100644 --- a/arch/ia64/kernel/ivt.S +++ b/arch/ia64/kernel/ivt.S @@ -580,7 +580,7 @@ ENTRY(dirty_bit) mov b0=r29 // restore b0 ;; st8 [r17]=r18 // store back updated PTE - itc.d r18 // install updated PTE + ITC_D(p0, r18, r16) // install updated PTE #endif mov pr=r31,-1 // restore pr RFI @@ -646,7 +646,7 @@ ENTRY(iaccess_bit) mov b0=r29 // restore b0 ;; st8 [r17]=r18 // store back updated PTE - itc.i r18 // install updated PTE + ITC_I(p0, r18, r16) // install updated PTE #endif /* !CONFIG_SMP */ mov pr=r31,-1 RFI @@ -698,7 +698,7 @@ ENTRY(daccess_bit) or r18=_PAGE_A,r18 // set the accessed bit ;; st8 [r17]=r18 // store back updated PTE - itc.d r18 // install updated PTE + ITC_D(p0, r18, r16) // install updated PTE #endif mov b0=r29 // restore b0 mov pr=r31,-1 -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:43 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:43 +0900 Subject: [PATCH 03/33] ia64/pv_ops: update native/inst.h to clobber predicate. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-4-git-send-email-yamahata@valinux.co.jp> add CLOBBER_PRED() to clobber predicate register. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/inst.h | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/ia64/include/asm/native/inst.h b/arch/ia64/include/asm/native/inst.h index c8efbf7..0a1026c 100644 --- a/arch/ia64/include/asm/native/inst.h +++ b/arch/ia64/include/asm/native/inst.h @@ -36,8 +36,13 @@ ;; \ movl clob = PARAVIRT_POISON; \ ;; +# define CLOBBER_PRED(pred_clob) \ + ;; \ + cmp.eq pred_clob, p0 = r0, r0 \ + ;; #else -# define CLOBBER(clob) /* nothing */ +# define CLOBBER(clob) /* nothing */ +# define CLOBBER_PRED(pred_clob) /* nothing */ #endif #define MOV_FROM_IFA(reg) \ @@ -136,7 +141,8 @@ #define SSM_PSR_I(pred, pred_clob, clob) \ (pred) ssm psr.i \ - CLOBBER(clob) + CLOBBER(clob) \ + CLOBBER_PRED(pred_clob) #define RSM_PSR_I(pred, clob0, clob1) \ (pred) rsm psr.i \ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:44 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:44 +0900 Subject: [PATCH 04/33] ia64: move function declaration, ia64_cpu_local_tick() from .c to .h In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-5-git-send-email-yamahata@valinux.co.jp> eliminate the function declaration ia64_cpu_local_tick() in process.c by defining in arch/ia64/include/asm/timex.h The same function will be used in a different .c file later. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/timex.h | 2 ++ arch/ia64/kernel/process.c | 1 - 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/timex.h b/arch/ia64/include/asm/timex.h index 05a6baf..4e03cfe 100644 --- a/arch/ia64/include/asm/timex.h +++ b/arch/ia64/include/asm/timex.h @@ -39,4 +39,6 @@ get_cycles (void) return ret; } +extern void ia64_cpu_local_tick (void); + #endif /* _ASM_IA64_TIMEX_H */ diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c index 3ab8373..8de0f46 100644 --- a/arch/ia64/kernel/process.c +++ b/arch/ia64/kernel/process.c @@ -251,7 +251,6 @@ default_idle (void) /* We don't actually take CPU down, just spin without interrupts. */ static inline void play_dead(void) { - extern void ia64_cpu_local_tick (void); unsigned int this_cpu = smp_processor_id(); /* Ack it */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:40 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:40 +0900 Subject: [PATCH 00/33] ia64/xen domU take 12 Message-ID: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> This patchset is ia64/xen domU patch take 12. Tony, please commit those patches. They are ready to commit because all the issues which were pointed out had been addressed and got enough reviews. This patchset does the followings. - Some preparation work. Mainly importing header files to define related structures. - Then, define functions related to hypercall which is the way to communicate with Xen hypervisor. - Add some helper functions which is necessary to utilize xen arch generic portion. - Next implements the xen instance of pv_ops introducing pv_info, pv_init_ops, pv_cpu_ops and its assembler counter part, pv_iosapic_ops, pv_irq_ops and, pv_time_ops step by step. - Introduce xen machine vector to describe xen platform. By using machine vector, xen domU implementation can be simplified. - Lastly update Kconfig to allow paravirtualization support and xen domU support to compile. For convenience the working full source is available from http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ branch: ia64-pv-ops-2008oct17-xen-ia64 For the status of this patch series http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge Changes from take 11: - fixed ivt.S paravirtualization when CONFIG_SMP=n pointed by Tony Luck - improved paravirtualiztion checker. - pvclock-abi.h: duplicates definitions instead of including x86 ones as suggested by Tony Luck. Changes from take 10: - rebased to 2.6.27 - renamed pv_iosapic_ops::get_irq_chip to pv_iosapic_ops::__get_irq_chip. - improved SSM_PSR_I to detect invalid register usage. - fixed consider_steal_time() of pv_time_ops. Changes from take 9: - rebased to 2.6.27-rc4 - caught up for moving header files. - caught up for x86 xen changes (mainly xen mode predicate) - enhanced pv checker to detect inappropriate register usage. - typo Changes from take 8: - rebased to 2.6.26 - updated pvclock-abi.h Changes from take 7: - various typos - clean up sync_bitops.h - style fix on include/asm-ia64/xen/interface.h - reserve the "break" numbers in include/asm-ia64/break.h - xencomm clean up - dropped NET_SKB_PAD patch. It was a bug in xen-netfront.c. - CONFIG_IA64_XEN -> CONFIG_IA64_XEN_GUEST - catch up for x86 pvclock-abi.h - work around for IPI with IA64_TIME_VECTOR - add pv checker Changes from take 6: - rebased to linux ia64 test tree - xen bsw_1 simplification. - add documentation. Documentation/ia64/xen.txt - preliminary support for save/restore. - network fix. NET_SKB_PAD. Changes from take 5: - rebased to Linux 2.6.26-rc3 - fix ivt.S paravirtualization. One instruction was wrongly paravirtualized. It wasn't revealed with Xen HVM domain so far, but with real hw - multi entry point support. - revised changelog to add CCs. Changes from take 4: - fix synch bit ops definitions to prevent accidental namespace clashes. - rebased and fixed breakages due to the upstream change. Changes from take 3: - split the patch set into pv_op part and xen domU part. - many clean ups. - introduced pv_ops: pv_cpu_ops and pv_time_ops. Changes from take 2: - many clean ups following to comments. - clean up:assembly instruction macro. - introduced pv_ops: pv_info, pv_init_ops, pv_iosapic_ops, pv_irq_ops. Changes from take 1: Single IVT source code. compile multitimes using assembler macros. thanks, Diffstat: Documentation/ia64/xen.txt | 183 +++++++++++ arch/ia64/Kconfig | 32 ++ arch/ia64/Makefile | 2 arch/ia64/include/asm/break.h | 9 arch/ia64/include/asm/machvec.h | 2 arch/ia64/include/asm/machvec_xen.h | 22 + arch/ia64/include/asm/meminit.h | 3 arch/ia64/include/asm/native/inst.h | 10 arch/ia64/include/asm/native/pvchk_inst.h | 263 +++++++++++++++++ arch/ia64/include/asm/paravirt.h | 4 arch/ia64/include/asm/pvclock-abi.h | 48 +++ arch/ia64/include/asm/sync_bitops.h | 51 +++ arch/ia64/include/asm/timex.h | 2 arch/ia64/include/asm/xen/events.h | 50 +++ arch/ia64/include/asm/xen/grant_table.h | 29 + arch/ia64/include/asm/xen/hypercall.h | 265 +++++++++++++++++ arch/ia64/include/asm/xen/hypervisor.h | 89 +++++ arch/ia64/include/asm/xen/inst.h | 458 ++++++++++++++++++++++++++++++ arch/ia64/include/asm/xen/interface.h | 346 ++++++++++++++++++++++ arch/ia64/include/asm/xen/irq.h | 44 ++ arch/ia64/include/asm/xen/minstate.h | 134 ++++++++ arch/ia64/include/asm/xen/page.h | 65 ++++ arch/ia64/include/asm/xen/privop.h | 129 ++++++++ arch/ia64/include/asm/xen/xcom_hcall.h | 51 +++ arch/ia64/include/asm/xen/xencomm.h | 42 ++ arch/ia64/kernel/Makefile | 18 + arch/ia64/kernel/acpi.c | 5 arch/ia64/kernel/asm-offsets.c | 31 ++ arch/ia64/kernel/ivt.S | 6 arch/ia64/kernel/nr-irqs.c | 1 arch/ia64/kernel/paravirt.c | 2 arch/ia64/kernel/paravirt_inst.h | 4 arch/ia64/kernel/process.c | 1 arch/ia64/scripts/pvcheck.sed | 32 ++ arch/ia64/xen/Kconfig | 26 + arch/ia64/xen/Makefile | 42 ++ arch/ia64/xen/grant-table.c | 155 ++++++++++ arch/ia64/xen/hypercall.S | 91 +++++ arch/ia64/xen/hypervisor.c | 96 ++++++ arch/ia64/xen/irq_xen.c | 435 ++++++++++++++++++++++++++++ arch/ia64/xen/irq_xen.h | 34 ++ arch/ia64/xen/machvec.c | 4 arch/ia64/xen/suspend.c | 45 ++ arch/ia64/xen/time.c | 213 +++++++++++++ arch/ia64/xen/time.h | 24 + arch/ia64/xen/xcom_hcall.c | 441 ++++++++++++++++++++++++++++ arch/ia64/xen/xen_pv_ops.c | 364 +++++++++++++++++++++++ arch/ia64/xen/xencomm.c | 105 ++++++ arch/ia64/xen/xenivt.S | 52 +++ arch/ia64/xen/xensetup.S | 83 +++++ 50 files changed, 4620 insertions(+), 23 deletions(-) From yamahata at valinux.co.jp Thu Oct 16 19:17:42 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:42 +0900 Subject: [PATCH 02/33] ia64/pv_ops: avoid name conflict of get_irq_chip(). In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-3-git-send-email-yamahata@valinux.co.jp> The macro get_irq_chip() is defined in linux/include/linux/irq.h which cause name conflict with one in linux/arch/ia64/include/asm/paravirt.h. rename the latter to __get_irq_chip(). Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/paravirt.h | 4 ++-- arch/ia64/kernel/paravirt.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 660cab0..2bf3636 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -117,7 +117,7 @@ static inline void paravirt_post_smp_prepare_boot_cpu(void) struct pv_iosapic_ops { void (*pcat_compat_init)(void); - struct irq_chip *(*get_irq_chip)(unsigned long trigger); + struct irq_chip *(*__get_irq_chip)(unsigned long trigger); unsigned int (*__read)(char __iomem *iosapic, unsigned int reg); void (*__write)(char __iomem *iosapic, unsigned int reg, u32 val); @@ -135,7 +135,7 @@ iosapic_pcat_compat_init(void) static inline struct irq_chip* iosapic_get_irq_chip(unsigned long trigger) { - return pv_iosapic_ops.get_irq_chip(trigger); + return pv_iosapic_ops.__get_irq_chip(trigger); } static inline unsigned int diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index afaf5b9..de35d8e 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -332,7 +332,7 @@ ia64_native_iosapic_write(char __iomem *iosapic, unsigned int reg, u32 val) struct pv_iosapic_ops pv_iosapic_ops = { .pcat_compat_init = ia64_native_iosapic_pcat_compat_init, - .get_irq_chip = ia64_native_iosapic_get_irq_chip, + .__get_irq_chip = ia64_native_iosapic_get_irq_chip, .__read = ia64_native_iosapic_read, .__write = ia64_native_iosapic_write, -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:45 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:45 +0900 Subject: [PATCH 05/33] ia64/xen: reserve "break" numbers used for xen hypercalls. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-6-git-send-email-yamahata@valinux.co.jp> reserve "break" numbers used for xen hypercalls to avoid reuse for something else. Cc: "Luck, Tony" Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/break.h | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/break.h b/arch/ia64/include/asm/break.h index f034020..e90c40e 100644 --- a/arch/ia64/include/asm/break.h +++ b/arch/ia64/include/asm/break.h @@ -20,4 +20,13 @@ */ #define __IA64_BREAK_SYSCALL 0x100000 +/* + * Xen specific break numbers: + */ +#define __IA64_XEN_HYPERCALL 0x1000 +/* [__IA64_XEN_HYPERPRIVOP_START, __IA64_XEN_HYPERPRIVOP_MAX] is used + for xen hyperprivops */ +#define __IA64_XEN_HYPERPRIVOP_START 0x1 +#define __IA64_XEN_HYPERPRIVOP_MAX 0x1a + #endif /* _ASM_IA64_BREAK_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:46 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:46 +0900 Subject: [PATCH 06/33] ia64/xen: introduce sync bitops which is necessary for ia64/xen support. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-7-git-send-email-yamahata@valinux.co.jp> define sync bitops which is necessary for ia64/xen. This bit operation is used to communicate with VMM or other guest kernel Even when this kernel is built for UP, VMM might be SMP so that those operation must always use atomic operation. Cc: Robin Holt Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/sync_bitops.h | 51 +++++++++++++++++++++++++++++++++++ 1 files changed, 51 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/sync_bitops.h diff --git a/arch/ia64/include/asm/sync_bitops.h b/arch/ia64/include/asm/sync_bitops.h new file mode 100644 index 0000000..593c12e --- /dev/null +++ b/arch/ia64/include/asm/sync_bitops.h @@ -0,0 +1,51 @@ +#ifndef _ASM_IA64_SYNC_BITOPS_H +#define _ASM_IA64_SYNC_BITOPS_H + +/* + * Copyright (C) 2008 Isaku Yamahata + * + * Based on synch_bitops.h which Dan Magenhaimer wrote. + * + * bit operations which provide guaranteed strong synchronisation + * when communicating with Xen or other guest OSes running on other CPUs. + */ + +static inline void sync_set_bit(int nr, volatile void *addr) +{ + set_bit(nr, addr); +} + +static inline void sync_clear_bit(int nr, volatile void *addr) +{ + clear_bit(nr, addr); +} + +static inline void sync_change_bit(int nr, volatile void *addr) +{ + change_bit(nr, addr); +} + +static inline int sync_test_and_set_bit(int nr, volatile void *addr) +{ + return test_and_set_bit(nr, addr); +} + +static inline int sync_test_and_clear_bit(int nr, volatile void *addr) +{ + return test_and_clear_bit(nr, addr); +} + +static inline int sync_test_and_change_bit(int nr, volatile void *addr) +{ + return test_and_change_bit(nr, addr); +} + +static inline int sync_test_bit(int nr, const volatile void *addr) +{ + return test_bit(nr, addr); +} + +#define sync_cmpxchg(ptr, old, new) \ + ((__typeof__(*(ptr)))cmpxchg_acq((ptr), (old), (new))) + +#endif /* _ASM_IA64_SYNC_BITOPS_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:58 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:58 +0900 Subject: [PATCH 18/33] ia64/pv_ops/xen: elf note based xen startup. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-19-git-send-email-yamahata@valinux.co.jp> This patch enables elf note based xen startup for IA-64, which gives the kernel an early hint for running on xen like x86 case. In order to avoid the multi entry point, presumably extending booting protocol(i.e. extending struct ia64_boot_param) would be necessary. It probably means that elilo also needs modification. Signed-off-by: Qing He Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/asm-offsets.c | 4 ++ arch/ia64/xen/Makefile | 3 +- arch/ia64/xen/xen_pv_ops.c | 65 +++++++++++++++++++++++++++++++ arch/ia64/xen/xensetup.S | 83 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 154 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/xen_pv_ops.c create mode 100644 arch/ia64/xen/xensetup.S diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c index eaa988b..742dbb1 100644 --- a/arch/ia64/kernel/asm-offsets.c +++ b/arch/ia64/kernel/asm-offsets.c @@ -17,6 +17,7 @@ #include #include +#include #include "../kernel/sigframe.h" #include "../kernel/fsyscall_gtod_data.h" @@ -292,6 +293,9 @@ void foo(void) #ifdef CONFIG_XEN BLANK(); + DEFINE(XEN_NATIVE_ASM, XEN_NATIVE); + DEFINE(XEN_PV_DOMAIN_ASM, XEN_PV_DOMAIN); + #define DEFINE_MAPPED_REG_OFS(sym, field) \ DEFINE(sym, (XMAPPEDREGS_OFS + offsetof(struct mapped_regs, field))) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index eb59563..abc356f 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,5 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o xcom_hcall.o grant-table.o +obj-y := hypercall.o xensetup.o xen_pv_ops.o \ + xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c new file mode 100644 index 0000000..77db214 --- /dev/null +++ b/arch/ia64/xen/xen_pv_ops.c @@ -0,0 +1,65 @@ +/****************************************************************************** + * arch/ia64/xen/xen_pv_ops.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include +#include + +#include +#include +#include + +/*************************************************************************** + * general info + */ +static struct pv_info xen_info __initdata = { + .kernel_rpl = 2, /* or 1: determin at runtime */ + .paravirt_enabled = 1, + .name = "Xen/ia64", +}; + +#define IA64_RSC_PL_SHIFT 2 +#define IA64_RSC_PL_BIT_SIZE 2 +#define IA64_RSC_PL_MASK \ + (((1UL << IA64_RSC_PL_BIT_SIZE) - 1) << IA64_RSC_PL_SHIFT) + +static void __init +xen_info_init(void) +{ + /* Xenified Linux/ia64 may run on pl = 1 or 2. + * determin at run time. */ + unsigned long rsc = ia64_getreg(_IA64_REG_AR_RSC); + unsigned int rpl = (rsc & IA64_RSC_PL_MASK) >> IA64_RSC_PL_SHIFT; + xen_info.kernel_rpl = rpl; +} + +/*************************************************************************** + * pv_ops initialization + */ + +void __init +xen_setup_pv_ops(void) +{ + xen_info_init(); + pv_info = xen_info; +} diff --git a/arch/ia64/xen/xensetup.S b/arch/ia64/xen/xensetup.S new file mode 100644 index 0000000..28fed1f --- /dev/null +++ b/arch/ia64/xen/xensetup.S @@ -0,0 +1,83 @@ +/* + * Support routines for Xen + * + * Copyright (C) 2005 Dan Magenheimer + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + + .section .data.read_mostly + .align 8 + .global xen_domain_type +xen_domain_type: + data4 XEN_NATIVE_ASM + .previous + + __INIT +ENTRY(startup_xen) + // Calculate load offset. + // The constant, LOAD_OFFSET, can't be used because the boot + // loader doesn't always load to the LMA specified by the vmlinux.lds. + mov r9=ip // must be the first instruction to make sure + // that r9 = the physical address of startup_xen. + // Usually r9 = startup_xen - LOAD_OFFSET + movl r8=startup_xen + ;; + sub r9=r9,r8 // Usually r9 = -LOAD_OFFSET. + + mov r10=PARAVIRT_HYPERVISOR_TYPE_XEN + movl r11=_start + ;; + add r11=r11,r9 + movl r8=hypervisor_type + ;; + add r8=r8,r9 + mov b0=r11 + ;; + st8 [r8]=r10 + br.cond.sptk.many b0 + ;; +END(startup_xen) + + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz "linux") + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz "2.6") + ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz "xen-3.0") + ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, data8.ua startup_xen - LOAD_OFFSET) + +#define isBP p3 // are we the Bootstrap Processor? + + .text + +GLOBAL_ENTRY(xen_setup_hook) + mov r8=XEN_PV_DOMAIN_ASM +(isBP) movl r9=xen_domain_type;; +(isBP) st4 [r9]=r8 + movl r10=xen_ivt;; + + mov cr.iva=r10 + + /* Set xsi base. */ +#define FW_HYPERCALL_SET_SHARED_INFO_VA 0x600 +(isBP) mov r2=FW_HYPERCALL_SET_SHARED_INFO_VA +(isBP) movl r28=XSI_BASE;; +(isBP) break 0x1000;; + + /* setup pv_ops */ +(isBP) mov r4=rp + ;; +(isBP) br.call.sptk.many rp=xen_setup_pv_ops + ;; +(isBP) mov rp=r4 + ;; + + br.ret.sptk.many rp + ;; +END(xen_setup_hook) -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:48 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:48 +0900 Subject: [PATCH 08/33] ia64/xen: introduce definitions necessary for ia64/xen hypercalls. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-9-git-send-email-yamahata@valinux.co.jp> import arch/ia64/include/asm/xen/interface.h to introduce definitions necessary for ia64/xen hypercalls. They are basic structures to communicate with xen hypervisor and will be used later. Cc: Robin Holt Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/xen/interface.h | 346 +++++++++++++++++++++++++++++++++ 1 files changed, 346 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/interface.h diff --git a/arch/ia64/include/asm/xen/interface.h b/arch/ia64/include/asm/xen/interface.h new file mode 100644 index 0000000..f00fab4 --- /dev/null +++ b/arch/ia64/include/asm/xen/interface.h @@ -0,0 +1,346 @@ +/****************************************************************************** + * arch-ia64/hypervisor-if.h + * + * Guest OS interface to IA64 Xen. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to + * deal in the Software without restriction, including without limitation the + * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or + * sell copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER + * DEALINGS IN THE SOFTWARE. + * + * Copyright by those who contributed. (in alphabetical order) + * + * Anthony Xu + * Eddie Dong + * Fred Yang + * Kevin Tian + * Alex Williamson + * Chris Wright + * Christian Limpach + * Dietmar Hahn + * Hollis Blanchard + * Isaku Yamahata + * Jan Beulich + * John Levon + * Kazuhiro Suzuki + * Keir Fraser + * Kouya Shimura + * Masaki Kanno + * Matt Chapman + * Matthew Chapman + * Samuel Thibault + * Tomonari Horikoshi + * Tristan Gingold + * Tsunehisa Doi + * Yutaka Ezaki + * Zhang Xin + * Zhang xiantao + * dan.magenheimer at hp.com + * ian.pratt at cl.cam.ac.uk + * michael.fetterman at cl.cam.ac.uk + */ + +#ifndef _ASM_IA64_XEN_INTERFACE_H +#define _ASM_IA64_XEN_INTERFACE_H + +#define __DEFINE_GUEST_HANDLE(name, type) \ + typedef struct { type *p; } __guest_handle_ ## name + +#define DEFINE_GUEST_HANDLE_STRUCT(name) \ + __DEFINE_GUEST_HANDLE(name, struct name) +#define DEFINE_GUEST_HANDLE(name) __DEFINE_GUEST_HANDLE(name, name) +#define GUEST_HANDLE(name) __guest_handle_ ## name +#define GUEST_HANDLE_64(name) GUEST_HANDLE(name) +#define set_xen_guest_handle(hnd, val) do { (hnd).p = val; } while (0) + +#ifndef __ASSEMBLY__ +/* Guest handles for primitive C types. */ +__DEFINE_GUEST_HANDLE(uchar, unsigned char); +__DEFINE_GUEST_HANDLE(uint, unsigned int); +__DEFINE_GUEST_HANDLE(ulong, unsigned long); +__DEFINE_GUEST_HANDLE(u64, unsigned long); +DEFINE_GUEST_HANDLE(char); +DEFINE_GUEST_HANDLE(int); +DEFINE_GUEST_HANDLE(long); +DEFINE_GUEST_HANDLE(void); + +typedef unsigned long xen_pfn_t; +DEFINE_GUEST_HANDLE(xen_pfn_t); +#define PRI_xen_pfn "lx" +#endif + +/* Arch specific VIRQs definition */ +#define VIRQ_ITC VIRQ_ARCH_0 /* V. Virtual itc timer */ +#define VIRQ_MCA_CMC VIRQ_ARCH_1 /* MCA cmc interrupt */ +#define VIRQ_MCA_CPE VIRQ_ARCH_2 /* MCA cpe interrupt */ + +/* Maximum number of virtual CPUs in multi-processor guests. */ +/* keep sizeof(struct shared_page) <= PAGE_SIZE. + * this is checked in arch/ia64/xen/hypervisor.c. */ +#define MAX_VIRT_CPUS 64 + +#ifndef __ASSEMBLY__ + +#define INVALID_MFN (~0UL) + +union vac { + unsigned long value; + struct { + int a_int:1; + int a_from_int_cr:1; + int a_to_int_cr:1; + int a_from_psr:1; + int a_from_cpuid:1; + int a_cover:1; + int a_bsw:1; + long reserved:57; + }; +}; + +union vdc { + unsigned long value; + struct { + int d_vmsw:1; + int d_extint:1; + int d_ibr_dbr:1; + int d_pmc:1; + int d_to_pmd:1; + int d_itm:1; + long reserved:58; + }; +}; + +struct mapped_regs { + union vac vac; + union vdc vdc; + unsigned long virt_env_vaddr; + unsigned long reserved1[29]; + unsigned long vhpi; + unsigned long reserved2[95]; + union { + unsigned long vgr[16]; + unsigned long bank1_regs[16]; /* bank1 regs (r16-r31) + when bank0 active */ + }; + union { + unsigned long vbgr[16]; + unsigned long bank0_regs[16]; /* bank0 regs (r16-r31) + when bank1 active */ + }; + unsigned long vnat; + unsigned long vbnat; + unsigned long vcpuid[5]; + unsigned long reserved3[11]; + unsigned long vpsr; + unsigned long vpr; + unsigned long reserved4[76]; + union { + unsigned long vcr[128]; + struct { + unsigned long dcr; /* CR0 */ + unsigned long itm; + unsigned long iva; + unsigned long rsv1[5]; + unsigned long pta; /* CR8 */ + unsigned long rsv2[7]; + unsigned long ipsr; /* CR16 */ + unsigned long isr; + unsigned long rsv3; + unsigned long iip; + unsigned long ifa; + unsigned long itir; + unsigned long iipa; + unsigned long ifs; + unsigned long iim; /* CR24 */ + unsigned long iha; + unsigned long rsv4[38]; + unsigned long lid; /* CR64 */ + unsigned long ivr; + unsigned long tpr; + unsigned long eoi; + unsigned long irr[4]; + unsigned long itv; /* CR72 */ + unsigned long pmv; + unsigned long cmcv; + unsigned long rsv5[5]; + unsigned long lrr0; /* CR80 */ + unsigned long lrr1; + unsigned long rsv6[46]; + }; + }; + union { + unsigned long reserved5[128]; + struct { + unsigned long precover_ifs; + unsigned long unat; /* not sure if this is needed + until NaT arch is done */ + int interrupt_collection_enabled; /* virtual psr.ic */ + + /* virtual interrupt deliverable flag is + * evtchn_upcall_mask in shared info area now. + * interrupt_mask_addr is the address + * of evtchn_upcall_mask for current vcpu + */ + unsigned char *interrupt_mask_addr; + int pending_interruption; + unsigned char vpsr_pp; + unsigned char vpsr_dfh; + unsigned char hpsr_dfh; + unsigned char hpsr_mfh; + unsigned long reserved5_1[4]; + int metaphysical_mode; /* 1 = use metaphys mapping + 0 = use virtual */ + int banknum; /* 0 or 1, which virtual + register bank is active */ + unsigned long rrs[8]; /* region registers */ + unsigned long krs[8]; /* kernel registers */ + unsigned long tmp[16]; /* temp registers + (e.g. for hyperprivops) */ + }; + }; +}; + +struct arch_vcpu_info { + /* nothing */ +}; + +/* + * This structure is used for magic page in domain pseudo physical address + * space and the result of XENMEM_machine_memory_map. + * As the XENMEM_machine_memory_map result, + * xen_memory_map::nr_entries indicates the size in bytes + * including struct xen_ia64_memmap_info. Not the number of entries. + */ +struct xen_ia64_memmap_info { + uint64_t efi_memmap_size; /* size of EFI memory map */ + uint64_t efi_memdesc_size; /* size of an EFI memory map + * descriptor */ + uint32_t efi_memdesc_version; /* memory descriptor version */ + void *memdesc[0]; /* array of efi_memory_desc_t */ +}; + +struct arch_shared_info { + /* PFN of the start_info page. */ + unsigned long start_info_pfn; + + /* Interrupt vector for event channel. */ + int evtchn_vector; + + /* PFN of memmap_info page */ + unsigned int memmap_info_num_pages; /* currently only = 1 case is + supported. */ + unsigned long memmap_info_pfn; + + uint64_t pad[31]; +}; + +struct xen_callback { + unsigned long ip; +}; +typedef struct xen_callback xen_callback_t; + +#endif /* !__ASSEMBLY__ */ + +/* Size of the shared_info area (this is not related to page size). */ +#define XSI_SHIFT 14 +#define XSI_SIZE (1 << XSI_SHIFT) +/* Log size of mapped_regs area (64 KB - only 4KB is used). */ +#define XMAPPEDREGS_SHIFT 12 +#define XMAPPEDREGS_SIZE (1 << XMAPPEDREGS_SHIFT) +/* Offset of XASI (Xen arch shared info) wrt XSI_BASE. */ +#define XMAPPEDREGS_OFS XSI_SIZE + +/* Hyperprivops. */ +#define HYPERPRIVOP_START 0x1 +#define HYPERPRIVOP_RFI (HYPERPRIVOP_START + 0x0) +#define HYPERPRIVOP_RSM_DT (HYPERPRIVOP_START + 0x1) +#define HYPERPRIVOP_SSM_DT (HYPERPRIVOP_START + 0x2) +#define HYPERPRIVOP_COVER (HYPERPRIVOP_START + 0x3) +#define HYPERPRIVOP_ITC_D (HYPERPRIVOP_START + 0x4) +#define HYPERPRIVOP_ITC_I (HYPERPRIVOP_START + 0x5) +#define HYPERPRIVOP_SSM_I (HYPERPRIVOP_START + 0x6) +#define HYPERPRIVOP_GET_IVR (HYPERPRIVOP_START + 0x7) +#define HYPERPRIVOP_GET_TPR (HYPERPRIVOP_START + 0x8) +#define HYPERPRIVOP_SET_TPR (HYPERPRIVOP_START + 0x9) +#define HYPERPRIVOP_EOI (HYPERPRIVOP_START + 0xa) +#define HYPERPRIVOP_SET_ITM (HYPERPRIVOP_START + 0xb) +#define HYPERPRIVOP_THASH (HYPERPRIVOP_START + 0xc) +#define HYPERPRIVOP_PTC_GA (HYPERPRIVOP_START + 0xd) +#define HYPERPRIVOP_ITR_D (HYPERPRIVOP_START + 0xe) +#define HYPERPRIVOP_GET_RR (HYPERPRIVOP_START + 0xf) +#define HYPERPRIVOP_SET_RR (HYPERPRIVOP_START + 0x10) +#define HYPERPRIVOP_SET_KR (HYPERPRIVOP_START + 0x11) +#define HYPERPRIVOP_FC (HYPERPRIVOP_START + 0x12) +#define HYPERPRIVOP_GET_CPUID (HYPERPRIVOP_START + 0x13) +#define HYPERPRIVOP_GET_PMD (HYPERPRIVOP_START + 0x14) +#define HYPERPRIVOP_GET_EFLAG (HYPERPRIVOP_START + 0x15) +#define HYPERPRIVOP_SET_EFLAG (HYPERPRIVOP_START + 0x16) +#define HYPERPRIVOP_RSM_BE (HYPERPRIVOP_START + 0x17) +#define HYPERPRIVOP_GET_PSR (HYPERPRIVOP_START + 0x18) +#define HYPERPRIVOP_SET_RR0_TO_RR4 (HYPERPRIVOP_START + 0x19) +#define HYPERPRIVOP_MAX (0x1a) + +/* Fast and light hypercalls. */ +#define __HYPERVISOR_ia64_fast_eoi __HYPERVISOR_arch_1 + +/* Xencomm macros. */ +#define XENCOMM_INLINE_MASK 0xf800000000000000UL +#define XENCOMM_INLINE_FLAG 0x8000000000000000UL + +#ifndef __ASSEMBLY__ + +/* + * Optimization features. + * The hypervisor may do some special optimizations for guests. This hypercall + * can be used to switch on/of these special optimizations. + */ +#define __HYPERVISOR_opt_feature 0x700UL + +#define XEN_IA64_OPTF_OFF 0x0 +#define XEN_IA64_OPTF_ON 0x1 + +/* + * If this feature is switched on, the hypervisor inserts the + * tlb entries without calling the guests traphandler. + * This is useful in guests using region 7 for identity mapping + * like the linux kernel does. + */ +#define XEN_IA64_OPTF_IDENT_MAP_REG7 1 + +/* Identity mapping of region 4 addresses in HVM. */ +#define XEN_IA64_OPTF_IDENT_MAP_REG4 2 + +/* Identity mapping of region 5 addresses in HVM. */ +#define XEN_IA64_OPTF_IDENT_MAP_REG5 3 + +#define XEN_IA64_OPTF_IDENT_MAP_NOT_SET (0) + +struct xen_ia64_opt_feature { + unsigned long cmd; /* Which feature */ + unsigned char on; /* Switch feature on/off */ + union { + struct { + /* The page protection bit mask of the pte. + * This will be or'ed with the pte. */ + unsigned long pgprot; + unsigned long key; /* A protection key for itir.*/ + }; + }; +}; + +#endif /* __ASSEMBLY__ */ + +#endif /* _ASM_IA64_XEN_INTERFACE_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:47 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:47 +0900 Subject: [PATCH 07/33] ia64/xen: increase IA64_MAX_RSVD_REGIONS. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-8-git-send-email-yamahata@valinux.co.jp> Xenlinux/ia64 needs to reserve one more region passed from xen hypervisor as start info. Cc: Robin Holt Cc: Bjorn Helgaas Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/meminit.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/meminit.h b/arch/ia64/include/asm/meminit.h index 7245a57..6bc96ee 100644 --- a/arch/ia64/include/asm/meminit.h +++ b/arch/ia64/include/asm/meminit.h @@ -18,10 +18,11 @@ * - crash dumping code reserved region * - Kernel memory map built from EFI memory map * - ELF core header + * - xen start info if CONFIG_XEN * * More could be added if necessary */ -#define IA64_MAX_RSVD_REGIONS 8 +#define IA64_MAX_RSVD_REGIONS 9 struct rsvd_region { unsigned long start; /* virtual address of beginning of element */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:51 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:51 +0900 Subject: [PATCH 11/33] ia64/xen: define helper functions for xen related address conversion. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-12-git-send-email-yamahata@valinux.co.jp> Xen needs some address conversions between pseudo physical address (guest phsyical address), guest machine address (real machine address) and dma address. Define helper functions for those address conversion. Cc: Jeremy Fitzhardinge Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/page.h | 65 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 65 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/page.h diff --git a/arch/ia64/include/asm/xen/page.h b/arch/ia64/include/asm/xen/page.h new file mode 100644 index 0000000..03441a7 --- /dev/null +++ b/arch/ia64/include/asm/xen/page.h @@ -0,0 +1,65 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/page.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_PAGE_H +#define _ASM_IA64_XEN_PAGE_H + +#define INVALID_P2M_ENTRY (~0UL) + +static inline unsigned long mfn_to_pfn(unsigned long mfn) +{ + return mfn; +} + +static inline unsigned long pfn_to_mfn(unsigned long pfn) +{ + return pfn; +} + +#define phys_to_machine_mapping_valid(_x) (1) + +static inline void *mfn_to_virt(unsigned long mfn) +{ + return __va(mfn << PAGE_SHIFT); +} + +static inline unsigned long virt_to_mfn(void *virt) +{ + return __pa(virt) >> PAGE_SHIFT; +} + +/* for tpmfront.c */ +static inline unsigned long virt_to_machine(void *virt) +{ + return __pa(virt); +} + +static inline void set_phys_to_machine(unsigned long pfn, unsigned long mfn) +{ + /* nothing */ +} + +#define pte_mfn(_x) pte_pfn(_x) +#define mfn_pte(_x, _y) __pte_ma(0) /* unmodified use */ +#define __pte_ma(_x) ((pte_t) {(_x)}) /* unmodified use */ + +#endif /* _ASM_IA64_XEN_PAGE_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:55 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:55 +0900 Subject: [PATCH 15/33] ia64/xen: implement arch specific part of xen grant table. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-16-git-send-email-yamahata@valinux.co.jp> Xen implements grant tables which is for sharing pages with guest domains. This patch implements arch specific part of grant table initialization. and xen_alloc_vm_area()/xen_free_vm_area() which are helper functions for xen grant table. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/grant_table.h | 29 ++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/grant-table.c | 155 +++++++++++++++++++++++++++++++ 3 files changed, 185 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/grant_table.h create mode 100644 arch/ia64/xen/grant-table.c diff --git a/arch/ia64/include/asm/xen/grant_table.h b/arch/ia64/include/asm/xen/grant_table.h new file mode 100644 index 0000000..2b1fae0 --- /dev/null +++ b/arch/ia64/include/asm/xen/grant_table.h @@ -0,0 +1,29 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/grant_table.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_GRANT_TABLE_H +#define _ASM_IA64_XEN_GRANT_TABLE_H + +struct vm_struct *xen_alloc_vm_area(unsigned long size); +void xen_free_vm_area(struct vm_struct *area); + +#endif /* _ASM_IA64_XEN_GRANT_TABLE_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ae08822..eb59563 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o xcom_hcall.o +obj-y := hypercall.o xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/grant-table.c b/arch/ia64/xen/grant-table.c new file mode 100644 index 0000000..777dd9a --- /dev/null +++ b/arch/ia64/xen/grant-table.c @@ -0,0 +1,155 @@ +/****************************************************************************** + * arch/ia64/xen/grant-table.c + * + * Copyright (c) 2006 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include + +#include +#include +#include + +#include + +struct vm_struct *xen_alloc_vm_area(unsigned long size) +{ + int order; + unsigned long virt; + unsigned long nr_pages; + struct vm_struct *area; + + order = get_order(size); + virt = __get_free_pages(GFP_KERNEL, order); + if (virt == 0) + goto err0; + nr_pages = 1 << order; + scrub_pages(virt, nr_pages); + + area = kmalloc(sizeof(*area), GFP_KERNEL); + if (area == NULL) + goto err1; + + area->flags = VM_IOREMAP; + area->addr = (void *)virt; + area->size = size; + area->pages = NULL; + area->nr_pages = nr_pages; + area->phys_addr = 0; /* xenbus_map_ring_valloc uses this field! */ + + return area; + +err1: + free_pages(virt, order); +err0: + return NULL; +} +EXPORT_SYMBOL_GPL(xen_alloc_vm_area); + +void xen_free_vm_area(struct vm_struct *area) +{ + unsigned int order = get_order(area->size); + unsigned long i; + unsigned long phys_addr = __pa(area->addr); + + /* This area is used for foreign page mappping. + * So underlying machine page may not be assigned. */ + for (i = 0; i < (1 << order); i++) { + unsigned long ret; + unsigned long gpfn = (phys_addr >> PAGE_SHIFT) + i; + struct xen_memory_reservation reservation = { + .nr_extents = 1, + .address_bits = 0, + .extent_order = 0, + .domid = DOMID_SELF + }; + set_xen_guest_handle(reservation.extent_start, &gpfn); + ret = HYPERVISOR_memory_op(XENMEM_populate_physmap, + &reservation); + BUG_ON(ret != 1); + } + free_pages((unsigned long)area->addr, order); + kfree(area); +} +EXPORT_SYMBOL_GPL(xen_free_vm_area); + + +/**************************************************************************** + * grant table hack + * cmd: GNTTABOP_xxx + */ + +int arch_gnttab_map_shared(unsigned long *frames, unsigned long nr_gframes, + unsigned long max_nr_gframes, + struct grant_entry **__shared) +{ + *__shared = __va(frames[0] << PAGE_SHIFT); + return 0; +} + +void arch_gnttab_unmap_shared(struct grant_entry *shared, + unsigned long nr_gframes) +{ + /* nothing */ +} + +static void +gnttab_map_grant_ref_pre(struct gnttab_map_grant_ref *uop) +{ + uint32_t flags; + + flags = uop->flags; + + if (flags & GNTMAP_host_map) { + if (flags & GNTMAP_application_map) { + printk(KERN_DEBUG + "GNTMAP_application_map is not supported yet: " + "flags 0x%x\n", flags); + BUG(); + } + if (flags & GNTMAP_contains_pte) { + printk(KERN_DEBUG + "GNTMAP_contains_pte is not supported yet: " + "flags 0x%x\n", flags); + BUG(); + } + } else if (flags & GNTMAP_device_map) { + printk("GNTMAP_device_map is not supported yet 0x%x\n", flags); + BUG(); /* not yet. actually this flag is not used. */ + } else { + BUG(); + } +} + +int +HYPERVISOR_grant_table_op(unsigned int cmd, void *uop, unsigned int count) +{ + if (cmd == GNTTABOP_map_grant_ref) { + unsigned int i; + for (i = 0; i < count; i++) { + gnttab_map_grant_ref_pre( + (struct gnttab_map_grant_ref *)uop + i); + } + } + return xencomm_hypercall_grant_table_op(cmd, uop, count); +} + +EXPORT_SYMBOL(HYPERVISOR_grant_table_op); -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:53 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:53 +0900 Subject: [PATCH 13/33] ia64/xen: implement the arch specific part of xencomm. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-14-git-send-email-yamahata@valinux.co.jp> On ia64/xen, pointer argument for the hypercall is passed by pseudo physical address (guest physical address.) So it is necessary to convert virtual address into pseudo physical address right before issuing hypercall. The frame work is called xencomm. This patch implements arch specific part. Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" Cc: Akio Takebe --- arch/ia64/include/asm/xen/xencomm.h | 41 +++++++++++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xencomm.c | 94 +++++++++++++++++++++++++++++++++++ 3 files changed, 136 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/xencomm.h create mode 100644 arch/ia64/xen/xencomm.c diff --git a/arch/ia64/include/asm/xen/xencomm.h b/arch/ia64/include/asm/xen/xencomm.h new file mode 100644 index 0000000..28732cd --- /dev/null +++ b/arch/ia64/include/asm/xen/xencomm.h @@ -0,0 +1,41 @@ +/* + * Copyright (C) 2006 Hollis Blanchard , IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_IA64_XEN_XENCOMM_H +#define _ASM_IA64_XEN_XENCOMM_H + +#include +#include + +/* Must be called before any hypercall. */ +extern void xencomm_initialize(void); + +/* Check if virtual contiguity means physical contiguity + * where the passed address is a pointer value in virtual address. + * On ia64, identity mapping area in region 7 or the piece of region 5 + * that is mapped by itr[IA64_TR_KERNEL]/dtr[IA64_TR_KERNEL] + */ +static inline int xencomm_is_phys_contiguous(unsigned long addr) +{ + return (PAGE_OFFSET <= addr && + addr < (PAGE_OFFSET + (1UL << IA64_MAX_PHYS_BITS))) || + (KERNEL_START <= addr && + addr < KERNEL_START + KERNEL_TR_PAGE_SIZE); +} + +#endif /* _ASM_IA64_XEN_XENCOMM_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index c200704..ad0c9f7 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o +obj-y := hypercall.o xencomm.o diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c new file mode 100644 index 0000000..3dc307f --- /dev/null +++ b/arch/ia64/xen/xencomm.c @@ -0,0 +1,94 @@ +/* + * Copyright (C) 2006 Hollis Blanchard , IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include + +static unsigned long kernel_virtual_offset; + +void +xencomm_initialize(void) +{ + kernel_virtual_offset = KERNEL_START - ia64_tpa(KERNEL_START); +} + +/* Translate virtual address to physical address. */ +unsigned long +xencomm_vtop(unsigned long vaddr) +{ + struct page *page; + struct vm_area_struct *vma; + + if (vaddr == 0) + return 0UL; + + if (REGION_NUMBER(vaddr) == 5) { + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + + /* On ia64, TASK_SIZE refers to current. It is not initialized + during boot. + Furthermore the kernel is relocatable and __pa() doesn't + work on addresses. */ + if (vaddr >= KERNEL_START + && vaddr < (KERNEL_START + KERNEL_TR_PAGE_SIZE)) + return vaddr - kernel_virtual_offset; + + /* In kernel area -- virtually mapped. */ + pgd = pgd_offset_k(vaddr); + if (pgd_none(*pgd) || pgd_bad(*pgd)) + return ~0UL; + + pud = pud_offset(pgd, vaddr); + if (pud_none(*pud) || pud_bad(*pud)) + return ~0UL; + + pmd = pmd_offset(pud, vaddr); + if (pmd_none(*pmd) || pmd_bad(*pmd)) + return ~0UL; + + ptep = pte_offset_kernel(pmd, vaddr); + if (!ptep) + return ~0UL; + + return (pte_val(*ptep) & _PFN_MASK) | (vaddr & ~PAGE_MASK); + } + + if (vaddr > TASK_SIZE) { + /* percpu variables */ + if (REGION_NUMBER(vaddr) == 7 && + REGION_OFFSET(vaddr) >= (1ULL << IA64_MAX_PHYS_BITS)) + ia64_tpa(vaddr); + + /* kernel address */ + return __pa(vaddr); + } + + /* XXX double-check (lack of) locking */ + vma = find_extend_vma(current->mm, vaddr); + if (!vma) + return ~0UL; + + /* We assume the page is modified. */ + page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH); + if (!page) + return ~0UL; + + return (page_to_pfn(page) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK); +} -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:56 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:56 +0900 Subject: [PATCH 16/33] ia64/xen: add definitions necessary for xen event channel. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-17-git-send-email-yamahata@valinux.co.jp> Xen paravirtualizes interrupt as event channel. This patch defines arch specific part of xen event channel. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/events.h | 50 ++++++++++++++++++++++++++++++++++++ 1 files changed, 50 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/events.h diff --git a/arch/ia64/include/asm/xen/events.h b/arch/ia64/include/asm/xen/events.h new file mode 100644 index 0000000..7324878 --- /dev/null +++ b/arch/ia64/include/asm/xen/events.h @@ -0,0 +1,50 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/events.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ +#ifndef _ASM_IA64_XEN_EVENTS_H +#define _ASM_IA64_XEN_EVENTS_H + +enum ipi_vector { + XEN_RESCHEDULE_VECTOR, + XEN_IPI_VECTOR, + XEN_CMCP_VECTOR, + XEN_CPEP_VECTOR, + + XEN_NR_IPIS, +}; + +static inline int xen_irqs_disabled(struct pt_regs *regs) +{ + return !(ia64_psr(regs)->i); +} + +static inline void xen_do_IRQ(int irq, struct pt_regs *regs) +{ + struct pt_regs *old_regs; + old_regs = set_irq_regs(regs); + irq_enter(); + __do_IRQ(irq); + irq_exit(); + set_irq_regs(old_regs); +} +#define irq_ctx_init(cpu) do { } while (0) + +#endif /* _ASM_IA64_XEN_EVENTS_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:52 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:52 +0900 Subject: [PATCH 12/33] ia64/xen: define helper functions for xen hypercalls. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-13-git-send-email-yamahata@valinux.co.jp> introduce helper functions for xen hypercalls which traps to hypervisor. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypercall.h | 265 +++++++++++++++++++++++++++++++++ arch/ia64/include/asm/xen/privop.h | 129 ++++++++++++++++ arch/ia64/xen/Makefile | 5 + arch/ia64/xen/hypercall.S | 91 +++++++++++ 4 files changed, 490 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/hypercall.h create mode 100644 arch/ia64/include/asm/xen/privop.h create mode 100644 arch/ia64/xen/Makefile create mode 100644 arch/ia64/xen/hypercall.S diff --git a/arch/ia64/include/asm/xen/hypercall.h b/arch/ia64/include/asm/xen/hypercall.h new file mode 100644 index 0000000..96fc623 --- /dev/null +++ b/arch/ia64/include/asm/xen/hypercall.h @@ -0,0 +1,265 @@ +/****************************************************************************** + * hypercall.h + * + * Linux-specific hypervisor handling. + * + * Copyright (c) 2002-2004, K A Fraser + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#ifndef _ASM_IA64_XEN_HYPERCALL_H +#define _ASM_IA64_XEN_HYPERCALL_H + +#include +#include +#include +#include +struct xencomm_handle; +extern unsigned long __hypercall(unsigned long a1, unsigned long a2, + unsigned long a3, unsigned long a4, + unsigned long a5, unsigned long cmd); + +/* + * Assembler stubs for hyper-calls. + */ + +#define _hypercall0(type, name) \ +({ \ + long __res; \ + __res = __hypercall(0, 0, 0, 0, 0, __HYPERVISOR_##name);\ + (type)__res; \ +}) + +#define _hypercall1(type, name, a1) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + 0, 0, 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall2(type, name, a1, a2) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + 0, 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall3(type, name, a1, a2, a3) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + 0, 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall4(type, name, a1, a2, a3, a4) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + (unsigned long)a4, \ + 0, __HYPERVISOR_##name); \ + (type)__res; \ +}) + +#define _hypercall5(type, name, a1, a2, a3, a4, a5) \ +({ \ + long __res; \ + __res = __hypercall((unsigned long)a1, \ + (unsigned long)a2, \ + (unsigned long)a3, \ + (unsigned long)a4, \ + (unsigned long)a5, \ + __HYPERVISOR_##name); \ + (type)__res; \ +}) + + +static inline int +xencomm_arch_hypercall_sched_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, sched_op_new, cmd, arg); +} + +static inline long +HYPERVISOR_set_timer_op(u64 timeout) +{ + unsigned long timeout_hi = (unsigned long)(timeout >> 32); + unsigned long timeout_lo = (unsigned long)timeout; + return _hypercall2(long, set_timer_op, timeout_lo, timeout_hi); +} + +static inline int +xencomm_arch_hypercall_multicall(struct xencomm_handle *call_list, + int nr_calls) +{ + return _hypercall2(int, multicall, call_list, nr_calls); +} + +static inline int +xencomm_arch_hypercall_memory_op(unsigned int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, memory_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_event_channel_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, event_channel_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_xen_version(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, xen_version, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_console_io(int cmd, int count, + struct xencomm_handle *str) +{ + return _hypercall3(int, console_io, cmd, count, str); +} + +static inline int +xencomm_arch_hypercall_physdev_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, physdev_op, cmd, arg); +} + +static inline int +xencomm_arch_hypercall_grant_table_op(unsigned int cmd, + struct xencomm_handle *uop, + unsigned int count) +{ + return _hypercall3(int, grant_table_op, cmd, uop, count); +} + +int HYPERVISOR_grant_table_op(unsigned int cmd, void *uop, unsigned int count); + +extern int xencomm_arch_hypercall_suspend(struct xencomm_handle *arg); + +static inline int +xencomm_arch_hypercall_callback_op(int cmd, struct xencomm_handle *arg) +{ + return _hypercall2(int, callback_op, cmd, arg); +} + +static inline long +xencomm_arch_hypercall_vcpu_op(int cmd, int cpu, void *arg) +{ + return _hypercall3(long, vcpu_op, cmd, cpu, arg); +} + +static inline int +HYPERVISOR_physdev_op(int cmd, void *arg) +{ + switch (cmd) { + case PHYSDEVOP_eoi: + return _hypercall1(int, ia64_fast_eoi, + ((struct physdev_eoi *)arg)->irq); + default: + return xencomm_hypercall_physdev_op(cmd, arg); + } +} + +static inline long +xencomm_arch_hypercall_opt_feature(struct xencomm_handle *arg) +{ + return _hypercall1(long, opt_feature, arg); +} + +/* for balloon driver */ +#define HYPERVISOR_update_va_mapping(va, new_val, flags) (0) + +/* Use xencomm to do hypercalls. */ +#define HYPERVISOR_sched_op xencomm_hypercall_sched_op +#define HYPERVISOR_event_channel_op xencomm_hypercall_event_channel_op +#define HYPERVISOR_callback_op xencomm_hypercall_callback_op +#define HYPERVISOR_multicall xencomm_hypercall_multicall +#define HYPERVISOR_xen_version xencomm_hypercall_xen_version +#define HYPERVISOR_console_io xencomm_hypercall_console_io +#define HYPERVISOR_memory_op xencomm_hypercall_memory_op +#define HYPERVISOR_suspend xencomm_hypercall_suspend +#define HYPERVISOR_vcpu_op xencomm_hypercall_vcpu_op +#define HYPERVISOR_opt_feature xencomm_hypercall_opt_feature + +/* to compile gnttab_copy_grant_page() in drivers/xen/core/gnttab.c */ +#define HYPERVISOR_mmu_update(req, count, success_count, domid) ({ BUG(); 0; }) + +static inline int +HYPERVISOR_shutdown( + unsigned int reason) +{ + struct sched_shutdown sched_shutdown = { + .reason = reason + }; + + int rc = HYPERVISOR_sched_op(SCHEDOP_shutdown, &sched_shutdown); + + return rc; +} + +/* for netfront.c, netback.c */ +#define MULTI_UVMFLAGS_INDEX 0 /* XXX any value */ + +static inline void +MULTI_update_va_mapping( + struct multicall_entry *mcl, unsigned long va, + pte_t new_val, unsigned long flags) +{ + mcl->op = __HYPERVISOR_update_va_mapping; + mcl->result = 0; +} + +static inline void +MULTI_grant_table_op(struct multicall_entry *mcl, unsigned int cmd, + void *uop, unsigned int count) +{ + mcl->op = __HYPERVISOR_grant_table_op; + mcl->args[0] = cmd; + mcl->args[1] = (unsigned long)uop; + mcl->args[2] = count; +} + +static inline void +MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, + int count, int *success_count, domid_t domid) +{ + mcl->op = __HYPERVISOR_mmu_update; + mcl->args[0] = (unsigned long)req; + mcl->args[1] = count; + mcl->args[2] = (unsigned long)success_count; + mcl->args[3] = domid; +} + +#endif /* _ASM_IA64_XEN_HYPERCALL_H */ diff --git a/arch/ia64/include/asm/xen/privop.h b/arch/ia64/include/asm/xen/privop.h new file mode 100644 index 0000000..71ec754 --- /dev/null +++ b/arch/ia64/include/asm/xen/privop.h @@ -0,0 +1,129 @@ +#ifndef _ASM_IA64_XEN_PRIVOP_H +#define _ASM_IA64_XEN_PRIVOP_H + +/* + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Paravirtualizations of privileged operations for Xen/ia64 + * + * + * inline privop and paravirt_alt support + * Copyright (c) 2007 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + */ + +#ifndef __ASSEMBLY__ +#include /* arch-ia64.h requires uint64_t */ +#endif +#include + +/* At 1 MB, before per-cpu space but still addressable using addl instead + of movl. */ +#define XSI_BASE 0xfffffffffff00000 + +/* Address of mapped regs. */ +#define XMAPPEDREGS_BASE (XSI_BASE + XSI_SIZE) + +#ifdef __ASSEMBLY__ +#define XEN_HYPER_RFI break HYPERPRIVOP_RFI +#define XEN_HYPER_RSM_PSR_DT break HYPERPRIVOP_RSM_DT +#define XEN_HYPER_SSM_PSR_DT break HYPERPRIVOP_SSM_DT +#define XEN_HYPER_COVER break HYPERPRIVOP_COVER +#define XEN_HYPER_ITC_D break HYPERPRIVOP_ITC_D +#define XEN_HYPER_ITC_I break HYPERPRIVOP_ITC_I +#define XEN_HYPER_SSM_I break HYPERPRIVOP_SSM_I +#define XEN_HYPER_GET_IVR break HYPERPRIVOP_GET_IVR +#define XEN_HYPER_THASH break HYPERPRIVOP_THASH +#define XEN_HYPER_ITR_D break HYPERPRIVOP_ITR_D +#define XEN_HYPER_SET_KR break HYPERPRIVOP_SET_KR +#define XEN_HYPER_GET_PSR break HYPERPRIVOP_GET_PSR +#define XEN_HYPER_SET_RR0_TO_RR4 break HYPERPRIVOP_SET_RR0_TO_RR4 + +#define XSI_IFS (XSI_BASE + XSI_IFS_OFS) +#define XSI_PRECOVER_IFS (XSI_BASE + XSI_PRECOVER_IFS_OFS) +#define XSI_IFA (XSI_BASE + XSI_IFA_OFS) +#define XSI_ISR (XSI_BASE + XSI_ISR_OFS) +#define XSI_IIM (XSI_BASE + XSI_IIM_OFS) +#define XSI_ITIR (XSI_BASE + XSI_ITIR_OFS) +#define XSI_PSR_I_ADDR (XSI_BASE + XSI_PSR_I_ADDR_OFS) +#define XSI_PSR_IC (XSI_BASE + XSI_PSR_IC_OFS) +#define XSI_IPSR (XSI_BASE + XSI_IPSR_OFS) +#define XSI_IIP (XSI_BASE + XSI_IIP_OFS) +#define XSI_B1NAT (XSI_BASE + XSI_B1NATS_OFS) +#define XSI_BANK1_R16 (XSI_BASE + XSI_BANK1_R16_OFS) +#define XSI_BANKNUM (XSI_BASE + XSI_BANKNUM_OFS) +#define XSI_IHA (XSI_BASE + XSI_IHA_OFS) +#endif + +#ifndef __ASSEMBLY__ + +/************************************************/ +/* Instructions paravirtualized for correctness */ +/************************************************/ + +/* "fc" and "thash" are privilege-sensitive instructions, meaning they + * may have different semantics depending on whether they are executed + * at PL0 vs PL!=0. When paravirtualized, these instructions mustn't + * be allowed to execute directly, lest incorrect semantics result. */ +extern void xen_fc(unsigned long addr); +extern unsigned long xen_thash(unsigned long addr); + +/* Note that "ttag" and "cover" are also privilege-sensitive; "ttag" + * is not currently used (though it may be in a long-format VHPT system!) + * and the semantics of cover only change if psr.ic is off which is very + * rare (and currently non-existent outside of assembly code */ + +/* There are also privilege-sensitive registers. These registers are + * readable at any privilege level but only writable at PL0. */ +extern unsigned long xen_get_cpuid(int index); +extern unsigned long xen_get_pmd(int index); + +extern unsigned long xen_get_eflag(void); /* see xen_ia64_getreg */ +extern void xen_set_eflag(unsigned long); /* see xen_ia64_setreg */ + +/************************************************/ +/* Instructions paravirtualized for performance */ +/************************************************/ + +/* Xen uses memory-mapped virtual privileged registers for access to many + * performance-sensitive privileged registers. Some, like the processor + * status register (psr), are broken up into multiple memory locations. + * Others, like "pend", are abstractions based on privileged registers. + * "Pend" is guaranteed to be set if reading cr.ivr would return a + * (non-spurious) interrupt. */ +#define XEN_MAPPEDREGS ((struct mapped_regs *)XMAPPEDREGS_BASE) + +#define XSI_PSR_I \ + (*XEN_MAPPEDREGS->interrupt_mask_addr) +#define xen_get_virtual_psr_i() \ + (!XSI_PSR_I) +#define xen_set_virtual_psr_i(_val) \ + ({ XSI_PSR_I = (uint8_t)(_val) ? 0 : 1; }) +#define xen_set_virtual_psr_ic(_val) \ + ({ XEN_MAPPEDREGS->interrupt_collection_enabled = _val ? 1 : 0; }) +#define xen_get_virtual_pend() \ + (*(((uint8_t *)XEN_MAPPEDREGS->interrupt_mask_addr) - 1)) + +/* Although all privileged operations can be left to trap and will + * be properly handled by Xen, some are frequent enough that we use + * hyperprivops for performance. */ +extern unsigned long xen_get_psr(void); +extern unsigned long xen_get_ivr(void); +extern unsigned long xen_get_tpr(void); +extern void xen_hyper_ssm_i(void); +extern void xen_set_itm(unsigned long); +extern void xen_set_tpr(unsigned long); +extern void xen_eoi(unsigned long); +extern unsigned long xen_get_rr(unsigned long index); +extern void xen_set_rr(unsigned long index, unsigned long val); +extern void xen_set_rr0_to_rr4(unsigned long val0, unsigned long val1, + unsigned long val2, unsigned long val3, + unsigned long val4); +extern void xen_set_kr(unsigned long index, unsigned long val); +extern void xen_ptcga(unsigned long addr, unsigned long size); + +#endif /* !__ASSEMBLY__ */ + +#endif /* _ASM_IA64_XEN_PRIVOP_H */ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile new file mode 100644 index 0000000..c200704 --- /dev/null +++ b/arch/ia64/xen/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for Xen components +# + +obj-y := hypercall.o diff --git a/arch/ia64/xen/hypercall.S b/arch/ia64/xen/hypercall.S new file mode 100644 index 0000000..d4ff0b9 --- /dev/null +++ b/arch/ia64/xen/hypercall.S @@ -0,0 +1,91 @@ +/* + * Support routines for Xen hypercalls + * + * Copyright (C) 2005 Dan Magenheimer + * Copyright (C) 2008 Yaozu (Eddie) Dong + */ + +#include +#include +#include + +/* + * Hypercalls without parameter. + */ +#define __HCALL0(name,hcall) \ + GLOBAL_ENTRY(name); \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +/* + * Hypercalls with 1 parameter. + */ +#define __HCALL1(name,hcall) \ + GLOBAL_ENTRY(name); \ + mov r8=r32; \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +/* + * Hypercalls with 2 parameters. + */ +#define __HCALL2(name,hcall) \ + GLOBAL_ENTRY(name); \ + mov r8=r32; \ + mov r9=r33; \ + break hcall; \ + br.ret.sptk.many rp; \ + END(name) + +__HCALL0(xen_get_psr, HYPERPRIVOP_GET_PSR) +__HCALL0(xen_get_ivr, HYPERPRIVOP_GET_IVR) +__HCALL0(xen_get_tpr, HYPERPRIVOP_GET_TPR) +__HCALL0(xen_hyper_ssm_i, HYPERPRIVOP_SSM_I) + +__HCALL1(xen_set_tpr, HYPERPRIVOP_SET_TPR) +__HCALL1(xen_eoi, HYPERPRIVOP_EOI) +__HCALL1(xen_thash, HYPERPRIVOP_THASH) +__HCALL1(xen_set_itm, HYPERPRIVOP_SET_ITM) +__HCALL1(xen_get_rr, HYPERPRIVOP_GET_RR) +__HCALL1(xen_fc, HYPERPRIVOP_FC) +__HCALL1(xen_get_cpuid, HYPERPRIVOP_GET_CPUID) +__HCALL1(xen_get_pmd, HYPERPRIVOP_GET_PMD) + +__HCALL2(xen_ptcga, HYPERPRIVOP_PTC_GA) +__HCALL2(xen_set_rr, HYPERPRIVOP_SET_RR) +__HCALL2(xen_set_kr, HYPERPRIVOP_SET_KR) + +#ifdef CONFIG_IA32_SUPPORT +__HCALL1(xen_get_eflag, HYPERPRIVOP_GET_EFLAG) +__HCALL1(xen_set_eflag, HYPERPRIVOP_SET_EFLAG) // refer SDM vol1 3.1.8 +#endif /* CONFIG_IA32_SUPPORT */ + +GLOBAL_ENTRY(xen_set_rr0_to_rr4) + mov r8=r32 + mov r9=r33 + mov r10=r34 + mov r11=r35 + mov r14=r36 + XEN_HYPER_SET_RR0_TO_RR4 + br.ret.sptk.many rp + ;; +END(xen_set_rr0_to_rr4) + +GLOBAL_ENTRY(xen_send_ipi) + mov r14=r32 + mov r15=r33 + mov r2=0x400 + break 0x1000 + ;; + br.ret.sptk.many rp + ;; +END(xen_send_ipi) + +GLOBAL_ENTRY(__hypercall) + mov r2=r37 + break 0x1000 + br.ret.sptk.many b0 + ;; +END(__hypercall) -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:06 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:06 +0900 Subject: [PATCH 26/33] ia64/pv_ops/xen: define the nubmer of irqs which xen needs. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-27-git-send-email-yamahata@valinux.co.jp> define arch/ia64/include/asm/xen/irq.h to define the number of irqs which xen needs. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/irq.h | 44 +++++++++++++++++++++++++++++++++++++++ arch/ia64/kernel/nr-irqs.c | 1 + 2 files changed, 45 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/irq.h diff --git a/arch/ia64/include/asm/xen/irq.h b/arch/ia64/include/asm/xen/irq.h new file mode 100644 index 0000000..a904509 --- /dev/null +++ b/arch/ia64/include/asm/xen/irq.h @@ -0,0 +1,44 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/irq.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef _ASM_IA64_XEN_IRQ_H +#define _ASM_IA64_XEN_IRQ_H + +/* + * The flat IRQ space is divided into two regions: + * 1. A one-to-one mapping of real physical IRQs. This space is only used + * if we have physical device-access privilege. This region is at the + * start of the IRQ space so that existing device drivers do not need + * to be modified to translate physical IRQ numbers into our IRQ space. + * 3. A dynamic mapping of inter-domain and Xen-sourced virtual IRQs. These + * are bound using the provided bind/unbind functions. + */ + +#define XEN_PIRQ_BASE 0 +#define XEN_NR_PIRQS 256 + +#define XEN_DYNIRQ_BASE (XEN_PIRQ_BASE + XEN_NR_PIRQS) +#define XEN_NR_DYNIRQS (NR_CPUS * 8) + +#define XEN_NR_IRQS (XEN_NR_PIRQS + XEN_NR_DYNIRQS) + +#endif /* _ASM_IA64_XEN_IRQ_H */ diff --git a/arch/ia64/kernel/nr-irqs.c b/arch/ia64/kernel/nr-irqs.c index 8273afc..ee56457 100644 --- a/arch/ia64/kernel/nr-irqs.c +++ b/arch/ia64/kernel/nr-irqs.c @@ -10,6 +10,7 @@ #include #include #include +#include void foo(void) { -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:03 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:03 +0900 Subject: [PATCH 23/33] ia64/pv_ops/xen: paravirtualize ivt.S for xen. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-24-git-send-email-yamahata@valinux.co.jp> paravirtualize ivt.S for xen by multi compile. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 1 + arch/ia64/xen/Makefile | 16 +++++++++++- arch/ia64/xen/xenivt.S | 52 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/xenivt.S diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 1e92ed0..e6a25c3 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -22,6 +22,7 @@ #include +#define ia64_ivt xen_ivt #define DO_SAVE_MIN XEN_DO_SAVE_MIN #define MOV_FROM_IFA(reg) \ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 7cb4247..5c87e4a 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,5 +2,19 @@ # Makefile for Xen components # -obj-y := hypercall.o xensetup.o xen_pv_ops.o \ +obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o + +AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN + +# xen multi compile +ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S +ASM_PARAVIRT_OBJS = $(addprefix xen-,$(ASM_PARAVIRT_MULTI_COMPILE_SRCS:.S=.o)) +obj-y += $(ASM_PARAVIRT_OBJS) +define paravirtualized_xen +AFLAGS_$(1) += -D__IA64_ASM_PARAVIRTUALIZED_XEN +endef +$(foreach o,$(ASM_PARAVIRT_OBJS),$(eval $(call paravirtualized_xen,$(o)))) + +$(obj)/xen-%.o: $(src)/../kernel/%.S FORCE + $(call if_changed_dep,as_o_S) diff --git a/arch/ia64/xen/xenivt.S b/arch/ia64/xen/xenivt.S new file mode 100644 index 0000000..3e71d50 --- /dev/null +++ b/arch/ia64/xen/xenivt.S @@ -0,0 +1,52 @@ +/* + * arch/ia64/xen/ivt.S + * + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * pv_ops. + */ + +#include +#include +#include + +#include "../kernel/minstate.h" + + .section .text,"ax" +GLOBAL_ENTRY(xen_event_callback) + mov r31=pr // prepare to save predicates + ;; + SAVE_MIN_WITH_COVER // uses r31; defines r2 and r3 + ;; + movl r3=XSI_PSR_IC + mov r14=1 + ;; + st4 [r3]=r14 + ;; + adds r3=8,r2 // set up second base pointer for SAVE_REST + srlz.i // ensure everybody knows psr.ic is back on + ;; + SAVE_REST + ;; +1: + alloc r14=ar.pfs,0,0,1,0 // must be first in an insn group + add out0=16,sp // pass pointer to pt_regs as first arg + ;; + br.call.sptk.many b0=xen_evtchn_do_upcall + ;; + movl r20=XSI_PSR_I_ADDR + ;; + ld8 r20=[r20] + ;; + adds r20=-1,r20 // vcpu_info->evtchn_upcall_pending + ;; + ld1 r20=[r20] + ;; + cmp.ne p6,p0=r20,r0 // if there are pending events, + (p6) br.spnt.few 1b // call evtchn_do_upcall again. + br.sptk.many xen_leave_kernel // we know ia64_leave_kernel is + // paravirtualized as xen_leave_kernel +END(xen_event_callback) -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:05 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:05 +0900 Subject: [PATCH 25/33] ia64/pv_ops/xen: implement xen pv_iosapic_ops. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-26-git-send-email-yamahata@valinux.co.jp> implement xen pv_iosapic_ops for xen paravirtualized iosapic. Signed-off-by: Isaku Yamahata --- arch/ia64/xen/xen_pv_ops.c | 52 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 5b23cd5..41a6cbf 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -292,6 +292,57 @@ const struct pv_cpu_asm_switch xen_cpu_asm_switch = { }; /*************************************************************************** + * pv_iosapic_ops + * iosapic read/write hooks. + */ +static void +xen_pcat_compat_init(void) +{ + /* nothing */ +} + +static struct irq_chip* +xen_iosapic_get_irq_chip(unsigned long trigger) +{ + return NULL; +} + +static unsigned int +xen_iosapic_read(char __iomem *iosapic, unsigned int reg) +{ + struct physdev_apic apic_op; + int ret; + + apic_op.apic_physbase = (unsigned long)iosapic - + __IA64_UNCACHED_OFFSET; + apic_op.reg = reg; + ret = HYPERVISOR_physdev_op(PHYSDEVOP_apic_read, &apic_op); + if (ret) + return ret; + return apic_op.value; +} + +static void +xen_iosapic_write(char __iomem *iosapic, unsigned int reg, u32 val) +{ + struct physdev_apic apic_op; + + apic_op.apic_physbase = (unsigned long)iosapic - + __IA64_UNCACHED_OFFSET; + apic_op.reg = reg; + apic_op.value = val; + HYPERVISOR_physdev_op(PHYSDEVOP_apic_write, &apic_op); +} + +static const struct pv_iosapic_ops xen_iosapic_ops __initdata = { + .pcat_compat_init = xen_pcat_compat_init, + .__get_irq_chip = xen_iosapic_get_irq_chip, + + .__read = xen_iosapic_read, + .__write = xen_iosapic_write, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -302,6 +353,7 @@ xen_setup_pv_ops(void) pv_info = xen_info; pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; + pv_iosapic_ops = xen_iosapic_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:59 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:59 +0900 Subject: [PATCH 19/33] ia64/pv_ops/xen: define xen pv_init_ops for various xen initialization. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-20-git-send-email-yamahata@valinux.co.jp> This patch implements xen version of pv_init_ops to do various xen initialization. This patch also includes ia64 counter part of x86 xen early printk support patches. Signed-off-by: Akio Takebe Signed-off-by: Alex Williamson Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypervisor.h | 14 ++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/hypervisor.c | 96 ++++++++++++++++++++++++++++ arch/ia64/xen/xen_pv_ops.c | 110 ++++++++++++++++++++++++++++++++ 4 files changed, 221 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/hypervisor.c diff --git a/arch/ia64/include/asm/xen/hypervisor.h b/arch/ia64/include/asm/xen/hypervisor.h index d1f84e1..7a804e8 100644 --- a/arch/ia64/include/asm/xen/hypervisor.h +++ b/arch/ia64/include/asm/xen/hypervisor.h @@ -59,8 +59,22 @@ extern enum xen_domain_type xen_domain_type; /* deprecated. remove this */ #define is_running_on_xen() (xen_domain_type == XEN_PV_DOMAIN) +extern struct shared_info *HYPERVISOR_shared_info; extern struct start_info *xen_start_info; +void __init xen_setup_vcpu_info_placement(void); +void force_evtchn_callback(void); + +/* for drivers/xen/balloon/balloon.c */ +#ifdef CONFIG_XEN_SCRUB_PAGES +#define scrub_pages(_p, _n) memset((void *)(_p), 0, (_n) << PAGE_SHIFT) +#else +#define scrub_pages(_p, _n) ((void)0) +#endif + +/* For setup_arch() in arch/ia64/kernel/setup.c */ +void xen_ia64_enable_opt_feature(void); + #else /* CONFIG_XEN */ #define xen_domain() (0) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index abc356f..7cb4247 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,4 +3,4 @@ # obj-y := hypercall.o xensetup.o xen_pv_ops.o \ - xencomm.o xcom_hcall.o grant-table.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o diff --git a/arch/ia64/xen/hypervisor.c b/arch/ia64/xen/hypervisor.c new file mode 100644 index 0000000..cac4d97 --- /dev/null +++ b/arch/ia64/xen/hypervisor.c @@ -0,0 +1,96 @@ +/****************************************************************************** + * arch/ia64/xen/hypervisor.c + * + * Copyright (c) 2006 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include + +#include "irq_xen.h" + +struct shared_info *HYPERVISOR_shared_info __read_mostly = + (struct shared_info *)XSI_BASE; +EXPORT_SYMBOL(HYPERVISOR_shared_info); + +DEFINE_PER_CPU(struct vcpu_info *, xen_vcpu); + +struct start_info *xen_start_info; +EXPORT_SYMBOL(xen_start_info); + +EXPORT_SYMBOL(xen_domain_type); + +EXPORT_SYMBOL(__hypercall); + +/* Stolen from arch/x86/xen/enlighten.c */ +/* + * Flag to determine whether vcpu info placement is available on all + * VCPUs. We assume it is to start with, and then set it to zero on + * the first failure. This is because it can succeed on some VCPUs + * and not others, since it can involve hypervisor memory allocation, + * or because the guest failed to guarantee all the appropriate + * constraints on all VCPUs (ie buffer can't cross a page boundary). + * + * Note that any particular CPU may be using a placed vcpu structure, + * but we can only optimise if the all are. + * + * 0: not available, 1: available + */ + +static void __init xen_vcpu_setup(int cpu) +{ + /* + * WARNING: + * before changing MAX_VIRT_CPUS, + * check that shared_info fits on a page + */ + BUILD_BUG_ON(sizeof(struct shared_info) > PAGE_SIZE); + per_cpu(xen_vcpu, cpu) = &HYPERVISOR_shared_info->vcpu_info[cpu]; +} + +void __init xen_setup_vcpu_info_placement(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + xen_vcpu_setup(cpu); +} + +void __cpuinit +xen_cpu_init(void) +{ + xen_smp_intr_init(); +} + +/************************************************************************** + * opt feature + */ +void +xen_ia64_enable_opt_feature(void) +{ + /* Enable region 7 identity map optimizations in Xen */ + struct xen_ia64_opt_feature optf; + + optf.cmd = XEN_IA64_OPTF_IDENT_MAP_REG7; + optf.on = XEN_IA64_OPTF_ON; + optf.pgprot = pgprot_val(PAGE_KERNEL); + optf.key = 0; /* No key on linux. */ + HYPERVISOR_opt_feature(&optf); +} diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 77db214..fc9d599 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -54,6 +54,115 @@ xen_info_init(void) } /*************************************************************************** + * pv_init_ops + * initialization hooks. + */ + +static void +xen_panic_hypercall(struct unw_frame_info *info, void *arg) +{ + current->thread.ksp = (__u64)info->sw - 16; + HYPERVISOR_shutdown(SHUTDOWN_crash); + /* we're never actually going to get here... */ +} + +static int +xen_panic_event(struct notifier_block *this, unsigned long event, void *ptr) +{ + unw_init_running(xen_panic_hypercall, NULL); + /* we're never actually going to get here... */ + return NOTIFY_DONE; +} + +static struct notifier_block xen_panic_block = { + xen_panic_event, NULL, 0 /* try to go last */ +}; + +static void xen_pm_power_off(void) +{ + local_irq_disable(); + HYPERVISOR_shutdown(SHUTDOWN_poweroff); +} + +static void __init +xen_banner(void) +{ + printk(KERN_INFO + "Running on Xen! pl = %d start_info_pfn=0x%lx nr_pages=%ld " + "flags=0x%x\n", + xen_info.kernel_rpl, + HYPERVISOR_shared_info->arch.start_info_pfn, + xen_start_info->nr_pages, xen_start_info->flags); +} + +static int __init +xen_reserve_memory(struct rsvd_region *region) +{ + region->start = (unsigned long)__va( + (HYPERVISOR_shared_info->arch.start_info_pfn << PAGE_SHIFT)); + region->end = region->start + PAGE_SIZE; + return 1; +} + +static void __init +xen_arch_setup_early(void) +{ + struct shared_info *s; + BUG_ON(!xen_pv_domain()); + + s = HYPERVISOR_shared_info; + xen_start_info = __va(s->arch.start_info_pfn << PAGE_SHIFT); + + /* Must be done before any hypercall. */ + xencomm_initialize(); + + xen_setup_features(); + /* Register a call for panic conditions. */ + atomic_notifier_chain_register(&panic_notifier_list, + &xen_panic_block); + pm_power_off = xen_pm_power_off; + + xen_ia64_enable_opt_feature(); +} + +static void __init +xen_arch_setup_console(char **cmdline_p) +{ + add_preferred_console("xenboot", 0, NULL); + add_preferred_console("tty", 0, NULL); + /* use hvc_xen */ + add_preferred_console("hvc", 0, NULL); + +#if !defined(CONFIG_VT) || !defined(CONFIG_DUMMY_CONSOLE) + conswitchp = NULL; +#endif +} + +static int __init +xen_arch_setup_nomca(void) +{ + return 1; +} + +static void __init +xen_post_smp_prepare_boot_cpu(void) +{ + xen_setup_vcpu_info_placement(); +} + +static const struct pv_init_ops xen_init_ops __initdata = { + .banner = xen_banner, + + .reserve_memory = xen_reserve_memory, + + .arch_setup_early = xen_arch_setup_early, + .arch_setup_console = xen_arch_setup_console, + .arch_setup_nomca = xen_arch_setup_nomca, + + .post_smp_prepare_boot_cpu = xen_post_smp_prepare_boot_cpu, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -62,4 +171,5 @@ xen_setup_pv_ops(void) { xen_info_init(); pv_info = xen_info; + pv_init_ops = xen_init_ops; } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:04 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:04 +0900 Subject: [PATCH 24/33] ia64/pv_ops/xen: paravirtualize entry.S for ia64/xen. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-25-git-send-email-yamahata@valinux.co.jp> paravirtualize entry.S for ia64/xen by multi compile. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 8 ++++++++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xen_pv_ops.c | 18 ++++++++++++++++++ 3 files changed, 27 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index e6a25c3..19c2ae1 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -25,6 +25,14 @@ #define ia64_ivt xen_ivt #define DO_SAVE_MIN XEN_DO_SAVE_MIN +#define __paravirt_switch_to xen_switch_to +#define __paravirt_leave_syscall xen_leave_syscall +#define __paravirt_work_processed_syscall xen_work_processed_syscall +#define __paravirt_leave_kernel xen_leave_kernel +#define __paravirt_pending_syscall_end xen_work_pending_syscall_end +#define __paravirt_work_processed_syscall_target \ + xen_work_processed_syscall + #define MOV_FROM_IFA(reg) \ movl reg = XSI_IFA; \ ;; \ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 5c87e4a..9b77e8a 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -8,7 +8,7 @@ obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile -ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S +ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S entry.S ASM_PARAVIRT_OBJS = $(addprefix xen-,$(ASM_PARAVIRT_MULTI_COMPILE_SRCS:.S=.o)) obj-y += $(ASM_PARAVIRT_OBJS) define paravirtualized_xen diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index c236f04..5b23cd5 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -275,6 +275,22 @@ static const struct pv_cpu_ops xen_cpu_ops __initdata = { = xen_intrin_local_irq_restore, }; +/****************************************************************************** + * replacement of hand written assembly codes. + */ + +extern char xen_switch_to; +extern char xen_leave_syscall; +extern char xen_work_processed_syscall; +extern char xen_leave_kernel; + +const struct pv_cpu_asm_switch xen_cpu_asm_switch = { + .switch_to = (unsigned long)&xen_switch_to, + .leave_syscall = (unsigned long)&xen_leave_syscall, + .work_processed_syscall = (unsigned long)&xen_work_processed_syscall, + .leave_kernel = (unsigned long)&xen_leave_kernel, +}; + /*************************************************************************** * pv_ops initialization */ @@ -286,4 +302,6 @@ xen_setup_pv_ops(void) pv_info = xen_info; pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; + + paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:07 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:07 +0900 Subject: [PATCH 27/33] ia64/pv_ops/xen: implement xen pv_irq_ops. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-28-git-send-email-yamahata@valinux.co.jp> implement xen pv_irq_ops to paravirtualize irq handling with xen event channel. Cc: Jeremy Fitzhardinge Signed-off-by: Akio Takebe Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/irq_xen.c | 435 ++++++++++++++++++++++++++++++++++++++++++++ arch/ia64/xen/irq_xen.h | 34 ++++ arch/ia64/xen/xen_pv_ops.c | 3 + 4 files changed, 473 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/irq_xen.c create mode 100644 arch/ia64/xen/irq_xen.h diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 9b77e8a..01c4289 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,7 +2,7 @@ # Makefile for Xen components # -obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o \ +obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN diff --git a/arch/ia64/xen/irq_xen.c b/arch/ia64/xen/irq_xen.c new file mode 100644 index 0000000..af93aad --- /dev/null +++ b/arch/ia64/xen/irq_xen.c @@ -0,0 +1,435 @@ +/****************************************************************************** + * arch/ia64/xen/irq_xen.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include + +#include +#include +#include + +#include + +#include "irq_xen.h" + +/*************************************************************************** + * pv_irq_ops + * irq operations + */ + +static int +xen_assign_irq_vector(int irq) +{ + struct physdev_irq irq_op; + + irq_op.irq = irq; + if (HYPERVISOR_physdev_op(PHYSDEVOP_alloc_irq_vector, &irq_op)) + return -ENOSPC; + + return irq_op.vector; +} + +static void +xen_free_irq_vector(int vector) +{ + struct physdev_irq irq_op; + + if (vector < IA64_FIRST_DEVICE_VECTOR || + vector > IA64_LAST_DEVICE_VECTOR) + return; + + irq_op.vector = vector; + if (HYPERVISOR_physdev_op(PHYSDEVOP_free_irq_vector, &irq_op)) + printk(KERN_WARNING "%s: xen_free_irq_vecotr fail vector=%d\n", + __func__, vector); +} + + +static DEFINE_PER_CPU(int, timer_irq) = -1; +static DEFINE_PER_CPU(int, ipi_irq) = -1; +static DEFINE_PER_CPU(int, resched_irq) = -1; +static DEFINE_PER_CPU(int, cmc_irq) = -1; +static DEFINE_PER_CPU(int, cmcp_irq) = -1; +static DEFINE_PER_CPU(int, cpep_irq) = -1; +#define NAME_SIZE 15 +static DEFINE_PER_CPU(char[NAME_SIZE], timer_name); +static DEFINE_PER_CPU(char[NAME_SIZE], ipi_name); +static DEFINE_PER_CPU(char[NAME_SIZE], resched_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cmc_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cmcp_name); +static DEFINE_PER_CPU(char[NAME_SIZE], cpep_name); +#undef NAME_SIZE + +struct saved_irq { + unsigned int irq; + struct irqaction *action; +}; +/* 16 should be far optimistic value, since only several percpu irqs + * are registered early. + */ +#define MAX_LATE_IRQ 16 +static struct saved_irq saved_percpu_irqs[MAX_LATE_IRQ]; +static unsigned short late_irq_cnt; +static unsigned short saved_irq_cnt; +static int xen_slab_ready; + +#ifdef CONFIG_SMP +/* Dummy stub. Though we may check XEN_RESCHEDULE_VECTOR before __do_IRQ, + * it ends up to issue several memory accesses upon percpu data and + * thus adds unnecessary traffic to other paths. + */ +static irqreturn_t +xen_dummy_handler(int irq, void *dev_id) +{ + + return IRQ_HANDLED; +} + +static struct irqaction xen_ipi_irqaction = { + .handler = handle_IPI, + .flags = IRQF_DISABLED, + .name = "IPI" +}; + +static struct irqaction xen_resched_irqaction = { + .handler = xen_dummy_handler, + .flags = IRQF_DISABLED, + .name = "resched" +}; + +static struct irqaction xen_tlb_irqaction = { + .handler = xen_dummy_handler, + .flags = IRQF_DISABLED, + .name = "tlb_flush" +}; +#endif + +/* + * This is xen version percpu irq registration, which needs bind + * to xen specific evtchn sub-system. One trick here is that xen + * evtchn binding interface depends on kmalloc because related + * port needs to be freed at device/cpu down. So we cache the + * registration on BSP before slab is ready and then deal them + * at later point. For rest instances happening after slab ready, + * we hook them to xen evtchn immediately. + * + * FIXME: MCA is not supported by far, and thus "nomca" boot param is + * required. + */ +static void +__xen_register_percpu_irq(unsigned int cpu, unsigned int vec, + struct irqaction *action, int save) +{ + irq_desc_t *desc; + int irq = 0; + + if (xen_slab_ready) { + switch (vec) { + case IA64_TIMER_VECTOR: + snprintf(per_cpu(timer_name, cpu), + sizeof(per_cpu(timer_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_virq_to_irqhandler(VIRQ_ITC, cpu, + action->handler, action->flags, + per_cpu(timer_name, cpu), action->dev_id); + per_cpu(timer_irq, cpu) = irq; + break; + case IA64_IPI_RESCHEDULE: + snprintf(per_cpu(resched_name, cpu), + sizeof(per_cpu(resched_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_RESCHEDULE_VECTOR, cpu, + action->handler, action->flags, + per_cpu(resched_name, cpu), action->dev_id); + per_cpu(resched_irq, cpu) = irq; + break; + case IA64_IPI_VECTOR: + snprintf(per_cpu(ipi_name, cpu), + sizeof(per_cpu(ipi_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_IPI_VECTOR, cpu, + action->handler, action->flags, + per_cpu(ipi_name, cpu), action->dev_id); + per_cpu(ipi_irq, cpu) = irq; + break; + case IA64_CMC_VECTOR: + snprintf(per_cpu(cmc_name, cpu), + sizeof(per_cpu(cmc_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_virq_to_irqhandler(VIRQ_MCA_CMC, cpu, + action->handler, + action->flags, + per_cpu(cmc_name, cpu), + action->dev_id); + per_cpu(cmc_irq, cpu) = irq; + break; + case IA64_CMCP_VECTOR: + snprintf(per_cpu(cmcp_name, cpu), + sizeof(per_cpu(cmcp_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_CMCP_VECTOR, cpu, + action->handler, + action->flags, + per_cpu(cmcp_name, cpu), + action->dev_id); + per_cpu(cmcp_irq, cpu) = irq; + break; + case IA64_CPEP_VECTOR: + snprintf(per_cpu(cpep_name, cpu), + sizeof(per_cpu(cpep_name, cpu)), + "%s%d", action->name, cpu); + irq = bind_ipi_to_irqhandler(XEN_CPEP_VECTOR, cpu, + action->handler, + action->flags, + per_cpu(cpep_name, cpu), + action->dev_id); + per_cpu(cpep_irq, cpu) = irq; + break; + case IA64_CPE_VECTOR: + case IA64_MCA_RENDEZ_VECTOR: + case IA64_PERFMON_VECTOR: + case IA64_MCA_WAKEUP_VECTOR: + case IA64_SPURIOUS_INT_VECTOR: + /* No need to complain, these aren't supported. */ + break; + default: + printk(KERN_WARNING "Percpu irq %d is unsupported " + "by xen!\n", vec); + break; + } + BUG_ON(irq < 0); + + if (irq > 0) { + /* + * Mark percpu. Without this, migrate_irqs() will + * mark the interrupt for migrations and trigger it + * on cpu hotplug. + */ + desc = irq_desc + irq; + desc->status |= IRQ_PER_CPU; + } + } + + /* For BSP, we cache registered percpu irqs, and then re-walk + * them when initializing APs + */ + if (!cpu && save) { + BUG_ON(saved_irq_cnt == MAX_LATE_IRQ); + saved_percpu_irqs[saved_irq_cnt].irq = vec; + saved_percpu_irqs[saved_irq_cnt].action = action; + saved_irq_cnt++; + if (!xen_slab_ready) + late_irq_cnt++; + } +} + +static void +xen_register_percpu_irq(ia64_vector vec, struct irqaction *action) +{ + __xen_register_percpu_irq(smp_processor_id(), vec, action, 1); +} + +static void +xen_bind_early_percpu_irq(void) +{ + int i; + + xen_slab_ready = 1; + /* There's no race when accessing this cached array, since only + * BSP will face with such step shortly + */ + for (i = 0; i < late_irq_cnt; i++) + __xen_register_percpu_irq(smp_processor_id(), + saved_percpu_irqs[i].irq, + saved_percpu_irqs[i].action, 0); +} + +/* FIXME: There's no obvious point to check whether slab is ready. So + * a hack is used here by utilizing a late time hook. + */ + +#ifdef CONFIG_HOTPLUG_CPU +static int __devinit +unbind_evtchn_callback(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + + if (action == CPU_DEAD) { + /* Unregister evtchn. */ + if (per_cpu(cpep_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cpep_irq, cpu), NULL); + per_cpu(cpep_irq, cpu) = -1; + } + if (per_cpu(cmcp_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cmcp_irq, cpu), NULL); + per_cpu(cmcp_irq, cpu) = -1; + } + if (per_cpu(cmc_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(cmc_irq, cpu), NULL); + per_cpu(cmc_irq, cpu) = -1; + } + if (per_cpu(ipi_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(ipi_irq, cpu), NULL); + per_cpu(ipi_irq, cpu) = -1; + } + if (per_cpu(resched_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(resched_irq, cpu), + NULL); + per_cpu(resched_irq, cpu) = -1; + } + if (per_cpu(timer_irq, cpu) >= 0) { + unbind_from_irqhandler(per_cpu(timer_irq, cpu), NULL); + per_cpu(timer_irq, cpu) = -1; + } + } + return NOTIFY_OK; +} + +static struct notifier_block unbind_evtchn_notifier = { + .notifier_call = unbind_evtchn_callback, + .priority = 0 +}; +#endif + +void xen_smp_intr_init_early(unsigned int cpu) +{ +#ifdef CONFIG_SMP + unsigned int i; + + for (i = 0; i < saved_irq_cnt; i++) + __xen_register_percpu_irq(cpu, saved_percpu_irqs[i].irq, + saved_percpu_irqs[i].action, 0); +#endif +} + +void xen_smp_intr_init(void) +{ +#ifdef CONFIG_SMP + unsigned int cpu = smp_processor_id(); + struct callback_register event = { + .type = CALLBACKTYPE_event, + .address = { .ip = (unsigned long)&xen_event_callback }, + }; + + if (cpu == 0) { + /* Initialization was already done for boot cpu. */ +#ifdef CONFIG_HOTPLUG_CPU + /* Register the notifier only once. */ + register_cpu_notifier(&unbind_evtchn_notifier); +#endif + return; + } + + /* This should be piggyback when setup vcpu guest context */ + BUG_ON(HYPERVISOR_callback_op(CALLBACKOP_register, &event)); +#endif /* CONFIG_SMP */ +} + +void __init +xen_irq_init(void) +{ + struct callback_register event = { + .type = CALLBACKTYPE_event, + .address = { .ip = (unsigned long)&xen_event_callback }, + }; + + xen_init_IRQ(); + BUG_ON(HYPERVISOR_callback_op(CALLBACKOP_register, &event)); + late_time_init = xen_bind_early_percpu_irq; +} + +void +xen_platform_send_ipi(int cpu, int vector, int delivery_mode, int redirect) +{ +#ifdef CONFIG_SMP + /* TODO: we need to call vcpu_up here */ + if (unlikely(vector == ap_wakeup_vector)) { + /* XXX + * This should be in __cpu_up(cpu) in ia64 smpboot.c + * like x86. But don't want to modify it, + * keep it untouched. + */ + xen_smp_intr_init_early(cpu); + + xen_send_ipi(cpu, vector); + /* vcpu_prepare_and_up(cpu); */ + return; + } +#endif + + switch (vector) { + case IA64_IPI_VECTOR: + xen_send_IPI_one(cpu, XEN_IPI_VECTOR); + break; + case IA64_IPI_RESCHEDULE: + xen_send_IPI_one(cpu, XEN_RESCHEDULE_VECTOR); + break; + case IA64_CMCP_VECTOR: + xen_send_IPI_one(cpu, XEN_CMCP_VECTOR); + break; + case IA64_CPEP_VECTOR: + xen_send_IPI_one(cpu, XEN_CPEP_VECTOR); + break; + case IA64_TIMER_VECTOR: { + /* this is used only once by check_sal_cache_flush() + at boot time */ + static int used = 0; + if (!used) { + xen_send_ipi(cpu, IA64_TIMER_VECTOR); + used = 1; + break; + } + /* fallthrough */ + } + default: + printk(KERN_WARNING "Unsupported IPI type 0x%x\n", + vector); + notify_remote_via_irq(0); /* defaults to 0 irq */ + break; + } +} + +static void __init +xen_register_ipi(void) +{ +#ifdef CONFIG_SMP + register_percpu_irq(IA64_IPI_VECTOR, &xen_ipi_irqaction); + register_percpu_irq(IA64_IPI_RESCHEDULE, &xen_resched_irqaction); + register_percpu_irq(IA64_IPI_LOCAL_TLB_FLUSH, &xen_tlb_irqaction); +#endif +} + +static void +xen_resend_irq(unsigned int vector) +{ + (void)resend_irq_on_evtchn(vector); +} + +const struct pv_irq_ops xen_irq_ops __initdata = { + .register_ipi = xen_register_ipi, + + .assign_irq_vector = xen_assign_irq_vector, + .free_irq_vector = xen_free_irq_vector, + .register_percpu_irq = xen_register_percpu_irq, + + .resend_irq = xen_resend_irq, +}; diff --git a/arch/ia64/xen/irq_xen.h b/arch/ia64/xen/irq_xen.h new file mode 100644 index 0000000..26110f3 --- /dev/null +++ b/arch/ia64/xen/irq_xen.h @@ -0,0 +1,34 @@ +/****************************************************************************** + * arch/ia64/xen/irq_xen.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#ifndef IRQ_XEN_H +#define IRQ_XEN_H + +extern void (*late_time_init)(void); +extern char xen_event_callback; +void __init xen_init_IRQ(void); + +extern const struct pv_irq_ops xen_irq_ops __initdata; +extern void xen_smp_intr_init(void); +extern void xen_send_ipi(int cpu, int vec); + +#endif /* IRQ_XEN_H */ diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 41a6cbf..4fe4e62 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -29,6 +29,8 @@ #include #include +#include "irq_xen.h" + /*************************************************************************** * general info */ @@ -354,6 +356,7 @@ xen_setup_pv_ops(void) pv_init_ops = xen_init_ops; pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; + pv_irq_ops = xen_irq_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:00 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:00 +0900 Subject: [PATCH 20/33] ia64/pv_ops/xen: define xen pv_cpu_ops. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-21-git-send-email-yamahata@valinux.co.jp> define xen pv_cpu_ops which implementes xen paravirtualized privileged instructions. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/xen/xen_pv_ops.c | 114 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 114 insertions(+), 0 deletions(-) diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index fc9d599..c236f04 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -163,6 +163,119 @@ static const struct pv_init_ops xen_init_ops __initdata = { }; /*************************************************************************** + * pv_cpu_ops + * intrinsics hooks. + */ + +static void xen_setreg(int regnum, unsigned long val) +{ + switch (regnum) { + case _IA64_REG_AR_KR0 ... _IA64_REG_AR_KR7: + xen_set_kr(regnum - _IA64_REG_AR_KR0, val); + break; +#ifdef CONFIG_IA32_SUPPORT + case _IA64_REG_AR_EFLAG: + xen_set_eflag(val); + break; +#endif + case _IA64_REG_CR_TPR: + xen_set_tpr(val); + break; + case _IA64_REG_CR_ITM: + xen_set_itm(val); + break; + case _IA64_REG_CR_EOI: + xen_eoi(val); + break; + default: + ia64_native_setreg_func(regnum, val); + break; + } +} + +static unsigned long xen_getreg(int regnum) +{ + unsigned long res; + + switch (regnum) { + case _IA64_REG_PSR: + res = xen_get_psr(); + break; +#ifdef CONFIG_IA32_SUPPORT + case _IA64_REG_AR_EFLAG: + res = xen_get_eflag(); + break; +#endif + case _IA64_REG_CR_IVR: + res = xen_get_ivr(); + break; + case _IA64_REG_CR_TPR: + res = xen_get_tpr(); + break; + default: + res = ia64_native_getreg_func(regnum); + break; + } + return res; +} + +/* turning on interrupts is a bit more complicated.. write to the + * memory-mapped virtual psr.i bit first (to avoid race condition), + * then if any interrupts were pending, we have to execute a hyperprivop + * to ensure the pending interrupt gets delivered; else we're done! */ +static void +xen_ssm_i(void) +{ + int old = xen_get_virtual_psr_i(); + xen_set_virtual_psr_i(1); + barrier(); + if (!old && xen_get_virtual_pend()) + xen_hyper_ssm_i(); +} + +/* turning off interrupts can be paravirtualized simply by writing + * to a memory-mapped virtual psr.i bit (implemented as a 16-bit bool) */ +static void +xen_rsm_i(void) +{ + xen_set_virtual_psr_i(0); + barrier(); +} + +static unsigned long +xen_get_psr_i(void) +{ + return xen_get_virtual_psr_i() ? IA64_PSR_I : 0; +} + +static void +xen_intrin_local_irq_restore(unsigned long mask) +{ + if (mask & IA64_PSR_I) + xen_ssm_i(); + else + xen_rsm_i(); +} + +static const struct pv_cpu_ops xen_cpu_ops __initdata = { + .fc = xen_fc, + .thash = xen_thash, + .get_cpuid = xen_get_cpuid, + .get_pmd = xen_get_pmd, + .getreg = xen_getreg, + .setreg = xen_setreg, + .ptcga = xen_ptcga, + .get_rr = xen_get_rr, + .set_rr = xen_set_rr, + .set_rr0_to_rr4 = xen_set_rr0_to_rr4, + .ssm_i = xen_ssm_i, + .rsm_i = xen_rsm_i, + .get_psr_i = xen_get_psr_i, + .intrin_local_irq_restore + = xen_intrin_local_irq_restore, +}; + +/*************************************************************************** * pv_ops initialization */ @@ -172,4 +285,5 @@ xen_setup_pv_ops(void) xen_info_init(); pv_info = xen_info; pv_init_ops = xen_init_ops; + pv_cpu_ops = xen_cpu_ops; } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:11 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:11 +0900 Subject: [PATCH 31/33] ia64/pv_ops: update Kconfig for paravirtualized guest and xen. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-32-git-send-email-yamahata@valinux.co.jp> introduce CONFIG_PARAVIRT_GUEST, CONFIG_PARAVIRT for paravirtualized guest. introduce CONFIG_XEN, CONFIG_IA64_XEN_GUEST for xen. Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/Kconfig | 32 ++++++++++++++++++++++++++++++++ arch/ia64/xen/Kconfig | 26 ++++++++++++++++++++++++++ 2 files changed, 58 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/xen/Kconfig diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 48e496f..34639f1 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -116,6 +116,33 @@ config AUDIT_ARCH bool default y +menuconfig PARAVIRT_GUEST + bool "Paravirtualized guest support" + help + Say Y here to get to see options related to running Linux under + various hypervisors. This option alone does not add any kernel code. + + If you say N, all options in this submenu will be skipped and disabled. + +if PARAVIRT_GUEST + +config PARAVIRT + bool "Enable paravirtualization code" + depends on PARAVIRT_GUEST + default y + bool + default y + help + This changes the kernel so it can modify itself when it is run + under a hypervisor, potentially improving performance significantly + over full virtualization. However, when run without a hypervisor + the kernel is theoretically slower and slightly larger. + + +source "arch/ia64/xen/Kconfig" + +endif + choice prompt "System type" default IA64_GENERIC @@ -137,6 +164,7 @@ config IA64_GENERIC SGI-SN2 For SGI Altix systems SGI-UV For SGI UV systems Ski-simulator For the HP simulator + Xen-domU For xen domU system If you don't know what to do, choose "generic". @@ -187,6 +215,10 @@ config IA64_HP_SIM bool "Ski-simulator" select SWIOTLB +config IA64_XEN_GUEST + bool "Xen guest" + depends on XEN + endchoice choice diff --git a/arch/ia64/xen/Kconfig b/arch/ia64/xen/Kconfig new file mode 100644 index 0000000..f1683a2 --- /dev/null +++ b/arch/ia64/xen/Kconfig @@ -0,0 +1,26 @@ +# +# This Kconfig describes xen/ia64 options +# + +config XEN + bool "Xen hypervisor support" + default y + depends on PARAVIRT && MCKINLEY && IA64_PAGE_SIZE_16KB && EXPERIMENTAL + select XEN_XENCOMM + select NO_IDLE_HZ + + # those are required to save/restore. + select ARCH_SUSPEND_POSSIBLE + select SUSPEND + select PM_SLEEP + help + Enable Xen hypervisor support. Resulting kernel runs + both as a guest OS on Xen and natively on hardware. + +config XEN_XENCOMM + depends on XEN + bool + +config NO_IDLE_HZ + depends on XEN + bool -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:10 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:10 +0900 Subject: [PATCH 30/33] ia64/xen: preliminary support for save/restore. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-31-git-send-email-yamahata@valinux.co.jp> preliminary support for save/restore. Although Save/restore isn't fully working yet, this patch is necessary to compile. Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/{time.h => suspend.c} | 45 +++++++++++++++++++++++++++++++++- arch/ia64/xen/time.c | 33 +++++++++++++++++++++++++ arch/ia64/xen/time.h | 1 + 4 files changed, 78 insertions(+), 3 deletions(-) copy arch/ia64/xen/{time.h => suspend.c} (64%) diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 972d085..0ad0224 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,7 +3,7 @@ # obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ - hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o suspend.o obj-$(CONFIG_IA64_GENERIC) += machvec.o diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/suspend.c similarity index 64% copy from arch/ia64/xen/time.h copy to arch/ia64/xen/suspend.c index b9c7ec5..fd66b04 100644 --- a/arch/ia64/xen/time.h +++ b/arch/ia64/xen/suspend.c @@ -1,5 +1,5 @@ /****************************************************************************** - * arch/ia64/xen/time.h + * arch/ia64/xen/suspend.c * * Copyright (c) 2008 Isaku Yamahata * VA Linux Systems Japan K.K. @@ -18,6 +18,47 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA * + * suspend/resume */ -extern struct pv_time_ops xen_time_ops __initdata; +#include +#include +#include "time.h" + +void +xen_mm_pin_all(void) +{ + /* nothing */ +} + +void +xen_mm_unpin_all(void) +{ + /* nothing */ +} + +void xen_pre_device_suspend(void) +{ + /* nothing */ +} + +void +xen_pre_suspend() +{ + /* nothing */ +} + +void +xen_post_suspend(int suspend_cancelled) +{ + if (suspend_cancelled) + return; + + xen_ia64_enable_opt_feature(); + /* add more if necessary */ +} + +void xen_arch_resume(void) +{ + xen_timer_resume_on_aps(); +} diff --git a/arch/ia64/xen/time.c b/arch/ia64/xen/time.c index ec168ec..d15a94c 100644 --- a/arch/ia64/xen/time.c +++ b/arch/ia64/xen/time.c @@ -26,6 +26,8 @@ #include #include +#include + #include #include @@ -178,3 +180,34 @@ struct pv_time_ops xen_time_ops __initdata = { .do_steal_accounting = xen_do_steal_accounting, .clocksource_resume = xen_itc_jitter_data_reset, }; + +/* Called after suspend, to resume time. */ +static void xen_local_tick_resume(void) +{ + /* Just trigger a tick. */ + ia64_cpu_local_tick(); + touch_softlockup_watchdog(); +} + +void +xen_timer_resume(void) +{ + unsigned int cpu; + + xen_local_tick_resume(); + + for_each_online_cpu(cpu) + xen_init_missing_ticks_accounting(cpu); +} + +static void ia64_cpu_local_tick_fn(void *unused) +{ + xen_local_tick_resume(); + xen_init_missing_ticks_accounting(smp_processor_id()); +} + +void +xen_timer_resume_on_aps(void) +{ + smp_call_function(&ia64_cpu_local_tick_fn, NULL, 1); +} diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/time.h index b9c7ec5..f98d7e1 100644 --- a/arch/ia64/xen/time.h +++ b/arch/ia64/xen/time.h @@ -21,3 +21,4 @@ */ extern struct pv_time_ops xen_time_ops __initdata; +void xen_timer_resume_on_aps(void); -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:12 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:12 +0900 Subject: [PATCH 32/33] ia64/xen: a recipe for using xen/ia64 with pv_ops. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-33-git-send-email-yamahata@valinux.co.jp> Recipe for useing xen/ia64 with pv_ops domU. Signed-off-by: Akio Takebe Signed-off-by: Isaku Yamahata --- Documentation/ia64/xen.txt | 183 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 183 insertions(+), 0 deletions(-) create mode 100644 Documentation/ia64/xen.txt diff --git a/Documentation/ia64/xen.txt b/Documentation/ia64/xen.txt new file mode 100644 index 0000000..a5c6993 --- /dev/null +++ b/Documentation/ia64/xen.txt @@ -0,0 +1,183 @@ + Recipe for getting/building/running Xen/ia64 with pv_ops + -------------------------------------------------------- + +This recipe discribes how to get xen-ia64 source and build it, +and run domU with pv_ops. + +=========== +Requirement +=========== + + - python + - mercurial + it (aka "hg") is a open-source source code + management software. See the below. + http://www.selenic.com/mercurial/wiki/ + - git + - bridge-utils + +================================= +Getting and Building Xen and Dom0 +================================= + + My enviroment is; + Machine : Tiger4 + Domain0 OS : RHEL5 + DomainU OS : RHEL5 + + 1. Download source + # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg + # cd xen-unstable.hg + # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg + + 2. # make world + + 3. # make install-tools + + 4. copy kernels and xen + # cp xen/xen.gz /boot/efi/efi/redhat/ + # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \ + /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen + + 5. make initrd for Dom0/DomU + # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \ + O=$(/bin/pwd)/build-linux-2.6.18-xen_ia64 + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \ + 2.6.18.8-xen --builtin mptspi --builtin mptbase \ + --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \ + --builtin ehci-hcd + +================================ +Making a disk image for guest OS +================================ + + 1. make file + # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0 + # mke2fs -F -j /root/rhel5.img + # mount -o loop /root/rhel5.img /mnt + # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt + # mkdir /mnt/{root,proc,sys,home,tmp} + + Note: You may miss some device files. If so, please create them + with mknod. Or you can use tar intead of cp. + + 2. modify DomU's fstab + # vi /mnt/etc/fstab + /dev/xvda1 / ext3 defaults 1 1 + none /dev/pts devpts gid=5,mode=620 0 0 + none /dev/shm tmpfs defaults 0 0 + none /proc proc defaults 0 0 + none /sys sysfs defaults 0 0 + + 3. modify inittab + set runlevel to 3 to avoid X trying to start + # vi /mnt/etc/inittab + id:3:initdefault: + Start a getty on the hvc0 console + X0:2345:respawn:/sbin/mingetty hvc0 + tty1-6 mingetty can be commented out + + 4. add hvc0 into /etc/securetty + # vi /mnt/etc/securetty (add hvc0) + + 5. umount + # umount /mnt + +FYI, virt-manager can also make a disk image for guest OS. +It's GUI tools and easy to make it. + +================== +Boot Xen & Domain0 +================== + + 1. replace elilo + elilo of RHEL5 can boot Xen and Dom0. + If you use old elilo (e.g RHEL4), please download from the below + http://elilo.sourceforge.net/cgi-bin/blosxom + and copy into /boot/efi/efi/redhat/ + # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi + + 2. modify elilo.conf (like the below) + # vi /boot/efi/efi/redhat/elilo.conf + prompt + timeout=20 + default=xen + relocatable + + image=vmlinuz-2.6.18.8-xen + label=xen + vmm=xen.gz + initrd=initrd-2.6.18.8-xen.img + read-only + append=" -- rhgb root=/dev/sda2" + +The append options before "--" are for xen hypervisor, +the options after "--" are for dom0. + +FYI, your machine may need console options like +"com1=19200,8n1 console=vga,com1". For example, +append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \ +console=ttyS0 root=/dev/sda2" + +===================================== +Getting and Building domU with pv_ops +===================================== + + 1. get pv_ops tree + # git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/ + + 2. git branch (if necessary) + # cd linux-2.6-xen-ia64/ + # git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19 + (Note: The current branch is xen-ia64-domu-minimal-2008may19. + But you would find the new branch. You can see with + "git branch -r" to get the branch lists. + http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ + is also available. The tree is based on + git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test) + + + 3. copy .config for pv_ops of domU + # cp arch/ia64/configs/xen_domu_wip_defconfig .config + + 4. make kernel with pv_ops + # make oldconfig + # make + + 5. install the kernel and initrd + # cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU + # make modules_install + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \ + 2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \ + --builtin mptbase --builtin mptscsih --builtin uhci-hcd \ + --builtin ohci-hcd --builtin ehci-hcd + +======================== +Boot DomainU with pv_ops +======================== + + 1. make config of DomU + # vi /etc/xen/rhel5 + kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU" + ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img" + vcpus = 1 + memory = 512 + name = "rhel5" + disk = [ 'file:/root/rhel5.img,xvda1,w' ] + root = "/dev/xvda1 ro" + extra= "rhgb console=hvc0" + + 2. After boot xen and dom0, start xend + # /etc/init.d/xend start + ( In the debugging case, # XEND_DEBUG=1 xend trace_start ) + + 3. start domU + # xm create -c rhel5 + +========= +Reference +========= +- Wiki of Xen/IA64 upstream merge + http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge + +Witten by Akio Takebe on 28 May 2008 -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:57 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:57 +0900 Subject: [PATCH 17/33] ia64/xen: introduce helper function to identify domain mode. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-18-git-send-email-yamahata@valinux.co.jp> There are four operating modes Xen code may find itself running in: - native - hvm domain - pv dom0 - pv domU Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/hypervisor.h | 75 ++++++++++++++++++++++++++++++++ 1 files changed, 75 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/hypervisor.h diff --git a/arch/ia64/include/asm/xen/hypervisor.h b/arch/ia64/include/asm/xen/hypervisor.h new file mode 100644 index 0000000..d1f84e1 --- /dev/null +++ b/arch/ia64/include/asm/xen/hypervisor.h @@ -0,0 +1,75 @@ +/****************************************************************************** + * hypervisor.h + * + * Linux-specific hypervisor handling. + * + * Copyright (c) 2002-2004, K A Fraser + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation; or, when distributed + * separately from the Linux kernel or incorporated into other + * software packages, subject to the following license: + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this source file (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, copy, modify, + * merge, publish, distribute, sublicense, and/or sell copies of the Software, + * and to permit persons to whom the Software is furnished to do so, subject to + * the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS + * IN THE SOFTWARE. + */ + +#ifndef _ASM_IA64_XEN_HYPERVISOR_H +#define _ASM_IA64_XEN_HYPERVISOR_H + +#ifdef CONFIG_XEN + +#include +#include +#include /* to compile feature.c */ +#include /* to comiple xen-netfront.c */ +#include + +/* xen_domain_type is set before executing any C code by early_xen_setup */ +enum xen_domain_type { + XEN_NATIVE, + XEN_PV_DOMAIN, + XEN_HVM_DOMAIN, +}; + +extern enum xen_domain_type xen_domain_type; + +#define xen_domain() (xen_domain_type != XEN_NATIVE) +#define xen_pv_domain() (xen_domain_type == XEN_PV_DOMAIN) +#define xen_initial_domain() (xen_pv_domain() && \ + (xen_start_info->flags & SIF_INITDOMAIN)) +#define xen_hvm_domain() (xen_domain_type == XEN_HVM_DOMAIN) + +/* deprecated. remove this */ +#define is_running_on_xen() (xen_domain_type == XEN_PV_DOMAIN) + +extern struct start_info *xen_start_info; + +#else /* CONFIG_XEN */ + +#define xen_domain() (0) +#define xen_pv_domain() (0) +#define xen_initial_domain() (0) +#define xen_hvm_domain() (0) +#define is_running_on_xen() (0) /* deprecated. remove this */ +#endif + +#define is_initial_xendomain() (0) /* deprecated. remove this */ + +#endif /* _ASM_IA64_XEN_HYPERVISOR_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:13 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:13 +0900 Subject: [PATCH 33/33] ia64/pv_ops: paravirtualized istruction checker. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-34-git-send-email-yamahata@valinux.co.jp> This patch implements a checker to detect instructions which should be paravirtualized instead of direct writing raw instruction. This patch does rough check so that it doesn't fully cover all cases, but it can detects most cases of paravirtualization breakage of hand written assembly codes. Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/native/pvchk_inst.h | 263 +++++++++++++++++++++++++++++ arch/ia64/kernel/Makefile | 18 ++ arch/ia64/kernel/paravirt_inst.h | 4 +- arch/ia64/scripts/pvcheck.sed | 32 ++++ 4 files changed, 316 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/native/pvchk_inst.h create mode 100644 arch/ia64/scripts/pvcheck.sed diff --git a/arch/ia64/include/asm/native/pvchk_inst.h b/arch/ia64/include/asm/native/pvchk_inst.h new file mode 100644 index 0000000..b8e6eb1 --- /dev/null +++ b/arch/ia64/include/asm/native/pvchk_inst.h @@ -0,0 +1,263 @@ +#ifndef _ASM_NATIVE_PVCHK_INST_H +#define _ASM_NATIVE_PVCHK_INST_H + +/****************************************************************************** + * arch/ia64/include/asm/native/pvchk_inst.h + * Checker for paravirtualizations of privileged operations. + * + * Copyright (C) 2005 Hewlett-Packard Co + * Dan Magenheimer + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +/********************************************** + * Instructions paravirtualized for correctness + **********************************************/ + +/* "fc" and "thash" are privilege-sensitive instructions, meaning they + * may have different semantics depending on whether they are executed + * at PL0 vs PL!=0. When paravirtualized, these instructions mustn't + * be allowed to execute directly, lest incorrect semantics result. + */ + +#define fc .error "fc should not be used directly." +#define thash .error "thash should not be used directly." + +/* Note that "ttag" and "cover" are also privilege-sensitive; "ttag" + * is not currently used (though it may be in a long-format VHPT system!) + * and the semantics of cover only change if psr.ic is off which is very + * rare (and currently non-existent outside of assembly code + */ +#define ttag .error "ttag should not be used directly." +#define cover .error "cover should not be used directly." + +/* There are also privilege-sensitive registers. These registers are + * readable at any privilege level but only writable at PL0. + */ +#define cpuid .error "cpuid should not be used directly." +#define pmd .error "pmd should not be used directly." + +/* + * mov ar.eflag = + * mov = ar.eflag + */ + +/********************************************** + * Instructions paravirtualized for performance + **********************************************/ +/* + * Those instructions include '.' which can't be handled by cpp. + * or can't be handled by cpp easily. + * They are handled by sed instead of cpp. + */ + +/* for .S + * itc.i + * itc.d + * + * bsw.0 + * bsw.1 + * + * ssm psr.ic | PSR_DEFAULT_BITS + * ssm psr.ic + * rsm psr.ic + * ssm psr.i + * rsm psr.i + * rsm psr.i | psr.ic + * rsm psr.dt + * ssm psr.dt + * + * mov = cr.ifa + * mov = cr.itir + * mov = cr.isr + * mov = cr.iha + * mov = cr.ipsr + * mov = cr.iim + * mov = cr.iip + * mov = cr.ivr + * mov = psr + * + * mov cr.ifa = + * mov cr.itir = + * mov cr.iha = + * mov cr.ipsr = + * mov cr.ifs = + * mov cr.iip = + * mov cr.kr = + */ + +/* for intrinsics + * ssm psr.i + * rsm psr.i + * mov = psr + * mov = ivr + * mov = tpr + * mov cr.itm = + * mov eoi = + * mov rr[] = + * mov = rr[] + * mov = kr + * mov kr = + * ptc.ga + */ + +/************************************************************* + * define paravirtualized instrcution macros as nop to ingore. + * and check whether arguments are appropriate. + *************************************************************/ + +/* check whether reg is a regular register */ +.macro is_rreg_in reg + .ifc "\reg", "r0" + nop 0 + .exitm + .endif + ;; + mov \reg = r0 + ;; +.endm +#define IS_RREG_IN(reg) is_rreg_in reg ; + +#define IS_RREG_OUT(reg) \ + ;; \ + mov reg = r0 \ + ;; + +#define IS_RREG_CLOB(reg) IS_RREG_OUT(reg) + +/* check whether pred is a predicate register */ +#define IS_PRED_IN(pred) \ + ;; \ + (pred) nop 0 \ + ;; + +#define IS_PRED_OUT(pred) \ + ;; \ + cmp.eq pred, p0 = r0, r0 \ + ;; + +#define IS_PRED_CLOB(pred) IS_PRED_OUT(pred) + + +#define DO_SAVE_MIN(__COVER, SAVE_IFS, EXTRA, WORKAROUND) \ + nop 0 +#define MOV_FROM_IFA(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_ITIR(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_ISR(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IHA(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IPSR(pred, reg) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IIM(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IIP(reg) \ + IS_RREG_OUT(reg) +#define MOV_FROM_IVR(reg, clob) \ + IS_RREG_OUT(reg) \ + IS_RREG_CLOB(clob) +#define MOV_FROM_PSR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IFA(reg, clob) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_ITIR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IHA(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IPSR(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IFS(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_IIP(reg, clob) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define MOV_TO_KR(kr, reg, clob0, clob1) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define ITC_I(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define ITC_D(pred, reg, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define ITC_I_AND_D(pred_i, pred_d, reg, clob) \ + IS_PRED_IN(pred_i) \ + IS_PRED_IN(pred_d) \ + IS_RREG_IN(reg) \ + IS_RREG_CLOB(clob) +#define THASH(pred, reg0, reg1, clob) \ + IS_PRED_IN(pred) \ + IS_RREG_OUT(reg0) \ + IS_RREG_IN(reg1) \ + IS_RREG_CLOB(clob) +#define SSM_PSR_IC_AND_DEFAULT_BITS_AND_SRLZ_I(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define SSM_PSR_IC_AND_SRLZ_D(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define RSM_PSR_IC(clob) \ + IS_RREG_CLOB(clob) +#define SSM_PSR_I(pred, pred_clob, clob) \ + IS_PRED_IN(pred) \ + IS_PRED_CLOB(pred_clob) \ + IS_RREG_CLOB(clob) +#define RSM_PSR_I(pred, clob0, clob1) \ + IS_PRED_IN(pred) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define RSM_PSR_I_IC(clob0, clob1, clob2) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) \ + IS_RREG_CLOB(clob2) +#define RSM_PSR_DT \ + nop 0 +#define SSM_PSR_DT_AND_SRLZ_I \ + nop 0 +#define BSW_0(clob0, clob1, clob2) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) \ + IS_RREG_CLOB(clob2) +#define BSW_1(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) +#define COVER \ + nop 0 +#define RFI \ + br.ret.sptk.many rp /* defining nop causes dependency error */ + +#endif /* _ASM_NATIVE_PVCHK_INST_H */ diff --git a/arch/ia64/kernel/Makefile b/arch/ia64/kernel/Makefile index 87fea11..55e6ca8 100644 --- a/arch/ia64/kernel/Makefile +++ b/arch/ia64/kernel/Makefile @@ -112,5 +112,23 @@ clean-files += $(objtree)/include/asm-ia64/nr-irqs.h ASM_PARAVIRT_OBJS = ivt.o entry.o define paravirtualized_native AFLAGS_$(1) += -D__IA64_ASM_PARAVIRTUALIZED_NATIVE +AFLAGS_pvchk-sed-$(1) += -D__IA64_ASM_PARAVIRTUALIZED_PVCHECK +extra-y += pvchk-$(1) endef $(foreach obj,$(ASM_PARAVIRT_OBJS),$(eval $(call paravirtualized_native,$(obj)))) + +# +# Checker for paravirtualizations of privileged operations. +# +quiet_cmd_pv_check_sed = PVCHK $@ +define cmd_pv_check_sed + sed -f $(srctree)/arch/$(SRCARCH)/scripts/pvcheck.sed $< > $@ +endef + +$(obj)/pvchk-sed-%.s: $(src)/%.S $(srctree)/arch/$(SRCARCH)/scripts/pvcheck.sed FORCE + $(call if_changed_dep,as_s_S) +$(obj)/pvchk-%.s: $(obj)/pvchk-sed-%.s FORCE + $(call if_changed,pv_check_sed) +$(obj)/pvchk-%.o: $(obj)/pvchk-%.s FORCE + $(call if_changed,as_o_S) +.PRECIOUS: $(obj)/pvchk-sed-%.s $(obj)/pvchk-%.s $(obj)/pvchk-%.o diff --git a/arch/ia64/kernel/paravirt_inst.h b/arch/ia64/kernel/paravirt_inst.h index 5cad6fb..64d6d81 100644 --- a/arch/ia64/kernel/paravirt_inst.h +++ b/arch/ia64/kernel/paravirt_inst.h @@ -20,7 +20,9 @@ * */ -#ifdef __IA64_ASM_PARAVIRTUALIZED_XEN +#ifdef __IA64_ASM_PARAVIRTUALIZED_PVCHECK +#include +#elif defined(__IA64_ASM_PARAVIRTUALIZED_XEN) #include #include #else diff --git a/arch/ia64/scripts/pvcheck.sed b/arch/ia64/scripts/pvcheck.sed new file mode 100644 index 0000000..ba66ac2 --- /dev/null +++ b/arch/ia64/scripts/pvcheck.sed @@ -0,0 +1,32 @@ +# +# Checker for paravirtualizations of privileged operations. +# +s/ssm.*psr\.ic.*/.warning \"ssm psr.ic should not be used directly\"/g +s/rsm.*psr\.ic.*/.warning \"rsm psr.ic should not be used directly\"/g +s/ssm.*psr\.i.*/.warning \"ssm psr.i should not be used directly\"/g +s/rsm.*psr\.i.*/.warning \"rsm psr.i should not be used directly\"/g +s/ssm.*psr\.dt.*/.warning \"ssm psr.dt should not be used directly\"/g +s/rsm.*psr\.dt.*/.warning \"rsm psr.dt should not be used directly\"/g +s/mov.*=.*cr\.ifa/.warning \"cr.ifa should not used directly\"/g +s/mov.*=.*cr\.itir/.warning \"cr.itir should not used directly\"/g +s/mov.*=.*cr\.isr/.warning \"cr.isr should not used directly\"/g +s/mov.*=.*cr\.iha/.warning \"cr.iha should not used directly\"/g +s/mov.*=.*cr\.ipsr/.warning \"cr.ipsr should not used directly\"/g +s/mov.*=.*cr\.iim/.warning \"cr.iim should not used directly\"/g +s/mov.*=.*cr\.iip/.warning \"cr.iip should not used directly\"/g +s/mov.*=.*cr\.ivr/.warning \"cr.ivr should not used directly\"/g +s/mov.*=[^\.]*psr/.warning \"psr should not used directly\"/g # avoid ar.fpsr +s/mov.*=.*ar\.eflags/.warning \"ar.eflags should not used directly\"/g +s/mov.*cr\.ifa.*=.*/.warning \"cr.ifa should not used directly\"/g +s/mov.*cr\.itir.*=.*/.warning \"cr.itir should not used directly\"/g +s/mov.*cr\.iha.*=.*/.warning \"cr.iha should not used directly\"/g +s/mov.*cr\.ipsr.*=.*/.warning \"cr.ipsr should not used directly\"/g +s/mov.*cr\.ifs.*=.*/.warning \"cr.ifs should not used directly\"/g +s/mov.*cr\.iip.*=.*/.warning \"cr.iip should not used directly\"/g +s/mov.*cr\.kr.*=.*/.warning \"cr.kr should not used directly\"/g +s/mov.*ar\.eflags.*=.*/.warning \"ar.eflags should not used directly\"/g +s/itc\.i.*/.warning \"itc.i should not be used directly.\"/g +s/itc\.d.*/.warning \"itc.d should not be used directly.\"/g +s/bsw\.0/.warning \"bsw.0 should not be used directly.\"/g +s/bsw\.1/.warning \"bsw.1 should not be used directly.\"/g +s/ptc\.ga.*/.warning \"ptc.ga should not be used directly.\"/g -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:08 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:08 +0900 Subject: [PATCH 28/33] ia64/pv_ops/xen: implement xen pv_time_ops. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-29-git-send-email-yamahata@valinux.co.jp> implement xen pv_time_ops to account steal time. Cc: Jeremy Fitzhardinge Signed-off-by: Alex Williamson Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/time.c | 180 ++++++++++++++++++++++++++++++++++++++++++++ arch/ia64/xen/time.h | 23 ++++++ arch/ia64/xen/xen_pv_ops.c | 2 + 4 files changed, 206 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/xen/time.c create mode 100644 arch/ia64/xen/time.h diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 01c4289..ed31c76 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,7 +3,7 @@ # obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ - hypervisor.o xencomm.o xcom_hcall.o grant-table.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN diff --git a/arch/ia64/xen/time.c b/arch/ia64/xen/time.c new file mode 100644 index 0000000..ec168ec --- /dev/null +++ b/arch/ia64/xen/time.c @@ -0,0 +1,180 @@ +/****************************************************************************** + * arch/ia64/xen/time.c + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include +#include +#include +#include +#include + +#include + +#include + +#include "../kernel/fsyscall_gtod_data.h" + +DEFINE_PER_CPU(struct vcpu_runstate_info, runstate); +DEFINE_PER_CPU(unsigned long, processed_stolen_time); +DEFINE_PER_CPU(unsigned long, processed_blocked_time); + +/* taken from i386/kernel/time-xen.c */ +static void xen_init_missing_ticks_accounting(int cpu) +{ + struct vcpu_register_runstate_memory_area area; + struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu); + int rc; + + memset(runstate, 0, sizeof(*runstate)); + + area.addr.v = runstate; + rc = HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, + &area); + WARN_ON(rc && rc != -ENOSYS); + + per_cpu(processed_blocked_time, cpu) = runstate->time[RUNSTATE_blocked]; + per_cpu(processed_stolen_time, cpu) = runstate->time[RUNSTATE_runnable] + + runstate->time[RUNSTATE_offline]; +} + +/* + * Runstate accounting + */ +/* stolen from arch/x86/xen/time.c */ +static void get_runstate_snapshot(struct vcpu_runstate_info *res) +{ + u64 state_time; + struct vcpu_runstate_info *state; + + BUG_ON(preemptible()); + + state = &__get_cpu_var(runstate); + + /* + * The runstate info is always updated by the hypervisor on + * the current CPU, so there's no need to use anything + * stronger than a compiler barrier when fetching it. + */ + do { + state_time = state->state_entry_time; + rmb(); + *res = *state; + rmb(); + } while (state->state_entry_time != state_time); +} + +#define NS_PER_TICK (1000000000LL/HZ) + +static unsigned long +consider_steal_time(unsigned long new_itm) +{ + unsigned long stolen, blocked; + unsigned long delta_itm = 0, stolentick = 0; + int cpu = smp_processor_id(); + struct vcpu_runstate_info runstate; + struct task_struct *p = current; + + get_runstate_snapshot(&runstate); + + /* + * Check for vcpu migration effect + * In this case, itc value is reversed. + * This causes huge stolen value. + * This function just checks and reject this effect. + */ + if (!time_after_eq(runstate.time[RUNSTATE_blocked], + per_cpu(processed_blocked_time, cpu))) + blocked = 0; + + if (!time_after_eq(runstate.time[RUNSTATE_runnable] + + runstate.time[RUNSTATE_offline], + per_cpu(processed_stolen_time, cpu))) + stolen = 0; + + if (!time_after(delta_itm + new_itm, ia64_get_itc())) + stolentick = ia64_get_itc() - new_itm; + + do_div(stolentick, NS_PER_TICK); + stolentick++; + + do_div(stolen, NS_PER_TICK); + + if (stolen > stolentick) + stolen = stolentick; + + stolentick -= stolen; + do_div(blocked, NS_PER_TICK); + + if (blocked > stolentick) + blocked = stolentick; + + if (stolen > 0 || blocked > 0) { + account_steal_time(NULL, jiffies_to_cputime(stolen)); + account_steal_time(idle_task(cpu), jiffies_to_cputime(blocked)); + run_local_timers(); + + if (rcu_pending(cpu)) + rcu_check_callbacks(cpu, user_mode(get_irq_regs())); + + scheduler_tick(); + run_posix_cpu_timers(p); + delta_itm += local_cpu_data->itm_delta * (stolen + blocked); + + if (cpu == time_keeper_id) { + write_seqlock(&xtime_lock); + do_timer(stolen + blocked); + local_cpu_data->itm_next = delta_itm + new_itm; + write_sequnlock(&xtime_lock); + } else { + local_cpu_data->itm_next = delta_itm + new_itm; + } + per_cpu(processed_stolen_time, cpu) += NS_PER_TICK * stolen; + per_cpu(processed_blocked_time, cpu) += NS_PER_TICK * blocked; + } + return delta_itm; +} + +static int xen_do_steal_accounting(unsigned long *new_itm) +{ + unsigned long delta_itm; + delta_itm = consider_steal_time(*new_itm); + *new_itm += delta_itm; + if (time_after(*new_itm, ia64_get_itc()) && delta_itm) + return 1; + + return 0; +} + +static void xen_itc_jitter_data_reset(void) +{ + u64 lcycle, ret; + + do { + lcycle = itc_jitter_data.itc_lastcycle; + ret = cmpxchg(&itc_jitter_data.itc_lastcycle, lcycle, 0); + } while (unlikely(ret != lcycle)); +} + +struct pv_time_ops xen_time_ops __initdata = { + .init_missing_ticks_accounting = xen_init_missing_ticks_accounting, + .do_steal_accounting = xen_do_steal_accounting, + .clocksource_resume = xen_itc_jitter_data_reset, +}; diff --git a/arch/ia64/xen/time.h b/arch/ia64/xen/time.h new file mode 100644 index 0000000..b9c7ec5 --- /dev/null +++ b/arch/ia64/xen/time.h @@ -0,0 +1,23 @@ +/****************************************************************************** + * arch/ia64/xen/time.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +extern struct pv_time_ops xen_time_ops __initdata; diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 4fe4e62..04cd123 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -30,6 +30,7 @@ #include #include "irq_xen.h" +#include "time.h" /*************************************************************************** * general info @@ -357,6 +358,7 @@ xen_setup_pv_ops(void) pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; pv_irq_ops = xen_irq_ops; + pv_time_ops = xen_time_ops; paravirt_cpu_asm_init(&xen_cpu_asm_switch); } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:01 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:01 +0900 Subject: [PATCH 21/33] ia64/pv_ops/xen: define xen paravirtualized instructions for hand written assembly code In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-22-git-send-email-yamahata@valinux.co.jp> define xen paravirtualized instructions for hand written assembly code. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata Cc: Akio Takebe --- arch/ia64/include/asm/xen/inst.h | 447 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 447 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/inst.h diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h new file mode 100644 index 0000000..03895e9 --- /dev/null +++ b/arch/ia64/include/asm/xen/inst.h @@ -0,0 +1,447 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/inst.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include + +#define MOV_FROM_IFA(reg) \ + movl reg = XSI_IFA; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_ITIR(reg) \ + movl reg = XSI_ITIR; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_ISR(reg) \ + movl reg = XSI_ISR; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IHA(reg) \ + movl reg = XSI_IHA; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IPSR(pred, reg) \ +(pred) movl reg = XSI_IPSR; \ + ;; \ +(pred) ld8 reg = [reg] + +#define MOV_FROM_IIM(reg) \ + movl reg = XSI_IIM; \ + ;; \ + ld8 reg = [reg] + +#define MOV_FROM_IIP(reg) \ + movl reg = XSI_IIP; \ + ;; \ + ld8 reg = [reg] + +.macro __MOV_FROM_IVR reg, clob + .ifc "\reg", "r8" + XEN_HYPER_GET_IVR + .exitm + .endif + .ifc "\clob", "r8" + XEN_HYPER_GET_IVR + ;; + mov \reg = r8 + .exitm + .endif + + mov \clob = r8 + ;; + XEN_HYPER_GET_IVR + ;; + mov \reg = r8 + ;; + mov r8 = \clob +.endm +#define MOV_FROM_IVR(reg, clob) __MOV_FROM_IVR reg, clob + +.macro __MOV_FROM_PSR pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_GET_PSR; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) XEN_HYPER_GET_PSR + ;; + (\pred) mov \reg = r8 + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) XEN_HYPER_GET_PSR + ;; + (\pred) mov \reg = r8 + (\pred) mov r8 = \clob +.endm +#define MOV_FROM_PSR(pred, reg, clob) __MOV_FROM_PSR pred, reg, clob + + +#define MOV_TO_IFA(reg, clob) \ + movl clob = XSI_IFA; \ + ;; \ + st8 [clob] = reg \ + +#define MOV_TO_ITIR(pred, reg, clob) \ +(pred) movl clob = XSI_ITIR; \ + ;; \ +(pred) st8 [clob] = reg + +#define MOV_TO_IHA(pred, reg, clob) \ +(pred) movl clob = XSI_IHA; \ + ;; \ +(pred) st8 [clob] = reg + +#define MOV_TO_IPSR(pred, reg, clob) \ +(pred) movl clob = XSI_IPSR; \ + ;; \ +(pred) st8 [clob] = reg; \ + ;; + +#define MOV_TO_IFS(pred, reg, clob) \ +(pred) movl clob = XSI_IFS; \ + ;; \ +(pred) st8 [clob] = reg; \ + ;; + +#define MOV_TO_IIP(reg, clob) \ + movl clob = XSI_IIP; \ + ;; \ + st8 [clob] = reg + +.macro ____MOV_TO_KR kr, reg, clob0, clob1 + .ifc "\clob0", "r9" + .error "clob0 \clob0 must not be r9" + .endif + .ifc "\clob1", "r8" + .error "clob1 \clob1 must not be r8" + .endif + + .ifnc "\reg", "r9" + .ifnc "\clob1", "r9" + mov \clob1 = r9 + .endif + mov r9 = \reg + .endif + .ifnc "\clob0", "r8" + mov \clob0 = r8 + .endif + mov r8 = \kr + ;; + XEN_HYPER_SET_KR + + .ifnc "\reg", "r9" + .ifnc "\clob1", "r9" + mov r9 = \clob1 + .endif + .endif + .ifnc "\clob0", "r8" + mov r8 = \clob0 + .endif +.endm + +.macro __MOV_TO_KR kr, reg, clob0, clob1 + .ifc "\clob0", "r9" + ____MOV_TO_KR \kr, \reg, \clob1, \clob0 + .exitm + .endif + .ifc "\clob1", "r8" + ____MOV_TO_KR \kr, \reg, \clob1, \clob0 + .exitm + .endif + + ____MOV_TO_KR \kr, \reg, \clob0, \clob1 +.endm + +#define MOV_TO_KR(kr, reg, clob0, clob1) \ + __MOV_TO_KR IA64_KR_ ## kr, reg, clob0, clob1 + + +.macro __ITC_I pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_ITC_I + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_I + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_I + ;; + (\pred) mov r8 = \clob + ;; +.endm +#define ITC_I(pred, reg, clob) __ITC_I pred, reg, clob + +.macro __ITC_D pred, reg, clob + .ifc "\reg", "r8" + (\pred) XEN_HYPER_ITC_D + ;; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_D + ;; + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg + ;; + (\pred) XEN_HYPER_ITC_D + ;; + (\pred) mov r8 = \clob + ;; +.endm +#define ITC_D(pred, reg, clob) __ITC_D pred, reg, clob + +.macro __ITC_I_AND_D pred_i, pred_d, reg, clob + .ifc "\reg", "r8" + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + .exitm + .endif + .ifc "\clob", "r8" + mov r8 = \reg + ;; + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + .exitm + .endif + + mov \clob = r8 + mov r8 = \reg + ;; + (\pred_i)XEN_HYPER_ITC_I + ;; + (\pred_d)XEN_HYPER_ITC_D + ;; + mov r8 = \clob + ;; +.endm +#define ITC_I_AND_D(pred_i, pred_d, reg, clob) \ + __ITC_I_AND_D pred_i, pred_d, reg, clob + +.macro __THASH pred, reg0, reg1, clob + .ifc "\reg0", "r8" + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + .exitm + .endc + .ifc "\reg1", "r8" + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + ;; + .exitm + .endif + .ifc "\clob", "r8" + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + ;; + .exitm + .endif + + (\pred) mov \clob = r8 + (\pred) mov r8 = \reg1 + (\pred) XEN_HYPER_THASH + ;; + (\pred) mov \reg0 = r8 + (\pred) mov r8 = \clob + ;; +.endm +#define THASH(pred, reg0, reg1, clob) __THASH pred, reg0, reg1, clob + +#define SSM_PSR_IC_AND_DEFAULT_BITS_AND_SRLZ_I(clob0, clob1) \ + mov clob0 = 1; \ + movl clob1 = XSI_PSR_IC; \ + ;; \ + st4 [clob1] = clob0 \ + ;; + +#define SSM_PSR_IC_AND_SRLZ_D(clob0, clob1) \ + ;; \ + srlz.d; \ + mov clob1 = 1; \ + movl clob0 = XSI_PSR_IC; \ + ;; \ + st4 [clob0] = clob1 + +#define RSM_PSR_IC(clob) \ + movl clob = XSI_PSR_IC; \ + ;; \ + st4 [clob] = r0; \ + ;; + +/* pred will be clobbered */ +#define MASK_TO_PEND_OFS (-1) +#define SSM_PSR_I(pred, pred_clob, clob) \ +(pred) movl clob = XSI_PSR_I_ADDR \ + ;; \ +(pred) ld8 clob = [clob] \ + ;; \ + /* if (pred) vpsr.i = 1 */ \ + /* if (pred) (vcpu->vcpu_info->evtchn_upcall_mask)=0 */ \ +(pred) st1 [clob] = r0, MASK_TO_PEND_OFS \ + ;; \ + /* if (vcpu->vcpu_info->evtchn_upcall_pending) */ \ +(pred) ld1 clob = [clob] \ + ;; \ +(pred) cmp.ne.unc pred_clob, p0 = clob, r0 \ + ;; \ +(pred_clob)XEN_HYPER_SSM_I /* do areal ssm psr.i */ + +#define RSM_PSR_I(pred, clob0, clob1) \ + movl clob0 = XSI_PSR_I_ADDR; \ + mov clob1 = 1; \ + ;; \ + ld8 clob0 = [clob0]; \ + ;; \ +(pred) st1 [clob0] = clob1 + +#define RSM_PSR_I_IC(clob0, clob1, clob2) \ + movl clob0 = XSI_PSR_I_ADDR; \ + movl clob1 = XSI_PSR_IC; \ + ;; \ + ld8 clob0 = [clob0]; \ + mov clob2 = 1; \ + ;; \ + /* note: clears both vpsr.i and vpsr.ic! */ \ + st1 [clob0] = clob2; \ + st4 [clob1] = r0; \ + ;; + +#define RSM_PSR_DT \ + XEN_HYPER_RSM_PSR_DT + +#define SSM_PSR_DT_AND_SRLZ_I \ + XEN_HYPER_SSM_PSR_DT + +#define BSW_0(clob0, clob1, clob2) \ + ;; \ + /* r16-r31 all now hold bank1 values */ \ + mov clob2 = ar.unat; \ + movl clob0 = XSI_BANK1_R16; \ + movl clob1 = XSI_BANK1_R16 + 8; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r16, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r17, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r18, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r19, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r20, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r21, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r22, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r23, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r24, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r25, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r26, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r27, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r28, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r29, 16; \ + ;; \ +.mem.offset 0, 0; st8.spill [clob0] = r30, 16; \ +.mem.offset 8, 0; st8.spill [clob1] = r31, 16; \ + ;; \ + mov clob1 = ar.unat; \ + movl clob0 = XSI_B1NAT; \ + ;; \ + st8 [clob0] = clob1; \ + mov ar.unat = clob2; \ + movl clob0 = XSI_BANKNUM; \ + ;; \ + st4 [clob0] = r0 + + + /* FIXME: THIS CODE IS NOT NaT SAFE! */ +#define XEN_BSW_1(clob) \ + mov clob = ar.unat; \ + movl r30 = XSI_B1NAT; \ + ;; \ + ld8 r30 = [r30]; \ + mov r31 = 1; \ + ;; \ + mov ar.unat = r30; \ + movl r30 = XSI_BANKNUM; \ + ;; \ + st4 [r30] = r31; \ + movl r30 = XSI_BANK1_R16; \ + movl r31 = XSI_BANK1_R16+8; \ + ;; \ + ld8.fill r16 = [r30], 16; \ + ld8.fill r17 = [r31], 16; \ + ;; \ + ld8.fill r18 = [r30], 16; \ + ld8.fill r19 = [r31], 16; \ + ;; \ + ld8.fill r20 = [r30], 16; \ + ld8.fill r21 = [r31], 16; \ + ;; \ + ld8.fill r22 = [r30], 16; \ + ld8.fill r23 = [r31], 16; \ + ;; \ + ld8.fill r24 = [r30], 16; \ + ld8.fill r25 = [r31], 16; \ + ;; \ + ld8.fill r26 = [r30], 16; \ + ld8.fill r27 = [r31], 16; \ + ;; \ + ld8.fill r28 = [r30], 16; \ + ld8.fill r29 = [r31], 16; \ + ;; \ + ld8.fill r30 = [r30]; \ + ld8.fill r31 = [r31]; \ + ;; \ + mov ar.unat = clob + +#define BSW_1(clob0, clob1) XEN_BSW_1(clob1) + + +#define COVER \ + XEN_HYPER_COVER + +#define RFI \ + XEN_HYPER_RFI; \ + dv_serialize_data -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:54 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:54 +0900 Subject: [PATCH 14/33] ia64/xen: xencomm conversion functions for hypercalls In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-15-git-send-email-yamahata@valinux.co.jp> On ia64/xen, pointer arguments for hypercall is passed by pseudo physical address(guest physical address.) So such hypercalls needs address conversion functions. This patch implements concrete conversion functions for such hypercalls. Signed-off-by: Akio Takebe Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/xcom_hcall.h | 51 ++++ arch/ia64/include/asm/xen/xencomm.h | 1 + arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xcom_hcall.c | 441 ++++++++++++++++++++++++++++++++ arch/ia64/xen/xencomm.c | 11 + 5 files changed, 505 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/xcom_hcall.h create mode 100644 arch/ia64/xen/xcom_hcall.c diff --git a/arch/ia64/include/asm/xen/xcom_hcall.h b/arch/ia64/include/asm/xen/xcom_hcall.h new file mode 100644 index 0000000..20b2950 --- /dev/null +++ b/arch/ia64/include/asm/xen/xcom_hcall.h @@ -0,0 +1,51 @@ +/* + * Copyright (C) 2006 Tristan Gingold , Bull SAS + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_IA64_XEN_XCOM_HCALL_H +#define _ASM_IA64_XEN_XCOM_HCALL_H + +/* These function creates inline or mini descriptor for the parameters and + calls the corresponding xencomm_arch_hypercall_X. + Architectures should defines HYPERVISOR_xxx as xencomm_hypercall_xxx unless + they want to use their own wrapper. */ +extern int xencomm_hypercall_console_io(int cmd, int count, char *str); + +extern int xencomm_hypercall_event_channel_op(int cmd, void *op); + +extern int xencomm_hypercall_xen_version(int cmd, void *arg); + +extern int xencomm_hypercall_physdev_op(int cmd, void *op); + +extern int xencomm_hypercall_grant_table_op(unsigned int cmd, void *op, + unsigned int count); + +extern int xencomm_hypercall_sched_op(int cmd, void *arg); + +extern int xencomm_hypercall_multicall(void *call_list, int nr_calls); + +extern int xencomm_hypercall_callback_op(int cmd, void *arg); + +extern int xencomm_hypercall_memory_op(unsigned int cmd, void *arg); + +extern int xencomm_hypercall_suspend(unsigned long srec); + +extern long xencomm_hypercall_vcpu_op(int cmd, int cpu, void *arg); + +extern long xencomm_hypercall_opt_feature(void *arg); + +#endif /* _ASM_IA64_XEN_XCOM_HCALL_H */ diff --git a/arch/ia64/include/asm/xen/xencomm.h b/arch/ia64/include/asm/xen/xencomm.h index 28732cd..cded677 100644 --- a/arch/ia64/include/asm/xen/xencomm.h +++ b/arch/ia64/include/asm/xen/xencomm.h @@ -24,6 +24,7 @@ /* Must be called before any hypercall. */ extern void xencomm_initialize(void); +extern int xencomm_is_initialized(void); /* Check if virtual contiguity means physical contiguity * where the passed address is a pointer value in virtual address. diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ad0c9f7..ae08822 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -2,4 +2,4 @@ # Makefile for Xen components # -obj-y := hypercall.o xencomm.o +obj-y := hypercall.o xencomm.o xcom_hcall.o diff --git a/arch/ia64/xen/xcom_hcall.c b/arch/ia64/xen/xcom_hcall.c new file mode 100644 index 0000000..ccaf743 --- /dev/null +++ b/arch/ia64/xen/xcom_hcall.c @@ -0,0 +1,441 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Tristan Gingold + * + * Copyright (c) 2007 + * Isaku Yamahata + * VA Linux Systems Japan K.K. + * consolidate mini and inline version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* Xencomm notes: + * This file defines hypercalls to be used by xencomm. The hypercalls simply + * create inlines or mini descriptors for pointers and then call the raw arch + * hypercall xencomm_arch_hypercall_XXX + * + * If the arch wants to directly use these hypercalls, simply define macros + * in asm/xen/hypercall.h, eg: + * #define HYPERVISOR_sched_op xencomm_hypercall_sched_op + * + * The arch may also define HYPERVISOR_xxx as a function and do more operations + * before/after doing the hypercall. + * + * Note: because only inline or mini descriptors are created these functions + * must only be called with in kernel memory parameters. + */ + +int +xencomm_hypercall_console_io(int cmd, int count, char *str) +{ + /* xen early printk uses console io hypercall before + * xencomm initialization. In that case, we just ignore it. + */ + if (!xencomm_is_initialized()) + return 0; + + return xencomm_arch_hypercall_console_io + (cmd, count, xencomm_map_no_alloc(str, count)); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_console_io); + +int +xencomm_hypercall_event_channel_op(int cmd, void *op) +{ + struct xencomm_handle *desc; + desc = xencomm_map_no_alloc(op, sizeof(struct evtchn_op)); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_event_channel_op(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_event_channel_op); + +int +xencomm_hypercall_xen_version(int cmd, void *arg) +{ + struct xencomm_handle *desc; + unsigned int argsize; + + switch (cmd) { + case XENVER_version: + /* do not actually pass an argument */ + return xencomm_arch_hypercall_xen_version(cmd, 0); + case XENVER_extraversion: + argsize = sizeof(struct xen_extraversion); + break; + case XENVER_compile_info: + argsize = sizeof(struct xen_compile_info); + break; + case XENVER_capabilities: + argsize = sizeof(struct xen_capabilities_info); + break; + case XENVER_changeset: + argsize = sizeof(struct xen_changeset_info); + break; + case XENVER_platform_parameters: + argsize = sizeof(struct xen_platform_parameters); + break; + case XENVER_get_features: + argsize = (arg == NULL) ? 0 : sizeof(struct xen_feature_info); + break; + + default: + printk(KERN_DEBUG + "%s: unknown version op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_xen_version(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_xen_version); + +int +xencomm_hypercall_physdev_op(int cmd, void *op) +{ + unsigned int argsize; + + switch (cmd) { + case PHYSDEVOP_apic_read: + case PHYSDEVOP_apic_write: + argsize = sizeof(struct physdev_apic); + break; + case PHYSDEVOP_alloc_irq_vector: + case PHYSDEVOP_free_irq_vector: + argsize = sizeof(struct physdev_irq); + break; + case PHYSDEVOP_irq_status_query: + argsize = sizeof(struct physdev_irq_status_query); + break; + + default: + printk(KERN_DEBUG + "%s: unknown physdev op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_physdev_op + (cmd, xencomm_map_no_alloc(op, argsize)); +} + +static int +xencommize_grant_table_op(struct xencomm_mini **xc_area, + unsigned int cmd, void *op, unsigned int count, + struct xencomm_handle **desc) +{ + struct xencomm_handle *desc1; + unsigned int argsize; + + switch (cmd) { + case GNTTABOP_map_grant_ref: + argsize = sizeof(struct gnttab_map_grant_ref); + break; + case GNTTABOP_unmap_grant_ref: + argsize = sizeof(struct gnttab_unmap_grant_ref); + break; + case GNTTABOP_setup_table: + { + struct gnttab_setup_table *setup = op; + + argsize = sizeof(*setup); + + if (count != 1) + return -EINVAL; + desc1 = __xencomm_map_no_alloc + (xen_guest_handle(setup->frame_list), + setup->nr_frames * + sizeof(*xen_guest_handle(setup->frame_list)), + *xc_area); + if (desc1 == NULL) + return -EINVAL; + (*xc_area)++; + set_xen_guest_handle(setup->frame_list, (void *)desc1); + break; + } + case GNTTABOP_dump_table: + argsize = sizeof(struct gnttab_dump_table); + break; + case GNTTABOP_transfer: + argsize = sizeof(struct gnttab_transfer); + break; + case GNTTABOP_copy: + argsize = sizeof(struct gnttab_copy); + break; + case GNTTABOP_query_size: + argsize = sizeof(struct gnttab_query_size); + break; + default: + printk(KERN_DEBUG "%s: unknown hypercall grant table op %d\n", + __func__, cmd); + BUG(); + } + + *desc = __xencomm_map_no_alloc(op, count * argsize, *xc_area); + if (*desc == NULL) + return -EINVAL; + (*xc_area)++; + + return 0; +} + +int +xencomm_hypercall_grant_table_op(unsigned int cmd, void *op, + unsigned int count) +{ + int rc; + struct xencomm_handle *desc; + XENCOMM_MINI_ALIGNED(xc_area, 2); + + rc = xencommize_grant_table_op(&xc_area, cmd, op, count, &desc); + if (rc) + return rc; + + return xencomm_arch_hypercall_grant_table_op(cmd, desc, count); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_grant_table_op); + +int +xencomm_hypercall_sched_op(int cmd, void *arg) +{ + struct xencomm_handle *desc; + unsigned int argsize; + + switch (cmd) { + case SCHEDOP_yield: + case SCHEDOP_block: + argsize = 0; + break; + case SCHEDOP_shutdown: + argsize = sizeof(struct sched_shutdown); + break; + case SCHEDOP_poll: + { + struct sched_poll *poll = arg; + struct xencomm_handle *ports; + + argsize = sizeof(struct sched_poll); + ports = xencomm_map_no_alloc(xen_guest_handle(poll->ports), + sizeof(*xen_guest_handle(poll->ports))); + + set_xen_guest_handle(poll->ports, (void *)ports); + break; + } + default: + printk(KERN_DEBUG "%s: unknown sched op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_sched_op(cmd, desc); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_sched_op); + +int +xencomm_hypercall_multicall(void *call_list, int nr_calls) +{ + int rc; + int i; + struct multicall_entry *mce; + struct xencomm_handle *desc; + XENCOMM_MINI_ALIGNED(xc_area, nr_calls * 2); + + for (i = 0; i < nr_calls; i++) { + mce = (struct multicall_entry *)call_list + i; + + switch (mce->op) { + case __HYPERVISOR_update_va_mapping: + case __HYPERVISOR_mmu_update: + /* No-op on ia64. */ + break; + case __HYPERVISOR_grant_table_op: + rc = xencommize_grant_table_op + (&xc_area, + mce->args[0], (void *)mce->args[1], + mce->args[2], &desc); + if (rc) + return rc; + mce->args[1] = (unsigned long)desc; + break; + case __HYPERVISOR_memory_op: + default: + printk(KERN_DEBUG + "%s: unhandled multicall op entry op %lu\n", + __func__, mce->op); + return -ENOSYS; + } + } + + desc = xencomm_map_no_alloc(call_list, + nr_calls * sizeof(struct multicall_entry)); + if (desc == NULL) + return -EINVAL; + + return xencomm_arch_hypercall_multicall(desc, nr_calls); +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_multicall); + +int +xencomm_hypercall_callback_op(int cmd, void *arg) +{ + unsigned int argsize; + switch (cmd) { + case CALLBACKOP_register: + argsize = sizeof(struct callback_register); + break; + case CALLBACKOP_unregister: + argsize = sizeof(struct callback_unregister); + break; + default: + printk(KERN_DEBUG + "%s: unknown callback op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_callback_op + (cmd, xencomm_map_no_alloc(arg, argsize)); +} + +static int +xencommize_memory_reservation(struct xencomm_mini *xc_area, + struct xen_memory_reservation *mop) +{ + struct xencomm_handle *desc; + + desc = __xencomm_map_no_alloc(xen_guest_handle(mop->extent_start), + mop->nr_extents * + sizeof(*xen_guest_handle(mop->extent_start)), + xc_area); + if (desc == NULL) + return -EINVAL; + + set_xen_guest_handle(mop->extent_start, (void *)desc); + return 0; +} + +int +xencomm_hypercall_memory_op(unsigned int cmd, void *arg) +{ + GUEST_HANDLE(xen_pfn_t) extent_start_va[2] = { {NULL}, {NULL} }; + struct xen_memory_reservation *xmr = NULL; + int rc; + struct xencomm_handle *desc; + unsigned int argsize; + XENCOMM_MINI_ALIGNED(xc_area, 2); + + switch (cmd) { + case XENMEM_increase_reservation: + case XENMEM_decrease_reservation: + case XENMEM_populate_physmap: + xmr = (struct xen_memory_reservation *)arg; + set_xen_guest_handle(extent_start_va[0], + xen_guest_handle(xmr->extent_start)); + + argsize = sizeof(*xmr); + rc = xencommize_memory_reservation(xc_area, xmr); + if (rc) + return rc; + xc_area++; + break; + + case XENMEM_maximum_ram_page: + argsize = 0; + break; + + case XENMEM_add_to_physmap: + argsize = sizeof(struct xen_add_to_physmap); + break; + + default: + printk(KERN_DEBUG "%s: unknown memory op %d\n", __func__, cmd); + return -ENOSYS; + } + + desc = xencomm_map_no_alloc(arg, argsize); + if (desc == NULL) + return -EINVAL; + + rc = xencomm_arch_hypercall_memory_op(cmd, desc); + + switch (cmd) { + case XENMEM_increase_reservation: + case XENMEM_decrease_reservation: + case XENMEM_populate_physmap: + set_xen_guest_handle(xmr->extent_start, + xen_guest_handle(extent_start_va[0])); + break; + } + + return rc; +} +EXPORT_SYMBOL_GPL(xencomm_hypercall_memory_op); + +int +xencomm_hypercall_suspend(unsigned long srec) +{ + struct sched_shutdown arg; + + arg.reason = SHUTDOWN_suspend; + + return xencomm_arch_hypercall_sched_op( + SCHEDOP_shutdown, xencomm_map_no_alloc(&arg, sizeof(arg))); +} + +long +xencomm_hypercall_vcpu_op(int cmd, int cpu, void *arg) +{ + unsigned int argsize; + switch (cmd) { + case VCPUOP_register_runstate_memory_area: { + struct vcpu_register_runstate_memory_area *area = + (struct vcpu_register_runstate_memory_area *)arg; + argsize = sizeof(*arg); + set_xen_guest_handle(area->addr.h, + (void *)xencomm_map_no_alloc(area->addr.v, + sizeof(area->addr.v))); + break; + } + + default: + printk(KERN_DEBUG "%s: unknown vcpu op %d\n", __func__, cmd); + return -ENOSYS; + } + + return xencomm_arch_hypercall_vcpu_op(cmd, cpu, + xencomm_map_no_alloc(arg, argsize)); +} + +long +xencomm_hypercall_opt_feature(void *arg) +{ + return xencomm_arch_hypercall_opt_feature( + xencomm_map_no_alloc(arg, + sizeof(struct xen_ia64_opt_feature))); +} diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c index 3dc307f..1f5d7ac 100644 --- a/arch/ia64/xen/xencomm.c +++ b/arch/ia64/xen/xencomm.c @@ -19,11 +19,22 @@ #include static unsigned long kernel_virtual_offset; +static int is_xencomm_initialized; + +/* for xen early printk. It uses console io hypercall which uses xencomm. + * However early printk may use it before xencomm initialization. + */ +int +xencomm_is_initialized(void) +{ + return is_xencomm_initialized; +} void xencomm_initialize(void) { kernel_virtual_offset = KERNEL_START - ia64_tpa(KERNEL_START); + is_xencomm_initialized = 1; } /* Translate virtual address to physical address. */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:09 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:09 +0900 Subject: [PATCH 29/33] ia64/xen: define xen machine vector for domU. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-30-git-send-email-yamahata@valinux.co.jp> define xen machine vector for domU. Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/Makefile | 2 ++ arch/ia64/include/asm/machvec.h | 2 ++ arch/ia64/include/asm/machvec_xen.h | 22 ++++++++++++++++++++++ arch/ia64/kernel/acpi.c | 5 +++++ arch/ia64/xen/Makefile | 2 ++ arch/ia64/xen/machvec.c | 4 ++++ 6 files changed, 37 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/machvec_xen.h create mode 100644 arch/ia64/xen/machvec.c diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile index 905d25b..4024250 100644 --- a/arch/ia64/Makefile +++ b/arch/ia64/Makefile @@ -56,9 +56,11 @@ core-$(CONFIG_IA64_DIG) += arch/ia64/dig/ core-$(CONFIG_IA64_GENERIC) += arch/ia64/dig/ core-$(CONFIG_IA64_HP_ZX1) += arch/ia64/dig/ core-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += arch/ia64/dig/ +core-$(CONFIG_IA64_XEN_GUEST) += arch/ia64/dig/ core-$(CONFIG_IA64_SGI_SN2) += arch/ia64/sn/ core-$(CONFIG_IA64_SGI_UV) += arch/ia64/uv/ core-$(CONFIG_KVM) += arch/ia64/kvm/ +core-$(CONFIG_XEN) += arch/ia64/xen/ drivers-$(CONFIG_PCI) += arch/ia64/pci/ drivers-$(CONFIG_IA64_HP_SIM) += arch/ia64/hp/sim/ diff --git a/arch/ia64/include/asm/machvec.h b/arch/ia64/include/asm/machvec.h index 2b850cc..de99cb2 100644 --- a/arch/ia64/include/asm/machvec.h +++ b/arch/ia64/include/asm/machvec.h @@ -128,6 +128,8 @@ extern void machvec_tlb_migrate_finish (struct mm_struct *); # include # elif defined (CONFIG_IA64_SGI_UV) # include +# elif defined (CONFIG_IA64_XEN_GUEST) +# include # elif defined (CONFIG_IA64_GENERIC) # ifdef MACHVEC_PLATFORM_HEADER diff --git a/arch/ia64/include/asm/machvec_xen.h b/arch/ia64/include/asm/machvec_xen.h new file mode 100644 index 0000000..55f9228 --- /dev/null +++ b/arch/ia64/include/asm/machvec_xen.h @@ -0,0 +1,22 @@ +#ifndef _ASM_IA64_MACHVEC_XEN_h +#define _ASM_IA64_MACHVEC_XEN_h + +extern ia64_mv_setup_t dig_setup; +extern ia64_mv_cpu_init_t xen_cpu_init; +extern ia64_mv_irq_init_t xen_irq_init; +extern ia64_mv_send_ipi_t xen_platform_send_ipi; + +/* + * This stuff has dual use! + * + * For a generic kernel, the macros are used to initialize the + * platform's machvec structure. When compiling a non-generic kernel, + * the macros are used directly. + */ +#define platform_name "xen" +#define platform_setup dig_setup +#define platform_cpu_init xen_cpu_init +#define platform_irq_init xen_irq_init +#define platform_send_ipi xen_platform_send_ipi + +#endif /* _ASM_IA64_MACHVEC_XEN_h */ diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index 5d1eb7e..0093649 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -52,6 +52,7 @@ #include #include #include +#include #define BAD_MADT_ENTRY(entry, end) ( \ (!entry) || (unsigned long)entry + sizeof(*entry) > end || \ @@ -121,6 +122,8 @@ acpi_get_sysname(void) return "uv"; else return "sn2"; + } else if (xen_pv_domain() && !strcmp(hdr->oem_id, "XEN")) { + return "xen"; } return "dig"; @@ -137,6 +140,8 @@ acpi_get_sysname(void) return "uv"; # elif defined (CONFIG_IA64_DIG) return "dig"; +# elif defined (CONFIG_IA64_XEN_GUEST) + return "xen"; # else # error Unknown platform. Fix acpi.c. # endif diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index ed31c76..972d085 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -5,6 +5,8 @@ obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o +obj-$(CONFIG_IA64_GENERIC) += machvec.o + AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile diff --git a/arch/ia64/xen/machvec.c b/arch/ia64/xen/machvec.c new file mode 100644 index 0000000..4ad588a --- /dev/null +++ b/arch/ia64/xen/machvec.c @@ -0,0 +1,4 @@ +#define MACHVEC_PLATFORM_NAME xen +#define MACHVEC_PLATFORM_HEADER +#include + -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:50 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:50 +0900 Subject: [PATCH 10/33] ia64/xen: add a necessary header file to compile include/xen/interface/xen.h In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-11-git-send-email-yamahata@valinux.co.jp> Create include/asm-ia64/pvclock-abi.h to compile which contains the same definitions of include/asm-x86/pvclock-abi.h because ia64/xen uses same structure. Hopefully include/asm-x86/pvclock-abi.h would be moved to somewhere more generic. Another approach is to include include/asm-x86/pvclock-abi.h from include/asm-ia64/pvclock-abi.h. But this would break if x86 header files are moved under arch/x86. So for now, same definitions are duplicated as suggested by Tony. Signed-off-by: Isaku Yamahata Cc: "Luck, Tony" --- arch/ia64/include/asm/pvclock-abi.h | 48 +++++++++++++++++++++++++++++++++++ 1 files changed, 48 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/pvclock-abi.h diff --git a/arch/ia64/include/asm/pvclock-abi.h b/arch/ia64/include/asm/pvclock-abi.h new file mode 100644 index 0000000..38a7a9e --- /dev/null +++ b/arch/ia64/include/asm/pvclock-abi.h @@ -0,0 +1,48 @@ +/* + * same structure to x86's + * Hopefully asm-x86/pvclock-abi.h would be moved to somewhere more generic. + * For now, define same duplicated definitions. + */ + +#ifndef ASM_IA64__PVCLOCK_ABI_H +#define ASM_IA64__PVCLOCK_ABI_H +#ifndef __ASSEMBLY__ + +/* + * These structs MUST NOT be changed. + * They are the ABI between hypervisor and guest OS. + * Both Xen and KVM are using this. + * + * pvclock_vcpu_time_info holds the system time and the tsc timestamp + * of the last update. So the guest can use the tsc delta to get a + * more precise system time. There is one per virtual cpu. + * + * pvclock_wall_clock references the point in time when the system + * time was zero (usually boot time), thus the guest calculates the + * current wall clock by adding the system time. + * + * Protocol for the "version" fields is: hypervisor raises it (making + * it uneven) before it starts updating the fields and raises it again + * (making it even) when it is done. Thus the guest can make sure the + * time values it got are consistent by checking the version before + * and after reading them. + */ + +struct pvclock_vcpu_time_info { + u32 version; + u32 pad0; + u64 tsc_timestamp; + u64 system_time; + u32 tsc_to_system_mul; + s8 tsc_shift; + u8 pad[3]; +} __attribute__((__packed__)); /* 32 bytes */ + +struct pvclock_wall_clock { + u32 version; + u32 sec; + u32 nsec; +} __attribute__((__packed__)); + +#endif /* __ASSEMBLY__ */ +#endif /* ASM_IA64__PVCLOCK_ABI_H */ -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:17:49 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:17:49 +0900 Subject: [PATCH 09/33] ia64/xen: define several constants for ia64/xen. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-10-git-send-email-yamahata@valinux.co.jp> define several constants for ia64/xen. Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/asm-offsets.c | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c index 94c44b1..eaa988b 100644 --- a/arch/ia64/kernel/asm-offsets.c +++ b/arch/ia64/kernel/asm-offsets.c @@ -16,6 +16,8 @@ #include #include +#include + #include "../kernel/sigframe.h" #include "../kernel/fsyscall_gtod_data.h" @@ -286,4 +288,29 @@ void foo(void) offsetof (struct itc_jitter_data_t, itc_jitter)); DEFINE(IA64_ITC_LASTCYCLE_OFFSET, offsetof (struct itc_jitter_data_t, itc_lastcycle)); + +#ifdef CONFIG_XEN + BLANK(); + +#define DEFINE_MAPPED_REG_OFS(sym, field) \ + DEFINE(sym, (XMAPPEDREGS_OFS + offsetof(struct mapped_regs, field))) + + DEFINE_MAPPED_REG_OFS(XSI_PSR_I_ADDR_OFS, interrupt_mask_addr); + DEFINE_MAPPED_REG_OFS(XSI_IPSR_OFS, ipsr); + DEFINE_MAPPED_REG_OFS(XSI_IIP_OFS, iip); + DEFINE_MAPPED_REG_OFS(XSI_IFS_OFS, ifs); + DEFINE_MAPPED_REG_OFS(XSI_PRECOVER_IFS_OFS, precover_ifs); + DEFINE_MAPPED_REG_OFS(XSI_ISR_OFS, isr); + DEFINE_MAPPED_REG_OFS(XSI_IFA_OFS, ifa); + DEFINE_MAPPED_REG_OFS(XSI_IIPA_OFS, iipa); + DEFINE_MAPPED_REG_OFS(XSI_IIM_OFS, iim); + DEFINE_MAPPED_REG_OFS(XSI_IHA_OFS, iha); + DEFINE_MAPPED_REG_OFS(XSI_ITIR_OFS, itir); + DEFINE_MAPPED_REG_OFS(XSI_PSR_IC_OFS, interrupt_collection_enabled); + DEFINE_MAPPED_REG_OFS(XSI_BANKNUM_OFS, banknum); + DEFINE_MAPPED_REG_OFS(XSI_BANK0_R16_OFS, bank0_regs[0]); + DEFINE_MAPPED_REG_OFS(XSI_BANK1_R16_OFS, bank1_regs[0]); + DEFINE_MAPPED_REG_OFS(XSI_B0NATS_OFS, vbnat); + DEFINE_MAPPED_REG_OFS(XSI_B1NATS_OFS, vnat); +#endif /* CONFIG_XEN */ } -- 1.6.0.2 From yamahata at valinux.co.jp Thu Oct 16 19:18:02 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 11:18:02 +0900 Subject: [PATCH 22/33] ia64/pv_ops/xen: paravirtualize DO_SAVE_MIN for xen. In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224209893-2032-23-git-send-email-yamahata@valinux.co.jp> paravirtualize DO_SAVE_MIN in minstate.h for xen. Signed-off-by: Yaozu (Eddie) Dong Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 2 + arch/ia64/include/asm/xen/minstate.h | 134 ++++++++++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 0 deletions(-) create mode 100644 arch/ia64/include/asm/xen/minstate.h diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 03895e9..1e92ed0 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -22,6 +22,8 @@ #include +#define DO_SAVE_MIN XEN_DO_SAVE_MIN + #define MOV_FROM_IFA(reg) \ movl reg = XSI_IFA; \ ;; \ diff --git a/arch/ia64/include/asm/xen/minstate.h b/arch/ia64/include/asm/xen/minstate.h new file mode 100644 index 0000000..4d92d9b --- /dev/null +++ b/arch/ia64/include/asm/xen/minstate.h @@ -0,0 +1,134 @@ +/* + * DO_SAVE_MIN switches to the kernel stacks (if necessary) and saves + * the minimum state necessary that allows us to turn psr.ic back + * on. + * + * Assumed state upon entry: + * psr.ic: off + * r31: contains saved predicates (pr) + * + * Upon exit, the state is as follows: + * psr.ic: off + * r2 = points to &pt_regs.r16 + * r8 = contents of ar.ccv + * r9 = contents of ar.csd + * r10 = contents of ar.ssd + * r11 = FPSR_DEFAULT + * r12 = kernel sp (kernel virtual address) + * r13 = points to current task_struct (kernel virtual address) + * p15 = TRUE if psr.i is set in cr.ipsr + * predicate registers (other than p2, p3, and p15), b6, r3, r14, r15: + * preserved + * CONFIG_XEN note: p6/p7 are not preserved + * + * Note that psr.ic is NOT turned on by this macro. This is so that + * we can pass interruption state as arguments to a handler. + */ +#define XEN_DO_SAVE_MIN(__COVER,SAVE_IFS,EXTRA,WORKAROUND) \ + mov r16=IA64_KR(CURRENT); /* M */ \ + mov r27=ar.rsc; /* M */ \ + mov r20=r1; /* A */ \ + mov r25=ar.unat; /* M */ \ + MOV_FROM_IPSR(p0,r29); /* M */ \ + MOV_FROM_IIP(r28); /* M */ \ + mov r21=ar.fpsr; /* M */ \ + mov r26=ar.pfs; /* I */ \ + __COVER; /* B;; (or nothing) */ \ + adds r16=IA64_TASK_THREAD_ON_USTACK_OFFSET,r16; \ + ;; \ + ld1 r17=[r16]; /* load current->thread.on_ustack flag */ \ + st1 [r16]=r0; /* clear current->thread.on_ustack flag */ \ + adds r1=-IA64_TASK_THREAD_ON_USTACK_OFFSET,r16 \ + /* switch from user to kernel RBS: */ \ + ;; \ + invala; /* M */ \ + /* SAVE_IFS;*/ /* see xen special handling below */ \ + cmp.eq pKStk,pUStk=r0,r17; /* are we in kernel mode already? */ \ + ;; \ +(pUStk) mov ar.rsc=0; /* set enforced lazy mode, pl 0, little-endian, loadrs=0 */ \ + ;; \ +(pUStk) mov.m r24=ar.rnat; \ +(pUStk) addl r22=IA64_RBS_OFFSET,r1; /* compute base of RBS */ \ +(pKStk) mov r1=sp; /* get sp */ \ + ;; \ +(pUStk) lfetch.fault.excl.nt1 [r22]; \ +(pUStk) addl r1=IA64_STK_OFFSET-IA64_PT_REGS_SIZE,r1; /* compute base of memory stack */ \ +(pUStk) mov r23=ar.bspstore; /* save ar.bspstore */ \ + ;; \ +(pUStk) mov ar.bspstore=r22; /* switch to kernel RBS */ \ +(pKStk) addl r1=-IA64_PT_REGS_SIZE,r1; /* if in kernel mode, use sp (r12) */ \ + ;; \ +(pUStk) mov r18=ar.bsp; \ +(pUStk) mov ar.rsc=0x3; /* set eager mode, pl 0, little-endian, loadrs=0 */ \ + adds r17=2*L1_CACHE_BYTES,r1; /* really: biggest cache-line size */ \ + adds r16=PT(CR_IPSR),r1; \ + ;; \ + lfetch.fault.excl.nt1 [r17],L1_CACHE_BYTES; \ + st8 [r16]=r29; /* save cr.ipsr */ \ + ;; \ + lfetch.fault.excl.nt1 [r17]; \ + tbit.nz p15,p0=r29,IA64_PSR_I_BIT; \ + mov r29=b0 \ + ;; \ + WORKAROUND; \ + adds r16=PT(R8),r1; /* initialize first base pointer */ \ + adds r17=PT(R9),r1; /* initialize second base pointer */ \ +(pKStk) mov r18=r0; /* make sure r18 isn't NaT */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r8,16; \ +.mem.offset 8,0; st8.spill [r17]=r9,16; \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r10,24; \ + movl r8=XSI_PRECOVER_IFS; \ +.mem.offset 8,0; st8.spill [r17]=r11,24; \ + ;; \ + /* xen special handling for possibly lazy cover */ \ + /* SAVE_MIN case in dispatch_ia32_handler: mov r30=r0 */ \ + ld8 r30=[r8]; \ +(pUStk) sub r18=r18,r22; /* r18=RSE.ndirty*8 */ \ + st8 [r16]=r28,16; /* save cr.iip */ \ + ;; \ + st8 [r17]=r30,16; /* save cr.ifs */ \ + mov r8=ar.ccv; \ + mov r9=ar.csd; \ + mov r10=ar.ssd; \ + movl r11=FPSR_DEFAULT; /* L-unit */ \ + ;; \ + st8 [r16]=r25,16; /* save ar.unat */ \ + st8 [r17]=r26,16; /* save ar.pfs */ \ + shl r18=r18,16; /* compute ar.rsc to be used for "loadrs" */ \ + ;; \ + st8 [r16]=r27,16; /* save ar.rsc */ \ +(pUStk) st8 [r17]=r24,16; /* save ar.rnat */ \ +(pKStk) adds r17=16,r17; /* skip over ar_rnat field */ \ + ;; /* avoid RAW on r16 & r17 */ \ +(pUStk) st8 [r16]=r23,16; /* save ar.bspstore */ \ + st8 [r17]=r31,16; /* save predicates */ \ +(pKStk) adds r16=16,r16; /* skip over ar_bspstore field */ \ + ;; \ + st8 [r16]=r29,16; /* save b0 */ \ + st8 [r17]=r18,16; /* save ar.rsc value for "loadrs" */ \ + cmp.eq pNonSys,pSys=r0,r0 /* initialize pSys=0, pNonSys=1 */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r20,16; /* save original r1 */ \ +.mem.offset 8,0; st8.spill [r17]=r12,16; \ + adds r12=-16,r1; /* switch to kernel memory stack (with 16 bytes of scratch) */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r13,16; \ +.mem.offset 8,0; st8.spill [r17]=r21,16; /* save ar.fpsr */ \ + mov r13=IA64_KR(CURRENT); /* establish `current' */ \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r15,16; \ +.mem.offset 8,0; st8.spill [r17]=r14,16; \ + ;; \ +.mem.offset 0,0; st8.spill [r16]=r2,16; \ +.mem.offset 8,0; st8.spill [r17]=r3,16; \ + ACCOUNT_GET_STAMP \ + adds r2=IA64_PT_REGS_R16_OFFSET,r1; \ + ;; \ + EXTRA; \ + movl r1=__gp; /* establish kernel global pointer */ \ + ;; \ + ACCOUNT_SYS_ENTER \ + BSW_1(r3,r14); /* switch back to bank 1 (must be last in insn group) */ \ + ;; -- 1.6.0.2 From randy.dunlap at oracle.com Thu Oct 16 21:45:32 2008 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Thu, 16 Oct 2008 21:45:32 -0700 Subject: [PATCH 32/33] ia64/xen: a recipe for using xen/ia64 with pv_ops. In-Reply-To: <1224209893-2032-33-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> <1224209893-2032-33-git-send-email-yamahata@valinux.co.jp> Message-ID: <20081016214532.7fa106a2.randy.dunlap@oracle.com> On Fri, 17 Oct 2008 11:18:12 +0900 Isaku Yamahata wrote: > diff --git a/Documentation/ia64/xen.txt b/Documentation/ia64/xen.txt > new file mode 100644 > index 0000000..a5c6993 > --- /dev/null > +++ b/Documentation/ia64/xen.txt > @@ -0,0 +1,183 @@ > + Recipe for getting/building/running Xen/ia64 with pv_ops > + -------------------------------------------------------- > + > +This recipe discribes how to get xen-ia64 source and build it, describes > +and run domU with pv_ops. > + > +=========== > +Requirement Requirements (?) > +=========== > + > + - python > + - mercurial > + it (aka "hg") is a open-source source code an > + management software. See the below. > + http://www.selenic.com/mercurial/wiki/ > + - git > + - bridge-utils > + > +================================= > +Getting and Building Xen and Dom0 > +================================= > + > + My enviroment is; My environment is: > + Machine : Tiger4 > + Domain0 OS : RHEL5 > + DomainU OS : RHEL5 > + > + 1. Download source > + # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg > + # cd xen-unstable.hg > + # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg > + > + 2. # make world > + > + 3. # make install-tools > + > + 4. copy kernels and xen > + # cp xen/xen.gz /boot/efi/efi/redhat/ > + # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \ > + /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen > + > + 5. make initrd for Dom0/DomU > + # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \ > + O=$(/bin/pwd)/build-linux-2.6.18-xen_ia64 > + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \ > + 2.6.18.8-xen --builtin mptspi --builtin mptbase \ > + --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \ > + --builtin ehci-hcd > + > +================================ > +Making a disk image for guest OS > +================================ > + > + 1. make file > + # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0 > + # mke2fs -F -j /root/rhel5.img > + # mount -o loop /root/rhel5.img /mnt > + # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt > + # mkdir /mnt/{root,proc,sys,home,tmp} > + > + Note: You may miss some device files. If so, please create them > + with mknod. Or you can use tar intead of cp. > + > + 2. modify DomU's fstab > + # vi /mnt/etc/fstab > + /dev/xvda1 / ext3 defaults 1 1 > + none /dev/pts devpts gid=5,mode=620 0 0 > + none /dev/shm tmpfs defaults 0 0 > + none /proc proc defaults 0 0 > + none /sys sysfs defaults 0 0 > + > + 3. modify inittab > + set runlevel to 3 to avoid X trying to start > + # vi /mnt/etc/inittab > + id:3:initdefault: > + Start a getty on the hvc0 console > + X0:2345:respawn:/sbin/mingetty hvc0 > + tty1-6 mingetty can be commented out > + > + 4. add hvc0 into /etc/securetty > + # vi /mnt/etc/securetty (add hvc0) > + > + 5. umount > + # umount /mnt > + > +FYI, virt-manager can also make a disk image for guest OS. > +It's GUI tools and easy to make it. > + > +================== > +Boot Xen & Domain0 > +================== > + > + 1. replace elilo > + elilo of RHEL5 can boot Xen and Dom0. > + If you use old elilo (e.g RHEL4), please download from the below > + http://elilo.sourceforge.net/cgi-bin/blosxom > + and copy into /boot/efi/efi/redhat/ > + # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi > + > + 2. modify elilo.conf (like the below) > + # vi /boot/efi/efi/redhat/elilo.conf > + prompt > + timeout=20 > + default=xen > + relocatable > + > + image=vmlinuz-2.6.18.8-xen > + label=xen > + vmm=xen.gz > + initrd=initrd-2.6.18.8-xen.img > + read-only > + append=" -- rhgb root=/dev/sda2" > + > +The append options before "--" are for xen hypervisor, > +the options after "--" are for dom0. > + > +FYI, your machine may need console options like > +"com1=19200,8n1 console=vga,com1". For example, > +append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \ > +console=ttyS0 root=/dev/sda2" > + > +===================================== > +Getting and Building domU with pv_ops > +===================================== > + > + 1. get pv_ops tree > + # git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/ > + > + 2. git branch (if necessary) > + # cd linux-2.6-xen-ia64/ > + # git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19 > + (Note: The current branch is xen-ia64-domu-minimal-2008may19. > + But you would find the new branch. You can see with > + "git branch -r" to get the branch lists. > + http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ > + is also available. The tree is based on > + git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test) > + > + > + 3. copy .config for pv_ops of domU > + # cp arch/ia64/configs/xen_domu_wip_defconfig .config > + > + 4. make kernel with pv_ops > + # make oldconfig > + # make > + > + 5. install the kernel and initrd > + # cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU > + # make modules_install > + # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \ > + 2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \ > + --builtin mptbase --builtin mptscsih --builtin uhci-hcd \ > + --builtin ohci-hcd --builtin ehci-hcd > + > +======================== > +Boot DomainU with pv_ops > +======================== > + > + 1. make config of DomU > + # vi /etc/xen/rhel5 > + kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU" > + ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img" > + vcpus = 1 > + memory = 512 > + name = "rhel5" > + disk = [ 'file:/root/rhel5.img,xvda1,w' ] > + root = "/dev/xvda1 ro" > + extra= "rhgb console=hvc0" > + > + 2. After boot xen and dom0, start xend > + # /etc/init.d/xend start > + ( In the debugging case, # XEND_DEBUG=1 xend trace_start ) > + > + 3. start domU > + # xm create -c rhel5 > + > +========= > +Reference > +========= > +- Wiki of Xen/IA64 upstream merge > + http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge > + > +Witten by Akio Takebe on 28 May 2008 Written > -- --- ~Randy From yamahata at valinux.co.jp Thu Oct 16 22:46:19 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Fri, 17 Oct 2008 14:46:19 +0900 Subject: [PATCH 32/33] ia64/xen: a recipe for using xen/ia64 with pv_ops. In-Reply-To: <20081016214532.7fa106a2.randy.dunlap@oracle.com> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> <1224209893-2032-33-git-send-email-yamahata@valinux.co.jp> <20081016214532.7fa106a2.randy.dunlap@oracle.com> Message-ID: <20081017054619.GB17594%yamahata@valinux.co.jp> Thank you for the review. Here is the updated one. I also found one more typo with a spell checker. From anirban.chakraborty at qlogic.com Thu Oct 16 22:48:56 2008 From: anirban.chakraborty at qlogic.com (Anirban Chakraborty) Date: Thu, 16 Oct 2008 22:48:56 -0700 Subject: [PATCH 6/6 v3] PCI: document the change In-Reply-To: <20081014044626.GB25780@parisc-linux.org> References: <20081001160706.GI13822@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CC69@pdsmsx415.ccr.corp.intel.com> <20081014010827.GX25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CE27@pdsmsx415.ccr.corp.intel.com> <20081014021435.GA1482@yzhao12-linux.sh.intel.com> <20081014040105.GA25780@parisc-linux.org> <08DF4D958216244799FC84F3514D70F00235CF5E@pdsmsx415.ccr.corp.intel.com> <20081014044626.GB25780@parisc-linux.org> Message-ID: <5F847B2D-D033-4A40-B132-A900E28EB36A@qlogic.com> On Oct 13, 2008, at 9:46 PM, Matthew Wilcox wrote: > On Tue, Oct 14, 2008 at 12:18:40PM +0800, Dong, Eddie wrote: >> Matthew Wilcox wrote: >>> On Tue, Oct 14, 2008 at 10:14:35AM +0800, Yu Zhao wrote: >>>> As Eddie said, we have two problems here: >>>> 1) User has to set device specific parameters of a VF >>>> when he wants to use this VF with KVM (assign this >>>> device to KVM guest). In this case, >>>> VF driver is not loaded in the host environment. So >>>> operations which >>>> are implemented as driver callback (e.g. >>>> set_mac_address()) are not supported. >>> >>> I suspect what you want to do is create, then configure >>> the device in the host, then assign it to the guest. >> >> That is not true. Rememver the created VFs will be destroyed no >> matter >> for PF power event or error recovery conducted reset. >> So what we want is: >> >> Config, create, assign, and then deassign and destroy and then >> recreate... > > Yes, but my point is this all happens in the _host_, not in the > _guest_. > >> Sorry can u explain a little bit more? The SR-IOV patch won't define >> what kind of entries should be created or not, we leave network >> subsystem to decide what to do. Same for disk subsstem etc. > > No entries should be created. This needs to be not SR-IOV specific. I think we need to cover both the scenarios here, virtualization and non virtualization. In the absence of virtualization, the VF and PF driver should be identical. In this context, how does the PF driver allocates a VF? Is dynamic allocation of VFs possible, or does it have to allocate all the VFs that the device supports when the PF driver loads? Also, will the probe function be called for the VFs, or does the PF driver handle only the probe for the physical function? In virtualization context things get bit more complex as the the VF driver in guest would like to treat the VF as a physical function but that may not be possible from the device perspective as the control registers may well be shared between VF and PF. I would think that the VF allocation is the job of SR PCIM. PCIM may well ask the PF driver to configure a VF upon user request. Thanks much, Anirban Chakraborty > -- > Matthew Wilcox Intel Open Source Technology Centre > "Bill, look, we understand that you're interested in selling us this > operating system, but compare it to ours. We can't possibly take such > a retrograde step." > -- > To unsubscribe from this list: send the line "unsubscribe linux- > kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From ryov at valinux.co.jp Fri Oct 17 00:09:50 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 17 Oct 2008 16:09:50 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction Message-ID: <20081017.160950.71109894.ryov@valinux.co.jp> Hi Alasdair and all, This is the dm-ioband version 1.8.0 release. Dm-ioband is an I/O bandwidth controller implemented as a device-mapper driver, which gives specified bandwidth to each job running on the same physical device. This release is a minor bug fix and confirmed running on the latest stable kernel 2.6.27.1. - Can be applied to the kernel 2.6.27.1 and 2.6.27-rc5-mm1. - Changes from 1.7.0 (posted on Oct 3, 2008): - Fix a minor bug in io_limit setting that causes dm-ioband to stop issuing I/O requests when a large value is set to io_limit. Alasdair, could you please review this patch and give me any comments? Thanks, Ryo Tsuruta From ryov at valinux.co.jp Fri Oct 17 00:10:29 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 17 Oct 2008 16:10:29 +0900 (JST) Subject: [PATCH 1/2] dm-ioband: I/O bandwidth controller v1.8.0: Source code and patch In-Reply-To: <20081017.160950.71109894.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> Message-ID: <20081017.161029.104053860.ryov@valinux.co.jp> This patch is the dm-ioband version 1.8.0 release. Signed-off-by: Ryo Tsuruta Signed-off-by: Hirokazu Takahashi diff -uprN linux-2.6.27.1.orig/drivers/md/Kconfig linux-2.6.27.1/drivers/md/Kconfig --- linux-2.6.27.1.orig/drivers/md/Kconfig 2008-10-16 08:02:53.000000000 +0900 +++ linux-2.6.27.1/drivers/md/Kconfig 2008-10-17 12:33:13.000000000 +0900 @@ -275,4 +275,17 @@ config DM_UEVENT ---help--- Generate udev events for DM events. +config DM_IOBAND + tristate "I/O bandwidth control (EXPERIMENTAL)" + depends on BLK_DEV_DM && EXPERIMENTAL + ---help--- + This device-mapper target allows to define how the + available bandwidth of a storage device should be + shared between processes, cgroups, the partitions or the LUNs. + + Information on how to use dm-ioband is available in: + . + + If unsure, say N. + endif # MD diff -uprN linux-2.6.27.1.orig/drivers/md/Makefile linux-2.6.27.1/drivers/md/Makefile --- linux-2.6.27.1.orig/drivers/md/Makefile 2008-10-16 08:02:53.000000000 +0900 +++ linux-2.6.27.1/drivers/md/Makefile 2008-10-17 12:33:13.000000000 +0900 @@ -7,6 +7,7 @@ dm-mod-objs := dm.o dm-table.o dm-target dm-multipath-objs := dm-path-selector.o dm-mpath.o dm-snapshot-objs := dm-snap.o dm-exception-store.o dm-mirror-objs := dm-raid1.o +dm-ioband-objs := dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-type.o md-mod-objs := md.o bitmap.o raid456-objs := raid5.o raid6algos.o raid6recov.o raid6tables.o \ raid6int1.o raid6int2.o raid6int4.o \ @@ -36,6 +37,7 @@ obj-$(CONFIG_DM_MULTIPATH) += dm-multipa obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o obj-$(CONFIG_DM_ZERO) += dm-zero.o +obj-$(CONFIG_DM_IOBAND) += dm-ioband.o quiet_cmd_unroll = UNROLL $@ cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \ diff -uprN linux-2.6.27.1.orig/drivers/md/dm-ioband-ctl.c linux-2.6.27.1/drivers/md/dm-ioband-ctl.c --- linux-2.6.27.1.orig/drivers/md/dm-ioband-ctl.c 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27.1/drivers/md/dm-ioband-ctl.c 2008-10-17 12:33:13.000000000 +0900 @@ -0,0 +1,1328 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * Authors: Hirokazu Takahashi + * Ryo Tsuruta + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ +#include +#include +#include +#include +#include +#include +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +#define DM_MSG_PREFIX "ioband" +#define POLICY_PARAM_START 6 +#define POLICY_PARAM_DELIM "=:," + +static LIST_HEAD(ioband_device_list); +/* to protect ioband_device_list */ +static DEFINE_SPINLOCK(ioband_devicelist_lock); + +static void suspend_ioband_device(struct ioband_device *, unsigned long, int); +static void resume_ioband_device(struct ioband_device *); +static void ioband_conduct(struct work_struct *); +static void ioband_hold_bio(struct ioband_group *, struct bio *); +static struct bio *ioband_pop_bio(struct ioband_group *); +static int ioband_set_param(struct ioband_group *, char *, char *); +static int ioband_group_attach(struct ioband_group *, int, char *); +static int ioband_group_type_select(struct ioband_group *, char *); + +long ioband_debug; /* just for debugging */ + +static void do_nothing(void) {} + +static int policy_init(struct ioband_device *dp, char *name, + int argc, char **argv) +{ + struct policy_type *p; + struct ioband_group *gp; + unsigned long flags; + int r; + + for (p = dm_ioband_policy_type; p->p_name; p++) { + if (!strcmp(name, p->p_name)) + break; + } + if (!p->p_name) + return -EINVAL; + + spin_lock_irqsave(&dp->g_lock, flags); + if (dp->g_policy == p) { + /* do nothing if the same policy is already set */ + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; + } + + suspend_ioband_device(dp, flags, 1); + list_for_each_entry(gp, &dp->g_groups, c_list) + dp->g_group_dtr(gp); + + /* switch to the new policy */ + dp->g_policy = p; + r = p->p_policy_init(dp, argc, argv); + if (!dp->g_hold_bio) + dp->g_hold_bio = ioband_hold_bio; + if (!dp->g_pop_bio) + dp->g_pop_bio = ioband_pop_bio; + + list_for_each_entry(gp, &dp->g_groups, c_list) + dp->g_group_ctr(gp, NULL); + resume_ioband_device(dp); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static struct ioband_device *alloc_ioband_device(char *name, + int io_throttle, int io_limit) + +{ + struct ioband_device *dp, *new; + unsigned long flags; + + new = kzalloc(sizeof(struct ioband_device), GFP_KERNEL); + if (!new) + return NULL; + + spin_lock_irqsave(&ioband_devicelist_lock, flags); + list_for_each_entry(dp, &ioband_device_list, g_list) { + if (!strcmp(dp->g_name, name)) { + dp->g_ref++; + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + kfree(new); + return dp; + } + } + + /* + * Prepare its own workqueue as generic_make_request() may + * potentially block the workqueue when submitting BIOs. + */ + new->g_ioband_wq = create_workqueue("kioband"); + if (!new->g_ioband_wq) { + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + kfree(new); + return NULL; + } + + INIT_DELAYED_WORK(&new->g_conductor, ioband_conduct); + INIT_LIST_HEAD(&new->g_groups); + INIT_LIST_HEAD(&new->g_list); + spin_lock_init(&new->g_lock); + mutex_init(&new->g_lock_device); + bio_list_init(&new->g_urgent_bios); + new->g_io_throttle = io_throttle; + new->g_io_limit[0] = io_limit; + new->g_io_limit[1] = io_limit; + new->g_issued[0] = 0; + new->g_issued[1] = 0; + new->g_blocked = 0; + new->g_ref = 1; + new->g_flags = 0; + strlcpy(new->g_name, name, sizeof(new->g_name)); + new->g_policy = NULL; + new->g_hold_bio = NULL; + new->g_pop_bio = NULL; + init_waitqueue_head(&new->g_waitq); + init_waitqueue_head(&new->g_waitq_suspend); + init_waitqueue_head(&new->g_waitq_flush); + list_add_tail(&new->g_list, &ioband_device_list); + + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + return new; +} + +static void release_ioband_device(struct ioband_device *dp) +{ + unsigned long flags; + + spin_lock_irqsave(&ioband_devicelist_lock, flags); + dp->g_ref--; + if (dp->g_ref > 0) { + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + return; + } + list_del(&dp->g_list); + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + destroy_workqueue(dp->g_ioband_wq); + kfree(dp); +} + +static int is_ioband_device_flushed(struct ioband_device *dp, + int wait_completion) +{ + struct ioband_group *gp; + + if (wait_completion && dp->g_issued[0] + dp->g_issued[1] > 0) + return 0; + if (dp->g_blocked || waitqueue_active(&dp->g_waitq)) + return 0; + list_for_each_entry(gp, &dp->g_groups, c_list) + if (waitqueue_active(&gp->c_waitq)) + return 0; + return 1; +} + +static void suspend_ioband_device(struct ioband_device *dp, + unsigned long flags, int wait_completion) +{ + struct ioband_group *gp; + + /* block incoming bios */ + set_device_suspended(dp); + + /* wake up all blocked processes and go down all ioband groups */ + wake_up_all(&dp->g_waitq); + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (!is_group_down(gp)) { + set_group_down(gp); + set_group_need_up(gp); + } + wake_up_all(&gp->c_waitq); + } + + /* flush the already mapped bios */ + spin_unlock_irqrestore(&dp->g_lock, flags); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + flush_workqueue(dp->g_ioband_wq); + + /* wait for all processes to wake up and bios to release */ + spin_lock_irqsave(&dp->g_lock, flags); + wait_event_lock_irq(dp->g_waitq_flush, + is_ioband_device_flushed(dp, wait_completion), + dp->g_lock, do_nothing()); +} + +static void resume_ioband_device(struct ioband_device *dp) +{ + struct ioband_group *gp; + + /* go up ioband groups */ + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (group_need_up(gp)) { + clear_group_need_up(gp); + clear_group_down(gp); + } + } + + /* accept incoming bios */ + wake_up_all(&dp->g_waitq_suspend); + clear_device_suspended(dp); +} + +static struct ioband_group *ioband_group_find( + struct ioband_group *head, int id) +{ + struct rb_node *node = head->c_group_root.rb_node; + + while (node) { + struct ioband_group *p = + container_of(node, struct ioband_group, c_group_node); + + if (p->c_id == id || id == IOBAND_ID_ANY) + return p; + node = (id < p->c_id) ? node->rb_left : node->rb_right; + } + return NULL; +} + +static void ioband_group_add_node(struct rb_root *root, + struct ioband_group *gp) +{ + struct rb_node **new = &root->rb_node, *parent = NULL; + struct ioband_group *p; + + while (*new) { + p = container_of(*new, struct ioband_group, c_group_node); + parent = *new; + new = (gp->c_id < p->c_id) ? + &(*new)->rb_left : &(*new)->rb_right; + } + + rb_link_node(&gp->c_group_node, parent, new); + rb_insert_color(&gp->c_group_node, root); +} + +static int ioband_group_init(struct ioband_group *gp, + struct ioband_group *head, struct ioband_device *dp, int id, char *param) +{ + unsigned long flags; + int r; + + INIT_LIST_HEAD(&gp->c_list); + bio_list_init(&gp->c_blocked_bios); + bio_list_init(&gp->c_prio_bios); + gp->c_id = id; /* should be verified */ + gp->c_blocked = 0; + gp->c_prio_blocked = 0; + memset(gp->c_stat, 0, sizeof(gp->c_stat)); + init_waitqueue_head(&gp->c_waitq); + gp->c_flags = 0; + gp->c_group_root = RB_ROOT; + gp->c_banddev = dp; + + spin_lock_irqsave(&dp->g_lock, flags); + if (head && ioband_group_find(head, id)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_group: id=%d already exists.", id); + return -EEXIST; + } + + list_add_tail(&gp->c_list, &dp->g_groups); + + r = dp->g_group_ctr(gp, param); + if (r) { + list_del(&gp->c_list); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; + } + + if (head) { + ioband_group_add_node(&head->c_group_root, gp); + gp->c_dev = head->c_dev; + gp->c_target = head->c_target; + } + + spin_unlock_irqrestore(&dp->g_lock, flags); + + return 0; +} + +static void ioband_group_release(struct ioband_group *head, + struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + list_del(&gp->c_list); + if (head) + rb_erase(&gp->c_group_node, &head->c_group_root); + dp->g_group_dtr(gp); + kfree(gp); +} + +static void ioband_group_destroy_all(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *group; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + while ((group = ioband_group_find(gp, IOBAND_ID_ANY))) + ioband_group_release(gp, group); + ioband_group_release(NULL, gp); + spin_unlock_irqrestore(&dp->g_lock, flags); +} + +static void ioband_group_stop_all(struct ioband_group *head, int suspend) +{ + struct ioband_device *dp = head->c_banddev; + struct ioband_group *p; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + set_group_down(p); + if (suspend) { + set_group_suspended(p); + dprintk(KERN_ERR "ioband suspend: gp(%p)\n", p); + } + } + set_group_down(head); + if (suspend) { + set_group_suspended(head); + dprintk(KERN_ERR "ioband suspend: gp(%p)\n", head); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + flush_workqueue(dp->g_ioband_wq); +} + +static void ioband_group_resume_all(struct ioband_group *head) +{ + struct ioband_device *dp = head->c_banddev; + struct ioband_group *p; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&dp->g_lock, flags); + for (node = rb_first(&head->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + clear_group_down(p); + clear_group_suspended(p); + dprintk(KERN_ERR "ioband resume: gp(%p)\n", p); + } + clear_group_down(head); + clear_group_suspended(head); + dprintk(KERN_ERR "ioband resume: gp(%p)\n", head); + spin_unlock_irqrestore(&dp->g_lock, flags); +} + +static int split_string(char *s, long *id, char **v) +{ + char *p, *q; + int r = 0; + + *id = IOBAND_ID_ANY; + p = strsep(&s, POLICY_PARAM_DELIM); + q = strsep(&s, POLICY_PARAM_DELIM); + if (!q) { + *v = p; + } else { + r = strict_strtol(p, 0, id); + *v = q; + } + return r; +} + +/* + * Create a new band device: + * parameters: + * + */ +static int ioband_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct ioband_group *gp; + struct ioband_device *dp; + struct dm_dev *dev; + int io_throttle; + int io_limit; + int i, r, start; + long val, id; + char *param; + + if (argc < POLICY_PARAM_START) { + ti->error = "Requires " __stringify(POLICY_PARAM_START) + " or more arguments"; + return -EINVAL; + } + + if (strlen(argv[1]) > IOBAND_NAME_MAX) { + ti->error = "Ioband device name is too long"; + return -EINVAL; + } + dprintk(KERN_ERR "ioband_ctr ioband device name:%s\n", argv[1]); + + r = strict_strtol(argv[2], 0, &val); + if (r || val < 0) { + ti->error = "Invalid io_throttle"; + return -EINVAL; + } + io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val; + + r = strict_strtol(argv[3], 0, &val); + if (r || val < 0) { + ti->error = "Invalid io_limit"; + return -EINVAL; + } + io_limit = val; + + r = dm_get_device(ti, argv[0], 0, ti->len, + dm_table_get_mode(ti->table), &dev); + if (r) { + ti->error = "Device lookup failed"; + return r; + } + + if (io_limit == 0) { + struct request_queue *q; + + q = bdev_get_queue(dev->bdev); + if (!q) { + ti->error = "Can't get queue size"; + r = -ENXIO; + goto release_dm_device; + } + dprintk(KERN_ERR "ioband_ctr nr_requests:%lu\n", + q->nr_requests); + io_limit = q->nr_requests; + } + + if (io_limit < io_throttle) + io_limit = io_throttle; + dprintk(KERN_ERR "ioband_ctr io_throttle:%d io_limit:%d\n", + io_throttle, io_limit); + + dp = alloc_ioband_device(argv[1], io_throttle, io_limit); + if (!dp) { + ti->error = "Cannot create ioband device"; + r = -EINVAL; + goto release_dm_device; + } + + mutex_lock(&dp->g_lock_device); + r = policy_init(dp, argv[POLICY_PARAM_START - 1], + argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]); + if (r) { + ti->error = "Invalid policy parameter"; + goto release_ioband_device; + } + + gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL); + if (!gp) { + ti->error = "Cannot allocate memory for ioband group"; + r = -ENOMEM; + goto release_ioband_device; + } + + ti->private = gp; + gp->c_target = ti; + gp->c_dev = dev; + + /* Find a default group parameter */ + for (start = POLICY_PARAM_START; start < argc; start++) + if (argv[start][0] == ':') + break; + param = (start < argc) ? &argv[start][1] : NULL; + + /* Create a default ioband group */ + r = ioband_group_init(gp, NULL, dp, IOBAND_ID_ANY, param); + if (r) { + kfree(gp); + ti->error = "Cannot create default ioband group"; + goto release_ioband_device; + } + + r = ioband_group_type_select(gp, argv[4]); + if (r) { + ti->error = "Cannot set ioband group type"; + goto release_ioband_group; + } + + /* Create sub ioband groups */ + for (i = start + 1; i < argc; i++) { + r = split_string(argv[i], &id, ¶m); + if (r) { + ti->error = "Invalid ioband group parameter"; + goto release_ioband_group; + } + r = ioband_group_attach(gp, id, param); + if (r) { + ti->error = "Cannot create ioband group"; + goto release_ioband_group; + } + } + mutex_unlock(&dp->g_lock_device); + return 0; + +release_ioband_group: + ioband_group_destroy_all(gp); +release_ioband_device: + mutex_unlock(&dp->g_lock_device); + release_ioband_device(dp); +release_dm_device: + dm_put_device(ti, dev); + return r; +} + +static void ioband_dtr(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_stop_all(gp, 0); + cancel_delayed_work_sync(&dp->g_conductor); + dm_put_device(ti, gp->c_dev); + ioband_group_destroy_all(gp); + mutex_unlock(&dp->g_lock_device); + release_ioband_device(dp); +} + +static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio) +{ + /* Todo: The list should be split into a read list and a write list */ + bio_list_add(&gp->c_blocked_bios, bio); +} + +static struct bio *ioband_pop_bio(struct ioband_group *gp) +{ + return bio_list_pop(&gp->c_blocked_bios); +} + +static int is_urgent_bio(struct bio *bio) +{ + struct page *page = bio_iovec_idx(bio, 0)->bv_page; + /* + * ToDo: A new flag should be added to struct bio, which indicates + * it contains urgent I/O requests. + */ + if (!PageReclaim(page)) + return 0; + if (PageSwapCache(page)) + return 2; + return 1; +} + +static inline int device_should_block(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + if (is_group_down(gp)) + return 0; + if (is_device_blocked(dp)) + return 1; + if (dp->g_blocked >= dp->g_io_limit[0] + dp->g_io_limit[1]) { + set_device_blocked(dp); + return 1; + } + return 0; +} + +static inline int group_should_block(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + + if (is_group_down(gp)) + return 0; + if (is_group_blocked(gp)) + return 1; + if (dp->g_should_block(gp)) { + set_group_blocked(gp); + return 1; + } + return 0; +} + +static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) { + /* + * Kernel threads shouldn't be blocked easily since each of + * them may handle BIOs for several groups on several + * partitions. + */ + wait_event_lock_irq(dp->g_waitq, !device_should_block(gp), + dp->g_lock, do_nothing()); + } else { + wait_event_lock_irq(gp->c_waitq, !group_should_block(gp), + dp->g_lock, do_nothing()); + } +} + +static inline int should_pushback_bio(struct ioband_group *gp) +{ + return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target); +} + +static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_issued[bio_data_dir(bio)]++; + return dp->g_prepare_bio(gp, bio, 0); +} + +static inline int room_for_bio(struct ioband_device *dp) +{ + return dp->g_issued[0] < dp->g_io_limit[0] + || dp->g_issued[1] < dp->g_io_limit[1]; +} + +static void hold_bio(struct ioband_group *gp, struct bio *bio) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_blocked++; + if (is_urgent_bio(bio)) { + /* + * ToDo: + * When barrier mode is supported, write bios sharing the same + * file system with the currnt one would be all moved + * to g_urgent_bios list. + * You don't have to care about barrier handling if the bio + * is for swapping. + */ + dp->g_prepare_bio(gp, bio, IOBAND_URGENT); + bio_list_add(&dp->g_urgent_bios, bio); + } else { + gp->c_blocked++; + dp->g_hold_bio(gp, bio); + } +} + +static inline int room_for_bio_rw(struct ioband_device *dp, int direct) +{ + return dp->g_issued[direct] < dp->g_io_limit[direct]; +} + +static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int direct) +{ + if (bio_list_empty(&gp->c_prio_bios)) + set_prio_queue(gp, direct); + bio_list_add(&gp->c_prio_bios, bio); + gp->c_prio_blocked++; +} + +static struct bio *pop_prio_bio(struct ioband_group *gp) +{ + struct bio *bio = bio_list_pop(&gp->c_prio_bios); + + if (bio_list_empty(&gp->c_prio_bios)) + clear_prio_queue(gp); + + if (bio) + gp->c_prio_blocked--; + return bio; +} + +static int make_issue_list(struct ioband_group *gp, struct bio *bio, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + + dp->g_blocked--; + gp->c_blocked--; + if (!gp->c_blocked && is_group_blocked(gp)) { + clear_group_blocked(gp); + wake_up_all(&gp->c_waitq); + } + if (should_pushback_bio(gp)) + bio_list_add(pushback_list, bio); + else { + int rw = bio_data_dir(bio); + + gp->c_stat[rw].deferred++; + gp->c_stat[rw].sectors += bio_sectors(bio); + bio_list_add(issue_list, bio); + } + return prepare_to_issue(gp, bio); +} + +static void release_urgent_bios(struct ioband_device *dp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct bio *bio; + + if (bio_list_empty(&dp->g_urgent_bios)) + return; + while (room_for_bio_rw(dp, 1)) { + bio = bio_list_pop(&dp->g_urgent_bios); + if (!bio) + return; + dp->g_blocked--; + dp->g_issued[bio_data_dir(bio)]++; + bio_list_add(issue_list, bio); + } +} + +static int release_prio_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + struct bio *bio; + int direct; + int ret; + + if (bio_list_empty(&gp->c_prio_bios)) + return R_OK; + direct = prio_queue_direct(gp); + while (gp->c_prio_blocked) { + if (!dp->g_can_submit(gp)) + return R_BLOCK; + if (!room_for_bio_rw(dp, direct)) + return R_OK; + bio = pop_prio_bio(gp); + if (!bio) + return R_OK; + ret = make_issue_list(gp, bio, issue_list, pushback_list); + if (ret) + return ret; + } + return R_OK; +} + +static int release_norm_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + struct ioband_device *dp = gp->c_banddev; + struct bio *bio; + int direct; + int ret; + + while (gp->c_blocked - gp->c_prio_blocked) { + if (!dp->g_can_submit(gp)) + return R_BLOCK; + if (!room_for_bio(dp)) + return R_OK; + bio = dp->g_pop_bio(gp); + if (!bio) + return R_OK; + + direct = bio_data_dir(bio); + if (!room_for_bio_rw(dp, direct)) { + push_prio_bio(gp, bio, direct); + continue; + } + ret = make_issue_list(gp, bio, issue_list, pushback_list); + if (ret) + return ret; + } + return R_OK; +} + +static inline int release_bios(struct ioband_group *gp, + struct bio_list *issue_list, struct bio_list *pushback_list) +{ + int ret = release_prio_bios(gp, issue_list, pushback_list); + if (ret) + return ret; + return release_norm_bios(gp, issue_list, pushback_list); +} + +static struct ioband_group *ioband_group_get(struct ioband_group *head, + struct bio *bio) +{ + struct ioband_group *gp; + + if (!head->c_type->t_getid) + return head; + + gp = ioband_group_find(head, head->c_type->t_getid(bio)); + + if (!gp) + gp = head; + return gp; +} + +/* + * Start to control the bandwidth once the number of uncompleted BIOs + * exceeds the value of "io_throttle". + */ +static int ioband_map(struct dm_target *ti, struct bio *bio, + union map_info *map_context) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + unsigned long flags; + int rw; + + spin_lock_irqsave(&dp->g_lock, flags); + + /* + * The device is suspended while some of the ioband device + * configurations are being changed. + */ + if (is_device_suspended(dp)) + wait_event_lock_irq(dp->g_waitq_suspend, + !is_device_suspended(dp), dp->g_lock, do_nothing()); + + gp = ioband_group_get(gp, bio); + prevent_burst_bios(gp, bio); + if (should_pushback_bio(gp)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return DM_MAPIO_REQUEUE; + } + + bio->bi_bdev = gp->c_dev->bdev; + bio->bi_sector -= ti->begin; + rw = bio_data_dir(bio); + + if (!gp->c_blocked && room_for_bio_rw(dp, rw)) { + if (dp->g_can_submit(gp)) { + prepare_to_issue(gp, bio); + gp->c_stat[rw].immediate++; + gp->c_stat[rw].sectors += bio_sectors(bio); + spin_unlock_irqrestore(&dp->g_lock, flags); + return DM_MAPIO_REMAPPED; + } else if (!dp->g_blocked + && dp->g_issued[0] + dp->g_issued[1] == 0) { + dprintk(KERN_ERR "ioband_map: token expired " + "gp:%p bio:%p\n", gp, bio); + queue_delayed_work(dp->g_ioband_wq, + &dp->g_conductor, 1); + } + } + hold_bio(gp, bio); + spin_unlock_irqrestore(&dp->g_lock, flags); + + return DM_MAPIO_SUBMITTED; +} + +/* + * Select the best group to resubmit its BIOs. + */ +static struct ioband_group *choose_best_group(struct ioband_device *dp) +{ + struct ioband_group *gp; + struct ioband_group *best = NULL; + int highest = 0; + int pri; + + /* Todo: The algorithm should be optimized. + * It would be better to use rbtree. + */ + list_for_each_entry(gp, &dp->g_groups, c_list) { + if (!gp->c_blocked || !room_for_bio(dp)) + continue; + if (gp->c_blocked == gp->c_prio_blocked + && !room_for_bio_rw(dp, prio_queue_direct(gp))) { + continue; + } + pri = dp->g_can_submit(gp); + if (pri > highest) { + highest = pri; + best = gp; + } + } + + return best; +} + +/* + * This function is called right after it becomes able to resubmit BIOs. + * It selects the best BIOs and passes them to the underlying layer. + */ +static void ioband_conduct(struct work_struct *work) +{ + struct ioband_device *dp = + container_of(work, struct ioband_device, g_conductor.work); + struct ioband_group *gp = NULL; + struct bio *bio; + unsigned long flags; + struct bio_list issue_list, pushback_list; + + bio_list_init(&issue_list); + bio_list_init(&pushback_list); + + spin_lock_irqsave(&dp->g_lock, flags); + release_urgent_bios(dp, &issue_list, &pushback_list); + if (dp->g_blocked) { + gp = choose_best_group(dp); + if (gp && release_bios(gp, &issue_list, &pushback_list) + == R_YIELD) + queue_delayed_work(dp->g_ioband_wq, + &dp->g_conductor, 0); + } + + if (is_device_blocked(dp) + && dp->g_blocked < dp->g_io_limit[0]+dp->g_io_limit[1]) { + clear_device_blocked(dp); + wake_up_all(&dp->g_waitq); + } + + if (dp->g_blocked && room_for_bio_rw(dp, 0) && room_for_bio_rw(dp, 1) && + bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) && + dp->g_restart_bios(dp)) { + dprintk(KERN_ERR "ioband_conduct: token expired dp:%p " + "issued(%d,%d) g_blocked(%d)\n", dp, + dp->g_issued[0], dp->g_issued[1], dp->g_blocked); + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + } + + + spin_unlock_irqrestore(&dp->g_lock, flags); + + while ((bio = bio_list_pop(&issue_list))) + generic_make_request(bio); + while ((bio = bio_list_pop(&pushback_list))) + bio_endio(bio, -EIO); +} + +static int ioband_end_io(struct dm_target *ti, struct bio *bio, + int error, union map_info *map_context) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + unsigned long flags; + int r = error; + + /* + * XXX: A new error code for device mapper devices should be used + * rather than EIO. + */ + if (error == -EIO && should_pushback_bio(gp)) { + /* This ioband device is suspending */ + r = DM_ENDIO_REQUEUE; + } + /* + * Todo: The algorithm should be optimized to eliminate the spinlock. + */ + spin_lock_irqsave(&dp->g_lock, flags); + dp->g_issued[bio_data_dir(bio)]--; + + /* + * Todo: It would be better to introduce high/low water marks here + * not to kick the workqueues so often. + */ + if (dp->g_blocked) + queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0); + else if (is_device_suspended(dp) + && dp->g_issued[0] + dp->g_issued[1] == 0) + wake_up_all(&dp->g_waitq_flush); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static void ioband_presuspend(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_stop_all(gp, 1); + mutex_unlock(&dp->g_lock_device); +} + +static void ioband_resume(struct dm_target *ti) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + + mutex_lock(&dp->g_lock_device); + ioband_group_resume_all(gp); + mutex_unlock(&dp->g_lock_device); +} + + +static void ioband_group_status(struct ioband_group *gp, int *szp, + char *result, unsigned int maxlen) +{ + struct ioband_group_stat *stat; + int i, sz = *szp; /* used in DMEMIT() */ + + DMEMIT(" %d", gp->c_id); + for (i = 0; i < 2; i++) { + stat = &gp->c_stat[i]; + DMEMIT(" %lu %lu %lu", + stat->immediate + stat->deferred, stat->deferred, + stat->sectors); + } + *szp = sz; +} + +static int ioband_status(struct dm_target *ti, status_type_t type, + char *result, unsigned int maxlen) +{ + struct ioband_group *gp = ti->private, *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + int sz = 0; /* used in DMEMIT() */ + unsigned long flags; + + mutex_lock(&dp->g_lock_device); + + switch (type) { + case STATUSTYPE_INFO: + spin_lock_irqsave(&dp->g_lock, flags); + DMEMIT("%s", dp->g_name); + ioband_group_status(gp, &sz, result, maxlen); + for (node = rb_first(&gp->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + ioband_group_status(p, &sz, result, maxlen); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + break; + + case STATUSTYPE_TABLE: + spin_lock_irqsave(&dp->g_lock, flags); + DMEMIT("%s %s %d %d %s %s", + gp->c_dev->name, dp->g_name, + dp->g_io_throttle, dp->g_io_limit[0], + gp->c_type->t_name, dp->g_policy->p_name); + dp->g_show(gp, &sz, result, maxlen); + spin_unlock_irqrestore(&dp->g_lock, flags); + break; + } + + mutex_unlock(&dp->g_lock_device); + return 0; +} + +static int ioband_group_type_select(struct ioband_group *gp, char *name) +{ + struct ioband_device *dp = gp->c_banddev; + struct group_type *t; + unsigned long flags; + + for (t = dm_ioband_group_type; (t->t_name); t++) { + if (!strcmp(name, t->t_name)) + break; + } + if (!t->t_name) { + DMWARN("ioband type select: %s isn't supported.", name); + return -EINVAL; + } + spin_lock_irqsave(&dp->g_lock, flags); + if (!RB_EMPTY_ROOT(&gp->c_group_root)) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EBUSY; + } + gp->c_type = t; + spin_unlock_irqrestore(&dp->g_lock, flags); + + return 0; +} + +static int ioband_set_param(struct ioband_group *gp, char *cmd, char *value) +{ + struct ioband_device *dp = gp->c_banddev; + char *val_str; + long id; + unsigned long flags; + int r; + + r = split_string(value, &id, &val_str); + if (r) + return r; + + spin_lock_irqsave(&dp->g_lock, flags); + if (id != IOBAND_ID_ANY) { + gp = ioband_group_find(gp, id); + if (!gp) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_set_param: id=%ld not found.", id); + return -EINVAL; + } + } + r = dp->g_set_param(gp, cmd, val_str); + spin_unlock_irqrestore(&dp->g_lock, flags); + return r; +} + +static int ioband_group_attach(struct ioband_group *gp, int id, char *param) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *sub_gp; + int r; + + if (id < 0) { + DMWARN("ioband_group_attach: invalid id:%d", id); + return -EINVAL; + } + if (!gp->c_type->t_getid) { + DMWARN("ioband_group_attach: " + "no ioband group type is specified"); + return -EINVAL; + } + + sub_gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL); + if (!sub_gp) + return -ENOMEM; + + r = ioband_group_init(sub_gp, gp, dp, id, param); + if (r < 0) { + kfree(sub_gp); + return r; + } + return 0; +} + +static int ioband_group_detach(struct ioband_group *gp, int id) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *sub_gp; + unsigned long flags; + + if (id < 0) { + DMWARN("ioband_group_detach: invalid id:%d", id); + return -EINVAL; + } + spin_lock_irqsave(&dp->g_lock, flags); + sub_gp = ioband_group_find(gp, id); + if (!sub_gp) { + spin_unlock_irqrestore(&dp->g_lock, flags); + DMWARN("ioband_group_detach: invalid id:%d", id); + return -EINVAL; + } + + /* + * Todo: Calling suspend_ioband_device() before releasing the + * ioband group has a large overhead. Need improvement. + */ + suspend_ioband_device(dp, flags, 0); + ioband_group_release(gp, sub_gp); + resume_ioband_device(dp); + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; +} + +/* + * Message parameters: + * "policy" + * ex) + * "policy" "weight" + * "type" "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid" + * "io_throttle" + * "io_limit" + * "attach" + * "detach" + * "any-command" : + * ex) + * "weight" 0: + * "token" 24: + */ +static int __ioband_message(struct dm_target *ti, + unsigned int argc, char **argv) +{ + struct ioband_group *gp = ti->private, *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + long val; + int r = 0; + unsigned long flags; + + if (argc == 1 && !strcmp(argv[0], "reset")) { + spin_lock_irqsave(&dp->g_lock, flags); + memset(gp->c_stat, 0, sizeof(gp->c_stat)); + for (node = rb_first(&gp->c_group_root); node; + node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + memset(p->c_stat, 0, sizeof(p->c_stat)); + } + spin_unlock_irqrestore(&dp->g_lock, flags); + return 0; + } + + if (argc != 2) { + DMWARN("Unrecognised band message received."); + return -EINVAL; + } + if (!strcmp(argv[0], "debug")) { + r = strict_strtol(argv[1], 0, &val); + if (r || val < 0) + return -EINVAL; + ioband_debug = val; + return 0; + } else if (!strcmp(argv[0], "io_throttle")) { + r = strict_strtol(argv[1], 0, &val); + spin_lock_irqsave(&dp->g_lock, flags); + if (r || val < 0 || + val > dp->g_io_limit[0] || val > dp->g_io_limit[1]) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EINVAL; + } + dp->g_io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val; + spin_unlock_irqrestore(&dp->g_lock, flags); + ioband_set_param(gp, argv[0], argv[1]); + return 0; + } else if (!strcmp(argv[0], "io_limit")) { + r = strict_strtol(argv[1], 0, &val); + if (r || val < 0) + return -EINVAL; + spin_lock_irqsave(&dp->g_lock, flags); + if (val == 0) { + struct request_queue *q; + + q = bdev_get_queue(gp->c_dev->bdev); + if (!q) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -ENXIO; + } + val = q->nr_requests; + } + if (val < dp->g_io_throttle) { + spin_unlock_irqrestore(&dp->g_lock, flags); + return -EINVAL; + } + dp->g_io_limit[0] = dp->g_io_limit[1] = val; + spin_unlock_irqrestore(&dp->g_lock, flags); + ioband_set_param(gp, argv[0], argv[1]); + return 0; + } else if (!strcmp(argv[0], "type")) { + return ioband_group_type_select(gp, argv[1]); + } else if (!strcmp(argv[0], "attach")) { + r = strict_strtol(argv[1], 0, &val); + if (r) + return r; + return ioband_group_attach(gp, val, NULL); + } else if (!strcmp(argv[0], "detach")) { + r = strict_strtol(argv[1], 0, &val); + if (r) + return r; + return ioband_group_detach(gp, val); + } else if (!strcmp(argv[0], "policy")) { + r = policy_init(dp, argv[1], 0, &argv[2]); + return r; + } else { + /* message anycommand : */ + r = ioband_set_param(gp, argv[0], argv[1]); + if (r < 0) + DMWARN("Unrecognised band message received."); + return r; + } + return 0; +} + +static int ioband_message(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct ioband_group *gp = ti->private; + struct ioband_device *dp = gp->c_banddev; + int r; + + mutex_lock(&dp->g_lock_device); + r = __ioband_message(ti, argc, argv); + mutex_unlock(&dp->g_lock_device); + return r; +} + +static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct ioband_group *gp = ti->private; + struct request_queue *q = bdev_get_queue(gp->c_dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = gp->c_dev->bdev; + bvm->bi_sector -= ti->begin; + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static struct target_type ioband_target = { + .name = "ioband", + .module = THIS_MODULE, + .version = {1, 8, 0}, + .ctr = ioband_ctr, + .dtr = ioband_dtr, + .map = ioband_map, + .end_io = ioband_end_io, + .presuspend = ioband_presuspend, + .resume = ioband_resume, + .status = ioband_status, + .message = ioband_message, + .merge = ioband_merge, +}; + +static int __init dm_ioband_init(void) +{ + int r; + + r = dm_register_target(&ioband_target); + if (r < 0) { + DMERR("register failed %d", r); + return r; + } + return r; +} + +static void __exit dm_ioband_exit(void) +{ + int r; + + r = dm_unregister_target(&ioband_target); + if (r < 0) + DMERR("unregister failed %d", r); +} + +module_init(dm_ioband_init); +module_exit(dm_ioband_exit); + +MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control"); +MODULE_AUTHOR("Hirokazu Takahashi , " + "Ryo Tsuruta +#include +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +/* + * The following functions determine when and which BIOs should + * be submitted to control the I/O flow. + * It is possible to add a new BIO scheduling policy with it. + */ + + +/* + * Functions for weight balancing policy based on the number of I/Os. + */ +#define DEFAULT_WEIGHT 100 +#define DEFAULT_TOKENPOOL 2048 +#define DEFAULT_BUCKET 2 +#define IOBAND_IOPRIO_BASE 100 +#define TOKEN_BATCH_UNIT 20 +#define PROCEED_THRESHOLD 8 +#define LOCAL_ACTIVE_RATIO 8 +#define GLOBAL_ACTIVE_RATIO 16 +#define OVERCOMMIT_RATE 4 + +/* + * Calculate the effective number of tokens this group has. + */ +static int get_token(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int token = gp->c_token; + int allowance = dp->g_epoch - gp->c_my_epoch; + + if (allowance) { + if (allowance > dp->g_carryover) + allowance = dp->g_carryover; + token += gp->c_token_initial * allowance; + } + if (is_group_down(gp)) + token += gp->c_token_initial * dp->g_carryover * 2; + + return token; +} + +/* + * Calculate the priority of a given group. + */ +static int iopriority(struct ioband_group *gp) +{ + return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1; +} + +/* + * This function is called when all the active group on the same ioband + * device has used up their tokens. It makes a new global epoch so that + * all groups on this device will get freshly assigned tokens. + */ +static int make_global_epoch(struct ioband_device *dp) +{ + struct ioband_group *gp = dp->g_dominant; + + /* + * Don't make a new epoch if the dominant group still has a lot of + * tokens, except when the I/O load is low. + */ + if (gp) { + int iopri = iopriority(gp); + if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE && + dp->g_issued[0] + dp->g_issued[1] >= dp->g_io_throttle) + return 0; + } + + dp->g_epoch++; + dprintk(KERN_ERR "make_epoch %d --> %d\n", + dp->g_epoch-1, dp->g_epoch); + + /* The leftover tokens will be used in the next epoch. */ + dp->g_token_extra = dp->g_token_left; + if (dp->g_token_extra < 0) + dp->g_token_extra = 0; + dp->g_token_left = dp->g_token_bucket; + + dp->g_expired = NULL; + dp->g_dominant = NULL; + + return 1; +} + +/* + * This function is called when this group has used up its own tokens. + * It will check whether it's possible to make a new epoch of this group. + */ +static inline int make_epoch(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int allowance = dp->g_epoch - gp->c_my_epoch; + + if (!allowance) + return 0; + if (allowance > dp->g_carryover) + allowance = dp->g_carryover; + gp->c_my_epoch = dp->g_epoch; + return allowance; +} + +/* + * Check whether this group has tokens to issue an I/O. Return 0 if it + * doesn't have any, otherwise return the priority of this group. + */ +static int is_token_left(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + int allowance; + int delta; + int extra; + + if (gp->c_token > 0) + return iopriority(gp); + + if (is_group_down(gp)) { + gp->c_token = gp->c_token_initial; + return iopriority(gp); + } + allowance = make_epoch(gp); + if (!allowance) + return 0; + /* + * If this group has the right to get tokens for several epochs, + * give all of them to the group here. + */ + delta = gp->c_token_initial * allowance; + dp->g_token_left -= delta; + /* + * Give some extra tokens to this group when there have left unused + * tokens on this ioband device from the previous epoch. + */ + extra = dp->g_token_extra * gp->c_token_initial / + (dp->g_token_bucket - dp->g_token_extra/2); + delta += extra; + gp->c_token += delta; + gp->c_consumed = 0; + + if (gp == dp->g_current) + dp->g_yield_mark += delta; + dprintk(KERN_ERR "refill token: " + "gp:%p token:%d->%d extra(%d) allowance(%d)\n", + gp, gp->c_token - delta, gp->c_token, extra, allowance); + if (gp->c_token > 0) + return iopriority(gp); + dprintk(KERN_ERR "refill token: yet empty gp:%p token:%d\n", + gp, gp->c_token); + return 0; +} + +/* + * Use tokens to issue an I/O. After the operation, the number of tokens left + * on this group may become negative value, which will be treated as debt. + */ +static int consume_token(struct ioband_group *gp, int count, int flag) +{ + struct ioband_device *dp = gp->c_banddev; + + if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial && + gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) { + ; /* Do nothing unless this group is really active. */ + } else if (!dp->g_dominant || + get_token(gp) > get_token(dp->g_dominant)) { + /* + * Regard this group as the dominant group on this + * ioband device when it has larger number of tokens + * than those of the previous one. + */ + dp->g_dominant = gp; + } + if (dp->g_epoch == gp->c_my_epoch && + gp->c_token > 0 && gp->c_token - count <= 0) { + /* Remember the last group which used up its own tokens. */ + dp->g_expired = gp; + if (dp->g_dominant == gp) + dp->g_dominant = NULL; + } + + if (gp != dp->g_current) { + /* This group is the current already. */ + dp->g_current = gp; + dp->g_yield_mark = + gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit); + } + gp->c_token -= count; + gp->c_consumed += count; + if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) { + /* + * Return-value 1 means that this policy requests dm-ioband + * to give a chance to another group to be selected since + * this group has already issued enough amount of I/Os. + */ + dp->g_current = NULL; + return R_YIELD; + } + /* + * Return-value 0 means that this policy allows dm-ioband to select + * this group to issue I/Os without a break. + */ + return R_OK; +} + +/* + * Consume one token on each I/O. + */ +static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag) +{ + return consume_token(gp, 1, flag); +} + +/* + * Check if this group is able to receive a new bio. + */ +static int is_queue_full(struct ioband_group *gp) +{ + return gp->c_blocked >= gp->c_limit; +} + +static void set_weight(struct ioband_group *gp, int new) +{ + struct ioband_device *dp = gp->c_banddev; + struct ioband_group *p; + + dp->g_weight_total += (new - gp->c_weight); + gp->c_weight = new; + + if (dp->g_weight_total == 0) { + list_for_each_entry(p, &dp->g_groups, c_list) + p->c_token = p->c_token_initial = p->c_limit = 1; + } else { + list_for_each_entry(p, &dp->g_groups, c_list) { + p->c_token = p->c_token_initial = + dp->g_token_bucket * p->c_weight / + dp->g_weight_total + 1; + p->c_limit = (dp->g_io_limit[0] + dp->g_io_limit[1]) * + p->c_weight / dp->g_weight_total / + OVERCOMMIT_RATE + 1; + } + } +} + +static void init_token_bucket(struct ioband_device *dp, int val) +{ + dp->g_token_bucket = ((dp->g_io_limit[0] + dp->g_io_limit[1]) * + DEFAULT_BUCKET) << dp->g_token_unit; + if (!val) + val = DEFAULT_TOKENPOOL << dp->g_token_unit; + if (val < dp->g_token_bucket) + val = dp->g_token_bucket; + dp->g_carryover = val/dp->g_token_bucket; + dp->g_token_left = 0; +} + +static int policy_weight_param(struct ioband_group *gp, char *cmd, char *value) +{ + struct ioband_device *dp = gp->c_banddev; + long val; + int r = 0, err; + + err = strict_strtol(value, 0, &val); + if (!strcmp(cmd, "weight")) { + if (!err && 0 < val && val <= SHORT_MAX) + set_weight(gp, val); + else + r = -EINVAL; + } else if (!strcmp(cmd, "token")) { + if (!err && val > 0) { + init_token_bucket(dp, val); + set_weight(gp, gp->c_weight); + dp->g_token_extra = 0; + } else + r = -EINVAL; + } else if (!strcmp(cmd, "io_limit")) { + init_token_bucket(dp, dp->g_token_bucket * dp->g_carryover); + set_weight(gp, gp->c_weight); + } else { + r = -EINVAL; + } + return r; +} + +static int policy_weight_ctr(struct ioband_group *gp, char *arg) +{ + struct ioband_device *dp = gp->c_banddev; + + if (!arg) + arg = __stringify(DEFAULT_WEIGHT); + gp->c_my_epoch = dp->g_epoch; + gp->c_weight = 0; + gp->c_consumed = 0; + return policy_weight_param(gp, "weight", arg); +} + +static void policy_weight_dtr(struct ioband_group *gp) +{ + struct ioband_device *dp = gp->c_banddev; + set_weight(gp, 0); + dp->g_dominant = NULL; + dp->g_expired = NULL; +} + +static void policy_weight_show(struct ioband_group *gp, int *szp, + char *result, unsigned int maxlen) +{ + struct ioband_group *p; + struct ioband_device *dp = gp->c_banddev; + struct rb_node *node; + int sz = *szp; /* used in DMEMIT() */ + + DMEMIT(" %d :%d", dp->g_token_bucket * dp->g_carryover, gp->c_weight); + + for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) { + p = rb_entry(node, struct ioband_group, c_group_node); + DMEMIT(" %d:%d", p->c_id, p->c_weight); + } + *szp = sz; +} + +/* + * + * g_can_submit : To determine whether a given group has the right to + * submit BIOs. The larger the return value the higher the + * priority to submit. Zero means it has no right. + * g_prepare_bio : Called right before submitting each BIO. + * g_restart_bios : Called if this ioband device has some BIOs blocked but none + * of them can be submitted now. This method has to + * reinitialize the data to restart to submit BIOs and return + * 0 or 1. + * The return value 0 means that it has become able to submit + * them now so that this ioband device will continue its work. + * The return value 1 means that it is still unable to submit + * them so that this device will stop its work. And this + * policy module has to reactivate the device when it gets + * to be able to submit BIOs. + * g_hold_bio : To hold a given BIO until it is submitted. + * The default function is used when this method is undefined. + * g_pop_bio : To select and get the best BIO to submit. + * g_group_ctr : To initalize the policy own members of struct ioband_group. + * g_group_dtr : Called when struct ioband_group is removed. + * g_set_param : To update the policy own date. + * The parameters can be passed through "dmsetup message" + * command. + * g_should_block : Called every time this ioband device receive a BIO. + * Return 1 if a given group can't receive any more BIOs, + * otherwise return 0. + * g_show : Show the configuration. + */ +static int policy_weight_init(struct ioband_device *dp, int argc, char **argv) +{ + long val; + int r = 0; + + if (argc < 1) + val = 0; + else { + r = strict_strtol(argv[0], 0, &val); + if (r || val < 0) + return -EINVAL; + } + + dp->g_can_submit = is_token_left; + dp->g_prepare_bio = prepare_token; + dp->g_restart_bios = make_global_epoch; + dp->g_group_ctr = policy_weight_ctr; + dp->g_group_dtr = policy_weight_dtr; + dp->g_set_param = policy_weight_param; + dp->g_should_block = is_queue_full; + dp->g_show = policy_weight_show; + + dp->g_epoch = 0; + dp->g_weight_total = 0; + dp->g_current = NULL; + dp->g_dominant = NULL; + dp->g_expired = NULL; + dp->g_token_extra = 0; + dp->g_token_unit = 0; + init_token_bucket(dp, val); + dp->g_token_left = dp->g_token_bucket; + + return 0; +} +/* weight balancing policy based on the number of I/Os. --- End --- */ + + +/* + * Functions for weight balancing policy based on I/O size. + * It just borrows a lot of functions from the regular weight balancing policy. + */ +static int w2_prepare_token(struct ioband_group *gp, struct bio *bio, int flag) +{ + /* Consume tokens depending on the size of a given bio. */ + return consume_token(gp, bio_sectors(bio), flag); +} + +static int w2_policy_weight_init(struct ioband_device *dp, + int argc, char **argv) +{ + long val; + int r = 0; + + if (argc < 1) + val = 0; + else { + r = strict_strtol(argv[0], 0, &val); + if (r || val < 0) + return -EINVAL; + } + + r = policy_weight_init(dp, argc, argv); + if (r < 0) + return r; + + dp->g_prepare_bio = w2_prepare_token; + dp->g_token_unit = PAGE_SHIFT - 9; + init_token_bucket(dp, val); + dp->g_token_left = dp->g_token_bucket; + return 0; +} +/* weight balancing policy based on I/O size. --- End --- */ + + +static int policy_default_init(struct ioband_device *dp, + int argc, char **argv) +{ + return policy_weight_init(dp, argc, argv); +} + +struct policy_type dm_ioband_policy_type[] = { + {"default", policy_default_init}, + {"weight", policy_weight_init}, + {"weight-iosize", w2_policy_weight_init}, + {NULL, policy_default_init} +}; diff -uprN linux-2.6.27.1.orig/drivers/md/dm-ioband-type.c linux-2.6.27.1/drivers/md/dm-ioband-type.c --- linux-2.6.27.1.orig/drivers/md/dm-ioband-type.c 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27.1/drivers/md/dm-ioband-type.c 2008-10-17 12:33:13.000000000 +0900 @@ -0,0 +1,76 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ +#include +#include "dm.h" +#include "dm-bio-list.h" +#include "dm-ioband.h" + +/* + * Any I/O bandwidth can be divided into several bandwidth groups, each of which + * has its own unique ID. The following functions are called to determine + * which group a given BIO belongs to and return the ID of the group. + */ + +/* ToDo: unsigned long value would be better for group ID */ + +static int ioband_process_id(struct bio *bio) +{ + /* + * This function will work for KVM and Xen. + */ + return (int)current->tgid; +} + +static int ioband_process_group(struct bio *bio) +{ + return (int)task_pgrp_nr(current); +} + +static int ioband_uid(struct bio *bio) +{ + return (int)current_uid(); +} + +static int ioband_gid(struct bio *bio) +{ + return (int)current_gid(); +} + +static int ioband_cpuset(struct bio *bio) +{ + return 0; /* not implemented yet */ +} + +static int ioband_node(struct bio *bio) +{ + return 0; /* not implemented yet */ +} + +static int ioband_cgroup(struct bio *bio) +{ + /* + * This function should return the ID of the cgroup which issued "bio". + * The ID of the cgroup which the current process belongs to won't be + * suitable ID for this purpose, since some BIOs will be handled by kernel + * threads like aio or pdflush on behalf of the process requesting the BIOs. + */ + return 0; /* not implemented yet */ +} + +struct group_type dm_ioband_group_type[] = { + {"none", NULL}, + {"pgrp", ioband_process_group}, + {"pid", ioband_process_id}, + {"node", ioband_node}, + {"cpuset", ioband_cpuset}, + {"cgroup", ioband_cgroup}, + {"user", ioband_uid}, + {"uid", ioband_uid}, + {"gid", ioband_gid}, + {NULL, NULL} +}; diff -uprN linux-2.6.27.1.orig/drivers/md/dm-ioband.h linux-2.6.27.1/drivers/md/dm-ioband.h --- linux-2.6.27.1.orig/drivers/md/dm-ioband.h 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27.1/drivers/md/dm-ioband.h 2008-10-17 12:33:13.000000000 +0900 @@ -0,0 +1,190 @@ +/* + * Copyright (C) 2008 VA Linux Systems Japan K.K. + * + * I/O bandwidth control + * + * This file is released under the GPL. + */ + +#include +#include + +#define DEFAULT_IO_THROTTLE 4 +#define DEFAULT_IO_LIMIT 128 +#define IOBAND_NAME_MAX 31 +#define IOBAND_ID_ANY (-1) + +struct ioband_group; + +struct ioband_device { + struct list_head g_groups; + struct delayed_work g_conductor; + struct workqueue_struct *g_ioband_wq; + struct bio_list g_urgent_bios; + int g_io_throttle; + int g_io_limit[2]; + int g_issued[2]; + int g_blocked; + spinlock_t g_lock; + struct mutex g_lock_device; + wait_queue_head_t g_waitq; + wait_queue_head_t g_waitq_suspend; + wait_queue_head_t g_waitq_flush; + + int g_ref; + struct list_head g_list; + int g_flags; + char g_name[IOBAND_NAME_MAX + 1]; + struct policy_type *g_policy; + + /* policy dependent */ + int (*g_can_submit)(struct ioband_group *); + int (*g_prepare_bio)(struct ioband_group *, struct bio *, int); + int (*g_restart_bios)(struct ioband_device *); + void (*g_hold_bio)(struct ioband_group *, struct bio *); + struct bio * (*g_pop_bio)(struct ioband_group *); + int (*g_group_ctr)(struct ioband_group *, char *); + void (*g_group_dtr)(struct ioband_group *); + int (*g_set_param)(struct ioband_group *, char *cmd, char *value); + int (*g_should_block)(struct ioband_group *); + void (*g_show)(struct ioband_group *, int *, char *, unsigned int); + + /* members for weight balancing policy */ + int g_epoch; + int g_weight_total; + /* the number of tokens which can be used in every epoch */ + int g_token_bucket; + /* how many epochs tokens can be carried over */ + int g_carryover; + /* how many tokens should be used for one page-sized I/O */ + int g_token_unit; + /* the last group which used a token */ + struct ioband_group *g_current; + /* give another group a chance to be scheduled when the rest + of tokens of the current group reaches this mark */ + int g_yield_mark; + /* the latest group which used up its tokens */ + struct ioband_group *g_expired; + /* the group which has the largest number of tokens in the + active groups */ + struct ioband_group *g_dominant; + /* the number of unused tokens in this epoch */ + int g_token_left; + /* left-over tokens from the previous epoch */ + int g_token_extra; +}; + +struct ioband_group_stat { + unsigned long sectors; + unsigned long immediate; + unsigned long deferred; +}; + +struct ioband_group { + struct list_head c_list; + struct ioband_device *c_banddev; + struct dm_dev *c_dev; + struct dm_target *c_target; + struct bio_list c_blocked_bios; + struct bio_list c_prio_bios; + struct rb_root c_group_root; + struct rb_node c_group_node; + int c_id; /* should be unsigned long or unsigned long long */ + char c_name[IOBAND_NAME_MAX + 1]; /* rfu */ + int c_blocked; + int c_prio_blocked; + wait_queue_head_t c_waitq; + int c_flags; + struct ioband_group_stat c_stat[2]; /* hold rd/wr status */ + struct group_type *c_type; + + /* members for weight balancing policy */ + int c_weight; + int c_my_epoch; + int c_token; + int c_token_initial; + int c_limit; + int c_consumed; + + /* rfu */ + /* struct bio_list c_ordered_tag_bios; */ +}; + +#define IOBAND_URGENT 1 + +#define DEV_BIO_BLOCKED 1 +#define DEV_SUSPENDED 2 + +#define set_device_blocked(dp) ((dp)->g_flags |= DEV_BIO_BLOCKED) +#define clear_device_blocked(dp) ((dp)->g_flags &= ~DEV_BIO_BLOCKED) +#define is_device_blocked(dp) ((dp)->g_flags & DEV_BIO_BLOCKED) + +#define set_device_suspended(dp) ((dp)->g_flags |= DEV_SUSPENDED) +#define clear_device_suspended(dp) ((dp)->g_flags &= ~DEV_SUSPENDED) +#define is_device_suspended(dp) ((dp)->g_flags & DEV_SUSPENDED) + +#define IOG_PRIO_BIO_WRITE 1 +#define IOG_PRIO_QUEUE 2 +#define IOG_BIO_BLOCKED 4 +#define IOG_GOING_DOWN 8 +#define IOG_SUSPENDED 16 +#define IOG_NEED_UP 32 + +#define R_OK 0 +#define R_BLOCK 1 +#define R_YIELD 2 + +#define set_group_blocked(gp) ((gp)->c_flags |= IOG_BIO_BLOCKED) +#define clear_group_blocked(gp) ((gp)->c_flags &= ~IOG_BIO_BLOCKED) +#define is_group_blocked(gp) ((gp)->c_flags & IOG_BIO_BLOCKED) + +#define set_group_down(gp) ((gp)->c_flags |= IOG_GOING_DOWN) +#define clear_group_down(gp) ((gp)->c_flags &= ~IOG_GOING_DOWN) +#define is_group_down(gp) ((gp)->c_flags & IOG_GOING_DOWN) + +#define set_group_suspended(gp) ((gp)->c_flags |= IOG_SUSPENDED) +#define clear_group_suspended(gp) ((gp)->c_flags &= ~IOG_SUSPENDED) +#define is_group_suspended(gp) ((gp)->c_flags & IOG_SUSPENDED) + +#define set_group_need_up(gp) ((gp)->c_flags |= IOG_NEED_UP) +#define clear_group_need_up(gp) ((gp)->c_flags &= ~IOG_NEED_UP) +#define group_need_up(gp) ((gp)->c_flags & IOG_NEED_UP) + +#define set_prio_read(gp) ((gp)->c_flags |= IOG_PRIO_QUEUE) +#define clear_prio_read(gp) ((gp)->c_flags &= ~IOG_PRIO_QUEUE) +#define is_prio_read(gp) \ + ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == IOG_PRIO_QUEUE) + +#define set_prio_write(gp) \ + ((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) +#define clear_prio_write(gp) \ + ((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) +#define is_prio_write(gp) \ + ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == \ + (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE)) + +#define set_prio_queue(gp, direct) \ + ((gp)->c_flags |= (IOG_PRIO_QUEUE|direct)) +#define clear_prio_queue(gp) clear_prio_write(gp) +#define is_prio_queue(gp) ((gp)->c_flags & IOG_PRIO_QUEUE) +#define prio_queue_direct(gp) ((gp)->c_flags & IOG_PRIO_BIO_WRITE) + + +struct policy_type { + const char *p_name; + int (*p_policy_init)(struct ioband_device *, int, char **); +}; + +extern struct policy_type dm_ioband_policy_type[]; + +struct group_type { + const char *t_name; + int (*t_getid)(struct bio *); +}; + +extern struct group_type dm_ioband_group_type[]; + +/* Just for debugging */ +extern long ioband_debug; +#define dprintk(format, a...) \ + if (ioband_debug > 0) ioband_debug--, printk(format, ##a) From ryov at valinux.co.jp Fri Oct 17 00:11:02 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Fri, 17 Oct 2008 16:11:02 +0900 (JST) Subject: [PATCH 2/2] dm-ioband: I/O bandwidth controller v1.8.0: Document In-Reply-To: <20081017.161029.104053860.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> <20081017.161029.104053860.ryov@valinux.co.jp> Message-ID: <20081017.161102.189704880.ryov@valinux.co.jp> This patch is the documentation of dm-ioband, design overview, installation, command, reference and examples. Signed-off-by: Ryo Tsuruta Signed-off-by: Hirokazu Takahashi diff -uprN linux-2.6.27.1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27.1/Documentation/device-mapper/ioband.txt --- linux-2.6.27.1.orig/Documentation/device-mapper/ioband.txt 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27.1/Documentation/device-mapper/ioband.txt 2008-10-17 12:33:13.000000000 +0900 @@ -0,0 +1,938 @@ + Block I/O bandwidth control: dm-ioband + + ------------------------------------------------------- + + Table of Contents + + [1]What's dm-ioband all about? + + [2]Differences from the CFQ I/O scheduler + + [3]How dm-ioband works. + + [4]Setup and Installation + + [5]Getting started + + [6]Command Reference + + [7]Examples + +What's dm-ioband all about? + + dm-ioband is an I/O bandwidth controller implemented as a device-mapper + driver. Several jobs using the same physical device have to share the + bandwidth of the device. dm-ioband gives bandwidth to each job according + to its weight, which each job can set its own value to. + + A job is a group of processes with the same pid or pgrp or uid or a + virtual machine such as KVM or Xen. A job can also be a cgroup by applying + the bio-cgroup patch, which can be found at + [8]http://people.valinux.co.jp/~ryov/bio-cgroup/. + + +------+ +------+ +------+ +------+ +------+ +------+ + |cgroup| |cgroup| | the | | pid | | pid | | the | jobs + | A | | B | |others| | X | | Y | |others| + +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ + +--V----+---V---+----V---+ +--V----+---V---+----V---+ + | group | group | default| | group | group | default| ioband groups + | | | group | | | | group | + +-------+-------+--------+ +-------+-------+--------+ + | ioband1 | | ioband2 | ioband devices + +-----------|------------+ +-----------|------------+ + +-----------V--------------+-------------V------------+ + | | | + | sdb1 | sdb2 | physical devices + +--------------------------+--------------------------+ + + + -------------------------------------------------------------------------- + +Differences from the CFQ I/O scheduler + + Dm-ioband is flexible to configure the bandwidth settings. + + Dm-ioband can work with any type of I/O scheduler such as the NOOP + scheduler, which is often chosen for high-end storages, since it is + implemented outside the I/O scheduling layer. It allows both of partition + based bandwidth control and job --- a group of processes --- based + control. In addition, it can set different configuration on each physical + device to control its bandwidth. + + Meanwhile the current implementation of the CFQ scheduler has 8 IO + priority levels and all jobs whose processes have the same IO priority + share the bandwidth assigned to this level between them. And IO priority + is an attribute of a process so that it equally effects to all block + devices. + + -------------------------------------------------------------------------- + +How dm-ioband works. + + Every ioband device has one ioband group, which by default is called the + default group. + + Ioband devices can also have extra ioband groups in them. Each ioband + group has a job to support and a weight. Proportional to the weight, + dm-ioband gives tokens to the group. + + A group passes on I/O requests that its job issues to the underlying + layer so long as it has tokens left, while requests are blocked if there + aren't any tokens left in the group. Tokens are refilled once all of + groups that have requests on a given physical device use up their tokens. + + There are two policies for token consumption. One is that a token is + consumed for each I/O request. The other is that a token is consumed for + each I/O sector, for example, one I/O request which consists of + 4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose + either policy. + + With this approach, a job running on an ioband group with large weight + is guaranteed a wide I/O bandwidth. + + -------------------------------------------------------------------------- + +Setup and Installation + + Build a kernel with these options enabled: + + CONFIG_MD + CONFIG_BLK_DEV_DM + CONFIG_DM_IOBAND + + + If compiled as module, use modprobe to load dm-ioband. + + # make modules + # make modules_install + # depmod -a + # modprobe dm-ioband + + + "dmsetup targets" command shows all available device-mapper targets. + "ioband" and the version number are displayed when dm-ioband has been + loaded. + + # dmsetup targets | grep ioband + ioband v1.8.0 + + + -------------------------------------------------------------------------- + +Getting started + + The following is a brief description how to control the I/O bandwidth of + disks. In this description, we'll take one disk with two partitions as an + example target. + + -------------------------------------------------------------------------- + + Create and map ioband devices + + Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped + to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2" + and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of + the bandwidth of the physical disk "/dev/sda" while "ioband2" can use 20%. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \ + "weight 0 :40" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \ + "weight 0 :10" | dmsetup create ioband2 + + + If the commands are successful then the device files + "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created. + + -------------------------------------------------------------------------- + + Additional bandwidth control + + In this example two extra ioband groups are created on "ioband1". The + first group consists of all the processes with user-id 1000 and the second + group consists of all the processes with user-id 2000. Their weights are + 30 and 20 respectively. + + # dmsetup message ioband1 0 type user + # dmsetup message ioband1 0 attach 1000 + # dmsetup message ioband1 0 attach 2000 + # dmsetup message ioband1 0 weight 1000:30 + # dmsetup message ioband1 0 weight 2000:20 + + + Now the processes in the user-id 1000 group can use 30% --- + 30/(30+20+40+10)*100 --- of the bandwidth of the physical disk. + + Table 1. Weight assignments + + +----------------------------------------------------------------+ + | ioband device | ioband group | ioband weight | + |---------------+--------------------------------+---------------| + | ioband1 | user id 1000 | 30 | + |---------------+--------------------------------+---------------| + | ioband1 | user id 2000 | 20 | + |---------------+--------------------------------+---------------| + | ioband1 | default group(the other users) | 40 | + |---------------+--------------------------------+---------------| + | ioband2 | default group | 10 | + +----------------------------------------------------------------+ + + -------------------------------------------------------------------------- + + Remove the ioband devices + + Remove the ioband devices when no longer used. + + # dmsetup remove ioband1 + # dmsetup remove ioband2 + + + -------------------------------------------------------------------------- + +Command Reference + + Create an ioband device + + SYNOPSIS + + dmsetup create IOBAND_DEVICE + + DESCRIPTION + + Create an ioband device with the given name IOBAND_DEVICE. + Generally, dmsetup reads a table from standard input. Each line of + the table specifies a single target and is of the form: + + start_sector num_sectors "ioband" device_file ioband_device_id \ + io_throttle io_limit ioband_group_type policy token_base \ + :weight [ioband_group_id:weight...] + + + start_sector, num_sectors + + The sector range of the underlying device where + dm-ioband maps. + + ioband + + Specify the string "ioband" as a target type. + + device_file + + Underlying device name. + + ioband_device_id + + The ID number for an ioband device. The same ID + must be set among the ioband devices that share the + same bandwidth, which means they work on the same + physical disk. + + io_throttle + + Dm-ioband starts to control the bandwidth when the + number of BIOs in progress exceeds this value. If 0 + is specified, dm-ioband uses the default value. + + io_limit + + Dm-ioband blocks all I/O requests for the + IOBAND_DEVICE when the number of BIOs in progress + exceeds this value. If 0 is specified, dm-ioband uses + the default value. + + ioband_group_type + + Specify how to evaluate the ioband group ID. The + type must be one of "none", "user", "gid", "pid" or + "pgrp." The type "cgroup" is enabled by applying the + bio-cgroup patch. Specify "none" if you don't need + any ioband groups other than the default ioband + group. + + policy + + Specify bandwidth control policy. A user can choose + either policy "weight" or "weight-iosize." + + weight + + This policy controls bandwidth + according to the proportional to the + weight of each ioband group based on the + number of I/O requests. + + weight-iosize + + This policy controls bandwidth + according to the proportional to the + weight of each ioband group based on the + number of I/O sectors. + + token_base + + The number of tokens which specified by token_base + will be distributed to all ioband groups according to + the proportional to the weight of each ioband group. + If 0 is specified, dm-ioband uses the default value. + + ioband_group_id:weight + + Set the weight of the ioband group specified by + ioband_group_id. If ioband_group_id is omitted, the + weight is assigned to the default ioband group. + + EXAMPLE + + Create an ioband device with the following parameters: + + * Starting sector = "0" + + * The number of sectors = "$(blockdev --getsize /dev/sda1)" + + * Target type = "ioband" + + * Underlying device name = "/dev/sda1" + + * Ioband device ID = "128" + + * I/O throttle = "10" + + * I/O limit = "400" + + * Ioband group type = "user" + + * Bandwidth control policy = "weight" + + * Token base = "2048" + + * Weight for the default ioband group = "100" + + * Weight for the ioband group 1000 = "80" + + * Weight for the ioband group 2000 = "20" + + * Ioband device name = "ioband1" + + # echo "0 $(blockdev --getsize /dev/sda1) ioband" \ + "/dev/sda1 128 10 400 user weight 2048 :100 1000:80 2000:20" \ + | dmsetup create ioband1 + + + Create two device groups (ID=1,2). The bandwidths of these + device groups will be individually controlled. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \ + "0 0 none weight 0 :80" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \ + "0 0 none weight 0 :20" | dmsetup create ioband2 + # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \ + "0 0 none weight 0 :60" | dmsetup create ioband3 + # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \ + "0 0 none weight 0 :40" | dmsetup create ioband4 + + + -------------------------------------------------------------------------- + + Remove the ioband device + + SYNOPSIS + + dmsetup remove IOBAND_DEVICE + + DESCRIPTION + + Remove the specified ioband device IOBAND_DEVICE. All the band + groups attached to the ioband device are also removed + automatically. + + EXAMPLE + + Remove ioband device "ioband1." + + # dmsetup remove ioband1 + + + -------------------------------------------------------------------------- + + Set an ioband group type + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 type TYPE + + DESCRIPTION + + Set the ioband group type of the specified ioband device + IOBAND_DEVICE. TYPE must be one of "none", "user", "gid", "pid" or + "pgrp." The type "cgroup" is enabled by applying the bio-cgroup + patch. Once the type is set, new ioband groups can be created on + IOBAND_DEVICE. + + EXAMPLE + + Set the ioband group type of ioband device "ioband1" to "user." + + # dmsetup message ioband1 0 type user + + + -------------------------------------------------------------------------- + + Create an ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 attach ID + + DESCRIPTION + + Create an ioband group and attach it to IOBAND_DEVICE. ID + specifies user-id, group-id, process-id or process-group-id + depending the ioband group type of IOBAND_DEVICE. + + EXAMPLE + + Create an ioband group which consists of all processes with + user-id 1000 and attach it to ioband device "ioband1." + + # dmsetup message ioband1 0 type user + # dmsetup message ioband1 0 attach 1000 + + + -------------------------------------------------------------------------- + + Detach the ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 detach ID + + DESCRIPTION + + Detach the ioband group specified by ID from ioband device + IOBAND_DEVICE. + + EXAMPLE + + Detach the ioband group with ID "2000" from ioband device + "ioband2." + + # dmsetup message ioband2 0 detach 1000 + + + -------------------------------------------------------------------------- + + Set bandwidth control policy + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 policy policy + + DESCRIPTION + + Set bandwidth control policy. This command applies to all ioband + devices which have the same ioband device ID as IOBAND_DEVICE. A + user can choose either policy "weight" or "weight-iosize." + + weight + + This policy controls bandwidth according to the + proportional to the weight of each ioband group based + on the number of I/O requests. + + weight-iosize + + This policy controls bandwidth according to the + proportional to the weight of each ioband group based + on the number of I/O sectors. + + EXAMPLE + + Set bandwidth control policy of ioband devices which have the + same ioband device ID as "ioband1" to "weight-iosize." + + # dmsetup message ioband1 0 policy weight-iosize + + + -------------------------------------------------------------------------- + + Set the weight of an ioband group + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 weight VAL + + dmsetup message IOBAND_DEVICE 0 weight ID:VAL + + DESCRIPTION + + Set the weight of the ioband group specified by ID. Set the + weight of the default ioband group of IOBAND_DEVICE if ID isn't + specified. + + The following example means that "ioband1" can use 80% --- + 40/(40+10)*100 --- of the bandwidth of the physical disk while + "ioband2" can use 20%. + + # dmsetup message ioband1 0 weight 40 + # dmsetup message ioband2 0 weight 10 + + + The following lines have the same effect as the above: + + # dmsetup message ioband1 0 weight 4 + # dmsetup message ioband2 0 weight 1 + + + VAL must be an integer larger than 0. The default value, which + is assigned to newly created ioband groups, is 100. + + EXAMPLE + + Set the weight of the default ioband group of "ioband1" to 40. + + # dmsetup message ioband1 0 weight 40 + + + Set the weight of the ioband group of "ioband1" with ID "1000" + to 10. + + # dmsetup message ioband1 0 weight 1000:10 + + + -------------------------------------------------------------------------- + + Set the number of tokens + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 token VAL + + DESCRIPTION + + Set the number of tokens to VAL. According to their weight, this + number of tokens will be distributed to all the ioband groups on + the physical device to which ioband device IOBAND_DEVICE belongs + when they use up their tokens. + + VAL must be an integer greater than 0. The default is 2048. + + EXAMPLE + + Set the number of tokens of the physical device to which + "ioband1" belongs to 256. + + # dmsetup message ioband1 0 token 256 + + + -------------------------------------------------------------------------- + + Set I/O throttling + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 io_throttle VAL + + DESCRIPTION + + Set the I/O throttling value of the physical disk to which + ioband device IOBAND_DEVICE belongs to VAL. Dm-ioband start to + control the bandwidth when the number of BIOs in progress on the + physical disk exceeds this value. + + EXAMPLE + + Set the I/O throttling value of "ioband1" to 16. + + # dmsetup message ioband1 0 io_throttle 16 + + + -------------------------------------------------------------------------- + + Set I/O limiting + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 io_limit VAL + + DESCRIPTION + + Set the I/O limiting value of the physical disk to which ioband + device IOBAND_DEVICE belongs to VAL. Dm-ioband will block all I/O + requests for the physical device if the number of BIOs in progress + on the physical disk exceeds this value. + + EXAMPLE + + Set the I/O limiting value of "ioband1" to 128. + + # dmsetup message ioband1 0 io_limit 128 + + + -------------------------------------------------------------------------- + + Display settings + + SYNOPSIS + + dmsetup table --target ioband + + DESCRIPTION + + Display the current table for the ioband device in a format. See + "dmsetup create" command for information on the table format. + + EXAMPLE + + The following output shows the current table of "ioband1." + + # dmsetup table --target ioband + ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \ + 2048 :100 1000:80 2000:20 + + + -------------------------------------------------------------------------- + + Display Statistics + + SYNOPSIS + + dmsetup status --target ioband + + DESCRIPTION + + Display the statistics of all the ioband devices whose target + type is "ioband." + + The output format is as below. the first five columns shows: + + * ioband device name + + * logical start sector of the device (must be 0) + + * device size in sectors + + * target type (must be "ioband") + + * device group ID + + The remaining columns show the statistics of each ioband group + on the band device. Each group uses seven columns for its + statistics. + + * ioband group ID (-1 means default) + + * total read requests + + * delayed read requests + + * total read sectors + + * total write requests + + * delayed write requests + + * total write sectors + + EXAMPLE + + The following output shows the statistics of two ioband devices. + Ioband2 only has the default ioband group and ioband1 has three + (default, 1001, 1002) ioband groups. + + # dmsetup status + ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352 + ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \ + 166 107 472 139 95 352 1002 211 146 520 210 147 504 + + + -------------------------------------------------------------------------- + + Reset status counter + + SYNOPSIS + + dmsetup message IOBAND_DEVICE 0 reset + + DESCRIPTION + + Reset the statistics of ioband device IOBAND_DEVICE. + + EXAMPLE + + Reset the statistics of "ioband1." + + # dmsetup message ioband1 0 reset + + + -------------------------------------------------------------------------- + +Examples + + Example #1: Bandwidth control on Partitions + + This example describes how to control the bandwidth with disk + partitions. The following diagram illustrates the configuration of this + example. You may want to run a database on /dev/mapper/ioband1 and web + applications on /dev/mapper/ioband2. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical devices + +---------------------------+---------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Create ioband devices with the same device group ID and assign + weights of 80 and 40 to the default ioband groups respectively. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \ + "none weight 0 :80" | dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \ + "none weight 0 :40" | dmsetup create ioband2 + + + 2. Create filesystems on the ioband devices and mount them. + + # mkfs.ext3 /dev/mapper/ioband1 + # mount /dev/mapper/ioband1 /mnt1 + + # mkfs.ext3 /dev/mapper/ioband2 + # mount /dev/mapper/ioband2 /mnt2 + + + -------------------------------------------------------------------------- + + Example #2: Bandwidth control on Logical Volumes + + This example is similar to the example #1 but it uses LVM logical + volumes instead of disk partitions. This example shows how to configure + ioband devices on two striped logical volumes. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/lv0 | | /dev/mapper/lv1 | striped logical + | | | | volumes + +-------------------------------------------------------+ + | vg0 | volume group + +-------------|----------------------------|------------+ + | | + +-------------V------------+ +-------------V------------+ + | /dev/sdb | | /dev/sdc | physical devices + +--------------------------+ +--------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Initialize the partitions for use by LVM. + + # pvcreate /dev/sdb + # pvcreate /dev/sdc + + + 2. Create a new volume group named "vg0" with /dev/sdb and /dev/sdc. + + # vgcreate vg0 /dev/sdb /dev/sdc + + + 3. Create two logical volumes in "vg0." The volumes have to be striped. + + # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M + # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M + + + The rest is the same as the example #1. + + 4. Create ioband devices corresponding to each logical volume and + assign weights of 80 and 40 to the default ioband groups respectively. + + # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \ + "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \ + dmsetup create ioband1 + # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \ + "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \ + dmsetup create ioband2 + + + 5. Create filesystems on the ioband devices and mount them. + + # mkfs.ext3 /dev/mapper/ioband1 + # mount /dev/mapper/ioband1 /mnt1 + + # mkfs.ext3 /dev/mapper/ioband2 + # mount /dev/mapper/ioband2 /mnt2 + + + -------------------------------------------------------------------------- + + Example #3: Bandwidth control on processes + + This example describes how to control the bandwidth with groups of + processes. You may also want to run an additional application on the same + machine described in the example #1. This example shows how to add a new + ioband group for this application. + + /mnt1 /mnt2 mount points + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +-------------+------------+ +-------------+------------+ + | default | | user=1000 | default | ioband groups + | (80) | | (20) | (40) | (weight) + +-------------+------------+ +-------------+------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical device + +---------------------------+---------------------------+ + + + The following shows to set up a new ioband group on the machine that is + already configured as the example #1. The application will have a weight + of 20 and run with user-id 1000 on /dev/mapper/ioband2. + + 1. Set the type of ioband2 to "user." + + # dmsetup message ioband2 0 type user. + + + 2. Create a new ioband group on ioband2. + + # dmsetup message ioband2 0 attach 1000 + + + 3. Assign weight of 10 to this newly created ioband group. + + # dmsetup message ioband2 0 weight 1000:20 + + + -------------------------------------------------------------------------- + + Example #4: Bandwidth control for Xen virtual block devices + + This example describes how to control the bandwidth for Xen virtual + block devices. The following diagram illustrates the configuration of this + example. + + Virtual Machine 1 Virtual Machine 2 virtual machines + | | + +-------------V------------+ +-------------V------------+ + | /dev/xvda1 | | /dev/xvda1 | virtual block + +-------------|------------+ +-------------|------------+ devices + | | + +-------------V------------+ +-------------V------------+ + | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices + +--------------------------+ +--------------------------+ + | default group | | default group | ioband groups + | (80) | | (40) | (weight) + +-------------|------------+ +-------------|------------+ + | | + +-------------V-------------+--------------V------------+ + | /dev/sda1 | /dev/sda2 | physical device + +---------------------------+---------------------------+ + + + The followings shows how to map ioband device "ioband1" and "ioband2" to + virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on + Virtual Machine 2" respectively on the machine configured as the example + #1. Add the following lines to the configuration files that are referenced + when creating "Virtual Machine 1" and "Virtual Machine 2." + + For "Virtual Machine 1" + disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ] + + For "Virtual Machine 2" + disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ] + + + -------------------------------------------------------------------------- + + Example #5: Bandwidth control for Xen blktap devices + + This example describes how to control the bandwidth for Xen virtual + block devices when Xen blktap devices are used. The following diagram + illustrates the configuration of this example. + + Virtual Machine 1 Virtual Machine 2 virtual machines + | | + +-------------V------------+ +-------------V------------+ + | /dev/xvda1 | | /dev/xvda1 | virtual block + +-------------|------------+ +-------------|------------+ devices + | | + +-------------V----------------------------V------------+ + | /dev/mapper/ioband1 | ioband device + +---------------------------+---------------------------+ + | default group | default group | ioband groups + | (80) | (40) | (weight) + +-------------|-------------+--------------|------------+ + | | + +-------------|----------------------------|------------+ + | +----------V----------+ +----------V---------+ | + | | vm1.img | | vm2.img | | disk image files + | +---------------------+ +--------------------+ | + | /vmdisk | mount point + +---------------------------|---------------------------+ + | + +---------------------------V---------------------------+ + | /dev/sda1 | physical device + +-------------------------------------------------------+ + + + To setup the above configuration, follow these steps: + + 1. Create an ioband device. + + # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \ + "1 0 0 none weight 0 :100" | dmsetup create ioband1 + + + 2. Add the following lines to the configuration files that are + referenced when creating "Virtual Machine 1" and "Virtual Machine 2." + Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used. + + For "Virtual Machine 1" + disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ] + + For "Virtual Machine 1" + disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ] + + + 3. Run the virtual machines. + + # xm create vm1 + # xm create vm2 + + + 4. Find out the process IDs of the daemons which control the blktap + devices. + + # lsof /vmdisk/disk[12].img + COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME + tapdisk 15011 root 11u REG 253,0 2147483648 48961 /vmdisk/vm1.img + tapdisk 15276 root 13u REG 253,0 2147483648 48962 /vmdisk/vm2.img + + + 5. Create new ioband groups of pid 15011 and pid 15276, which are + process IDs of the tapdisks, and assign weight of 80 and 40 to the + groups respectively. + + # dmsetup message ioband1 0 type pid + # dmsetup message ioband1 0 attach 15011 + # dmsetup message ioband1 0 weight 15011:80 + # dmsetup message ioband1 0 attach 15276 + # dmsetup message ioband1 0 weight 15276:40 From tony.luck at intel.com Fri Oct 17 13:16:28 2008 From: tony.luck at intel.com (Luck, Tony) Date: Fri, 17 Oct 2008 13:16:28 -0700 Subject: [PATCH 00/33] ia64/xen domU take 12 In-Reply-To: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> References: <1224209893-2032-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <57C9024A16AD2D4C97DC78E552063EA3533459D6@orsmsx505.amr.corp.intel.com> > This patchset is ia64/xen domU patch take 12. > Tony, please commit those patches. Ok. Committed. -Tony From pavel at suse.cz Fri Oct 17 15:54:41 2008 From: pavel at suse.cz (Pavel Machek) Date: Sat, 18 Oct 2008 00:54:41 +0200 Subject: [PATCH 8/8 v4] PCI: document the changes In-Reply-To: <20081014110157.GH1734@yzhao12-linux.sh.intel.com> References: <20081014103424.GA1704@yzhao12-linux.sh.intel.com> <20081014110157.GH1734@yzhao12-linux.sh.intel.com> Message-ID: <20081017225440.GA1467@ucw.cz> Hi! > Create how-to for SR-IOV user and device driver developer. > > Signed-off-by: Yu Zhao > +1.1 What is SR-IOV > + > +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended > +capability which makes one physical device appear as multiple virtual > +devices. The physical device is referred to as Physical Function while > +the virtual devices are referred to as Virtual Functions. Allocation > +of Virtual Functions can be dynamically controlled by Physical Function > +via registers encapsulated in the capability. By default, this feature > +is not enabled and the Physical Function behaves as traditional PCIe > +device. Once it's turned on, each Virtual Function's PCI configuration > +space can be accessed by its own Bus, Device and Function Number (Routing > +ID). And each Virtual Function also has PCI Memory Space, which is > used Ok, why is this optional? If intel cares about virtualization, it should enable this by default. I dont see why this should be configurable. > +#ifdef CONFIG_PM > +/* > + * If Physical Function supports the power management, then the > + * SR-IOV needs to be disabled before the adapter goes to sleep, > + * because Virtual Functions will not work when the adapter is in > + * the power-saving mode. > + * The SR-IOV can be enabled again after the adapter wakes up. > + */ How beatiful :-(. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html From yamahata at valinux.co.jp Sun Oct 19 20:55:11 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:11 +0900 Subject: [PATCH 09/13] ia64/pv_ops: gate page paravirtualization. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-10-git-send-email-yamahata@valinux.co.jp> paravirtualize gate page by allowing each pv_ops instances to define its own gate page. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/patchlist.h | 38 ++++++++++++++ arch/ia64/include/asm/paravirt.h | 35 +++++++++++++ arch/ia64/kernel/Makefile | 32 ++---------- arch/ia64/kernel/Makefile.gate | 27 ++++++++++ arch/ia64/kernel/gate.lds.S | 17 ++++--- arch/ia64/kernel/paravirt_patchlist.c | 78 ++++++++++++++++++++++++++++++ arch/ia64/kernel/paravirt_patchlist.h | 28 +++++++++++ arch/ia64/kernel/patch.c | 12 ++-- arch/ia64/mm/init.c | 6 ++- 9 files changed, 230 insertions(+), 43 deletions(-) create mode 100644 arch/ia64/include/asm/native/patchlist.h create mode 100644 arch/ia64/kernel/Makefile.gate create mode 100644 arch/ia64/kernel/paravirt_patchlist.c create mode 100644 arch/ia64/kernel/paravirt_patchlist.h diff --git a/arch/ia64/include/asm/native/patchlist.h b/arch/ia64/include/asm/native/patchlist.h new file mode 100644 index 0000000..be16ca9 --- /dev/null +++ b/arch/ia64/include/asm/native/patchlist.h @@ -0,0 +1,38 @@ +/****************************************************************************** + * arch/ia64/include/asm/native/inst.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#define __paravirt_start_gate_fsyscall_patchlist \ + __ia64_native_start_gate_fsyscall_patchlist +#define __paravirt_end_gate_fsyscall_patchlist \ + __ia64_native_end_gate_fsyscall_patchlist +#define __paravirt_start_gate_brl_fsys_bubble_down_patchlist \ + __ia64_native_start_gate_brl_fsys_bubble_down_patchlist +#define __paravirt_end_gate_brl_fsys_bubble_down_patchlist \ + __ia64_native_end_gate_brl_fsys_bubble_down_patchlist +#define __paravirt_start_gate_vtop_patchlist \ + __ia64_native_start_gate_vtop_patchlist +#define __paravirt_end_gate_vtop_patchlist \ + __ia64_native_end_gate_vtop_patchlist +#define __paravirt_start_gate_mckinley_e9_patchlist \ + __ia64_native_start_gate_mckinley_e9_patchlist +#define __paravirt_end_gate_mckinley_e9_patchlist \ + __ia64_native_end_gate_mckinley_e9_patchlist diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index a73e77a..fc433f6 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -35,6 +35,41 @@ extern struct pv_fsys_data pv_fsys_data; unsigned long *paravirt_get_fsyscall_table(void); char *paravirt_get_fsys_bubble_down(void); + +/****************************************************************************** + * patchlist addresses for gate page + */ +enum pv_gate_patchlist { + PV_GATE_START_FSYSCALL, + PV_GATE_END_FSYSCALL, + + PV_GATE_START_BRL_FSYS_BUBBLE_DOWN, + PV_GATE_END_BRL_FSYS_BUBBLE_DOWN, + + PV_GATE_START_VTOP, + PV_GATE_END_VTOP, + + PV_GATE_START_MCKINLEY_E9, + PV_GATE_END_MCKINLEY_E9, +}; + +struct pv_patchdata { + unsigned long start_fsyscall_patchlist; + unsigned long end_fsyscall_patchlist; + unsigned long start_brl_fsys_bubble_down_patchlist; + unsigned long end_brl_fsys_bubble_down_patchlist; + unsigned long start_vtop_patchlist; + unsigned long end_vtop_patchlist; + unsigned long start_mckinley_e9_patchlist; + unsigned long end_mckinley_e9_patchlist; + + void *gate_section; +}; + +extern struct pv_patchdata pv_patchdata; + +unsigned long paravirt_get_gate_patchlist(enum pv_gate_patchlist type); +void *paravirt_get_gate_section(void); #endif #ifdef CONFIG_PARAVIRT_GUEST diff --git a/arch/ia64/kernel/Makefile b/arch/ia64/kernel/Makefile index 1ab150e..8dc9df8 100644 --- a/arch/ia64/kernel/Makefile +++ b/arch/ia64/kernel/Makefile @@ -5,7 +5,7 @@ extra-y := head.o init_task.o vmlinux.lds obj-y := acpi.o entry.o efi.o efi_stub.o gate-data.o fsys.o ia64_ksyms.o irq.o irq_ia64.o \ - irq_lsapic.o ivt.o machvec.o pal.o patch.o process.o perfmon.o ptrace.o sal.o \ + irq_lsapic.o ivt.o machvec.o pal.o paravirt_patchlist.o patch.o process.o perfmon.o ptrace.o sal.o \ salinfo.o setup.o signal.o sys_ia64.o time.o traps.o unaligned.o \ unwind.o mca.o mca_asm.o topology.o @@ -47,35 +47,13 @@ ifeq ($(CONFIG_DMAR), y) obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o endif -# The gate DSO image is built using a special linker script. -targets += gate.so gate-syms.o - -extra-y += gate.so gate-syms.o gate.lds gate.o - # fp_emulate() expects f2-f5,f16-f31 to contain the user-level state. CFLAGS_traps.o += -mfixed-range=f2-f5,f16-f31 -CPPFLAGS_gate.lds := -P -C -U$(ARCH) - -quiet_cmd_gate = GATE $@ - cmd_gate = $(CC) -nostdlib $(GATECFLAGS_$(@F)) -Wl,-T,$(filter-out FORCE,$^) -o $@ - -GATECFLAGS_gate.so = -shared -s -Wl,-soname=linux-gate.so.1 \ - $(call ld-option, -Wl$(comma)--hash-style=sysv) -$(obj)/gate.so: $(obj)/gate.lds $(obj)/gate.o FORCE - $(call if_changed,gate) - -$(obj)/built-in.o: $(obj)/gate-syms.o -$(obj)/built-in.o: ld_flags += -R $(obj)/gate-syms.o - -GATECFLAGS_gate-syms.o = -r -$(obj)/gate-syms.o: $(obj)/gate.lds $(obj)/gate.o FORCE - $(call if_changed,gate) - -# gate-data.o contains the gate DSO image as data in section .data.gate. -# We must build gate.so before we can assemble it. -# Note: kbuild does not track this dependency due to usage of .incbin -$(obj)/gate-data.o: $(obj)/gate.so +# The gate DSO image is built using a special linker script. +include $(srctree)/arch/ia64/kernel/Makefile.gate +# tell compiled for native +CPPFLAGS_gate.lds += -D__IA64_GATE_PARAVIRTUALIZED_NATIVE # Calculate NR_IRQ = max(IA64_NATIVE_NR_IRQS, XEN_NR_IRQS, ...) based on config define sed-y diff --git a/arch/ia64/kernel/Makefile.gate b/arch/ia64/kernel/Makefile.gate new file mode 100644 index 0000000..ee6ef03 --- /dev/null +++ b/arch/ia64/kernel/Makefile.gate @@ -0,0 +1,27 @@ +# The gate DSO image is built using a special linker script. + +Targets += gate.so gate-syms.o + +extra-y += gate.so gate-syms.o gate.lds gate.o + +CPPFLAGS_gate.lds := -P -C -U$(ARCH) + +quiet_cmd_gate = GATE $@ + cmd_gate = $(CC) -nostdlib $(GATECFLAGS_$(@F)) -Wl,-T,$(filter-out FORCE,$^) -o $@ + +GATECFLAGS_gate.so = -shared -s -Wl,-soname=linux-gate.so.1 \ + $(call ld-option, -Wl$(comma)--hash-style=sysv) +$(obj)/gate.so: $(obj)/gate.lds $(obj)/gate.o FORCE + $(call if_changed,gate) + +$(obj)/built-in.o: $(obj)/gate-syms.o +$(obj)/built-in.o: ld_flags += -R $(obj)/gate-syms.o + +GATECFLAGS_gate-syms.o = -r +$(obj)/gate-syms.o: $(obj)/gate.lds $(obj)/gate.o FORCE + $(call if_changed,gate) + +# gate-data.o contains the gate DSO image as data in section .data.gate. +# We must build gate.so before we can assemble it. +# Note: kbuild does not track this dependency due to usage of .incbin +$(obj)/gate-data.o: $(obj)/gate.so diff --git a/arch/ia64/kernel/gate.lds.S b/arch/ia64/kernel/gate.lds.S index 3cb1abc..88c64ed 100644 --- a/arch/ia64/kernel/gate.lds.S +++ b/arch/ia64/kernel/gate.lds.S @@ -7,6 +7,7 @@ #include +#include "paravirt_patchlist.h" SECTIONS { @@ -33,21 +34,21 @@ SECTIONS . = GATE_ADDR + 0x600; .data.patch : { - __start_gate_mckinley_e9_patchlist = .; + __paravirt_start_gate_mckinley_e9_patchlist = .; *(.data.patch.mckinley_e9) - __end_gate_mckinley_e9_patchlist = .; + __paravirt_end_gate_mckinley_e9_patchlist = .; - __start_gate_vtop_patchlist = .; + __paravirt_start_gate_vtop_patchlist = .; *(.data.patch.vtop) - __end_gate_vtop_patchlist = .; + __paravirt_end_gate_vtop_patchlist = .; - __start_gate_fsyscall_patchlist = .; + __paravirt_start_gate_fsyscall_patchlist = .; *(.data.patch.fsyscall_table) - __end_gate_fsyscall_patchlist = .; + __paravirt_end_gate_fsyscall_patchlist = .; - __start_gate_brl_fsys_bubble_down_patchlist = .; + __paravirt_start_gate_brl_fsys_bubble_down_patchlist = .; *(.data.patch.brl_fsys_bubble_down) - __end_gate_brl_fsys_bubble_down_patchlist = .; + __paravirt_end_gate_brl_fsys_bubble_down_patchlist = .; } :readable .IA_64.unwind_info : { *(.IA_64.unwind_info*) } diff --git a/arch/ia64/kernel/paravirt_patchlist.c b/arch/ia64/kernel/paravirt_patchlist.c new file mode 100644 index 0000000..bdefe9a --- /dev/null +++ b/arch/ia64/kernel/paravirt_patchlist.c @@ -0,0 +1,78 @@ +/****************************************************************************** + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include + +#define DECLARE(name) \ + extern unsigned long \ + __ia64_native_start_gate_##name##_patchlist[]; \ + extern unsigned long \ + __ia64_native_end_gate_##name##_patchlist[] + +DECLARE(fsyscall); +DECLARE(brl_fsys_bubble_down); +DECLARE(vtop); +DECLARE(mckinley_e9); + +extern unsigned long __start_gate_section[]; + +#define ASSIGN(name) \ + .start_##name##_patchlist = \ + (unsigned long)__ia64_native_start_gate_##name##_patchlist, \ + .end_##name##_patchlist = \ + (unsigned long)__ia64_native_end_gate_##name##_patchlist + +struct pv_patchdata pv_patchdata __initdata = { + ASSIGN(fsyscall), + ASSIGN(brl_fsys_bubble_down), + ASSIGN(vtop), + ASSIGN(mckinley_e9), + + .gate_section = (void*)__start_gate_section, +}; + + +unsigned long __init +paravirt_get_gate_patchlist(enum pv_gate_patchlist type) +{ + +#define CASE(NAME, name) \ + case PV_GATE_START_##NAME: \ + return pv_patchdata.start_##name##_patchlist; \ + case PV_GATE_END_##NAME: \ + return pv_patchdata.end_##name##_patchlist; \ + + switch (type) { + CASE(FSYSCALL, fsyscall); + CASE(BRL_FSYS_BUBBLE_DOWN, brl_fsys_bubble_down); + CASE(VTOP, vtop); + CASE(MCKINLEY_E9, mckinley_e9); + default: + BUG(); + break; + } + return 0; +} + +void * __init +paravirt_get_gate_section(void) +{ + return pv_patchdata.gate_section; +} diff --git a/arch/ia64/kernel/paravirt_patchlist.h b/arch/ia64/kernel/paravirt_patchlist.h new file mode 100644 index 0000000..0684aa6 --- /dev/null +++ b/arch/ia64/kernel/paravirt_patchlist.h @@ -0,0 +1,28 @@ +/****************************************************************************** + * linux/arch/ia64/xen/paravirt_patchlist.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#if defined(__IA64_GATE_PARAVIRTUALIZED_XEN) +#include +#else +#include +#endif + diff --git a/arch/ia64/kernel/patch.c b/arch/ia64/kernel/patch.c index 02dd977..64c6f95 100644 --- a/arch/ia64/kernel/patch.c +++ b/arch/ia64/kernel/patch.c @@ -227,13 +227,13 @@ patch_brl_fsys_bubble_down (unsigned long start, unsigned long end) void __init ia64_patch_gate (void) { -# define START(name) ((unsigned long) __start_gate_##name##_patchlist) -# define END(name) ((unsigned long)__end_gate_##name##_patchlist) +# define START(name) paravirt_get_gate_patchlist(PV_GATE_START_##name) +# define END(name) paravirt_get_gate_patchlist(PV_GATE_END_##name) - patch_fsyscall_table(START(fsyscall), END(fsyscall)); - patch_brl_fsys_bubble_down(START(brl_fsys_bubble_down), END(brl_fsys_bubble_down)); - ia64_patch_vtop(START(vtop), END(vtop)); - ia64_patch_mckinley_e9(START(mckinley_e9), END(mckinley_e9)); + patch_fsyscall_table(START(FSYSCALL), END(FSYSCALL)); + patch_brl_fsys_bubble_down(START(BRL_FSYS_BUBBLE_DOWN), END(BRL_FSYS_BUBBLE_DOWN)); + ia64_patch_vtop(START(VTOP), END(VTOP)); + ia64_patch_mckinley_e9(START(MCKINLEY_E9), END(MCKINLEY_E9)); } void ia64_patch_phys_stack_reg(unsigned long val) diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index 59e6851..6dfd895 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -259,6 +259,7 @@ put_kernel_page (struct page *page, unsigned long address, pgprot_t pgprot) static void __init setup_gate (void) { + void *gate_section; struct page *page; /* @@ -266,10 +267,11 @@ setup_gate (void) * headers etc. and once execute-only page to enable * privilege-promotion via "epc": */ - page = virt_to_page(ia64_imva(__start_gate_section)); + gate_section = paravirt_get_gate_section(); + page = virt_to_page(ia64_imva(gate_section)); put_kernel_page(page, GATE_ADDR, PAGE_READONLY); #ifdef HAVE_BUGGY_SEGREL - page = virt_to_page(ia64_imva(__start_gate_section + PAGE_SIZE)); + page = virt_to_page(ia64_imva(gate_section + PAGE_SIZE)); put_kernel_page(page, GATE_ADDR + PAGE_SIZE, PAGE_GATE); #else put_kernel_page(page, GATE_ADDR + PERCPU_PAGE_SIZE, PAGE_GATE); -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:08 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:08 +0900 Subject: [PATCH 06/13] ia64/pv_ops: paravirtualize mov = ar.itc. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-7-git-send-email-yamahata@valinux.co.jp> paravirtualize mov reg = ar.itc. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/inst.h | 5 +++++ arch/ia64/kernel/entry.S | 4 ++-- arch/ia64/kernel/fsys.S | 4 ++-- arch/ia64/kernel/ivt.S | 2 +- 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/ia64/include/asm/native/inst.h b/arch/ia64/include/asm/native/inst.h index 5e4e151..ad59fc6 100644 --- a/arch/ia64/include/asm/native/inst.h +++ b/arch/ia64/include/asm/native/inst.h @@ -77,6 +77,11 @@ (pred) mov reg = psr \ CLOBBER(clob) +#define MOV_FROM_ITC(pred, pred_clob, reg, clob) \ +(pred) mov reg = ar.itc \ + CLOBBER(clob) \ + CLOBBER_PRED(pred_clob) + #define MOV_TO_IFA(reg, clob) \ mov cr.ifa = reg \ CLOBBER(clob) diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S index 7ef0c59..d8462bd 100644 --- a/arch/ia64/kernel/entry.S +++ b/arch/ia64/kernel/entry.S @@ -734,7 +734,7 @@ GLOBAL_ENTRY(__paravirt_leave_syscall) __paravirt_work_processed_syscall: #ifdef CONFIG_VIRT_CPU_ACCOUNTING adds r2=PT(LOADRS)+16,r12 -(pUStk) mov.m r22=ar.itc // fetch time at leave + MOV_FROM_ITC(pUStk, p9, r22, r19) // fetch time at leave adds r18=TI_FLAGS+IA64_TASK_SIZE,r13 ;; (p6) ld4 r31=[r18] // load current_thread_info()->flags @@ -983,7 +983,7 @@ GLOBAL_ENTRY(__paravirt_leave_kernel) #ifdef CONFIG_VIRT_CPU_ACCOUNTING .pred.rel.mutex pUStk,pKStk MOV_FROM_PSR(pKStk, r22, r29) // M2 read PSR now that interrupts are disabled -(pUStk) mov.m r22=ar.itc // M fetch time at leave + MOV_FROM_ITC(pUStk, p9, r22, r29) // M fetch time at leave nop.i 0 ;; #else diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S index 3544d75..3567d54 100644 --- a/arch/ia64/kernel/fsys.S +++ b/arch/ia64/kernel/fsys.S @@ -280,7 +280,7 @@ ENTRY(fsys_gettimeofday) (p9) cmp.eq p13,p0 = 0,r30 // if mmio_ptr, clear p13 jitter control ;; .pred.rel.mutex p8,p9 -(p8) mov r2 = ar.itc // CPU_TIMER. 36 clocks latency!!! + MOV_FROM_ITC(p8, p6, r2, r10) // CPU_TIMER. 36 clocks latency!!! (p9) ld8 r2 = [r30] // MMIO_TIMER. Could also have latency issues.. (p13) ld8 r25 = [r19] // get itc_lastcycle value ld8 r9 = [r22],IA64_TIMESPEC_TV_NSEC_OFFSET // tv_sec @@ -684,7 +684,7 @@ GLOBAL_ENTRY(paravirt_fsys_bubble_down) ;; mov ar.rsc=0 // M2 set enforced lazy mode, pl 0, LE, loadrs=0 #ifdef CONFIG_VIRT_CPU_ACCOUNTING - mov.m r30=ar.itc // M get cycle for accounting + MOV_FROM_ITC(p0, p6, r30, r23) // M get cycle for accounting #else nop.m 0 #endif diff --git a/arch/ia64/kernel/ivt.S b/arch/ia64/kernel/ivt.S index f675d8e..ec9a5fd 100644 --- a/arch/ia64/kernel/ivt.S +++ b/arch/ia64/kernel/ivt.S @@ -804,7 +804,7 @@ ENTRY(break_fault) /////////////////////////////////////////////////////////////////////// st1 [r16]=r0 // M2|3 clear current->thread.on_ustack flag #ifdef CONFIG_VIRT_CPU_ACCOUNTING - mov.m r30=ar.itc // M get cycle for accounting + MOV_FROM_ITC(p0, p14, r30, r18) // M get cycle for accounting #else mov b6=r30 // I0 setup syscall handler branch reg early #endif -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:12 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:12 +0900 Subject: [PATCH 10/13] ia64/pv_ops/xen: define xen specific gate page. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-11-git-send-email-yamahata@valinux.co.jp> define xen specific gate page. At this phase bits in the gate page is same to native. At the next phase, it will be paravirtualized. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/patchlist.h | 38 +++++++++++++++++++++++++++++++++ arch/ia64/kernel/vmlinux.lds.S | 6 +++++ arch/ia64/xen/Makefile | 16 +++++++++++++- arch/ia64/xen/gate-data.S | 3 ++ arch/ia64/xen/xen_pv_ops.c | 32 +++++++++++++++++++++++++++ 5 files changed, 94 insertions(+), 1 deletions(-) create mode 100644 arch/ia64/include/asm/xen/patchlist.h create mode 100644 arch/ia64/xen/gate-data.S diff --git a/arch/ia64/include/asm/xen/patchlist.h b/arch/ia64/include/asm/xen/patchlist.h new file mode 100644 index 0000000..eae944e --- /dev/null +++ b/arch/ia64/include/asm/xen/patchlist.h @@ -0,0 +1,38 @@ +/****************************************************************************** + * arch/ia64/include/asm/xen/patchlist.h + * + * Copyright (c) 2008 Isaku Yamahata + * VA Linux Systems Japan K.K. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#define __paravirt_start_gate_fsyscall_patchlist \ + __xen_start_gate_fsyscall_patchlist +#define __paravirt_end_gate_fsyscall_patchlist \ + __xen_end_gate_fsyscall_patchlist +#define __paravirt_start_gate_brl_fsys_bubble_down_patchlist \ + __xen_start_gate_brl_fsys_bubble_down_patchlist +#define __paravirt_end_gate_brl_fsys_bubble_down_patchlist \ + __xen_end_gate_brl_fsys_bubble_down_patchlist +#define __paravirt_start_gate_vtop_patchlist \ + __xen_start_gate_vtop_patchlist +#define __paravirt_end_gate_vtop_patchlist \ + __xen_end_gate_vtop_patchlist +#define __paravirt_start_gate_mckinley_e9_patchlist \ + __xen_start_gate_mckinley_e9_patchlist +#define __paravirt_end_gate_mckinley_e9_patchlist \ + __xen_end_gate_mckinley_e9_patchlist diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S index 10a7d47..92ae7e8 100644 --- a/arch/ia64/kernel/vmlinux.lds.S +++ b/arch/ia64/kernel/vmlinux.lds.S @@ -201,6 +201,12 @@ SECTIONS __start_gate_section = .; *(.data.gate) __stop_gate_section = .; +#ifdef CONFIG_XEN + . = ALIGN(PAGE_SIZE); + __xen_start_gate_section = .; + *(.data.gate.xen) + __xen_stop_gate_section = .; +#endif } . = ALIGN(PAGE_SIZE); /* make sure the gate page doesn't expose * kernel data diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index b4ca2e6..94f0d8e 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -3,10 +3,24 @@ # obj-y := hypercall.o xenivt.o xensetup.o xen_pv_ops.o irq_xen.o \ - hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o suspend.o + hypervisor.o xencomm.o xcom_hcall.o grant-table.o time.o suspend.o \ + gate-data.o obj-$(CONFIG_IA64_GENERIC) += machvec.o +# The gate DSO image is built using a special linker script. +include $(srctree)/arch/ia64/kernel/Makefile.gate + +# tell compiled for xen +CPPFLAGS_gate.lds += -D__IA64_GATE_PARAVIRTUALIZED_XEN + +# use same file of native. +$(obj)/gate.o: $(src)/../kernel/gate.S FORCE + $(call if_changed_dep,as_o_S) +$(obj)/gate.lds: $(src)/../kernel/gate.lds.S FORCE + $(call if_changed_dep,cpp_lds_S) + + AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile diff --git a/arch/ia64/xen/gate-data.S b/arch/ia64/xen/gate-data.S new file mode 100644 index 0000000..7d4830a --- /dev/null +++ b/arch/ia64/xen/gate-data.S @@ -0,0 +1,3 @@ + .section .data.gate.xen, "aw" + + .incbin "arch/ia64/xen/gate.so" diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 2515d8f..f53b48c 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -179,6 +179,37 @@ struct pv_fsys_data xen_fsys_data __initdata = { }; /*************************************************************************** + * pv_patchdata + * patchdata addresses + */ + +#define DECLARE(name) \ + extern unsigned long __xen_start_gate_##name##_patchlist[]; \ + extern unsigned long __xen_end_gate_##name##_patchlist[] + +DECLARE(fsyscall); +DECLARE(brl_fsys_bubble_down); +DECLARE(vtop); +DECLARE(mckinley_e9); + +extern unsigned long __xen_start_gate_section[]; + +#define ASSIGN(name) \ + .start_##name##_patchlist = \ + (unsigned long)__xen_start_gate_##name##_patchlist, \ + .end_##name##_patchlist = \ + (unsigned long)__xen_end_gate_##name##_patchlist + +static struct pv_patchdata xen_patchdata __initdata = { + ASSIGN(fsyscall), + ASSIGN(brl_fsys_bubble_down), + ASSIGN(vtop), + ASSIGN(mckinley_e9), + + .gate_section = (void*)__xen_start_gate_section, +}; + +/*************************************************************************** * pv_cpu_ops * intrinsics hooks. */ @@ -447,6 +478,7 @@ xen_setup_pv_ops(void) pv_info = xen_info; pv_init_ops = xen_init_ops; pv_fsys_data = xen_fsys_data; + pv_patchdata = xen_patchdata; pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; pv_irq_ops = xen_irq_ops; -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:10 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:10 +0900 Subject: [PATCH 08/13] ia64/pv_ops/xen/pv_time_ops: implement sched_clock. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-9-git-send-email-yamahata@valinux.co.jp> paravirtualize sched_clock. Signed-off-by: Isaku Yamahata --- arch/ia64/xen/Kconfig | 1 + arch/ia64/xen/time.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 0 deletions(-) diff --git a/arch/ia64/xen/Kconfig b/arch/ia64/xen/Kconfig index f1683a2..48839da 100644 --- a/arch/ia64/xen/Kconfig +++ b/arch/ia64/xen/Kconfig @@ -8,6 +8,7 @@ config XEN depends on PARAVIRT && MCKINLEY && IA64_PAGE_SIZE_16KB && EXPERIMENTAL select XEN_XENCOMM select NO_IDLE_HZ + select HAVE_UNSTABLE_SCHED_CLOCK # those are required to save/restore. select ARCH_SUSPEND_POSSIBLE diff --git a/arch/ia64/xen/time.c b/arch/ia64/xen/time.c index d15a94c..c85d319 100644 --- a/arch/ia64/xen/time.c +++ b/arch/ia64/xen/time.c @@ -175,10 +175,58 @@ static void xen_itc_jitter_data_reset(void) } while (unlikely(ret != lcycle)); } +/* based on xen_sched_clock() in arch/x86/xen/time.c. */ +/* + * This relies on HAVE_UNSTABLE_SCHED_CLOCK. If it can't be defined, + * something similar logic should be implemented here. + */ +/* + * Xen sched_clock implementation. Returns the number of unstolen + * nanoseconds, which is nanoseconds the VCPU spent in RUNNING+BLOCKED + * states. + */ +static unsigned long long xen_sched_clock(void) +{ + struct vcpu_runstate_info runstate; + + unsigned long long now; + unsigned long long offset; + unsigned long long ret; + + /* + * Ideally sched_clock should be called on a per-cpu basis + * anyway, so preempt should already be disabled, but that's + * not current practice at the moment. + */ + preempt_disable(); + + /* + * both ia64_native_sched_clock() and xen's runstate are + * based on mAR.ITC. So difference of them makes sense. + */ + now = ia64_native_sched_clock(); + + get_runstate_snapshot(&runstate); + + WARN_ON(runstate.state != RUNSTATE_running); + + offset = 0; + if (now > runstate.state_entry_time) + offset = now - runstate.state_entry_time; + ret = runstate.time[RUNSTATE_blocked] + + runstate.time[RUNSTATE_running] + + offset; + + preempt_enable(); + + return ret; +} + struct pv_time_ops xen_time_ops __initdata = { .init_missing_ticks_accounting = xen_init_missing_ticks_accounting, .do_steal_accounting = xen_do_steal_accounting, .clocksource_resume = xen_itc_jitter_data_reset, + .sched_clock = xen_sched_clock, }; /* Called after suspend, to resume time. */ -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:09 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:09 +0900 Subject: [PATCH 07/13] ia64/pv_ops/pv_time_ops: add sched_clock hook. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-8-git-send-email-yamahata@valinux.co.jp> add sched_clock() hook to paravirtualize sched_clock(). ia64 sched_clock() is based on ar.itc which isn't stable on virtualized environment because vcpu may move around on pcpus. So it needs paravirtualization. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/paravirt.h | 7 +++++++ arch/ia64/include/asm/timex.h | 1 + arch/ia64/kernel/head.S | 4 ++-- arch/ia64/kernel/paravirt.c | 1 + arch/ia64/kernel/time.c | 12 ++++++++++++ 5 files changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 56f69f9..a73e77a 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -225,6 +225,8 @@ struct pv_time_ops { int (*do_steal_accounting)(unsigned long *new_itm); void (*clocksource_resume)(void); + + unsigned long long (*sched_clock)(void); }; extern struct pv_time_ops pv_time_ops; @@ -242,6 +244,11 @@ paravirt_do_steal_accounting(unsigned long *new_itm) return pv_time_ops.do_steal_accounting(new_itm); } +static inline unsigned long long paravirt_sched_clock(void) +{ + return pv_time_ops.sched_clock(); +} + #endif /* !__ASSEMBLY__ */ #else diff --git a/arch/ia64/include/asm/timex.h b/arch/ia64/include/asm/timex.h index 4e03cfe..86c7db8 100644 --- a/arch/ia64/include/asm/timex.h +++ b/arch/ia64/include/asm/timex.h @@ -40,5 +40,6 @@ get_cycles (void) } extern void ia64_cpu_local_tick (void); +extern unsigned long long ia64_native_sched_clock (void); #endif /* _ASM_IA64_TIMEX_H */ diff --git a/arch/ia64/kernel/head.S b/arch/ia64/kernel/head.S index 66e491d..ca1336a 100644 --- a/arch/ia64/kernel/head.S +++ b/arch/ia64/kernel/head.S @@ -1050,7 +1050,7 @@ END(ia64_delay_loop) * except that the multiplication and the shift are done with 128-bit * intermediate precision so that we can produce a full 64-bit result. */ -GLOBAL_ENTRY(sched_clock) +GLOBAL_ENTRY(ia64_native_sched_clock) addl r8=THIS_CPU(cpu_info) + IA64_CPUINFO_NSEC_PER_CYC_OFFSET,r0 mov.m r9=ar.itc // fetch cycle-counter (35 cyc) ;; @@ -1066,7 +1066,7 @@ GLOBAL_ENTRY(sched_clock) ;; shrp r8=r9,r8,IA64_NSEC_PER_CYC_SHIFT br.ret.sptk.many rp -END(sched_clock) +END(ia64_native_sched_clock) #ifdef CONFIG_VIRT_CPU_ACCOUNTING GLOBAL_ENTRY(cycle_to_cputime) diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index de35d8e..6ebbcc1 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -366,4 +366,5 @@ ia64_native_do_steal_accounting(unsigned long *new_itm) struct pv_time_ops pv_time_ops = { .do_steal_accounting = ia64_native_do_steal_accounting, + .sched_clock = ia64_native_sched_clock, }; diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c index 65c10a4..6f6ca42 100644 --- a/arch/ia64/kernel/time.c +++ b/arch/ia64/kernel/time.c @@ -50,6 +50,18 @@ EXPORT_SYMBOL(last_cli_ip); #endif #ifdef CONFIG_PARAVIRT +/* We need to define a real function for sched_clock, to override the + weak default version */ +unsigned long long sched_clock(void) +{ + return paravirt_sched_clock(); +} +#else +unsigned long long +sched_clock(void) __attribute__((alias("ia64_native_sched_clock"))); +#endif + +#ifdef CONFIG_PARAVIRT static void paravirt_clocksource_resume(void) { -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:05 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:05 +0900 Subject: [PATCH 03/13] ia64/pv_ops: paravirtualize fsys.S. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-4-git-send-email-yamahata@valinux.co.jp> paravirtualize fsys.S. Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/fsys.S | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S index 788319f..3544d75 100644 --- a/arch/ia64/kernel/fsys.S +++ b/arch/ia64/kernel/fsys.S @@ -419,7 +419,7 @@ EX(.fail_efault, ld8 r14=[r33]) // r14 <- *set mov r17=(1 << (SIGKILL - 1)) | (1 << (SIGSTOP - 1)) ;; - rsm psr.i // mask interrupt delivery + RSM_PSR_I(p0, r18, r19) // mask interrupt delivery mov ar.ccv=0 andcm r14=r14,r17 // filter out SIGKILL & SIGSTOP @@ -492,7 +492,7 @@ EX(.fail_efault, ld8 r14=[r33]) // r14 <- *set #ifdef CONFIG_SMP st4.rel [r31]=r0 // release the lock #endif - ssm psr.i + SSM_PSR_I(p0, p9, r31) ;; srlz.d // ensure psr.i is set again @@ -514,7 +514,7 @@ EX(.fail_efault, (p15) st8 [r34]=r3) #ifdef CONFIG_SMP st4.rel [r31]=r0 // release the lock #endif - ssm psr.i + SSM_PSR_I(p0, p9, r17) ;; srlz.d br.sptk.many fsys_fallback_syscall // with signal pending, do the heavy-weight syscall @@ -522,7 +522,7 @@ EX(.fail_efault, (p15) st8 [r34]=r3) #ifdef CONFIG_SMP .lock_contention: /* Rather than spinning here, fall back on doing a heavy-weight syscall. */ - ssm psr.i + SSM_PSR_I(p0, p9, r17) ;; srlz.d br.sptk.many fsys_fallback_syscall @@ -593,11 +593,11 @@ ENTRY(fsys_fallback_syscall) adds r17=-1024,r15 movl r14=sys_call_table ;; - rsm psr.i + RSM_PSR_I(p0, r26, r27) shladd r18=r17,3,r14 ;; ld8 r18=[r18] // load normal (heavy-weight) syscall entry-point - mov r29=psr // read psr (12 cyc load latency) + MOV_FROM_PSR(p0, r29, r26) // read psr (12 cyc load latency) mov r27=ar.rsc mov r21=ar.fpsr mov r26=ar.pfs @@ -735,7 +735,7 @@ GLOBAL_ENTRY(paravirt_fsys_bubble_down) mov rp=r14 // I0 set the real return addr and r3=_TIF_SYSCALL_TRACEAUDIT,r3 // A ;; - ssm psr.i // M2 we're on kernel stacks now, reenable irqs + SSM_PSR_I(p0, p6, r22) // M2 we're on kernel stacks now, reenable irqs cmp.eq p8,p0=r3,r0 // A (p10) br.cond.spnt.many ia64_ret_from_syscall // B return if bad call-frame or r15 is a NaT -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:13 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:13 +0900 Subject: [PATCH 11/13] ia64/pv_ops: move down __kernel_syscall_via_epc. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-12-git-send-email-yamahata@valinux.co.jp> Move down __kernel_syscall_via_epc to the end of the page. We want to paravirtualize only __kernel_syscall_via_epc because it includes privileged instructions. Its paravirtualization increases its symbols size. On the other hand, each paravirtualized gate must have e symbols of same value and size to native's because the page is mapped to GATE_ADDR and GATE_ADDR + PERCPU_PAGE_SIZE and vmlinux is linked to those symbols. Later to have the same symbol size, we pads NOPs at the end of __kernel_syscall_via_epc. Move it after other functions to keep symbols of other functions have same values and sizes. Signed-off-by: Isaku Yamahata --- arch/ia64/kernel/gate.S | 162 +++++++++++++++++++++++----------------------- 1 files changed, 81 insertions(+), 81 deletions(-) diff --git a/arch/ia64/kernel/gate.S b/arch/ia64/kernel/gate.S index 74b1ccc..c957228 100644 --- a/arch/ia64/kernel/gate.S +++ b/arch/ia64/kernel/gate.S @@ -48,87 +48,6 @@ GLOBAL_ENTRY(__kernel_syscall_via_break) } END(__kernel_syscall_via_break) -/* - * On entry: - * r11 = saved ar.pfs - * r15 = system call # - * b0 = saved return address - * b6 = return address - * On exit: - * r11 = saved ar.pfs - * r15 = system call # - * b0 = saved return address - * all other "scratch" registers: undefined - * all "preserved" registers: same as on entry - */ - -GLOBAL_ENTRY(__kernel_syscall_via_epc) - .prologue - .altrp b6 - .body -{ - /* - * Note: the kernel cannot assume that the first two instructions in this - * bundle get executed. The remaining code must be safe even if - * they do not get executed. - */ - adds r17=-1024,r15 // A - mov r10=0 // A default to successful syscall execution - epc // B causes split-issue -} - ;; - rsm psr.be | psr.i // M2 (5 cyc to srlz.d) - LOAD_FSYSCALL_TABLE(r14) // X - ;; - mov r16=IA64_KR(CURRENT) // M2 (12 cyc) - shladd r18=r17,3,r14 // A - mov r19=NR_syscalls-1 // A - ;; - lfetch [r18] // M0|1 - mov r29=psr // M2 (12 cyc) - // If r17 is a NaT, p6 will be zero - cmp.geu p6,p7=r19,r17 // A (sysnr > 0 && sysnr < 1024+NR_syscalls)? - ;; - mov r21=ar.fpsr // M2 (12 cyc) - tnat.nz p10,p9=r15 // I0 - mov.i r26=ar.pfs // I0 (would stall anyhow due to srlz.d...) - ;; - srlz.d // M0 (forces split-issue) ensure PSR.BE==0 -(p6) ld8 r18=[r18] // M0|1 - nop.i 0 - ;; - nop.m 0 -(p6) tbit.z.unc p8,p0=r18,0 // I0 (dual-issues with "mov b7=r18"!) - nop.i 0 - ;; -(p8) ssm psr.i -(p6) mov b7=r18 // I0 -(p8) br.dptk.many b7 // B - - mov r27=ar.rsc // M2 (12 cyc) -/* - * brl.cond doesn't work as intended because the linker would convert this branch - * into a branch to a PLT. Perhaps there will be a way to avoid this with some - * future version of the linker. In the meantime, we just use an indirect branch - * instead. - */ -#ifdef CONFIG_ITANIUM -(p6) add r14=-8,r14 // r14 <- addr of fsys_bubble_down entry - ;; -(p6) ld8 r14=[r14] // r14 <- fsys_bubble_down - ;; -(p6) mov b7=r14 -(p6) br.sptk.many b7 -#else - BRL_COND_FSYS_BUBBLE_DOWN(p6) -#endif - ssm psr.i - mov r10=-1 -(p10) mov r8=EINVAL -(p9) mov r8=ENOSYS - FSYS_RETURN -END(__kernel_syscall_via_epc) - # define ARG0_OFF (16 + IA64_SIGFRAME_ARG0_OFFSET) # define ARG1_OFF (16 + IA64_SIGFRAME_ARG1_OFFSET) # define ARG2_OFF (16 + IA64_SIGFRAME_ARG2_OFFSET) @@ -374,3 +293,84 @@ restore_rbs: // invala not necessary as that will happen when returning to user-mode br.cond.sptk back_from_restore_rbs END(__kernel_sigtramp) + +/* + * On entry: + * r11 = saved ar.pfs + * r15 = system call # + * b0 = saved return address + * b6 = return address + * On exit: + * r11 = saved ar.pfs + * r15 = system call # + * b0 = saved return address + * all other "scratch" registers: undefined + * all "preserved" registers: same as on entry + */ + +GLOBAL_ENTRY(__kernel_syscall_via_epc) + .prologue + .altrp b6 + .body +{ + /* + * Note: the kernel cannot assume that the first two instructions in this + * bundle get executed. The remaining code must be safe even if + * they do not get executed. + */ + adds r17=-1024,r15 // A + mov r10=0 // A default to successful syscall execution + epc // B causes split-issue +} + ;; + rsm psr.be | psr.i // M2 (5 cyc to srlz.d) + LOAD_FSYSCALL_TABLE(r14) // X + ;; + mov r16=IA64_KR(CURRENT) // M2 (12 cyc) + shladd r18=r17,3,r14 // A + mov r19=NR_syscalls-1 // A + ;; + lfetch [r18] // M0|1 + mov r29=psr // M2 (12 cyc) + // If r17 is a NaT, p6 will be zero + cmp.geu p6,p7=r19,r17 // A (sysnr > 0 && sysnr < 1024+NR_syscalls)? + ;; + mov r21=ar.fpsr // M2 (12 cyc) + tnat.nz p10,p9=r15 // I0 + mov.i r26=ar.pfs // I0 (would stall anyhow due to srlz.d...) + ;; + srlz.d // M0 (forces split-issue) ensure PSR.BE==0 +(p6) ld8 r18=[r18] // M0|1 + nop.i 0 + ;; + nop.m 0 +(p6) tbit.z.unc p8,p0=r18,0 // I0 (dual-issues with "mov b7=r18"!) + nop.i 0 + ;; +(p8) ssm psr.i +(p6) mov b7=r18 // I0 +(p8) br.dptk.many b7 // B + + mov r27=ar.rsc // M2 (12 cyc) +/* + * brl.cond doesn't work as intended because the linker would convert this branch + * into a branch to a PLT. Perhaps there will be a way to avoid this with some + * future version of the linker. In the meantime, we just use an indirect branch + * instead. + */ +#ifdef CONFIG_ITANIUM +(p6) add r14=-8,r14 // r14 <- addr of fsys_bubble_down entry + ;; +(p6) ld8 r14=[r14] // r14 <- fsys_bubble_down + ;; +(p6) mov b7=r14 +(p6) br.sptk.many b7 +#else + BRL_COND_FSYS_BUBBLE_DOWN(p6) +#endif + ssm psr.i + mov r10=-1 +(p10) mov r8=EINVAL +(p9) mov r8=ENOSYS + FSYS_RETURN +END(__kernel_syscall_via_epc) -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:15 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:15 +0900 Subject: [PATCH 13/13] ia64/pv_ops/xen/gate.S: xen gate page paravirtualization In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-14-git-send-email-yamahata@valinux.co.jp> xen gate page paravirtualization Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 4 ++++ arch/ia64/xen/Makefile | 1 + 2 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 90537dc..c53a476 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -386,6 +386,10 @@ #define RSM_PSR_DT \ XEN_HYPER_RSM_PSR_DT +#define RSM_PSR_BE_I(clob0, clob1) \ + RSM_PSR_I(p0, clob0, clob1); \ + rum psr.be + #define SSM_PSR_DT_AND_SRLZ_I \ XEN_HYPER_SSM_PSR_DT diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 94f0d8e..e6f4a0a 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -13,6 +13,7 @@ include $(srctree)/arch/ia64/kernel/Makefile.gate # tell compiled for xen CPPFLAGS_gate.lds += -D__IA64_GATE_PARAVIRTUALIZED_XEN +AFLAGS_gate.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN -D__IA64_GATE_PARAVIRTUALIZED_XEN # use same file of native. $(obj)/gate.o: $(src)/../kernel/gate.S FORCE -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:04 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:04 +0900 Subject: [PATCH 02/13] ia64/pv_ops/xen: preliminary to paravirtualizing fsys.S for xen. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-3-git-send-email-yamahata@valinux.co.jp> This is a preliminary patch to paravirtualizing fsys.S. compile fsys.S twice one for native and one for xen, and switch them at run tine. Later fsys.S will be paravirtualized. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 3 +++ arch/ia64/xen/Makefile | 2 +- arch/ia64/xen/xen_pv_ops.c | 14 ++++++++++++++ 3 files changed, 18 insertions(+), 1 deletions(-) diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index 19c2ae1..e8e01b2 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -33,6 +33,9 @@ #define __paravirt_work_processed_syscall_target \ xen_work_processed_syscall +#define paravirt_fsyscall_table xen_fsyscall_table +#define paravirt_fsys_bubble_down xen_fsys_bubble_down + #define MOV_FROM_IFA(reg) \ movl reg = XSI_IFA; \ ;; \ diff --git a/arch/ia64/xen/Makefile b/arch/ia64/xen/Makefile index 0ad0224..b4ca2e6 100644 --- a/arch/ia64/xen/Makefile +++ b/arch/ia64/xen/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_IA64_GENERIC) += machvec.o AFLAGS_xenivt.o += -D__IA64_ASM_PARAVIRTUALIZED_XEN # xen multi compile -ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S entry.S +ASM_PARAVIRT_MULTI_COMPILE_SRCS = ivt.S entry.S fsys.S ASM_PARAVIRT_OBJS = $(addprefix xen-,$(ASM_PARAVIRT_MULTI_COMPILE_SRCS:.S=.o)) obj-y += $(ASM_PARAVIRT_OBJS) define paravirtualized_xen diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 04cd123..17eed4f 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -166,6 +167,18 @@ static const struct pv_init_ops xen_init_ops __initdata = { }; /*************************************************************************** + * pv_fsys_data + * addresses for fsys + */ + +extern unsigned long xen_fsyscall_table[NR_syscalls]; +extern char xen_fsys_bubble_down[]; +struct pv_fsys_data xen_fsys_data __initdata = { + .fsyscall_table = (unsigned long *)xen_fsyscall_table, + .fsys_bubble_down = (void *)xen_fsys_bubble_down, +}; + +/*************************************************************************** * pv_cpu_ops * intrinsics hooks. */ @@ -355,6 +368,7 @@ xen_setup_pv_ops(void) xen_info_init(); pv_info = xen_info; pv_init_ops = xen_init_ops; + pv_fsys_data = xen_fsys_data; pv_cpu_ops = xen_cpu_ops; pv_iosapic_ops = xen_iosapic_ops; pv_irq_ops = xen_irq_ops; -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:02 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:02 +0900 Subject: [PATCH 00/13] ia64/pv_ops, xen: more paravirtualization. Message-ID: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> This patchset is for more paravirtualization on ia64/pv_ops. - paravirtualize fsys call (fsys.S) by multi compile - paravirtualize gate page (gate.S) by multi compile - support save/restore For this purpose, the followings needs to be paravirtualized - ar.itc instruction - sched_clock() This is because timer may changed before/after saving/restoring. For convenience the working full source is available from http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ branch: ia64-pv-ops-2008oct20-xen-ia64-optimized-domu For the status of this patch series http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge thanks, Diffstat: arch/ia64/include/asm/native/inst.h | 13 ++ arch/ia64/include/asm/native/patchlist.h | 38 ++++++ arch/ia64/include/asm/native/pvchk_inst.h | 8 + arch/ia64/include/asm/paravirt.h | 57 +++++++++ arch/ia64/include/asm/timex.h | 1 arch/ia64/include/asm/xen/inst.h | 28 ++++ arch/ia64/include/asm/xen/interface.h | 9 + arch/ia64/include/asm/xen/minstate.h | 11 + arch/ia64/include/asm/xen/patchlist.h | 38 ++++++ arch/ia64/include/asm/xen/privop.h | 2 arch/ia64/kernel/Makefile | 36 +----- arch/ia64/kernel/Makefile.gate | 27 ++++ arch/ia64/kernel/asm-offsets.c | 2 arch/ia64/kernel/entry.S | 4 arch/ia64/kernel/fsys.S | 35 +++-- arch/ia64/kernel/gate.S | 179 +++++++++++++++--------------- arch/ia64/kernel/gate.lds.S | 17 +- arch/ia64/kernel/head.S | 4 arch/ia64/kernel/ivt.S | 2 arch/ia64/kernel/paravirt.c | 1 arch/ia64/kernel/paravirt_patchlist.c | 78 +++++++++++++ arch/ia64/kernel/paravirt_patchlist.h | 28 ++++ arch/ia64/kernel/patch.c | 38 ++++-- arch/ia64/kernel/time.c | 12 ++ arch/ia64/kernel/vmlinux.lds.S | 6 + arch/ia64/mm/init.c | 8 - arch/ia64/scripts/pvcheck.sed | 1 arch/ia64/xen/Kconfig | 1 arch/ia64/xen/Makefile | 19 ++- arch/ia64/xen/gate-data.S | 3 arch/ia64/xen/time.c | 48 ++++++++ arch/ia64/xen/xen_pv_ops.c | 126 ++++++++++++++++++++- 32 files changed, 720 insertions(+), 160 deletions(-) From yamahata at valinux.co.jp Sun Oct 19 20:55:14 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:14 +0900 Subject: [PATCH 12/13] ia64/pv_ops: paravirtualize gate.S. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-13-git-send-email-yamahata@valinux.co.jp> paravirtualize gate.S. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/inst.h | 5 +++++ arch/ia64/include/asm/native/pvchk_inst.h | 3 +++ arch/ia64/kernel/gate.S | 17 +++++++++++++---- 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/arch/ia64/include/asm/native/inst.h b/arch/ia64/include/asm/native/inst.h index ad59fc6..d2d46ef 100644 --- a/arch/ia64/include/asm/native/inst.h +++ b/arch/ia64/include/asm/native/inst.h @@ -166,6 +166,11 @@ #define RSM_PSR_DT \ rsm psr.dt +#define RSM_PSR_BE_I(clob0, clob1) \ + rsm psr.be | psr.i \ + CLOBBER(clob0) \ + CLOBBER(clob1) + #define SSM_PSR_DT_AND_SRLZ_I \ ssm psr.dt \ ;; \ diff --git a/arch/ia64/include/asm/native/pvchk_inst.h b/arch/ia64/include/asm/native/pvchk_inst.h index 13b289e..8d72962 100644 --- a/arch/ia64/include/asm/native/pvchk_inst.h +++ b/arch/ia64/include/asm/native/pvchk_inst.h @@ -251,6 +251,9 @@ IS_RREG_CLOB(clob2) #define RSM_PSR_DT \ nop 0 +#define RSM_PSR_BE_I(clob0, clob1) \ + IS_RREG_CLOB(clob0) \ + IS_RREG_CLOB(clob1) #define SSM_PSR_DT_AND_SRLZ_I \ nop 0 #define BSW_0(clob0, clob1, clob2) \ diff --git a/arch/ia64/kernel/gate.S b/arch/ia64/kernel/gate.S index c957228..cf5e0a1 100644 --- a/arch/ia64/kernel/gate.S +++ b/arch/ia64/kernel/gate.S @@ -13,6 +13,7 @@ #include #include #include +#include "paravirt_inst.h" /* * We can't easily refer to symbols inside the kernel. To avoid full runtime relocation, @@ -323,7 +324,7 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) epc // B causes split-issue } ;; - rsm psr.be | psr.i // M2 (5 cyc to srlz.d) + RSM_PSR_BE_I(r20, r22) // M2 (5 cyc to srlz.d) LOAD_FSYSCALL_TABLE(r14) // X ;; mov r16=IA64_KR(CURRENT) // M2 (12 cyc) @@ -331,7 +332,7 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) mov r19=NR_syscalls-1 // A ;; lfetch [r18] // M0|1 - mov r29=psr // M2 (12 cyc) + MOV_FROM_PSR(p0, r29, r8) // M2 (12 cyc) // If r17 is a NaT, p6 will be zero cmp.geu p6,p7=r19,r17 // A (sysnr > 0 && sysnr < 1024+NR_syscalls)? ;; @@ -347,7 +348,7 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) (p6) tbit.z.unc p8,p0=r18,0 // I0 (dual-issues with "mov b7=r18"!) nop.i 0 ;; -(p8) ssm psr.i + SSM_PSR_I(p8, p14, r25) (p6) mov b7=r18 // I0 (p8) br.dptk.many b7 // B @@ -368,9 +369,17 @@ GLOBAL_ENTRY(__kernel_syscall_via_epc) #else BRL_COND_FSYS_BUBBLE_DOWN(p6) #endif - ssm psr.i + SSM_PSR_I(p0, p14, r10) mov r10=-1 (p10) mov r8=EINVAL (p9) mov r8=ENOSYS FSYS_RETURN + +#ifdef CONFIG_PARAVIRT + /* + * padd to make the size of this symbol constant + * independent of paravirtualization. + */ + .align PAGE_SIZE / 8 +#endif END(__kernel_syscall_via_epc) -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:03 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:03 +0900 Subject: [PATCH 01/13] ia64/pv_ops: add hooks to paravirtualize fsyscall implementation. In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-2-git-send-email-yamahata@valinux.co.jp> Add two hooks, paravirt_get_fsyscall_table() and paravirt_get_fsys_bubble_doen() to paravirtualize fsyscall implementation. This patch just add the hooks fsyscall and don't paravirtualize it. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/inst.h | 3 +++ arch/ia64/include/asm/paravirt.h | 15 +++++++++++++++ arch/ia64/kernel/Makefile | 4 ++-- arch/ia64/kernel/fsys.S | 17 +++++++++-------- arch/ia64/kernel/patch.c | 26 +++++++++++++++++++++++--- arch/ia64/mm/init.c | 2 +- 6 files changed, 53 insertions(+), 14 deletions(-) diff --git a/arch/ia64/include/asm/native/inst.h b/arch/ia64/include/asm/native/inst.h index 0a1026c..5e4e151 100644 --- a/arch/ia64/include/asm/native/inst.h +++ b/arch/ia64/include/asm/native/inst.h @@ -30,6 +30,9 @@ #define __paravirt_work_processed_syscall_target \ ia64_work_processed_syscall +#define paravirt_fsyscall_table ia64_native_fsyscall_table +#define paravirt_fsys_bubble_down ia64_native_fsys_bubble_down + #ifdef CONFIG_PARAVIRT_GUEST_ASM_CLOBBER_CHECK # define PARAVIRT_POISON 0xdeadbeefbaadf00d # define CLOBBER(clob) \ diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 2bf3636..56f69f9 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -22,6 +22,21 @@ #ifndef __ASM_PARAVIRT_H #define __ASM_PARAVIRT_H +#ifndef __ASSEMBLY__ +/****************************************************************************** + * fsys related addresses + */ +struct pv_fsys_data { + unsigned long *fsyscall_table; + void *fsys_bubble_down; +}; + +extern struct pv_fsys_data pv_fsys_data; + +unsigned long *paravirt_get_fsyscall_table(void); +char *paravirt_get_fsys_bubble_down(void); +#endif + #ifdef CONFIG_PARAVIRT_GUEST #define PARAVIRT_HYPERVISOR_TYPE_DEFAULT 0 diff --git a/arch/ia64/kernel/Makefile b/arch/ia64/kernel/Makefile index c381ea9..1ab150e 100644 --- a/arch/ia64/kernel/Makefile +++ b/arch/ia64/kernel/Makefile @@ -111,9 +111,9 @@ include/asm-ia64/nr-irqs.h: arch/$(SRCARCH)/kernel/nr-irqs.s clean-files += $(objtree)/include/asm-ia64/nr-irqs.h # -# native ivt.S and entry.S +# native ivt.S, entry.S and fsys.S # -ASM_PARAVIRT_OBJS = ivt.o entry.o +ASM_PARAVIRT_OBJS = ivt.o entry.o fsys.o define paravirtualized_native AFLAGS_$(1) += -D__IA64_ASM_PARAVIRTUALIZED_NATIVE AFLAGS_pvchk-sed-$(1) += -D__IA64_ASM_PARAVIRTUALIZED_PVCHECK diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S index c1625c7..788319f 100644 --- a/arch/ia64/kernel/fsys.S +++ b/arch/ia64/kernel/fsys.S @@ -25,6 +25,7 @@ #include #include "entry.h" +#include "paravirt_inst.h" /* * See Documentation/ia64/fsys.txt for details on fsyscalls. @@ -602,7 +603,7 @@ ENTRY(fsys_fallback_syscall) mov r26=ar.pfs END(fsys_fallback_syscall) /* FALL THROUGH */ -GLOBAL_ENTRY(fsys_bubble_down) +GLOBAL_ENTRY(paravirt_fsys_bubble_down) .prologue .altrp b6 .body @@ -640,7 +641,7 @@ GLOBAL_ENTRY(fsys_bubble_down) * * PSR.BE : already is turned off in __kernel_syscall_via_epc() * PSR.AC : don't care (kernel normally turns PSR.AC on) - * PSR.I : already turned off by the time fsys_bubble_down gets + * PSR.I : already turned off by the time paravirt_fsys_bubble_down gets * invoked * PSR.DFL: always 0 (kernel never turns it on) * PSR.DFH: don't care --- kernel never touches f32-f127 on its own @@ -650,7 +651,7 @@ GLOBAL_ENTRY(fsys_bubble_down) * PSR.DB : don't care --- kernel never enables kernel-level * breakpoints * PSR.TB : must be 0 already; if it wasn't zero on entry to - * __kernel_syscall_via_epc, the branch to fsys_bubble_down + * __kernel_syscall_via_epc, the branch to paravirt_fsys_bubble_down * will trigger a taken branch; the taken-trap-handler then * converts the syscall into a break-based system-call. */ @@ -741,14 +742,14 @@ GLOBAL_ENTRY(fsys_bubble_down) nop.m 0 (p8) br.call.sptk.many b6=b6 // B (ignore return address) br.cond.spnt ia64_trace_syscall // B -END(fsys_bubble_down) +END(paravirt_fsys_bubble_down) .rodata .align 8 - .globl fsyscall_table + .globl paravirt_fsyscall_table - data8 fsys_bubble_down -fsyscall_table: + data8 paravirt_fsys_bubble_down +paravirt_fsyscall_table: data8 fsys_ni_syscall data8 0 // exit // 1025 data8 0 // read @@ -1033,4 +1034,4 @@ fsyscall_table: // fill in zeros for the remaining entries .zero: - .space fsyscall_table + 8*NR_syscalls - .zero, 0 + .space paravirt_fsyscall_table + 8*NR_syscalls - .zero, 0 diff --git a/arch/ia64/kernel/patch.c b/arch/ia64/kernel/patch.c index b83b2c5..02dd977 100644 --- a/arch/ia64/kernel/patch.c +++ b/arch/ia64/kernel/patch.c @@ -7,6 +7,7 @@ #include #include +#include #include #include #include @@ -169,16 +170,35 @@ ia64_patch_mckinley_e9 (unsigned long start, unsigned long end) ia64_srlz_i(); } +extern unsigned long ia64_native_fsyscall_table[NR_syscalls]; +extern char ia64_native_fsys_bubble_down[]; +struct pv_fsys_data pv_fsys_data __initdata = { + .fsyscall_table = (unsigned long *)ia64_native_fsyscall_table, + .fsys_bubble_down = (void *)ia64_native_fsys_bubble_down, +}; + +unsigned long * __init +paravirt_get_fsyscall_table(void) +{ + return pv_fsys_data.fsyscall_table; +} + +char * __init +paravirt_get_fsys_bubble_down(void) +{ + return pv_fsys_data.fsys_bubble_down; +} + static void __init patch_fsyscall_table (unsigned long start, unsigned long end) { - extern unsigned long fsyscall_table[NR_syscalls]; + u64 fsyscall_table = (u64)paravirt_get_fsyscall_table(); s32 *offp = (s32 *) start; u64 ip; while (offp < (s32 *) end) { ip = (u64) ia64_imva((char *) offp + *offp); - ia64_patch_imm64(ip, (u64) fsyscall_table); + ia64_patch_imm64(ip, fsyscall_table); ia64_fc((void *) ip); ++offp; } @@ -189,7 +209,7 @@ patch_fsyscall_table (unsigned long start, unsigned long end) static void __init patch_brl_fsys_bubble_down (unsigned long start, unsigned long end) { - extern char fsys_bubble_down[]; + u64 fsys_bubble_down = (u64)paravirt_get_fsys_bubble_down(); s32 *offp = (s32 *) start; u64 ip; diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index f482a90..59e6851 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -667,8 +667,8 @@ mem_init (void) * code can tell them apart. */ for (i = 0; i < NR_syscalls; ++i) { - extern unsigned long fsyscall_table[NR_syscalls]; extern unsigned long sys_call_table[NR_syscalls]; + unsigned long *fsyscall_table = paravirt_get_fsyscall_table(); if (!fsyscall_table[i] || nolwsys) fsyscall_table[i] = sys_call_table[i] | 1; -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:06 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:06 +0900 Subject: [PATCH 04/13] ia64/pv_ops/pvchecker: support mov = ar.itc paravirtualization In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-5-git-send-email-yamahata@valinux.co.jp> add suport for mov = ar.itc to pvchecker. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/native/pvchk_inst.h | 5 +++++ arch/ia64/scripts/pvcheck.sed | 1 + 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/native/pvchk_inst.h b/arch/ia64/include/asm/native/pvchk_inst.h index b8e6eb1..13b289e 100644 --- a/arch/ia64/include/asm/native/pvchk_inst.h +++ b/arch/ia64/include/asm/native/pvchk_inst.h @@ -180,6 +180,11 @@ IS_PRED_IN(pred) \ IS_RREG_OUT(reg) \ IS_RREG_CLOB(clob) +#define MOV_FROM_ITC(pred, pred_clob, reg, clob) \ + IS_PRED_IN(pred) \ + IS_PRED_CLOB(pred_clob) \ + IS_RREG_OUT(reg) \ + IS_RREG_CLOB(clob) #define MOV_TO_IFA(reg, clob) \ IS_RREG_IN(reg) \ IS_RREG_CLOB(clob) diff --git a/arch/ia64/scripts/pvcheck.sed b/arch/ia64/scripts/pvcheck.sed index ba66ac2..e59809a 100644 --- a/arch/ia64/scripts/pvcheck.sed +++ b/arch/ia64/scripts/pvcheck.sed @@ -17,6 +17,7 @@ s/mov.*=.*cr\.iip/.warning \"cr.iip should not used directly\"/g s/mov.*=.*cr\.ivr/.warning \"cr.ivr should not used directly\"/g s/mov.*=[^\.]*psr/.warning \"psr should not used directly\"/g # avoid ar.fpsr s/mov.*=.*ar\.eflags/.warning \"ar.eflags should not used directly\"/g +s/mov.*=.*ar\.itc.*/.warning \"ar.itc should not used directly\"/g s/mov.*cr\.ifa.*=.*/.warning \"cr.ifa should not used directly\"/g s/mov.*cr\.itir.*=.*/.warning \"cr.itir should not used directly\"/g s/mov.*cr\.iha.*=.*/.warning \"cr.iha should not used directly\"/g -- 1.6.0.2 From yamahata at valinux.co.jp Sun Oct 19 20:55:07 2008 From: yamahata at valinux.co.jp (Isaku Yamahata) Date: Mon, 20 Oct 2008 12:55:07 +0900 Subject: [PATCH 05/13] ia64/pv_ops/xen: paravirtualize read/write ar.itc and ar.itm In-Reply-To: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> References: <1224474915-17171-1-git-send-email-yamahata@valinux.co.jp> Message-ID: <1224474915-17171-6-git-send-email-yamahata@valinux.co.jp> paravirtualize ar.itc and ar.itm in order to support save/restore. Signed-off-by: Isaku Yamahata --- arch/ia64/include/asm/xen/inst.h | 21 +++++++++ arch/ia64/include/asm/xen/interface.h | 9 ++++ arch/ia64/include/asm/xen/minstate.h | 11 ++++- arch/ia64/include/asm/xen/privop.h | 2 + arch/ia64/kernel/asm-offsets.c | 2 + arch/ia64/xen/xen_pv_ops.c | 80 ++++++++++++++++++++++++++++++++- 6 files changed, 123 insertions(+), 2 deletions(-) diff --git a/arch/ia64/include/asm/xen/inst.h b/arch/ia64/include/asm/xen/inst.h index e8e01b2..90537dc 100644 --- a/arch/ia64/include/asm/xen/inst.h +++ b/arch/ia64/include/asm/xen/inst.h @@ -113,6 +113,27 @@ .endm #define MOV_FROM_PSR(pred, reg, clob) __MOV_FROM_PSR pred, reg, clob +/* assuming ar.itc is read with interrupt disabled. */ +#define MOV_FROM_ITC(pred, pred_clob, reg, clob) \ +(pred) movl clob = XSI_ITC_OFFSET; \ + ;; \ +(pred) ld8 clob = [clob]; \ +(pred) mov reg = ar.itc; \ + ;; \ +(pred) add reg = reg, clob; \ + ;; \ +(pred) movl clob = XSI_ITC_LAST; \ + ;; \ +(pred) ld8 clob = [clob]; \ + ;; \ +(pred) cmp.geu.unc pred_clob, p0 = clob, reg; \ + ;; \ +(pred_clob) add reg = 1, clob; \ + ;; \ +(pred) movl clob = XSI_ITC_LAST; \ + ;; \ +(pred) st8 [clob] = reg + #define MOV_TO_IFA(reg, clob) \ movl clob = XSI_IFA; \ diff --git a/arch/ia64/include/asm/xen/interface.h b/arch/ia64/include/asm/xen/interface.h index f00fab4..e951e74 100644 --- a/arch/ia64/include/asm/xen/interface.h +++ b/arch/ia64/include/asm/xen/interface.h @@ -209,6 +209,15 @@ struct mapped_regs { unsigned long krs[8]; /* kernel registers */ unsigned long tmp[16]; /* temp registers (e.g. for hyperprivops) */ + + /* itc paravirtualization + * vAR.ITC = mAR.ITC + itc_offset + * itc_last is one which was lastly passed to + * the guest OS in order to prevent it from + * going backwords. + */ + unsigned long itc_offset; + unsigned long itc_last; }; }; }; diff --git a/arch/ia64/include/asm/xen/minstate.h b/arch/ia64/include/asm/xen/minstate.h index 4d92d9b..c57fa91 100644 --- a/arch/ia64/include/asm/xen/minstate.h +++ b/arch/ia64/include/asm/xen/minstate.h @@ -1,3 +1,12 @@ + +#ifdef CONFIG_VIRT_CPU_ACCOUNTING +/* read ar.itc in advance, and use it before leaving bank 0 */ +#define XEN_ACCOUNT_GET_STAMP \ + MOV_FROM_ITC(pUStk, p6, r20, r2); +#else +#define XEN_ACCOUNT_GET_STAMP +#endif + /* * DO_SAVE_MIN switches to the kernel stacks (if necessary) and saves * the minimum state necessary that allows us to turn psr.ic back @@ -123,7 +132,7 @@ ;; \ .mem.offset 0,0; st8.spill [r16]=r2,16; \ .mem.offset 8,0; st8.spill [r17]=r3,16; \ - ACCOUNT_GET_STAMP \ + XEN_ACCOUNT_GET_STAMP \ adds r2=IA64_PT_REGS_R16_OFFSET,r1; \ ;; \ EXTRA; \ diff --git a/arch/ia64/include/asm/xen/privop.h b/arch/ia64/include/asm/xen/privop.h index 71ec754..2261dda 100644 --- a/arch/ia64/include/asm/xen/privop.h +++ b/arch/ia64/include/asm/xen/privop.h @@ -55,6 +55,8 @@ #define XSI_BANK1_R16 (XSI_BASE + XSI_BANK1_R16_OFS) #define XSI_BANKNUM (XSI_BASE + XSI_BANKNUM_OFS) #define XSI_IHA (XSI_BASE + XSI_IHA_OFS) +#define XSI_ITC_OFFSET (XSI_BASE + XSI_ITC_OFFSET_OFS) +#define XSI_ITC_LAST (XSI_BASE + XSI_ITC_LAST_OFS) #endif #ifndef __ASSEMBLY__ diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c index 742dbb1..af56501 100644 --- a/arch/ia64/kernel/asm-offsets.c +++ b/arch/ia64/kernel/asm-offsets.c @@ -316,5 +316,7 @@ void foo(void) DEFINE_MAPPED_REG_OFS(XSI_BANK1_R16_OFS, bank1_regs[0]); DEFINE_MAPPED_REG_OFS(XSI_B0NATS_OFS, vbnat); DEFINE_MAPPED_REG_OFS(XSI_B1NATS_OFS, vnat); + DEFINE_MAPPED_REG_OFS(XSI_ITC_OFFSET_OFS, itc_offset); + DEFINE_MAPPED_REG_OFS(XSI_ITC_LAST_OFS, itc_last); #endif /* CONFIG_XEN */ } diff --git a/arch/ia64/xen/xen_pv_ops.c b/arch/ia64/xen/xen_pv_ops.c index 17eed4f..2515d8f 100644 --- a/arch/ia64/xen/xen_pv_ops.c +++ b/arch/ia64/xen/xen_pv_ops.c @@ -183,6 +183,75 @@ struct pv_fsys_data xen_fsys_data __initdata = { * intrinsics hooks. */ +static void +xen_set_itm_with_offset(unsigned long val) +{ + /* ia64_cpu_local_tick() calls this with interrupt enabled. */ + /* WARN_ON(!irqs_disabled()); */ + xen_set_itm(val - XEN_MAPPEDREGS->itc_offset); +} + +static unsigned long +xen_get_itm_with_offset(void) +{ + /* unused at this moment */ + printk(KERN_DEBUG "%s is called.\n", __func__); + + WARN_ON(!irqs_disabled()); + return ia64_native_getreg(_IA64_REG_CR_ITM) + + XEN_MAPPEDREGS->itc_offset; +} + +/* ia64_set_itc() is only called by + * cpu_init() with ia64_set_itc(0) and ia64_sync_itc(). + * So XEN_MAPPEDRESG->itc_offset cal be considered as almost constant. + */ +static void +xen_set_itc(unsigned long val) +{ + unsigned long mitc; + + WARN_ON(!irqs_disabled()); + mitc = ia64_native_getreg(_IA64_REG_AR_ITC); + XEN_MAPPEDREGS->itc_offset = val - mitc; + XEN_MAPPEDREGS->itc_last = val; +} + +static unsigned long +xen_get_itc(void) +{ + unsigned long res; + unsigned long itc_offset; + unsigned long itc_last; + unsigned long ret_itc_last; + + itc_offset = XEN_MAPPEDREGS->itc_offset; + do { + itc_last = XEN_MAPPEDREGS->itc_last; + res = ia64_native_getreg(_IA64_REG_AR_ITC); + res += itc_offset; + if (itc_last >= res) + res = itc_last + 1; + ret_itc_last = cmpxchg(&XEN_MAPPEDREGS->itc_last, + itc_last, res); + } while (unlikely(ret_itc_last != itc_last)); + return res; + +#if 0 + /* ia64_itc_udelay() calls ia64_get_itc() with interrupt enabled. + Should it be paravirtualized instead? */ + WARN_ON(!irqs_disabled()); + itc_offset = XEN_MAPPEDREGS->itc_offset; + itc_last = XEN_MAPPEDREGS->itc_last; + res = ia64_native_getreg(_IA64_REG_AR_ITC); + res += itc_offset; + if (itc_last >= res) + res = itc_last + 1; + XEN_MAPPEDREGS->itc_last = res; + return res; +#endif +} + static void xen_setreg(int regnum, unsigned long val) { switch (regnum) { @@ -194,11 +263,14 @@ static void xen_setreg(int regnum, unsigned long val) xen_set_eflag(val); break; #endif + case _IA64_REG_AR_ITC: + xen_set_itc(val); + break; case _IA64_REG_CR_TPR: xen_set_tpr(val); break; case _IA64_REG_CR_ITM: - xen_set_itm(val); + xen_set_itm_with_offset(val); break; case _IA64_REG_CR_EOI: xen_eoi(val); @@ -222,6 +294,12 @@ static unsigned long xen_getreg(int regnum) res = xen_get_eflag(); break; #endif + case _IA64_REG_AR_ITC: + res = xen_get_itc(); + break; + case _IA64_REG_CR_ITM: + res = xen_get_itm_with_offset(); + break; case _IA64_REG_CR_IVR: res = xen_get_ivr(); break; -- 1.6.0.2 From baramsori72 at gmail.com Mon Oct 20 01:40:49 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Mon, 20 Oct 2008 17:40:49 +0900 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081017.160950.71109894.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> Message-ID: <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> Hi, Ryo Tsuruta I am trying to install your new released dm-ioband v1.8.0 I/O bandwidth controller v1.8.0 could be adjusted to the latest stable kernel 2.6.27.1 it was good news for me but, I had a problem when I try to patch bio-cgroup files to stable kernel 2.6.27.1 many patch failure and hunk messages were occured. Can you check bio-cgroup patch files against stable kernel 2.6.27.1? If the dm-ioband and bio-cgroup patches had different base kernel, i think it is not natural. I think it is more reasonable to release dm-ioband and bio-cgroup patches in pair. In my situation, both direct IO and buffered IO are same important. thank you Regard, Dong-Jae Kang 2008/10/17 Ryo Tsuruta : > Hi Alasdair and all, > > This is the dm-ioband version 1.8.0 release. > > Dm-ioband is an I/O bandwidth controller implemented as a device-mapper > driver, which gives specified bandwidth to each job running on the same > physical device. > > This release is a minor bug fix and confirmed running on the latest > stable kernel 2.6.27.1. > > - Can be applied to the kernel 2.6.27.1 and 2.6.27-rc5-mm1. > - Changes from 1.7.0 (posted on Oct 3, 2008): > - Fix a minor bug in io_limit setting that causes dm-ioband to stop > issuing I/O requests when a large value is set to io_limit. > > Alasdair, could you please review this patch and give me any comments? > > Thanks, > Ryo Tsuruta > _______________________________________________ > Containers mailing list > Containers at lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > -- ------------------------------------------------------------------------------------------------- DONG-JAE, KANG Senior Member of Engineering Staff Internet Platform Research Dept, S/W Content Research Lab Electronics and Telecommunications Research Institute(ETRI) 138 Gajeongno, Yuseong-gu, Daejeon, 305-700 KOREA Phone : 82-42-860-1561 Fax : 82-42-860-6699 Mobile : 82-10-9919-2353 E-mail : djkang at etri.re.kr (MSN) ------------------------------------------------------------------------------------------------- From ryov at valinux.co.jp Mon Oct 20 02:01:30 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Mon, 20 Oct 2008 18:01:30 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> References: <20081017.160950.71109894.ryov@valinux.co.jp> <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> Message-ID: <20081020.180129.193689104.ryov@valinux.co.jp> Hi Dong-Jae, > I am trying to install your new released dm-ioband v1.8.0 > I/O bandwidth controller v1.8.0 could be adjusted to the latest > stable kernel 2.6.27.1 > it was good news for me > but, I had a problem when I try to patch bio-cgroup files to stable > kernel 2.6.27.1 > many patch failure and hunk messages were occured. > > Can you check bio-cgroup patch files against stable kernel 2.6.27.1? bio-cgroup patch can apply only to 2.6.27-rc5-mm1 so far. Please use 2.6.27-rc5-mm1 to try both bio-cgroup and dm-ioband, dm-ioband v1.8.0 can also apply to 2.6.27-rc5-mm1. Thanks, Ryo Tsuruta From baramsori72 at gmail.com Mon Oct 20 03:13:14 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Mon, 20 Oct 2008 19:13:14 +0900 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081020.180129.193689104.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> <20081020.180129.193689104.ryov@valinux.co.jp> Message-ID: <2891419e0810200313j5cac8541qff80614e3d784b1b@mail.gmail.com> Hi, Ryo Tsuruta Thank you for fast reply. 2008/10/20 Ryo Tsuruta : > Hi Dong-Jae, > >> I am trying to install your new released dm-ioband v1.8.0 >> I/O bandwidth controller v1.8.0 could be adjusted to the latest >> stable kernel 2.6.27.1 >> it was good news for me >> but, I had a problem when I try to patch bio-cgroup files to stable >> kernel 2.6.27.1 >> many patch failure and hunk messages were occured. >> >> Can you check bio-cgroup patch files against stable kernel 2.6.27.1? > > bio-cgroup patch can apply only to 2.6.27-rc5-mm1 so far. > Please use 2.6.27-rc5-mm1 to try both bio-cgroup and dm-ioband, > dm-ioband v1.8.0 can also apply to 2.6.27-rc5-mm1. OK, I will use kernel-2.6.27-rc5-mm1 as your comments Well, do you have any plan to upgrade bio-cgroup patches for stable latest kernel? but, I think it will be not easy job ^^ Thank you Regards, Dong-Jae Kang -- ------------------------------------------------------------------------------------------------- DONG-JAE, KANG Senior Member of Engineering Staff Internet Platform Research Dept, S/W Content Research Lab Electronics and Telecommunications Research Institute(ETRI) 138 Gajeongno, Yuseong-gu, Daejeon, 305-700 KOREA Phone : 82-42-860-1561 Fax : 82-42-860-6699 Mobile : 82-10-9919-2353 E-mail : djkang at etri.re.kr (MSN) ------------------------------------------------------------------------------------------------- From ryov at valinux.co.jp Mon Oct 20 05:48:58 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Mon, 20 Oct 2008 21:48:58 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <2891419e0810200313j5cac8541qff80614e3d784b1b@mail.gmail.com> References: <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> <20081020.180129.193689104.ryov@valinux.co.jp> <2891419e0810200313j5cac8541qff80614e3d784b1b@mail.gmail.com> Message-ID: <20081020.214858.193685328.ryov@valinux.co.jp> Hi Dong-Jae, > Well, do you have any plan to upgrade bio-cgroup patches for stable > latest kernel? > but, I think it will be not easy job ^^ I think the memory cgroup will have a major change sometime soon. So the new patch will be released after that. Thanks, Ryo Tsuruta From kamezawa.hiroyu at jp.fujitsu.com Mon Oct 20 19:00:48 2008 From: kamezawa.hiroyu at jp.fujitsu.com (KAMEZAWA Hiroyuki) Date: Tue, 21 Oct 2008 11:00:48 +0900 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081020.214858.193685328.ryov@valinux.co.jp> References: <2891419e0810200140s3cf9c0a3q228620519ae5f4af@mail.gmail.com> <20081020.180129.193689104.ryov@valinux.co.jp> <2891419e0810200313j5cac8541qff80614e3d784b1b@mail.gmail.com> <20081020.214858.193685328.ryov@valinux.co.jp> Message-ID: <20081021110048.949704bd.kamezawa.hiroyu@jp.fujitsu.com> On Mon, 20 Oct 2008 21:48:58 +0900 (JST) Ryo Tsuruta wrote: > Hi Dong-Jae, > > > Well, do you have any plan to upgrade bio-cgroup patches for stable > > latest kernel? > > but, I think it will be not easy job ^^ > > I think the memory cgroup will have a major change sometime soon. > So the new patch will be released after that. > the newest mmotm has the newest *big* change. enjoy it ;) I think no major change around page_cgroup will not occur for a while but I myself have plan/patches to modify memcg itself's charge/uncharge callpath. Thanks, -Kame From baramsori72 at gmail.com Mon Oct 20 19:54:31 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Tue, 21 Oct 2008 11:54:31 +0900 Subject: [Question] power management related with cgroup based resource management Message-ID: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> Hi, all These days, I am interested in green IT area for low power OS So, I have a question about it. Is there any good idea or comments about power management related with cgroup based resource management? I have no idea about that, but it seems to be possible to find a good concept. And I hope so Is it some strange question? ^^ Regards, Dong-Jae Kang From yu.zhao at intel.com Tue Oct 21 04:40:56 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:40:56 +0800 Subject: [PATCH 0/15 v5] PCI: Linux kernel SR-IOV support Message-ID: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. Major changes between v4 -> v5: 1, remove interfaces for PF driver to create sysfs entries (Matthew Wilcox) 2, get ride of 'struct kobject' used in 'struct pci_iov' (Greg KH) 3, split big chunk of code into more patches 4, add boot options to reassign resources under a bus 5, add boot option to align MMIO resources of a device --- [PATCH 1/15 v5] PCI: remove unnecessary arg of pci_update_resource() [PATCH 2/15 v5] PCI: define PCI resource names in an 'enum' [PATCH 3/15 v5] PCI: export __pci_read_base [PATCH 4/15 v5] PCI: make pci_alloc_child_bus() be able to handle bridge device [PATCH 5/15 v5] PCI: add a wrapper for resource_alignment() [PATCH 6/15 v5] PCI: add a new function to map BAR offset [PATCH 7/15 v5] PCI: cleanup pcibios_allocate_resources() [PATCH 8/15 v5] PCI: add boot option to reassign resources [PATCH 9/15 v5] PCI: add boot option to align MMIO resource [PATCH 10/15 v5] PCI: cleanup pci_bus_add_devices() [PATCH 11/15 v5] PCI: split a new function from pci_bus_add_devices() [PATCH 12/15 v5] PCI: support the SR-IOV capability [PATCH 13/15 v5] PCI: reserve bus range for the SR-IOV device [PATCH 14/15 v5] PCI: document the changes [PATCH 15/15 v5] PCI: document the new PCI boot parameters --- Single Root I/O Virtualization (SR-IOV) capability defined by PCI-SIG is intended to enable multiple system software to share PCI hardware resources. PCI device that supports this capability can be extended to one Physical Functions plus multiple Virtual Functions. Physical Function, which could be considered as the "real" PCI device, reflects the hardware instance and manages all physical resources. Virtual Functions are associated with a Physical Function and shares physical resources with the Physical Function.Software can control allocation of Virtual Functions via registers encapsulated in the capability structure. SR-IOV specification can be found at http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.netxen.com/products/chipsolutions/NX3031.html http://www.neterion.com/products/x3100.html From yu.zhao at intel.com Tue Oct 21 04:44:03 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:44:03 +0800 Subject: [PATCH 1/15 v5] PCI: remove unnecessary arg of pci_update_resource() In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114403.GB3185@yzhao12-linux.sh.intel.com> This cleanup removes unnecessary argument 'struct resource *res' in pci_update_resource(), so it takes same arguments as other companion functions (pci_assign_resource(), etc.). Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 4 ++-- drivers/pci/setup-res.c | 7 ++++--- include/linux/pci.h | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4db261e..ae62f01 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -376,8 +376,8 @@ pci_restore_bars(struct pci_dev *dev) return; } - for (i = 0; i < numres; i ++) - pci_update_resource(dev, &dev->resource[i], i); + for (i = 0; i < numres; i++) + pci_update_resource(dev, i); } static struct pci_platform_pm_ops *pci_platform_pm; diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index 2dbd96c..b7ca679 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -26,11 +26,12 @@ #include "pci.h" -void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno) +void pci_update_resource(struct pci_dev *dev, int resno) { struct pci_bus_region region; u32 new, check, mask; int reg; + struct resource *res = dev->resource + resno; /* * Ignore resources for unimplemented BARs and unused resource slots @@ -162,7 +163,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) } else { res->flags &= ~IORESOURCE_STARTALIGN; if (resno < PCI_BRIDGE_RESOURCES) - pci_update_resource(dev, res, resno); + pci_update_resource(dev, resno); } return ret; @@ -197,7 +198,7 @@ int pci_assign_resource_fixed(struct pci_dev *dev, int resno) dev_err(&dev->dev, "BAR %d: can't allocate %s resource %pR\n", resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res); } else if (resno < PCI_BRIDGE_RESOURCES) { - pci_update_resource(dev, res, resno); + pci_update_resource(dev, resno); } return ret; diff --git a/include/linux/pci.h b/include/linux/pci.h index 085187b..43e1fc1 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -626,7 +626,7 @@ int pcix_get_mmrbc(struct pci_dev *dev); int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); int pcie_get_readrq(struct pci_dev *dev); int pcie_set_readrq(struct pci_dev *dev, int rq); -void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno); +void pci_update_resource(struct pci_dev *dev, int resno); int __must_check pci_assign_resource(struct pci_dev *dev, int i); int pci_select_bars(struct pci_dev *dev, unsigned long flags); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:44:50 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:44:50 +0800 Subject: [PATCH 2/15 v5] PCI: define PCI resource names in an 'enum' In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114450.GC3185@yzhao12-linux.sh.intel.com> This patch moves all definitions of PCI resource names to an 'enum', and also replaces some hard-coded resource variables with symbol names. This change eases the introduction of device specific resources. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci-sysfs.c | 4 +++- drivers/pci/pci.c | 19 ++----------------- drivers/pci/probe.c | 2 +- drivers/pci/proc.c | 7 ++++--- include/linux/pci.h | 37 ++++++++++++++++++++++++------------- 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 110022d..5c456ab 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf) struct pci_dev * pci_dev = to_pci_dev(dev); char * str = buf; int i; - int max = 7; + int max; resource_size_t start, end; if (pci_dev->subordinate) max = DEVICE_COUNT_RESOURCE; + else + max = PCI_BRIDGE_RESOURCES; for (i = 0; i < max; i++) { struct resource *res = &pci_dev->resource[i]; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index ae62f01..40284dc 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -359,24 +359,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res) static void pci_restore_bars(struct pci_dev *dev) { - int i, numres; - - switch (dev->hdr_type) { - case PCI_HEADER_TYPE_NORMAL: - numres = 6; - break; - case PCI_HEADER_TYPE_BRIDGE: - numres = 2; - break; - case PCI_HEADER_TYPE_CARDBUS: - numres = 1; - break; - default: - /* Should never get here, but just in case... */ - return; - } + int i; - for (i = 0; i < numres; i++) + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) pci_update_resource(dev, i); } diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index aaaf0a1..a52784c 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -426,7 +426,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->subordinate = 0xff; /* Set up default resource pointers and names.. */ - for (i = 0; i < 4; i++) { + for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; child->resource[i]->name = child->name; } diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index e1098c3..f6f2a59 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v) dev->vendor, dev->device, dev->irq); - /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */ - for (i=0; i<7; i++) { + + /* only print standard and ROM resources to preserve compatibility */ + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", (unsigned long long)(start | (dev->resource[i].flags & PCI_REGION_FLAG_MASK))); } - for (i=0; i<7; i++) { + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", diff --git a/include/linux/pci.h b/include/linux/pci.h index 43e1fc1..2ada2b6 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -76,7 +76,30 @@ enum pci_mmap_state { #define PCI_DMA_FROMDEVICE 2 #define PCI_DMA_NONE 3 -#define DEVICE_COUNT_RESOURCE 12 +/* + * For PCI devices, the region numbers are assigned this way: + */ +enum { + /* #0-5: standard PCI regions */ + PCI_STD_RESOURCES, + PCI_STD_RESOURCES_END = 5, + + /* #6: expansion ROM */ + PCI_ROM_RESOURCE, + + /* address space assigned to buses behind the bridge */ +#ifndef PCI_BRIDGE_RES_NUM +#define PCI_BRIDGE_RES_NUM 4 +#endif + PCI_BRIDGE_RESOURCES, + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1, + + /* total resources associated with a PCI device */ + PCI_NUM_RESOURCES, + + /* preserve this for compatibility */ + DEVICE_COUNT_RESOURCE +}; typedef int __bitwise pci_power_t; @@ -262,18 +285,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev, hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space); } -/* - * For PCI devices, the region numbers are assigned this way: - * - * 0-5 standard PCI regions - * 6 expansion ROM - * 7-10 bridges: address space assigned to buses behind the bridge - */ - -#define PCI_ROM_RESOURCE 6 -#define PCI_BRIDGE_RESOURCES 7 -#define PCI_NUM_RESOURCES 11 - #ifndef PCI_BUS_NUM_RESOURCES #define PCI_BUS_NUM_RESOURCES 16 #endif -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:45:27 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:45:27 +0800 Subject: [PATCH 3/15 v5] PCI: export __pci_read_base In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114527.GD3185@yzhao12-linux.sh.intel.com> Export __pci_read_base() so it can be used by whole PCI subsystem. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci.h | 9 +++++++++ drivers/pci/probe.c | 20 +++++++++----------- 2 files changed, 18 insertions(+), 11 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index b205ab8..fbbc6ad 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -157,6 +157,15 @@ struct pci_slot_attribute { }; #define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, attr) +enum pci_bar_type { + pci_bar_unknown, /* Standard PCI BAR probe */ + pci_bar_io, /* An io port BAR */ + pci_bar_mem32, /* A 32-bit memory BAR */ + pci_bar_mem64, /* A 64-bit memory BAR */ +}; + +extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, + struct resource *res, unsigned int reg); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index a52784c..db3e5a7 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -135,13 +135,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask) return size; } -enum pci_bar_type { - pci_bar_unknown, /* Standard PCI BAR probe */ - pci_bar_io, /* An io port BAR */ - pci_bar_mem32, /* A 32-bit memory BAR */ - pci_bar_mem64, /* A 64-bit memory BAR */ -}; - static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) { if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) { @@ -156,11 +149,16 @@ static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) return pci_bar_mem32; } -/* - * If the type is not unknown, we assume that the lowest bit is 'enable'. - * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit. +/** + * pci_read_base - read a PCI BAR + * @dev: the PCI device + * @type: type of the BAR + * @res: resource buffer to be filled in + * @pos: BAR position in the config space + * + * Returns 1 if the BAR is 64-bit, or 0 if 32-bit. */ -static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, +int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int pos) { u32 l, sz, mask; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:47:35 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:47:35 +0800 Subject: [PATCH 4/15 v5] PCI: make pci_alloc_child_bus() be able to handle NULL bridge In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114735.GE3185@yzhao12-linux.sh.intel.com> Make pci_alloc_child_bus() be able to handle buses without bridge devices. Some devices such as SR-IOV devices use more than one bus number while there is no explicit bridge devices since they have internal routing mechanism. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/probe.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index db3e5a7..4b12b58 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -401,12 +401,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, if (!child) return NULL; - child->self = bridge; child->parent = parent; child->ops = parent->ops; child->sysdata = parent->sysdata; child->bus_flags = parent->bus_flags; - child->bridge = get_device(&bridge->dev); /* initialize some portions of the bus device, but don't register it * now as the parent is not properly set up yet. This device will get @@ -423,6 +421,11 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->primary = parent->secondary; child->subordinate = 0xff; + if (!bridge) + return child; + + child->self = bridge; + child->bridge = get_device(&bridge->dev); /* Set up default resource pointers and names.. */ for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:48:03 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:48:03 +0800 Subject: [PATCH 5/15 v5] PCI: add a wrapper for resource_alignment() In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114803.GF3185@yzhao12-linux.sh.intel.com> Add a wrap of resource_alignment() so it can handle device specific resource alignment. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 20 ++++++++++++++++++++ drivers/pci/pci.h | 1 + drivers/pci/setup-bus.c | 4 ++-- drivers/pci/setup-res.c | 7 ++++--- 4 files changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 40284dc..a9b554e 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1904,6 +1904,26 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) return bars; } +/** + * pci_resource_alignment - get a PCI BAR resource alignment + * @dev: the PCI device + * @resno: the resource number + * + * Returns alignment size on success, or 0 on error. + */ +int pci_resource_alignment(struct pci_dev *dev, int resno) +{ + resource_size_t align; + struct resource *res = dev->resource + resno; + + align = resource_alignment(res); + if (align) + return align; + + dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fbbc6ad..baa3d23 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -166,6 +166,7 @@ enum pci_bar_type { extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); +extern int pci_resource_alignment(struct pci_dev *dev, int resno); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ea979f2..90a9c0a 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -25,6 +25,7 @@ #include #include #include +#include "pci.h" static void pbus_assign_resources_sorted(struct pci_bus *bus) @@ -351,8 +352,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, unsigned long if (r->parent || (r->flags & mask) != type) continue; r_size = resource_size(r); - /* For bridges size != alignment */ - align = resource_alignment(r); + align = pci_resource_alignment(dev, i); order = __ffs(align) - 20; if (order > 11) { dev_warn(&dev->dev, "BAR %d bad alignment %llx: " diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index b7ca679..88a9c70 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -133,7 +133,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) size = resource_size(res); min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM; - align = resource_alignment(res); + align = pci_resource_alignment(dev, resno); if (!align) { dev_err(&dev->dev, "BAR %d: can't allocate resource (bogus " "alignment) %pR flags %#lx\n", @@ -224,7 +224,7 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) if (!(r->flags) || r->parent) continue; - r_align = resource_alignment(r); + r_align = pci_resource_alignment(dev, i); if (!r_align) { dev_warn(&dev->dev, "BAR %d: bogus alignment " "%pR flags %#lx\n", @@ -236,7 +236,8 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) struct resource_list *ln = list->next; if (ln) - align = resource_alignment(ln->res); + align = pci_resource_alignment(ln->dev, + ln->res - ln->dev->resource); if (r_align > align) { tmp = kmalloc(sizeof(*tmp), GFP_KERNEL); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:48:27 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:48:27 +0800 Subject: [PATCH 6/15 v5] PCI: add a new function to map BAR offset In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114827.GG3185@yzhao12-linux.sh.intel.com> Add a new function to map resource number to base register (offset and type). Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 22 ++++++++++++++++++++++ drivers/pci/pci.h | 2 ++ drivers/pci/setup-res.c | 13 +++++-------- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a9b554e..b02167a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1924,6 +1924,28 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) return 0; } +/** + * pci_resource_bar - get position of the BAR associated with a resource + * @dev: the PCI device + * @resno: the resource number + * @type: the BAR type to be filled in + * + * Returns BAR position in config space, or 0 if the BAR is invalid. + */ +int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) +{ + if (resno < PCI_ROM_RESOURCE) { + *type = pci_bar_unknown; + return PCI_BASE_ADDRESS_0 + 4 * resno; + } else if (resno == PCI_ROM_RESOURCE) { + *type = pci_bar_mem32; + return dev->rom_base_reg; + } + + dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index baa3d23..d707477 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -167,6 +167,8 @@ enum pci_bar_type { extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); extern int pci_resource_alignment(struct pci_dev *dev, int resno); +extern int pci_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index 88a9c70..5812f4b 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -31,6 +31,7 @@ void pci_update_resource(struct pci_dev *dev, int resno) struct pci_bus_region region; u32 new, check, mask; int reg; + enum pci_bar_type type; struct resource *res = dev->resource + resno; /* @@ -62,17 +63,13 @@ void pci_update_resource(struct pci_dev *dev, int resno) else mask = (u32)PCI_BASE_ADDRESS_MEM_MASK; - if (resno < 6) { - reg = PCI_BASE_ADDRESS_0 + 4 * resno; - } else if (resno == PCI_ROM_RESOURCE) { + reg = pci_resource_bar(dev, resno, &type); + if (!reg) + return; + if (type != pci_bar_unknown) { if (!(res->flags & IORESOURCE_ROM_ENABLE)) return; new |= PCI_ROM_ADDRESS_ENABLE; - reg = dev->rom_base_reg; - } else { - /* Hmm, non-standard resource. */ - - return; /* kill uninitialised var warning */ } pci_write_config_dword(dev, reg, new); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:48:52 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:48:52 +0800 Subject: [PATCH 7/15 v5] PCI: cleanup pcibios_allocate_resources() In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114852.GH3185@yzhao12-linux.sh.intel.com> This cleanup makes pcibios_allocate_resources() easier to be read. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- arch/x86/pci/i386.c | 28 ++++++++++++++-------------- 1 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index 844df0c..8729bde 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list) static void __init pcibios_allocate_resources(int pass) { struct pci_dev *dev = NULL; - int idx, disabled; + int idx, enabled; u16 command; struct resource *r, *pr; @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass) if (!r->start) /* Address not assigned at all */ continue; if (r->flags & IORESOURCE_IO) - disabled = !(command & PCI_COMMAND_IO); + enabled = command & PCI_COMMAND_IO; else - disabled = !(command & PCI_COMMAND_MEMORY); - if (pass == disabled) { - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", + enabled = command & PCI_COMMAND_MEMORY; + if (pass == enabled) + continue; + dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", (unsigned long long) r->start, (unsigned long long) r->end, - r->flags, disabled, pass); - pr = pci_find_parent_resource(dev, r); - if (!pr || request_resource(pr, r) < 0) { - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); - /* We'll assign a new address later */ - r->end -= r->start; - r->start = 0; - } - } + r->flags, enabled, pass); + pr = pci_find_parent_resource(dev, r); + if (pr && !request_resource(pr, r)) + continue; + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + /* We'll assign a new address later */ + r->end -= r->start; + r->start = 0; } if (!pass) { r = &dev->resource[PCI_ROM_RESOURCE]; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:49:59 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:49:59 +0800 Subject: [PATCH 8/15 v5] PCI: add boot options to reassign resources In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021114959.GI3185@yzhao12-linux.sh.intel.com> This patch adds boot options so user can reassign device resources of all devices under a bus. The boot options can be used as: pci=assign-mmio=0000:01;0000:02,assign-pio=0000:03 '0000' and '01/02/03' are domain and bus number. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- arch/x86/pci/common.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++ arch/x86/pci/i386.c | 10 ++++--- arch/x86/pci/pci.h | 3 ++ 3 files changed, 80 insertions(+), 4 deletions(-) diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index b67732b..0774a67 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -137,6 +137,70 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) } } +static char *pci_assign_pio; +static char *pci_assign_mmio; + +static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) +{ + int i; + int type = 0; + int domain, busnr; + + if (!bus->self) + return 0; + + for (i = 0; i < 2; i++) { + char *str = i ? pci_assign_pio : pci_assign_mmio; + while (str && *str) { + if (sscanf(str, "%04x:%02x", &domain, &busnr) != 2) { + if (sscanf(str, "%02x", &busnr) != 1) + break; + domain = 0; + } + + if (pci_domain_nr(bus) == domain && + bus->number == busnr) { + type |= i ? IORESOURCE_IO : IORESOURCE_MEM; + break; + } + + str = strchr(str, ';'); + if (str) + str++; + } + } + + return type; +} + +static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) +{ + int i; + int type = pcibios_bus_resource_needs_fixup(bus); + + if (!type) + return; + + for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) { + struct resource *res = bus->resource[i]; + if (!res) + continue; + if (res->flags & type) + res->flags = 0; + } +} + +int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) +{ + struct pci_bus *bus; + + for (bus = dev->bus; bus && bus != pci_root_bus; bus = bus->parent) + if (pcibios_bus_resource_needs_fixup(bus)) + return 1; + + return 0; +} + /* * Called after each bus is probed, but before its children * are examined. @@ -147,6 +211,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *b) struct pci_dev *dev; pci_read_bridge_bases(b); + pcibios_fixup_bus_resources(b); list_for_each_entry(dev, &b->devices, bus_list) pcibios_fixup_device_resources(dev); } @@ -519,6 +584,12 @@ char * __devinit pcibios_setup(char *str) } else if (!strcmp(str, "skip_isa_align")) { pci_probe |= PCI_CAN_SKIP_ISA_ALIGN; return NULL; + } else if (!strncmp(str, "assign-pio=", 11)) { + pci_assign_pio = str + 11; + return NULL; + } else if (!strncmp(str, "assign-mmio=", 12)) { + pci_assign_mmio = str + 12; + return NULL; } return str; } diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index 8729bde..ea82a5b 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -169,10 +169,12 @@ static void __init pcibios_allocate_resources(int pass) (unsigned long long) r->start, (unsigned long long) r->end, r->flags, enabled, pass); - pr = pci_find_parent_resource(dev, r); - if (pr && !request_resource(pr, r)) - continue; - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + if (!pcibios_resource_needs_fixup(dev, idx)) { + pr = pci_find_parent_resource(dev, r); + if (pr && !request_resource(pr, r)) + continue; + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + } /* We'll assign a new address later */ r->end -= r->start; r->start = 0; diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h index 15b9cf6..f22737d 100644 --- a/arch/x86/pci/pci.h +++ b/arch/x86/pci/pci.h @@ -117,6 +117,9 @@ extern int __init pcibios_init(void); extern int __init pci_mmcfg_arch_init(void); extern void __init pci_mmcfg_arch_free(void); +/* pci-common.c */ +extern int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno); + /* * AMD Fam10h CPUs are buggy, and cannot access MMIO config space * on their northbrige except through the * %eax register. As such, you MUST -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:50:34 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:50:34 +0800 Subject: [PATCH 9/15 v5] PCI: add boot option to align MMIO resource In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115034.GJ3185@yzhao12-linux.sh.intel.com> This patch adds boot option to align MMIO resource for a device. The alignment is a bigger value between the PAGE_SIZE and the resource size. The boot option can be used as: pci=align-mmio=0000:01:02.3;04:05.6 '0000:01:02.3' and '04:05.6' are domain, bus, device and function numbers. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- arch/x86/pci/common.c | 37 +++++++++++++++++++++++++++++++++++++ drivers/pci/pci.c | 20 ++++++++++++++++++-- include/linux/pci.h | 1 + 3 files changed, 56 insertions(+), 2 deletions(-) diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 0774a67..c1821e3 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -139,6 +139,7 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) static char *pci_assign_pio; static char *pci_assign_mmio; +static char *pci_align_mmio; static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) { @@ -190,6 +191,36 @@ static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) } } +int pcibios_resource_alignment(struct pci_dev *dev, int resno) +{ + int domain, busnr, slot, func; + char *str = pci_align_mmio; + + if (dev->resource[resno].flags & IORESOURCE_IO) + return 0; + + while (str && *str) { + if (sscanf(str, "%04x:%02x:%02x.%d", + &domain, &busnr, &slot, &func) != 4) { + if (sscanf(str, "%02x:%02x.%d", + &busnr, &slot, &func) != 3) + break; + domain = 0; + } + + if (pci_domain_nr(dev->bus) == domain && + dev->bus->number == busnr && + dev->devfn == PCI_DEVFN(slot, func)) + return PAGE_SIZE; + + str = strchr(str, ';'); + if (str) + str++; + } + + return 0; +} + int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) { struct pci_bus *bus; @@ -198,6 +229,9 @@ int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) if (pcibios_bus_resource_needs_fixup(bus)) return 1; + if (pcibios_resource_alignment(dev, resno)) + return 1; + return 0; } @@ -590,6 +624,9 @@ char * __devinit pcibios_setup(char *str) } else if (!strncmp(str, "assign-mmio=", 12)) { pci_assign_mmio = str + 12; return NULL; + } else if (!strncmp(str, "align-mmio=", 11)) { + pci_align_mmio = str + 11; + return NULL; } return str; } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b02167a..11ecd6f 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1015,6 +1015,20 @@ int __attribute__ ((weak)) pcibios_set_pcie_reset_state(struct pci_dev *dev, } /** + * pcibios_resource_alignment - get resource alignment requirement + * @dev: the PCI device + * @resno: resource number + * + * Queries the resource alignment from PCI low level code. Returns positive + * if there is alignment requirement of the resource, or 0 otherwise. + */ +int __attribute__ ((weak)) pcibios_resource_alignment(struct pci_dev *dev, + int resno) +{ + return 0; +} + +/** * pci_set_pcie_reset_state - set reset state for device dev * @dev: the PCI-E device reset * @state: Reset state to enter into @@ -1913,12 +1927,14 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) */ int pci_resource_alignment(struct pci_dev *dev, int resno) { - resource_size_t align; + resource_size_t align, bios_align; struct resource *res = dev->resource + resno; + bios_align = pcibios_resource_alignment(dev, resno); + align = resource_alignment(res); if (align) - return align; + return align > bios_align ? align : bios_align; dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); return 0; diff --git a/include/linux/pci.h b/include/linux/pci.h index 2ada2b6..6ac69af 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1121,6 +1121,7 @@ int pcibios_add_platform_entries(struct pci_dev *dev); void pcibios_disable_device(struct pci_dev *dev); int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state); +int pcibios_resource_alignment(struct pci_dev *dev, int resno); #ifdef CONFIG_PCI_MMCONFIG extern void __init pci_mmcfg_early_init(void); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:51:05 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:51:05 +0800 Subject: [PATCH 10/15 v5] PCI: cleanup pci_bus_add_devices() In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115105.GK3185@yzhao12-linux.sh.intel.com> This cleanup makes pci_bus_add_devices() easier to be read. And it checks if a bus has been added before removing it. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/bus.c | 56 +++++++++++++++++++++++++------------------------ drivers/pci/remove.c | 2 + 2 files changed, 31 insertions(+), 27 deletions(-) diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 999cc40..7a21602 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -71,7 +71,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res, } /** - * add a single device + * pci_bus_add_device - add a single device * @dev: device to add * * This adds a single pci device to the global @@ -105,7 +105,7 @@ int pci_bus_add_device(struct pci_dev *dev) void pci_bus_add_devices(struct pci_bus *bus) { struct pci_dev *dev; - struct pci_bus *child_bus; + struct pci_bus *child; int retval; list_for_each_entry(dev, &bus->devices, bus_list) { @@ -120,39 +120,41 @@ void pci_bus_add_devices(struct pci_bus *bus) list_for_each_entry(dev, &bus->devices, bus_list) { BUG_ON(!dev->is_added); + child = dev->subordinate; /* * If there is an unattached subordinate bus, attach * it and then scan for unattached PCI devices. */ - if (dev->subordinate) { - if (list_empty(&dev->subordinate->node)) { - down_write(&pci_bus_sem); - list_add_tail(&dev->subordinate->node, - &dev->bus->children); - up_write(&pci_bus_sem); - } - pci_bus_add_devices(dev->subordinate); - - /* register the bus with sysfs as the parent is now - * properly registered. */ - child_bus = dev->subordinate; - if (child_bus->is_added) - continue; - child_bus->dev.parent = child_bus->bridge; - retval = device_register(&child_bus->dev); - if (retval) - dev_err(&dev->dev, "Error registering pci_bus," - " continuing...\n"); - else { - child_bus->is_added = 1; - retval = device_create_file(&child_bus->dev, - &dev_attr_cpuaffinity); - } + if (!child) + continue; + if (list_empty(&child->node)) { + down_write(&pci_bus_sem); + list_add_tail(&child->node, + &dev->bus->children); + up_write(&pci_bus_sem); + } + pci_bus_add_devices(child); + + /* + * register the bus with sysfs as the parent is now + * properly registered. + */ + if (child->is_added) + continue; + child->dev.parent = child->bridge; + retval = device_register(&child->dev); + if (retval) + dev_err(&dev->dev, "Error registering pci_bus," + " continuing...\n"); + else { + child->is_added = 1; + retval = device_create_file(&child->dev, + &dev_attr_cpuaffinity); if (retval) dev_err(&dev->dev, "Error creating cpuaffinity" " file, continuing...\n"); - retval = device_create_file(&child_bus->dev, + retval = device_create_file(&child->dev, &dev_attr_cpulistaffinity); if (retval) dev_err(&dev->dev, diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c index 042e089..bfa0869 100644 --- a/drivers/pci/remove.c +++ b/drivers/pci/remove.c @@ -72,6 +72,8 @@ void pci_remove_bus(struct pci_bus *pci_bus) list_del(&pci_bus->node); up_write(&pci_bus_sem); pci_remove_legacy_files(pci_bus); + if (!pci_bus->is_added) + return; device_remove_file(&pci_bus->dev, &dev_attr_cpuaffinity); device_remove_file(&pci_bus->dev, &dev_attr_cpulistaffinity); device_unregister(&pci_bus->dev); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:52:42 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:52:42 +0800 Subject: [PATCH 11/15 v5] PCI: split a new function from pci_bus_add_devices() In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115242.GL3185@yzhao12-linux.sh.intel.com> This patch splits a new function from pci_bus_add_devices(). The new function can be used to register a PCI bus to device core and create its sysfs entries. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/bus.c | 48 +++++++++++++++++++++++++++++------------------- include/linux/pci.h | 1 + 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 7a21602..303ca74 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -91,6 +91,33 @@ int pci_bus_add_device(struct pci_dev *dev) } /** + * pci_bus_add_child - add a child bus + * @bus: bus to add + * + * This adds sysfs entries for a single bus + */ +int pci_bus_add_child(struct pci_bus *bus) +{ + int retval; + + if (bus->bridge) + bus->dev.parent = bus->bridge; + + retval = device_register(&bus->dev); + if (retval) + return retval; + + bus->is_added = 1; + + retval = device_create_file(&bus->dev, &dev_attr_cpuaffinity); + if (retval) + return retval; + + retval = device_create_file(&bus->dev, &dev_attr_cpulistaffinity); + return retval; +} + +/** * pci_bus_add_devices - insert newly discovered PCI devices * @bus: bus to check for new devices * @@ -141,26 +168,9 @@ void pci_bus_add_devices(struct pci_bus *bus) */ if (child->is_added) continue; - child->dev.parent = child->bridge; - retval = device_register(&child->dev); + retval = pci_bus_add_child(child); if (retval) - dev_err(&dev->dev, "Error registering pci_bus," - " continuing...\n"); - else { - child->is_added = 1; - retval = device_create_file(&child->dev, - &dev_attr_cpuaffinity); - if (retval) - dev_err(&dev->dev, "Error creating cpuaffinity" - " file, continuing...\n"); - - retval = device_create_file(&child->dev, - &dev_attr_cpulistaffinity); - if (retval) - dev_err(&dev->dev, - "Error creating cpulistaffinity" - " file, continuing...\n"); - } + dev_err(&dev->dev, "Error adding bus, continuing\n"); } } diff --git a/include/linux/pci.h b/include/linux/pci.h index 6ac69af..80d88f8 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -528,6 +528,7 @@ struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn); void pci_device_add(struct pci_dev *dev, struct pci_bus *bus); unsigned int pci_scan_child_bus(struct pci_bus *bus); int __must_check pci_bus_add_device(struct pci_dev *dev); +int pci_bus_add_child(struct pci_bus *bus); void pci_read_bridge_bases(struct pci_bus *child); struct resource *pci_find_parent_resource(const struct pci_dev *dev, struct resource *res); -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:53:08 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:53:08 +0800 Subject: [PATCH 12/15 v5] PCI: support the SR-IOV capability In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115308.GM3185@yzhao12-linux.sh.intel.com> Support Single Root I/O Virtualization (SR-IOV) capability. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/Kconfig | 12 + drivers/pci/Makefile | 2 + drivers/pci/iov.c | 616 ++++++++++++++++++++++++++++++++++++++++++++++ drivers/pci/pci-sysfs.c | 4 + drivers/pci/pci.c | 14 + drivers/pci/pci.h | 48 ++++ drivers/pci/probe.c | 4 + include/linux/pci.h | 40 +++ include/linux/pci_regs.h | 21 ++ 9 files changed, 761 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index e1ca425..e7c0836 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -50,3 +50,15 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool "PCI SR-IOV support" + depends on PCI + select PCI_MSI + default n + help + This option allows device drivers to enable Single Root I/O + Virtualization. Each Virtual Function's PCI configuration + space can be accessed using its own Bus, Device and Function + Number (Routing ID). Each Virtual Function also has PCI Memory + Space, which is used to map its own register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 4b47f4e..abbfcfa 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -55,3 +55,5 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o ifeq ($(CONFIG_PCI_DEBUG),y) EXTRA_CFLAGS += -DDEBUG endif + +obj-$(CONFIG_PCI_IOV) += iov.o diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 0000000..571a46c --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,616 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2008 Intel Corporation + * + * PCI Express Single Root I/O Virtualization capability support. + */ + +#include +#include +#include +#include +#include +#include "pci.h" + + +#define iov_config_attr(field) \ +static ssize_t field##_show(struct device *dev, \ + struct device_attribute *attr, char *buf) \ +{ \ + struct pci_dev *pdev = to_pci_dev(dev); \ + return sprintf(buf, "%d\n", pdev->iov->field); \ +} + +iov_config_attr(status); +iov_config_attr(totalvfs); +iov_config_attr(initialvfs); +iov_config_attr(numvfs); + +static inline void vf_rid(struct pci_dev *dev, int vfn, u8 *busnr, u8 *devfn) +{ + u16 rid; + + rid = (dev->bus->number << 8) + dev->devfn + + dev->iov->offset + dev->iov->stride * vfn; + *busnr = rid >> 8; + *devfn = rid & 0xff; +} + +static int vf_add(struct pci_dev *dev, int vfn) +{ + int i; + int rc; + u8 busnr, devfn; + struct pci_dev *vf; + struct pci_bus *bus; + struct resource *res; + resource_size_t size; + + vf_rid(dev, vfn, &busnr, &devfn); + + vf = alloc_pci_dev(); + if (!vf) + return -ENOMEM; + + if (dev->bus->number == busnr) + vf->bus = bus = dev->bus; + else { + list_for_each_entry(bus, &dev->bus->children, node) + if (bus->number == busnr) { + vf->bus = bus; + break; + } + BUG_ON(!vf->bus); + } + + vf->sysdata = bus->sysdata; + vf->dev.parent = dev->dev.parent; + vf->dev.bus = dev->dev.bus; + vf->devfn = devfn; + vf->hdr_type = PCI_HEADER_TYPE_NORMAL; + vf->multifunction = 0; + vf->vendor = dev->vendor; + pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_DID, &vf->device); + vf->cfg_size = PCI_CFG_SPACE_EXP_SIZE; + vf->error_state = pci_channel_io_normal; + vf->is_pcie = 1; + vf->pcie_type = PCI_EXP_TYPE_ENDPOINT; + vf->dma_mask = 0xffffffff; + + dev_set_name(&vf->dev, "%04x:%02x:%02x.%d", pci_domain_nr(bus), + busnr, PCI_SLOT(devfn), PCI_FUNC(devfn)); + + pci_read_config_byte(vf, PCI_REVISION_ID, &vf->revision); + vf->class = dev->class; + vf->current_state = PCI_UNKNOWN; + vf->irq = 0; + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + if (!res->parent) + continue; + vf->resource[i].name = pci_name(vf); + vf->resource[i].flags = res->flags; + size = resource_size(res); + do_div(size, dev->iov->totalvfs); + vf->resource[i].start = res->start + size * vfn; + vf->resource[i].end = vf->resource[i].start + size - 1; + rc = request_resource(res, &vf->resource[i]); + BUG_ON(rc); + } + + vf->subsystem_vendor = dev->subsystem_vendor; + pci_read_config_word(vf, PCI_SUBSYSTEM_ID, &vf->subsystem_device); + + pci_device_add(vf, bus); + return pci_bus_add_device(vf); +} + +static void vf_remove(struct pci_dev *dev, int vfn) +{ + u8 busnr, devfn; + struct pci_dev *vf; + + vf_rid(dev, vfn, &busnr, &devfn); + + vf = pci_get_bus_and_slot(busnr, devfn); + if (!vf) + return; + + pci_dev_put(vf); + pci_remove_bus_device(vf); +} + +static int iov_enable(struct pci_dev *dev) +{ + int rc; + int i, j; + u16 ctrl; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (!iov->numvfs) + return -EINVAL; + + if (iov->status) + return 0; + + rc = iov->callback(dev, PCI_IOV_ENABLE); + if (rc) + return rc; + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl |= (PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + for (i = 0; i < iov->numvfs; i++) { + rc = vf_add(dev, i); + if (rc) + goto failed; + } + + iov->status = 1; + return 0; + +failed: + for (j = 0; j < i; j++) + vf_remove(dev, j); + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + return rc; +} + +static int iov_disable(struct pci_dev *dev) +{ + int i; + int rc; + u16 ctrl; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (!iov->status) + return 0; + + rc = iov->callback(dev, PCI_IOV_DISABLE); + if (rc) + return rc; + + for (i = 0; i < iov->numvfs; i++) + vf_remove(dev, i); + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + iov->status = 0; + return 0; +} + +static int iov_set_numvfs(struct pci_dev *dev, int numvfs) +{ + int rc; + u16 offset, stride; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (numvfs < 0 || numvfs > iov->initialvfs || iov->status) + return -EINVAL; + + if (numvfs == iov->numvfs) + return 0; + + rc = iov->callback(dev, PCI_IOV_NUMVFS | iov->numvfs); + if (rc) + return rc; + + pci_write_config_word(dev, iov->cap + PCI_IOV_NUM_VF, numvfs); + pci_read_config_word(dev, iov->cap + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(dev, iov->cap + PCI_IOV_VF_STRIDE, &stride); + if ((numvfs && !offset) || (numvfs > 1 && !stride)) + return -EIO; + + iov->offset = offset; + iov->stride = stride; + iov->numvfs = numvfs; + return 0; +} + +static ssize_t status_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int rc; + long enable; + struct pci_dev *pdev = to_pci_dev(dev); + + rc = strict_strtol(buf, 0, &enable); + if (rc) + return rc; + + mutex_lock(&pdev->iov->ops_lock); + switch (enable) { + case 0: + rc = iov_disable(pdev); + break; + case 1: + rc = iov_enable(pdev); + break; + default: + rc = -EINVAL; + } + mutex_unlock(&pdev->iov->ops_lock); + + return rc ? rc : count; +} + +static ssize_t numvfs_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int rc; + long numvfs; + struct pci_dev *pdev = to_pci_dev(dev); + + rc = strict_strtol(buf, 0, &numvfs); + if (rc) + return rc; + + mutex_lock(&pdev->iov->ops_lock); + rc = iov_set_numvfs(pdev, numvfs); + mutex_unlock(&pdev->iov->ops_lock); + + return rc ? rc : count; +} + +static DEVICE_ATTR(totalvfs, S_IRUGO, totalvfs_show, NULL); +static DEVICE_ATTR(initialvfs, S_IRUGO, initialvfs_show, NULL); +static DEVICE_ATTR(numvfs, S_IWUSR | S_IRUGO, numvfs_show, numvfs_store); +static DEVICE_ATTR(enable, S_IWUSR | S_IRUGO, status_show, status_store); + +static struct attribute *iov_attrs[] = { + &dev_attr_totalvfs.attr, + &dev_attr_initialvfs.attr, + &dev_attr_numvfs.attr, + &dev_attr_enable.attr, + NULL +}; + +static struct attribute_group iov_attr_group = { + .attrs = iov_attrs, + .name = "iov", +}; + +static int iov_alloc_bus(struct pci_bus *bus, int busnr) +{ + int i; + int rc; + struct pci_dev *dev; + struct pci_bus *child; + + list_for_each_entry(dev, &bus->devices, bus_list) + if (dev->iov) + break; + + BUG_ON(!dev->iov); + pci_dev_get(dev); + mutex_lock(&dev->iov->bus_lock); + + for (i = bus->number + 1; i <= busnr; i++) { + list_for_each_entry(child, &bus->children, node) + if (child->number == i) + break; + if (child->number == i) + continue; + child = pci_add_new_bus(bus, NULL, i); + if (!child) + return -ENOMEM; + + child->subordinate = i; + child->dev.parent = bus->bridge; + rc = pci_bus_add_child(child); + if (rc) + return rc; + } + + mutex_unlock(&dev->iov->bus_lock); + + return 0; +} + +static void iov_release_bus(struct pci_bus *bus) +{ + struct pci_dev *dev, *tmp; + struct pci_bus *child, *next; + + list_for_each_entry(dev, &bus->devices, bus_list) + if (dev->iov) + break; + + BUG_ON(!dev->iov); + mutex_lock(&dev->iov->bus_lock); + + list_for_each_entry(tmp, &bus->devices, bus_list) + if (tmp->iov && tmp->iov->callback) + goto done; + + list_for_each_entry_safe(child, next, &bus->children, node) + if (!child->bridge) + pci_remove_bus(child); +done: + mutex_unlock(&dev->iov->bus_lock); + pci_dev_put(dev); +} + +/** + * pci_iov_init - initialize device's SR-IOV capability + * @dev: the PCI device + * + * Returns 0 on success, or negative on failure. + * + * The major differences between Virtual Function and PCI device are: + * 1) the device with multiple bus numbers uses internal routing, so + * there is no explicit bridge device in this case. + * 2) Virtual Function memory spaces are designated by BARs encapsulated + * in the capability structure, and the BARs in Virtual Function PCI + * configuration space are read-only zero. + */ +int pci_iov_init(struct pci_dev *dev) +{ + int i; + int pos; + u32 pgsz; + u16 ctrl, total, initial, offset, stride; + struct pci_iov *iov; + struct resource *res; + + if (!dev->is_pcie || (dev->pcie_type != PCI_EXP_TYPE_RC_END && + dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)) + return -ENODEV; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_IOV); + if (!pos) + return -ENODEV; + + ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0; + pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl); + ssleep(1); + + pci_read_config_word(dev, pos + PCI_IOV_TOTAL_VF, &total); + pci_read_config_word(dev, pos + PCI_IOV_INITIAL_VF, &initial); + pci_write_config_word(dev, pos + PCI_IOV_NUM_VF, initial); + pci_read_config_word(dev, pos + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(dev, pos + PCI_IOV_VF_STRIDE, &stride); + if (!total || initial > total || (initial && !offset) || + (initial > 1 && !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_IOV_SUP_PGSIZE, &pgsz); + i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0; + pgsz &= ~((1 << i) - 1); + if (!pgsz) + return -EIO; + + pgsz &= ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_IOV_SYS_PGSIZE, pgsz); + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) + return -ENOMEM; + + iov->cap = pos; + iov->totalvfs = total; + iov->initialvfs = initial; + iov->offset = offset; + iov->stride = stride; + iov->align = pgsz << 12; + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + pos = iov->cap + PCI_IOV_BAR_0 + i * 4; + i += __pci_read_base(dev, pci_bar_unknown, res, pos); + if (!res->flags) + continue; + res->flags &= ~IORESOURCE_SIZEALIGN; + res->end = res->start + resource_size(res) * total - 1; + } + + mutex_init(&iov->ops_lock); + mutex_init(&iov->bus_lock); + + dev->iov = iov; + + return 0; +} + +/** + * pci_iov_release - release resources used by SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_release(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + mutex_destroy(&dev->iov->ops_lock); + mutex_destroy(&dev->iov->bus_lock); + kfree(dev->iov); + dev->iov = NULL; +} + +/** + * pci_iov_create_sysfs - create sysfs for SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_create_sysfs(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + sysfs_create_group(&dev->dev.kobj, &iov_attr_group); +} + +/** + * pci_iov_remove_sysfs - remove sysfs of SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_remove_sysfs(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + sysfs_remove_group(&dev->dev.kobj, &iov_attr_group); +} + +/** + * pci_iov_bus_range - find bus range used by SR-IOV capability + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + if (!dev->iov) + continue; + vf_rid(dev, dev->iov->totalvfs - 1, &busnr, &devfn); + if (busnr > max) + max = busnr; + } + + return max ? max - bus->number : 0; +} + +int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + return dev->iov->align; +} + +int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + *type = pci_bar_unknown; + return dev->iov->cap + PCI_IOV_BAR_0 + + 4 * (resno - PCI_IOV_RESOURCES); +} + +/** + * pci_iov_register - register SR-IOV service + * @dev: the PCI device + * @callback: callback function for SR-IOV events + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *, u32)) +{ + u8 busnr, devfn; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!callback || iov->callback) + return -EINVAL; + + vf_rid(dev, iov->totalvfs - 1, &busnr, &devfn); + if (busnr > dev->bus->subordinate) + return -EIO; + + iov->callback = callback; + return iov_alloc_bus(dev->bus, busnr); +} +EXPORT_SYMBOL_GPL(pci_iov_register); + +/** + * pci_iov_unregister - unregister SR-IOV service + * @dev: the PCI device + */ +void pci_iov_unregister(struct pci_dev *dev) +{ + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->callback) + return; + + iov->callback = NULL; + iov_release_bus(dev->bus); +} +EXPORT_SYMBOL_GPL(pci_iov_unregister); + +/** + * pci_iov_enable - enable SR-IOV capability + * @dev: the PCI device + * @numvfs: number of VFs to be available + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + int rc; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!iov->callback) + return -EINVAL; + + mutex_lock(&iov->ops_lock); + rc = iov_set_numvfs(dev, numvfs); + if (rc) + goto done; + rc = iov_enable(dev); +done: + mutex_unlock(&iov->ops_lock); + + return rc; +} +EXPORT_SYMBOL_GPL(pci_iov_enable); + +/** + * pci_iov_disable - disable SR-IOV capability + * @dev: the PCI device + * + * Should be called upon Physical Function driver removal, and power + * state change. All previous allocated Virtual Functions are reclaimed. + */ +void pci_iov_disable(struct pci_dev *dev) +{ + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->callback) + return; + + mutex_lock(&iov->ops_lock); + iov_disable(dev); + mutex_unlock(&iov->ops_lock); +} +EXPORT_SYMBOL_GPL(pci_iov_disable); diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 5c456ab..18881f2 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -847,6 +847,9 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev) /* Active State Power Management */ pcie_aspm_create_sysfs_dev_files(dev); + /* Single Root I/O Virtualization */ + pci_iov_create_sysfs(dev); + return 0; } @@ -932,6 +935,7 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev) } pcie_aspm_remove_sysfs_dev_files(dev); + pci_iov_remove_sysfs(dev); } /** diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 11ecd6f..10a43b2 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1936,6 +1936,13 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) if (align) return align > bios_align ? align : bios_align; + if (resno > PCI_ROM_RESOURCE && resno < PCI_BRIDGE_RESOURCES) { + /* device specific resource */ + align = pci_iov_resource_align(dev, resno); + if (align) + return align > bios_align ? align : bios_align; + } + dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); return 0; } @@ -1950,12 +1957,19 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) */ int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) { + int reg; + if (resno < PCI_ROM_RESOURCE) { *type = pci_bar_unknown; return PCI_BASE_ADDRESS_0 + 4 * resno; } else if (resno == PCI_ROM_RESOURCE) { *type = pci_bar_mem32; return dev->rom_base_reg; + } else if (resno < PCI_BRIDGE_RESOURCES) { + /* device specific resource */ + reg = pci_iov_resource_bar(dev, resno, type); + if (reg) + return reg; } dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d707477..7735d92 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -181,4 +181,52 @@ static inline int pci_ari_enabled(struct pci_dev *dev) return dev->ari_enabled; } +/* Single Root I/O Virtualization */ +struct pci_iov { + int cap; /* capability position */ + int align; /* page size used to map memory space */ + int status; /* status of SR-IOV */ + u16 totalvfs; /* total VFs associated with the PF */ + u16 initialvfs; /* initial VFs associated with the PF */ + u16 numvfs; /* number of VFs available */ + u16 offset; /* first VF Routing ID offset */ + u16 stride; /* following VF stride */ + struct mutex ops_lock; /* lock for SR-IOV operations */ + struct mutex bus_lock; /* lock for VF bus */ + int (*callback)(struct pci_dev *, u32); /* event callback function */ +}; + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_init(struct pci_dev *dev); +extern void pci_iov_release(struct pci_dev *dev); +void pci_iov_create_sysfs(struct pci_dev *dev); +void pci_iov_remove_sysfs(struct pci_dev *dev); +extern int pci_iov_resource_align(struct pci_dev *dev, int resno); +extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); +#else +static inline int pci_iov_init(struct pci_dev *dev) +{ + return -EIO; +} +static inline void pci_iov_release(struct pci_dev *dev) +{ +} +static inline void pci_iov_create_sysfs(struct pci_dev *dev) +{ +} +static inline void pci_iov_remove_sysfs(struct pci_dev *dev) +{ +} +static inline int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + return 0; +} +static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + return 0; +} +#endif /* CONFIG_PCI_IOV */ + #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 4b12b58..18ce9c0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -779,6 +779,7 @@ static int pci_setup_device(struct pci_dev * dev) static void pci_release_capabilities(struct pci_dev *dev) { pci_vpd_release(dev); + pci_iov_release(dev); } /** @@ -962,6 +963,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Alternative Routing-ID Forwarding */ pci_enable_ari(dev); + + /* Single Root I/O Virtualization */ + pci_iov_init(dev); } void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) diff --git a/include/linux/pci.h b/include/linux/pci.h index 80d88f8..e64ffa2 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -87,6 +87,12 @@ enum { /* #6: expansion ROM */ PCI_ROM_RESOURCE, + /* device specific resources */ +#ifdef CONFIG_PCI_IOV + PCI_IOV_RESOURCES, + PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1, +#endif + /* address space assigned to buses behind the bridge */ #ifndef PCI_BRIDGE_RES_NUM #define PCI_BRIDGE_RES_NUM 4 @@ -165,6 +171,7 @@ struct pci_cap_saved_state { struct pcie_link_state; struct pci_vpd; +struct pci_iov; /* * The pci_dev structure is used to describe PCI devices. @@ -253,6 +260,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + struct pci_iov *iov; }; extern struct pci_dev *alloc_pci_dev(void); @@ -1147,5 +1155,37 @@ static inline void * pci_ioremap_bar(struct pci_dev *pdev, int bar) } #endif +/* SR-IOV events masks */ +#define PCI_IOV_VIRTFN_ID 0x0000FFFFU /* Virtual Function Number */ +#define PCI_IOV_NUM_VIRTFN 0x0000FFFFU /* NumVFs to be set */ +/* SR-IOV events values */ +#define PCI_IOV_ENABLE 0x00010000U /* SR-IOV enable request */ +#define PCI_IOV_DISABLE 0x00020000U /* SR-IOV disable request */ +#define PCI_IOV_NUMVFS 0x00040000U /* SR-IOV disable request */ + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_enable(struct pci_dev *dev, int numvfs); +extern void pci_iov_disable(struct pci_dev *dev); +extern int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *dev, u32 event)); +extern void pci_iov_unregister(struct pci_dev *dev); +#else +static inline int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + return -EIO; +} +static inline void pci_iov_disable(struct pci_dev *dev) +{ +} +static inline int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *dev, u32 event)) +{ + return -EIO; +} +static inline void pci_iov_unregister(struct pci_dev *dev) +{ +} +#endif /* CONFIG_PCI_IOV */ + #endif /* __KERNEL__ */ #endif /* LINUX_PCI_H */ diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index eb6686b..1b28b3f 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -363,6 +363,7 @@ #define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */ #define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */ #define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */ +#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */ #define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */ #define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */ #define PCI_EXP_DEVCAP 4 /* Device capabilities */ @@ -434,6 +435,7 @@ #define PCI_EXT_CAP_ID_DSN 3 #define PCI_EXT_CAP_ID_PWR 4 #define PCI_EXT_CAP_ID_ARI 14 +#define PCI_EXT_CAP_ID_IOV 16 /* Advanced Error Reporting */ #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */ @@ -551,4 +553,23 @@ #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */ #define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */ +/* Single Root I/O Virtualization */ +#define PCI_IOV_CAP 0x04 /* SR-IOV Capabilities */ +#define PCI_IOV_CTRL 0x08 /* SR-IOV Control */ +#define PCI_IOV_CTRL_VFE 0x01 /* VF Enable */ +#define PCI_IOV_CTRL_MSE 0x08 /* VF Memory Space Enable */ +#define PCI_IOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */ +#define PCI_IOV_STATUS 0x0a /* SR-IOV Status */ +#define PCI_IOV_INITIAL_VF 0x0c /* Initial VFs */ +#define PCI_IOV_TOTAL_VF 0x0e /* Total VFs */ +#define PCI_IOV_NUM_VF 0x10 /* Number of VFs */ +#define PCI_IOV_FUNC_LINK 0x12 /* Function Dependency Link */ +#define PCI_IOV_VF_OFFSET 0x14 /* First VF Offset */ +#define PCI_IOV_VF_STRIDE 0x16 /* Following VF Stride */ +#define PCI_IOV_VF_DID 0x1a /* VF Device ID */ +#define PCI_IOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */ +#define PCI_IOV_SYS_PGSIZE 0x20 /* System Page Size */ +#define PCI_IOV_BAR_0 0x24 /* VF BAR0 */ +#define PCI_IOV_NUM_BAR 6 /* Number of VF BARs */ + #endif /* LINUX_PCI_REGS_H */ -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:53:32 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:53:32 +0800 Subject: [PATCH 13/15 v5] PCI: reserve bus range for the SR-IOV device In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115332.GN3185@yzhao12-linux.sh.intel.com> Reserve bus range for SR-IOV at device scanning stage. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- drivers/pci/pci.h | 5 +++++ drivers/pci/probe.c | 3 +++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 7735d92..5206ae7 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -204,6 +204,7 @@ void pci_iov_remove_sysfs(struct pci_dev *dev); extern int pci_iov_resource_align(struct pci_dev *dev, int resno); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -227,6 +228,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +extern inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 18ce9c0..50a1380 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1068,6 +1068,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn < 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:54:06 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:54:06 +0800 Subject: [PATCH 14/15 v5] PCI: document the SR-IOV In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115406.GO3185@yzhao12-linux.sh.intel.com> Create how-to for SR-IOV user and device driver developer. Cc: Jesse Barnes Cc: Randy Dunlap Cc: Grant Grundler Cc: Alex Chiang Cc: Matthew Wilcox Cc: Roland Dreier Cc: Greg KH Signed-off-by: Yu Zhao --- Documentation/PCI/pci-iov-howto.txt | 181 +++++++++++++++++++++++++++++++++++ 1 files changed, 181 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 0000000..5632723 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,181 @@ + PCI Express Single Root I/O Virtualization HOWTO + Copyright (C) 2008 Intel Corporation + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function while +the virtual devices are referred to as Virtual Functions. Allocation +of Virtual Functions can be dynamically controlled by Physical Function +via registers encapsulated in the capability. By default, this feature +is not enabled and the Physical Function behaves as traditional PCIe +device. Once it's turned on, each Virtual Function's PCI configuration +space can be accessed by its own Bus, Device and Function Number (Routing +ID). And each Virtual Function also has PCI Memory Space, which is used +to map its register set. Virtual Function device driver operates on the +register set so it can be functional and appear as a real existing PCI +device. + +2. User Guide + +2.1 How can I manage SR-IOV + +If a device supports SR-IOV, then there should be some entries under +Physical Function's PCI device directory. These entries are in directory: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/ + (XXXX:BB:DD:F is the domain, bus, device and function number) + +To enable or disable SR-IOV: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/enable + (writing 1/0 means enable/disable VFs, state change will + notify PF driver) + +To change number of Virtual Functions: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/numvfs + (writing positive integer to this file will change NumVFs) + +The total and initial number of VFs can get from: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/totalvfs + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/initialvfs + +2.2 How can I use Virtual Functions + +Virtual Functions are treated as hot-plugged PCI devices in the kernel, +so they should be able to work in the same way as real PCI devices. +NOTE: Virtual Function device driver must be loaded to make it work. + + +3. Developer Guide + +3.1 SR-IOV APIs + +To register SR-IOV service, Physical Function device driver needs to call: + int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *, u32)) + The 'callback' is a callback function that the SR-IOV code will invoke + it when events related to VFs happen (e.g. user enable/disable VFs). + The first argument is PF itself, the second argument is event type and + value. For now, following events type are supported: + - PCI_IOV_ENABLE: SR-IOV enable request + - PCI_IOV_DISABLE: SR-IOV disable request + - PCI_IOV_NUMVFS: changing Number of VFs request + And event values can be extract using following masks: + - PCI_IOV_NUM_VIRTFN: Number of Virtual Functions + +To unregister SR-IOV service, Physical Function device driver needs to call: + void pci_iov_unregister(struct pci_dev *dev) + +To enable SR-IOV, Physical Function device driver needs to call: + int pci_iov_enable(struct pci_dev *dev, int numvfs) + 'numvfs' is the number of VFs that PF wants to enable. + +To disable SR-IOV, Physical Function device driver needs to call: + void pci_iov_disable(struct pci_dev *dev) + +Note: above two functions sleeps 1 second waiting on hardware transaction +completion according to SR-IOV specification. + +3.2 Usage example + +Following piece of code illustrates the usage of APIs above. + +static int callback(struct pci_dev *dev, u32 event) +{ + int numvfs; + + if (event & PCI_IOV_ENABLE) { + /* + * request to enable SR-IOV. + * Note: if the PF driver want to support PM, it has + * to check the device power state here to see if this + * request is allowed or not. + */ + ... + + } else if (event & PCI_IOV_DISABLE) { + /* + * request to disable SR-IOV. + */ + ... + + } else if (event & PCI_IOV_NUMVFS) { + /* + * request to change NumVFs. + */ + numvfs = event & PCI_IOV_NUM_VIRTFN; + ... + + } else + return -EINVAL; + + return 0; +} + +static int __devinit dev_probe(struct pci_dev *dev, + const struct pci_device_id *id) +{ + int err; + int numvfs; + + ... + err = pci_iov_register(dev, callback); + ... + err = pci_iov_enable(dev, numvfs); + ... + + return err; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + ... + pci_iov_disable(dev); + ... + pci_iov_unregister(dev); + ... +} + +#ifdef CONFIG_PM +/* + * If Physical Function supports the power management, then the + * SR-IOV needs to be disabled before the adapter goes to sleep, + * because Virtual Functions will not work when the adapter is in + * the power-saving mode. + * The SR-IOV can be enabled again after the adapter wakes up. + */ +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + pci_iov_disable(dev); + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + int err; + int numvfs; + + ... + rc = pci_iov_enable(dev, numvfs); + ... + + return 0; +} +#endif + +static struct pci_driver dev_driver = { + .name = "SR-IOV Physical Function driver", + .id_table = dev_id_table, + .probe = dev_probe, + .remove = __devexit_p(dev_remove), +#ifdef CONFIG_PM + .suspend = dev_suspend, + .resume = dev_resume, +#endif +}; -- 1.5.6.4 From yu.zhao at intel.com Tue Oct 21 04:54:29 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Tue, 21 Oct 2008 19:54:29 +0800 Subject: [PATCH 15/15 v5] PCI: document the new PCI boot parameters In-Reply-To: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021115429.GP3185@yzhao12-linux.sh.intel.com> Complete document for the new PCI boot parameters. --- Documentation/kernel-parameters.txt | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 53ba7c7..5482ae0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file cbmemsize=nn[KMG] The fixed amount of bus space which is reserved for the CardBus bridge's memory window. The default value is 64 megabytes. + assign-mmio=[dddd:]bb [X86] reassign memory resources of all + devices under bus [dddd:]bb (dddd is the domain + number and bb is the bus number). + assign-pio=[dddd:]bb [X86] reassign io port resources of all + devices under bus [dddd:]bb (dddd is the domain + number and bb is the bus number). + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a + device with minimum PAGE_SIZE alignment (dddd + is the domain number and bb, dd and f is the + bus, device and function number). pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. -- 1.5.6.4 From zumeng.chen at windriver.com Tue Oct 21 04:10:36 2008 From: zumeng.chen at windriver.com (Chen Zumeng) Date: Tue, 21 Oct 2008 19:10:36 +0800 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081017.160950.71109894.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> Message-ID: <48FDB8AC.9020707@windriver.com> Hi, Ryo Tsuruta I applied your patches(both) into the latest kernel(27), and dm-ioband looks work well(other than schedule_timeout in alloc_ioband_device); But I think you are the author of bio_tracking, so it is high appreciated if you can give your comments and advices for potential difference between 27-rc5-mm1 and 2.6.27 to me. And our test team want to test bio_tracking as your benchmark reports, so would you please send me your test codes? Thanks in advance. Regards, Zumeng P.S. The following are my changes to avoid schedule_timeout: diff --git a/drivers/md/dm-ioband-ctl.c b/drivers/md/dm-ioband-ctl.c index a792620..643ca4e 100644 --- a/drivers/md/dm-ioband-ctl.c +++ b/drivers/md/dm-ioband-ctl.c @@ -100,6 +100,7 @@ static struct ioband_device *alloc_ioband_device(char *name, return dp; } } + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); /* * Prepare its own workqueue as generic_make_request() may @@ -133,9 +134,11 @@ static struct ioband_device *alloc_ioband_device(char *name, init_waitqueue_head(&new->g_waitq); init_waitqueue_head(&new->g_waitq_suspend); init_waitqueue_head(&new->g_waitq_flush); - list_add_tail(&new->g_list, &ioband_device_list); + spin_lock_irqsave(&ioband_devicelist_lock, flags); + list_add_tail(&new->g_list, &ioband_device_list); spin_unlock_irqrestore(&ioband_devicelist_lock, flags); + return new; } --- Ryo Tsuruta wrote: > Hi Alasdair and all, > > This is the dm-ioband version 1.8.0 release. > > Dm-ioband is an I/O bandwidth controller implemented as a device-mapper > driver, which gives specified bandwidth to each job running on the same > physical device. > > This release is a minor bug fix and confirmed running on the latest > stable kernel 2.6.27.1. > > - Can be applied to the kernel 2.6.27.1 and 2.6.27-rc5-mm1. > - Changes from 1.7.0 (posted on Oct 3, 2008): > - Fix a minor bug in io_limit setting that causes dm-ioband to stop > issuing I/O requests when a large value is set to io_limit. > > Alasdair, could you please review this patch and give me any comments? > > Thanks, > Ryo Tsuruta > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From greg at kroah.com Tue Oct 21 09:50:54 2008 From: greg at kroah.com (Greg KH) Date: Tue, 21 Oct 2008 09:50:54 -0700 Subject: [PATCH 12/15 v5] PCI: support the SR-IOV capability In-Reply-To: <20081021115308.GM3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> <20081021115308.GM3185@yzhao12-linux.sh.intel.com> Message-ID: <20081021165054.GA24795@kroah.com> On Tue, Oct 21, 2008 at 07:53:08PM +0800, Yu Zhao wrote: > Support Single Root I/O Virtualization (SR-IOV) capability. > > Cc: Jesse Barnes > Cc: Randy Dunlap > Cc: Grant Grundler > Cc: Alex Chiang > Cc: Matthew Wilcox > Cc: Roland Dreier > Cc: Greg KH > Signed-off-by: Yu Zhao > > --- > drivers/pci/Kconfig | 12 + > drivers/pci/Makefile | 2 + > drivers/pci/iov.c | 616 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/pci/pci-sysfs.c | 4 + > drivers/pci/pci.c | 14 + > drivers/pci/pci.h | 48 ++++ > drivers/pci/probe.c | 4 + > include/linux/pci.h | 40 +++ > include/linux/pci_regs.h | 21 ++ > 9 files changed, 761 insertions(+), 0 deletions(-) > create mode 100644 drivers/pci/iov.c > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > index e1ca425..e7c0836 100644 > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -50,3 +50,15 @@ config HT_IRQ > This allows native hypertransport devices to use interrupts. > > If unsure say Y. > + > +config PCI_IOV > + bool "PCI SR-IOV support" > + depends on PCI > + select PCI_MSI > + default n > + help > + This option allows device drivers to enable Single Root I/O > + Virtualization. Each Virtual Function's PCI configuration > + space can be accessed using its own Bus, Device and Function > + Number (Routing ID). Each Virtual Function also has PCI Memory > + Space, which is used to map its own register set. > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile > index 4b47f4e..abbfcfa 100644 > --- a/drivers/pci/Makefile > +++ b/drivers/pci/Makefile > @@ -55,3 +55,5 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o > ifeq ($(CONFIG_PCI_DEBUG),y) > EXTRA_CFLAGS += -DDEBUG > endif > + > +obj-$(CONFIG_PCI_IOV) += iov.o > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > new file mode 100644 > index 0000000..571a46c > --- /dev/null > +++ b/drivers/pci/iov.c > @@ -0,0 +1,616 @@ > +/* > + * drivers/pci/iov.c > + * > + * Copyright (C) 2008 Intel Corporation > + * > + * PCI Express Single Root I/O Virtualization capability support. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include "pci.h" > + > + > +#define iov_config_attr(field) \ > +static ssize_t field##_show(struct device *dev, \ > + struct device_attribute *attr, char *buf) \ > +{ \ > + struct pci_dev *pdev = to_pci_dev(dev); \ > + return sprintf(buf, "%d\n", pdev->iov->field); \ > +} > + > +iov_config_attr(status); > +iov_config_attr(totalvfs); > +iov_config_attr(initialvfs); > +iov_config_attr(numvfs); As you are adding new sysfs entries, can you also create the proper documentation in Documentation/ABI/ so that people can understand how to use them? Yes, I see you added a stand-alone document, but putting it in the "standard" format is also necessary. thanks, greg k-h From yu.zhao at intel.com Tue Oct 21 20:05:26 2008 From: yu.zhao at intel.com (Zhao, Yu) Date: Wed, 22 Oct 2008 11:05:26 +0800 Subject: [PATCH 12/15 v5] PCI: support the SR-IOV capability In-Reply-To: <20081021165054.GA24795@kroah.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> <20081021115308.GM3185@yzhao12-linux.sh.intel.com> <20081021165054.GA24795@kroah.com> Message-ID: <48FE9876.9040003@intel.com> Greg KH wrote: > On Tue, Oct 21, 2008 at 07:53:08PM +0800, Yu Zhao wrote: >> Support Single Root I/O Virtualization (SR-IOV) capability. >> >> Cc: Jesse Barnes >> Cc: Randy Dunlap >> Cc: Grant Grundler >> Cc: Alex Chiang >> Cc: Matthew Wilcox >> Cc: Roland Dreier >> Cc: Greg KH >> Signed-off-by: Yu Zhao >> >> +#define iov_config_attr(field) \ >> +static ssize_t field##_show(struct device *dev, \ >> + struct device_attribute *attr, char *buf) \ >> +{ \ >> + struct pci_dev *pdev = to_pci_dev(dev); \ >> + return sprintf(buf, "%d\n", pdev->iov->field); \ >> +} >> + >> +iov_config_attr(status); >> +iov_config_attr(totalvfs); >> +iov_config_attr(initialvfs); >> +iov_config_attr(numvfs); > > As you are adding new sysfs entries, can you also create the proper > documentation in Documentation/ABI/ so that people can understand how to > use them? Yes, I see you added a stand-alone document, but putting it > in the "standard" format is also necessary. Thanks for reminding me about this. I used to update ABI doc in earlier versions, but somehow forgot to do this after several carry-forwards... Will complete it in next version. Regards, Yu From mingo at elte.hu Wed Oct 22 00:19:09 2008 From: mingo at elte.hu (Ingo Molnar) Date: Wed, 22 Oct 2008 09:19:09 +0200 Subject: [PATCH 8/15 v5] PCI: add boot options to reassign resources In-Reply-To: <20081021114959.GI3185@yzhao12-linux.sh.intel.com> References: <20081021114056.GA3185@yzhao12-linux.sh.intel.com> <20081021114959.GI3185@yzhao12-linux.sh.intel.com> Message-ID: <20081022071909.GB27637@elte.hu> * Yu Zhao wrote: > +static char *pci_assign_pio; > +static char *pci_assign_mmio; > + > +static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) > +{ > + int i; > + int type = 0; > + int domain, busnr; > + > + if (!bus->self) > + return 0; > + > + for (i = 0; i < 2; i++) { > + char *str = i ? pci_assign_pio : pci_assign_mmio; > + while (str && *str) { please put a newline after local variable defitions. > + for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) { > + struct resource *res = bus->resource[i]; > + if (!res) ditto. nice debug feature! Ingo From zumeng.chen at windriver.com Wed Oct 22 00:55:15 2008 From: zumeng.chen at windriver.com (Chen Zumeng) Date: Wed, 22 Oct 2008 15:55:15 +0800 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <48FDB8AC.9020707@windriver.com> References: <20081017.160950.71109894.ryov@valinux.co.jp> <48FDB8AC.9020707@windriver.com> Message-ID: <48FEDC63.308@windriver.com> Chen Zumeng wrote: > Hi, Ryo Tsuruta > > And our test team want to test bio_tracking as your benchmark reports, > so would you please send me your test codes? Thanks in advance. Hi Ryo Tsuruta, I wonder if you received last email, so I reply this email to ask for your bio_tracking test codes to generate your benchmark reports as shown in your website. Thanks in advance :) Regards, Zumeng > > Regards, > Zumeng > > P.S. The following are my changes to avoid schedule_timeout: > > > diff --git a/drivers/md/dm-ioband-ctl.c b/drivers/md/dm-ioband-ctl.c > index a792620..643ca4e 100644 > --- a/drivers/md/dm-ioband-ctl.c > +++ b/drivers/md/dm-ioband-ctl.c > @@ -100,6 +100,7 @@ static struct ioband_device > *alloc_ioband_device(char *name, > return dp; > } > } > + spin_unlock_irqrestore(&ioband_devicelist_lock, flags); > > /* > * Prepare its own workqueue as generic_make_request() may > @@ -133,9 +134,11 @@ static struct ioband_device > *alloc_ioband_device(char *name, > init_waitqueue_head(&new->g_waitq); > init_waitqueue_head(&new->g_waitq_suspend); > init_waitqueue_head(&new->g_waitq_flush); > - list_add_tail(&new->g_list, &ioband_device_list); > > + spin_lock_irqsave(&ioband_devicelist_lock, flags); > + list_add_tail(&new->g_list, &ioband_device_list); > spin_unlock_irqrestore(&ioband_devicelist_lock, flags); > + > return new; > } > --- > > Ryo Tsuruta wrote: >> Hi Alasdair and all, >> >> This is the dm-ioband version 1.8.0 release. >> >> Dm-ioband is an I/O bandwidth controller implemented as a device-mapper >> driver, which gives specified bandwidth to each job running on the same >> physical device. >> >> This release is a minor bug fix and confirmed running on the latest >> stable kernel 2.6.27.1. >> >> - Can be applied to the kernel 2.6.27.1 and 2.6.27-rc5-mm1. >> - Changes from 1.7.0 (posted on Oct 3, 2008): >> - Fix a minor bug in io_limit setting that causes dm-ioband to stop >> issuing I/O requests when a large value is set to io_limit. >> >> Alasdair, could you please review this patch and give me any comments? >> >> Thanks, >> Ryo Tsuruta >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo at vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> > > From ryov at valinux.co.jp Wed Oct 22 01:05:36 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Wed, 22 Oct 2008 17:05:36 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <48FEDC63.308@windriver.com> References: <20081017.160950.71109894.ryov@valinux.co.jp> <48FDB8AC.9020707@windriver.com> <48FEDC63.308@windriver.com> Message-ID: <20081022.170536.193712541.ryov@valinux.co.jp> Hi Chen, > Chen Zumeng wrote: > > Hi, Ryo Tsuruta > > And our test team want to test bio_tracking as your benchmark reports, > > so would you please send me your test codes? Thanks in advance. > Hi Ryo Tsuruta, > > I wonder if you received last email, so I reply this email to ask > for your bio_tracking test codes to generate your benchmark reports > as shown in your website. Thanks in advance :) I've uploaded two scripts here: http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-count.sh http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-size.sh xdd-count.sh controls bandwidth based on the number of I/O requests, and xdd-size.sh controls bandwidth based onthe number of I/O sectors. Theses scritpts require xdd disk I/O testing tool which can be downloaded from here: http://www.ioperformance.com/products.htm Please feel free to ask me questions if you have any questions. > > P.S. The following are my changes to avoid schedule_timeout: Thanks, but your patch seems to cause a problem when ioband devices which have the same name are created at the same time. I will fix the issue in the next release. Thanks, Ryo Tsuruta From zumeng.chen at windriver.com Wed Oct 22 01:12:21 2008 From: zumeng.chen at windriver.com (Chen Zumeng) Date: Wed, 22 Oct 2008 16:12:21 +0800 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081022.170536.193712541.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> <48FDB8AC.9020707@windriver.com> <48FEDC63.308@windriver.com> <20081022.170536.193712541.ryov@valinux.co.jp> Message-ID: <48FEE065.20406@windriver.com> Ryo Tsuruta wrote: > Hi Chen, > >> Chen Zumeng wrote: >>> Hi, Ryo Tsuruta >>> And our test team want to test bio_tracking as your benchmark reports, >>> so would you please send me your test codes? Thanks in advance. >> Hi Ryo Tsuruta, >> >> I wonder if you received last email, so I reply this email to ask >> for your bio_tracking test codes to generate your benchmark reports >> as shown in your website. Thanks in advance :) Thanks for you quick reply :) Regards, Zumeng > > I've uploaded two scripts here: > http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-count.sh > http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-size.sh > > xdd-count.sh controls bandwidth based on the number of I/O requests, > and xdd-size.sh controls bandwidth based onthe number of I/O sectors. > Theses scritpts require xdd disk I/O testing tool which can be > downloaded from here: > http://www.ioperformance.com/products.htm > > Please feel free to ask me questions if you have any questions. OK thanks. > >>> P.S. The following are my changes to avoid schedule_timeout: > > Thanks, but your patch seems to cause a problem when ioband devices > which have the same name are created at the same time. I will fix the > issue in the next release. Maybe, hoping your next release :) Zumeng > > Thanks, > Ryo Tsuruta > From yu.zhao at intel.com Wed Oct 22 01:38:09 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:38:09 +0800 Subject: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support Message-ID: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Greetings, Following patches are intended to support SR-IOV capability in the Linux kernel. With these patches, people can turn a PCI device with the capability into multiple ones from software perspective, which will benefit KVM and achieve other purposes such as QoS, security, and etc. Changes from v5 to v6: 1, update ABI document to include SR-IOV sysfs entries (Greg KH) 2, fix two coding style problems (Ingo Molnar) --- [PATCH 1/16 v6] PCI: remove unnecessary arg of pci_update_resource() [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' [PATCH 3/16 v6] PCI: export __pci_read_base [PATCH 4/16 v6] PCI: make pci_alloc_child_bus() be able to handle NULL bridge [PATCH 5/16 v6] PCI: add a wrapper for resource_alignment() [PATCH 6/16 v6] PCI: add a new function to map BAR offset [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources() [PATCH 8/16 v6] PCI: add boot options to reassign resources [PATCH 9/16 v6] PCI: add boot option to align MMIO resources [PATCH 10/16 v6] PCI: cleanup pci_bus_add_devices() [PATCH 11/16 v6] PCI: split a new function from pci_bus_add_devices() [PATCH 12/16 v6] PCI: support the SR-IOV capability [PATCH 13/16 v6] PCI: reserve bus range for SR-IOV device [PATCH 14/16 v6] PCI: document for SR-IOV user and developer [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries [PATCH 16/16 v6] PCI: document the new PCI boot parameters --- Single Root I/O Virtualization (SR-IOV) capability defined by PCI-SIG is intended to enable multiple system software to share PCI hardware resources. PCI device that supports this capability can be extended to one Physical Functions plus multiple Virtual Functions. Physical Function, which could be considered as the "real" PCI device, reflects the hardware instance and manages all physical resources. Virtual Functions are associated with a Physical Function and shares physical resources with the Physical Function.Software can control allocation of Virtual Functions via registers encapsulated in the capability structure. SR-IOV specification can be found at http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf Devices that support SR-IOV are available from following vendors: http://download.intel.com/design/network/ProdBrf/320025.pdf http://www.netxen.com/products/chipsolutions/NX3031.html http://www.neterion.com/products/x3100.html From yu.zhao at intel.com Wed Oct 22 01:40:41 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:40:41 +0800 Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084041.GB3773@yzhao12-linux.sh.intel.com> This patch moves all definitions of the PCI resource names to an 'enum', and also replaces some hard-coded resource variables with symbol names. This change eases introduction of device specific resources. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/pci-sysfs.c | 4 +++- drivers/pci/pci.c | 19 ++----------------- drivers/pci/probe.c | 2 +- drivers/pci/proc.c | 7 ++++--- include/linux/pci.h | 37 ++++++++++++++++++++++++------------- 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 110022d..5c456ab 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf) struct pci_dev * pci_dev = to_pci_dev(dev); char * str = buf; int i; - int max = 7; + int max; resource_size_t start, end; if (pci_dev->subordinate) max = DEVICE_COUNT_RESOURCE; + else + max = PCI_BRIDGE_RESOURCES; for (i = 0; i < max; i++) { struct resource *res = &pci_dev->resource[i]; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index ae62f01..40284dc 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -359,24 +359,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res) static void pci_restore_bars(struct pci_dev *dev) { - int i, numres; - - switch (dev->hdr_type) { - case PCI_HEADER_TYPE_NORMAL: - numres = 6; - break; - case PCI_HEADER_TYPE_BRIDGE: - numres = 2; - break; - case PCI_HEADER_TYPE_CARDBUS: - numres = 1; - break; - default: - /* Should never get here, but just in case... */ - return; - } + int i; - for (i = 0; i < numres; i++) + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) pci_update_resource(dev, i); } diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index aaaf0a1..a52784c 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -426,7 +426,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->subordinate = 0xff; /* Set up default resource pointers and names.. */ - for (i = 0; i < 4; i++) { + for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; child->resource[i]->name = child->name; } diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index e1098c3..f6f2a59 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v) dev->vendor, dev->device, dev->irq); - /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */ - for (i=0; i<7; i++) { + + /* only print standard and ROM resources to preserve compatibility */ + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", (unsigned long long)(start | (dev->resource[i].flags & PCI_REGION_FLAG_MASK))); } - for (i=0; i<7; i++) { + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { resource_size_t start, end; pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, "\t%16llx", diff --git a/include/linux/pci.h b/include/linux/pci.h index 43e1fc1..2ada2b6 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -76,7 +76,30 @@ enum pci_mmap_state { #define PCI_DMA_FROMDEVICE 2 #define PCI_DMA_NONE 3 -#define DEVICE_COUNT_RESOURCE 12 +/* + * For PCI devices, the region numbers are assigned this way: + */ +enum { + /* #0-5: standard PCI regions */ + PCI_STD_RESOURCES, + PCI_STD_RESOURCES_END = 5, + + /* #6: expansion ROM */ + PCI_ROM_RESOURCE, + + /* address space assigned to buses behind the bridge */ +#ifndef PCI_BRIDGE_RES_NUM +#define PCI_BRIDGE_RES_NUM 4 +#endif + PCI_BRIDGE_RESOURCES, + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1, + + /* total resources associated with a PCI device */ + PCI_NUM_RESOURCES, + + /* preserve this for compatibility */ + DEVICE_COUNT_RESOURCE +}; typedef int __bitwise pci_power_t; @@ -262,18 +285,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev, hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space); } -/* - * For PCI devices, the region numbers are assigned this way: - * - * 0-5 standard PCI regions - * 6 expansion ROM - * 7-10 bridges: address space assigned to buses behind the bridge - */ - -#define PCI_ROM_RESOURCE 6 -#define PCI_BRIDGE_RESOURCES 7 -#define PCI_NUM_RESOURCES 11 - #ifndef PCI_BUS_NUM_RESOURCES #define PCI_BUS_NUM_RESOURCES 16 #endif -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:40:11 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:40:11 +0800 Subject: [PATCH 1/16 v6] PCI: remove unnecessary arg of pci_update_resource() In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084011.GA3773@yzhao12-linux.sh.intel.com> This cleanup removes unnecessary argument 'struct resource *res' in pci_update_resource(), so it takes same arguments as other companion functions (pci_assign_resource(), etc.). Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 4 ++-- drivers/pci/setup-res.c | 7 ++++--- include/linux/pci.h | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4db261e..ae62f01 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -376,8 +376,8 @@ pci_restore_bars(struct pci_dev *dev) return; } - for (i = 0; i < numres; i ++) - pci_update_resource(dev, &dev->resource[i], i); + for (i = 0; i < numres; i++) + pci_update_resource(dev, i); } static struct pci_platform_pm_ops *pci_platform_pm; diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index 2dbd96c..b7ca679 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -26,11 +26,12 @@ #include "pci.h" -void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno) +void pci_update_resource(struct pci_dev *dev, int resno) { struct pci_bus_region region; u32 new, check, mask; int reg; + struct resource *res = dev->resource + resno; /* * Ignore resources for unimplemented BARs and unused resource slots @@ -162,7 +163,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) } else { res->flags &= ~IORESOURCE_STARTALIGN; if (resno < PCI_BRIDGE_RESOURCES) - pci_update_resource(dev, res, resno); + pci_update_resource(dev, resno); } return ret; @@ -197,7 +198,7 @@ int pci_assign_resource_fixed(struct pci_dev *dev, int resno) dev_err(&dev->dev, "BAR %d: can't allocate %s resource %pR\n", resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res); } else if (resno < PCI_BRIDGE_RESOURCES) { - pci_update_resource(dev, res, resno); + pci_update_resource(dev, resno); } return ret; diff --git a/include/linux/pci.h b/include/linux/pci.h index 085187b..43e1fc1 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -626,7 +626,7 @@ int pcix_get_mmrbc(struct pci_dev *dev); int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc); int pcie_get_readrq(struct pci_dev *dev); int pcie_set_readrq(struct pci_dev *dev, int rq); -void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno); +void pci_update_resource(struct pci_dev *dev, int resno); int __must_check pci_assign_resource(struct pci_dev *dev, int i); int pci_select_bars(struct pci_dev *dev, unsigned long flags); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:41:02 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:41:02 +0800 Subject: [PATCH 3/16 v6] PCI: export __pci_read_base In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084102.GC3773@yzhao12-linux.sh.intel.com> Export __pci_read_base() so it can be used by whole PCI subsystem. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/pci.h | 9 +++++++++ drivers/pci/probe.c | 20 +++++++++----------- 2 files changed, 18 insertions(+), 11 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index b205ab8..fbbc6ad 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -157,6 +157,15 @@ struct pci_slot_attribute { }; #define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, attr) +enum pci_bar_type { + pci_bar_unknown, /* Standard PCI BAR probe */ + pci_bar_io, /* An io port BAR */ + pci_bar_mem32, /* A 32-bit memory BAR */ + pci_bar_mem64, /* A 64-bit memory BAR */ +}; + +extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, + struct resource *res, unsigned int reg); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index a52784c..db3e5a7 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -135,13 +135,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask) return size; } -enum pci_bar_type { - pci_bar_unknown, /* Standard PCI BAR probe */ - pci_bar_io, /* An io port BAR */ - pci_bar_mem32, /* A 32-bit memory BAR */ - pci_bar_mem64, /* A 64-bit memory BAR */ -}; - static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) { if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) { @@ -156,11 +149,16 @@ static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar) return pci_bar_mem32; } -/* - * If the type is not unknown, we assume that the lowest bit is 'enable'. - * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit. +/** + * pci_read_base - read a PCI BAR + * @dev: the PCI device + * @type: type of the BAR + * @res: resource buffer to be filled in + * @pos: BAR position in the config space + * + * Returns 1 if the BAR is 64-bit, or 0 if 32-bit. */ -static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, +int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int pos) { u32 l, sz, mask; -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:41:48 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:41:48 +0800 Subject: [PATCH 5/16 v6] PCI: add a wrapper for resource_alignment() In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084148.GE3773@yzhao12-linux.sh.intel.com> Add a wrapper for resource_alignment() so it can handle device specific resource alignment. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 20 ++++++++++++++++++++ drivers/pci/pci.h | 1 + drivers/pci/setup-bus.c | 4 ++-- drivers/pci/setup-res.c | 7 ++++--- 4 files changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 40284dc..a9b554e 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1904,6 +1904,26 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) return bars; } +/** + * pci_resource_alignment - get a PCI BAR resource alignment + * @dev: the PCI device + * @resno: the resource number + * + * Returns alignment size on success, or 0 on error. + */ +int pci_resource_alignment(struct pci_dev *dev, int resno) +{ + resource_size_t align; + struct resource *res = dev->resource + resno; + + align = resource_alignment(res); + if (align) + return align; + + dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fbbc6ad..baa3d23 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -166,6 +166,7 @@ enum pci_bar_type { extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); +extern int pci_resource_alignment(struct pci_dev *dev, int resno); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ea979f2..90a9c0a 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -25,6 +25,7 @@ #include #include #include +#include "pci.h" static void pbus_assign_resources_sorted(struct pci_bus *bus) @@ -351,8 +352,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, unsigned long if (r->parent || (r->flags & mask) != type) continue; r_size = resource_size(r); - /* For bridges size != alignment */ - align = resource_alignment(r); + align = pci_resource_alignment(dev, i); order = __ffs(align) - 20; if (order > 11) { dev_warn(&dev->dev, "BAR %d bad alignment %llx: " diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index b7ca679..88a9c70 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -133,7 +133,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno) size = resource_size(res); min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM; - align = resource_alignment(res); + align = pci_resource_alignment(dev, resno); if (!align) { dev_err(&dev->dev, "BAR %d: can't allocate resource (bogus " "alignment) %pR flags %#lx\n", @@ -224,7 +224,7 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) if (!(r->flags) || r->parent) continue; - r_align = resource_alignment(r); + r_align = pci_resource_alignment(dev, i); if (!r_align) { dev_warn(&dev->dev, "BAR %d: bogus alignment " "%pR flags %#lx\n", @@ -236,7 +236,8 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head) struct resource_list *ln = list->next; if (ln) - align = resource_alignment(ln->res); + align = pci_resource_alignment(ln->dev, + ln->res - ln->dev->resource); if (r_align > align) { tmp = kmalloc(sizeof(*tmp), GFP_KERNEL); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:41:24 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:41:24 +0800 Subject: [PATCH 4/16 v6] PCI: make pci_alloc_child_bus() be able to handle NULL bridge In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084124.GD3773@yzhao12-linux.sh.intel.com> Make pci_alloc_child_bus() be able to allocate buses without bridge devices. Some SR-IOV devices can occupy more than one bus number, but there is no explicit bridges because that have internal routing mechanism. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/probe.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index db3e5a7..4b12b58 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -401,12 +401,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, if (!child) return NULL; - child->self = bridge; child->parent = parent; child->ops = parent->ops; child->sysdata = parent->sysdata; child->bus_flags = parent->bus_flags; - child->bridge = get_device(&bridge->dev); /* initialize some portions of the bus device, but don't register it * now as the parent is not properly set up yet. This device will get @@ -423,6 +421,11 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, child->primary = parent->secondary; child->subordinate = 0xff; + if (!bridge) + return child; + + child->self = bridge; + child->bridge = get_device(&bridge->dev); /* Set up default resource pointers and names.. */ for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) { child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i]; -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:42:18 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:42:18 +0800 Subject: [PATCH 6/16 v6] PCI: add a new function to map BAR offset In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084218.GF3773@yzhao12-linux.sh.intel.com> Add a function to map resource number to corresponding register so people can get the offset and type of device specific BARs. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/pci.c | 22 ++++++++++++++++++++++ drivers/pci/pci.h | 2 ++ drivers/pci/setup-res.c | 13 +++++-------- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a9b554e..b02167a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1924,6 +1924,28 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) return 0; } +/** + * pci_resource_bar - get position of the BAR associated with a resource + * @dev: the PCI device + * @resno: the resource number + * @type: the BAR type to be filled in + * + * Returns BAR position in config space, or 0 if the BAR is invalid. + */ +int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) +{ + if (resno < PCI_ROM_RESOURCE) { + *type = pci_bar_unknown; + return PCI_BASE_ADDRESS_0 + 4 * resno; + } else if (resno == PCI_ROM_RESOURCE) { + *type = pci_bar_mem32; + return dev->rom_base_reg; + } + + dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); + return 0; +} + static void __devinit pci_no_domains(void) { #ifdef CONFIG_PCI_DOMAINS diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index baa3d23..d707477 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -167,6 +167,8 @@ enum pci_bar_type { extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, struct resource *res, unsigned int reg); extern int pci_resource_alignment(struct pci_dev *dev, int resno); +extern int pci_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); extern void pci_enable_ari(struct pci_dev *dev); /** * pci_ari_enabled - query ARI forwarding status diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index 88a9c70..5812f4b 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -31,6 +31,7 @@ void pci_update_resource(struct pci_dev *dev, int resno) struct pci_bus_region region; u32 new, check, mask; int reg; + enum pci_bar_type type; struct resource *res = dev->resource + resno; /* @@ -62,17 +63,13 @@ void pci_update_resource(struct pci_dev *dev, int resno) else mask = (u32)PCI_BASE_ADDRESS_MEM_MASK; - if (resno < 6) { - reg = PCI_BASE_ADDRESS_0 + 4 * resno; - } else if (resno == PCI_ROM_RESOURCE) { + reg = pci_resource_bar(dev, resno, &type); + if (!reg) + return; + if (type != pci_bar_unknown) { if (!(res->flags & IORESOURCE_ROM_ENABLE)) return; new |= PCI_ROM_ADDRESS_ENABLE; - reg = dev->rom_base_reg; - } else { - /* Hmm, non-standard resource. */ - - return; /* kill uninitialised var warning */ } pci_write_config_dword(dev, reg, new); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:42:41 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:42:41 +0800 Subject: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources() In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084241.GG3773@yzhao12-linux.sh.intel.com> This cleanup makes pcibios_allocate_resources() easier to read. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- arch/x86/pci/i386.c | 28 ++++++++++++++-------------- 1 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index 844df0c..8729bde 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list) static void __init pcibios_allocate_resources(int pass) { struct pci_dev *dev = NULL; - int idx, disabled; + int idx, enabled; u16 command; struct resource *r, *pr; @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass) if (!r->start) /* Address not assigned at all */ continue; if (r->flags & IORESOURCE_IO) - disabled = !(command & PCI_COMMAND_IO); + enabled = command & PCI_COMMAND_IO; else - disabled = !(command & PCI_COMMAND_MEMORY); - if (pass == disabled) { - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", + enabled = command & PCI_COMMAND_MEMORY; + if (pass == enabled) + continue; + dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", (unsigned long long) r->start, (unsigned long long) r->end, - r->flags, disabled, pass); - pr = pci_find_parent_resource(dev, r); - if (!pr || request_resource(pr, r) < 0) { - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); - /* We'll assign a new address later */ - r->end -= r->start; - r->start = 0; - } - } + r->flags, enabled, pass); + pr = pci_find_parent_resource(dev, r); + if (pr && !request_resource(pr, r)) + continue; + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + /* We'll assign a new address later */ + r->end -= r->start; + r->start = 0; } if (!pass) { r = &dev->resource[PCI_ROM_RESOURCE]; -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:43:03 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:43:03 +0800 Subject: [PATCH 8/16 v6] PCI: add boot options to reassign resources In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084303.GH3773@yzhao12-linux.sh.intel.com> This patch adds boot options so user can reassign device resources of all devices under a bus. The boot options can be used as: pci=assign-mmio=0000:01,assign-pio=0000:02 '[dddd:]bb' is the domain and bus number. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- arch/x86/pci/common.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++ arch/x86/pci/i386.c | 10 ++++--- arch/x86/pci/pci.h | 3 ++ 3 files changed, 82 insertions(+), 4 deletions(-) diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index b67732b..06e1ce0 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -137,6 +137,72 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) } } +static char *pci_assign_pio; +static char *pci_assign_mmio; + +static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) +{ + int i; + int type = 0; + int domain, busnr; + + if (!bus->self) + return 0; + + for (i = 0; i < 2; i++) { + char *str = i ? pci_assign_pio : pci_assign_mmio; + + while (str && *str) { + if (sscanf(str, "%04x:%02x", &domain, &busnr) != 2) { + if (sscanf(str, "%02x", &busnr) != 1) + break; + domain = 0; + } + + if (pci_domain_nr(bus) == domain && + bus->number == busnr) { + type |= i ? IORESOURCE_IO : IORESOURCE_MEM; + break; + } + + str = strchr(str, ';'); + if (str) + str++; + } + } + + return type; +} + +static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) +{ + int i; + int type = pcibios_bus_resource_needs_fixup(bus); + + if (!type) + return; + + for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) { + struct resource *res = bus->resource[i]; + + if (!res) + continue; + if (res->flags & type) + res->flags = 0; + } +} + +int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) +{ + struct pci_bus *bus; + + for (bus = dev->bus; bus && bus != pci_root_bus; bus = bus->parent) + if (pcibios_bus_resource_needs_fixup(bus)) + return 1; + + return 0; +} + /* * Called after each bus is probed, but before its children * are examined. @@ -147,6 +213,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *b) struct pci_dev *dev; pci_read_bridge_bases(b); + pcibios_fixup_bus_resources(b); list_for_each_entry(dev, &b->devices, bus_list) pcibios_fixup_device_resources(dev); } @@ -519,6 +586,12 @@ char * __devinit pcibios_setup(char *str) } else if (!strcmp(str, "skip_isa_align")) { pci_probe |= PCI_CAN_SKIP_ISA_ALIGN; return NULL; + } else if (!strncmp(str, "assign-pio=", 11)) { + pci_assign_pio = str + 11; + return NULL; + } else if (!strncmp(str, "assign-mmio=", 12)) { + pci_assign_mmio = str + 12; + return NULL; } return str; } diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index 8729bde..ea82a5b 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -169,10 +169,12 @@ static void __init pcibios_allocate_resources(int pass) (unsigned long long) r->start, (unsigned long long) r->end, r->flags, enabled, pass); - pr = pci_find_parent_resource(dev, r); - if (pr && !request_resource(pr, r)) - continue; - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + if (!pcibios_resource_needs_fixup(dev, idx)) { + pr = pci_find_parent_resource(dev, r); + if (pr && !request_resource(pr, r)) + continue; + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); + } /* We'll assign a new address later */ r->end -= r->start; r->start = 0; diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h index 15b9cf6..f22737d 100644 --- a/arch/x86/pci/pci.h +++ b/arch/x86/pci/pci.h @@ -117,6 +117,9 @@ extern int __init pcibios_init(void); extern int __init pci_mmcfg_arch_init(void); extern void __init pci_mmcfg_arch_free(void); +/* pci-common.c */ +extern int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno); + /* * AMD Fam10h CPUs are buggy, and cannot access MMIO config space * on their northbrige except through the * %eax register. As such, you MUST -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:43:24 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:43:24 +0800 Subject: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084324.GI3773@yzhao12-linux.sh.intel.com> This patch adds boot option to align MMIO resource for a device. The alignment is a bigger value between the PAGE_SIZE and the resource size. The boot option can be used as: pci=align-mmio=0000:01:02.3 '[0000:]01:02.3' is the domain, bus, device and function number of the device. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- arch/x86/pci/common.c | 37 +++++++++++++++++++++++++++++++++++++ drivers/pci/pci.c | 20 ++++++++++++++++++-- include/linux/pci.h | 1 + 3 files changed, 56 insertions(+), 2 deletions(-) diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 06e1ce0..3c5d230 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -139,6 +139,7 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) static char *pci_assign_pio; static char *pci_assign_mmio; +static char *pci_align_mmio; static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) { @@ -192,6 +193,36 @@ static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) } } +int pcibios_resource_alignment(struct pci_dev *dev, int resno) +{ + int domain, busnr, slot, func; + char *str = pci_align_mmio; + + if (dev->resource[resno].flags & IORESOURCE_IO) + return 0; + + while (str && *str) { + if (sscanf(str, "%04x:%02x:%02x.%d", + &domain, &busnr, &slot, &func) != 4) { + if (sscanf(str, "%02x:%02x.%d", + &busnr, &slot, &func) != 3) + break; + domain = 0; + } + + if (pci_domain_nr(dev->bus) == domain && + dev->bus->number == busnr && + dev->devfn == PCI_DEVFN(slot, func)) + return PAGE_SIZE; + + str = strchr(str, ';'); + if (str) + str++; + } + + return 0; +} + int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) { struct pci_bus *bus; @@ -200,6 +231,9 @@ int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) if (pcibios_bus_resource_needs_fixup(bus)) return 1; + if (pcibios_resource_alignment(dev, resno)) + return 1; + return 0; } @@ -592,6 +626,9 @@ char * __devinit pcibios_setup(char *str) } else if (!strncmp(str, "assign-mmio=", 12)) { pci_assign_mmio = str + 12; return NULL; + } else if (!strncmp(str, "align-mmio=", 11)) { + pci_align_mmio = str + 11; + return NULL; } return str; } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b02167a..11ecd6f 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1015,6 +1015,20 @@ int __attribute__ ((weak)) pcibios_set_pcie_reset_state(struct pci_dev *dev, } /** + * pcibios_resource_alignment - get resource alignment requirement + * @dev: the PCI device + * @resno: resource number + * + * Queries the resource alignment from PCI low level code. Returns positive + * if there is alignment requirement of the resource, or 0 otherwise. + */ +int __attribute__ ((weak)) pcibios_resource_alignment(struct pci_dev *dev, + int resno) +{ + return 0; +} + +/** * pci_set_pcie_reset_state - set reset state for device dev * @dev: the PCI-E device reset * @state: Reset state to enter into @@ -1913,12 +1927,14 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) */ int pci_resource_alignment(struct pci_dev *dev, int resno) { - resource_size_t align; + resource_size_t align, bios_align; struct resource *res = dev->resource + resno; + bios_align = pcibios_resource_alignment(dev, resno); + align = resource_alignment(res); if (align) - return align; + return align > bios_align ? align : bios_align; dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); return 0; diff --git a/include/linux/pci.h b/include/linux/pci.h index 2ada2b6..6ac69af 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1121,6 +1121,7 @@ int pcibios_add_platform_entries(struct pci_dev *dev); void pcibios_disable_device(struct pci_dev *dev); int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state); +int pcibios_resource_alignment(struct pci_dev *dev, int resno); #ifdef CONFIG_PCI_MMCONFIG extern void __init pci_mmcfg_early_init(void); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:43:46 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:43:46 +0800 Subject: [PATCH 10/16 v6] PCI: cleanup pci_bus_add_devices() In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084346.GJ3773@yzhao12-linux.sh.intel.com> This cleanup makes pci_bus_add_devices() easier to read. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/bus.c | 56 +++++++++++++++++++++++++------------------------ drivers/pci/remove.c | 2 + 2 files changed, 31 insertions(+), 27 deletions(-) diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 999cc40..7a21602 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -71,7 +71,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res, } /** - * add a single device + * pci_bus_add_device - add a single device * @dev: device to add * * This adds a single pci device to the global @@ -105,7 +105,7 @@ int pci_bus_add_device(struct pci_dev *dev) void pci_bus_add_devices(struct pci_bus *bus) { struct pci_dev *dev; - struct pci_bus *child_bus; + struct pci_bus *child; int retval; list_for_each_entry(dev, &bus->devices, bus_list) { @@ -120,39 +120,41 @@ void pci_bus_add_devices(struct pci_bus *bus) list_for_each_entry(dev, &bus->devices, bus_list) { BUG_ON(!dev->is_added); + child = dev->subordinate; /* * If there is an unattached subordinate bus, attach * it and then scan for unattached PCI devices. */ - if (dev->subordinate) { - if (list_empty(&dev->subordinate->node)) { - down_write(&pci_bus_sem); - list_add_tail(&dev->subordinate->node, - &dev->bus->children); - up_write(&pci_bus_sem); - } - pci_bus_add_devices(dev->subordinate); - - /* register the bus with sysfs as the parent is now - * properly registered. */ - child_bus = dev->subordinate; - if (child_bus->is_added) - continue; - child_bus->dev.parent = child_bus->bridge; - retval = device_register(&child_bus->dev); - if (retval) - dev_err(&dev->dev, "Error registering pci_bus," - " continuing...\n"); - else { - child_bus->is_added = 1; - retval = device_create_file(&child_bus->dev, - &dev_attr_cpuaffinity); - } + if (!child) + continue; + if (list_empty(&child->node)) { + down_write(&pci_bus_sem); + list_add_tail(&child->node, + &dev->bus->children); + up_write(&pci_bus_sem); + } + pci_bus_add_devices(child); + + /* + * register the bus with sysfs as the parent is now + * properly registered. + */ + if (child->is_added) + continue; + child->dev.parent = child->bridge; + retval = device_register(&child->dev); + if (retval) + dev_err(&dev->dev, "Error registering pci_bus," + " continuing...\n"); + else { + child->is_added = 1; + retval = device_create_file(&child->dev, + &dev_attr_cpuaffinity); if (retval) dev_err(&dev->dev, "Error creating cpuaffinity" " file, continuing...\n"); - retval = device_create_file(&child_bus->dev, + retval = device_create_file(&child->dev, &dev_attr_cpulistaffinity); if (retval) dev_err(&dev->dev, diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c index 042e089..bfa0869 100644 --- a/drivers/pci/remove.c +++ b/drivers/pci/remove.c @@ -72,6 +72,8 @@ void pci_remove_bus(struct pci_bus *pci_bus) list_del(&pci_bus->node); up_write(&pci_bus_sem); pci_remove_legacy_files(pci_bus); + if (!pci_bus->is_added) + return; device_remove_file(&pci_bus->dev, &dev_attr_cpuaffinity); device_remove_file(&pci_bus->dev, &dev_attr_cpulistaffinity); device_unregister(&pci_bus->dev); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:44:05 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:44:05 +0800 Subject: [PATCH 11/16 v6] PCI: split a new function from pci_bus_add_devices() In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084405.GK3773@yzhao12-linux.sh.intel.com> This patch splits a new function from pci_bus_add_devices(). The new function can be used to register PCI bus to the device core and create its sysfs entries. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/bus.c | 47 ++++++++++++++++++++++++++++------------------- include/linux/pci.h | 1 + 2 files changed, 29 insertions(+), 19 deletions(-) diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 7a21602..1713c35 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -91,6 +91,32 @@ int pci_bus_add_device(struct pci_dev *dev) } /** + * pci_bus_add_child - add a child bus + * @bus: bus to add + * + * This adds sysfs entries for a single bus + */ +int pci_bus_add_child(struct pci_bus *bus) +{ + int retval; + + if (bus->bridge) + bus->dev.parent = bus->bridge; + + retval = device_register(&bus->dev); + if (retval) + return retval; + + bus->is_added = 1; + + retval = device_create_file(&bus->dev, &dev_attr_cpuaffinity); + if (retval) + return retval; + + return device_create_file(&bus->dev, &dev_attr_cpulistaffinity); +} + +/** * pci_bus_add_devices - insert newly discovered PCI devices * @bus: bus to check for new devices * @@ -141,26 +167,9 @@ void pci_bus_add_devices(struct pci_bus *bus) */ if (child->is_added) continue; - child->dev.parent = child->bridge; - retval = device_register(&child->dev); + retval = pci_bus_add_child(child); if (retval) - dev_err(&dev->dev, "Error registering pci_bus," - " continuing...\n"); - else { - child->is_added = 1; - retval = device_create_file(&child->dev, - &dev_attr_cpuaffinity); - if (retval) - dev_err(&dev->dev, "Error creating cpuaffinity" - " file, continuing...\n"); - - retval = device_create_file(&child->dev, - &dev_attr_cpulistaffinity); - if (retval) - dev_err(&dev->dev, - "Error creating cpulistaffinity" - " file, continuing...\n"); - } + dev_err(&dev->dev, "Error adding bus, continuing\n"); } } diff --git a/include/linux/pci.h b/include/linux/pci.h index 6ac69af..80d88f8 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -528,6 +528,7 @@ struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn); void pci_device_add(struct pci_dev *dev, struct pci_bus *bus); unsigned int pci_scan_child_bus(struct pci_bus *bus); int __must_check pci_bus_add_device(struct pci_dev *dev); +int pci_bus_add_child(struct pci_bus *bus); void pci_read_bridge_bases(struct pci_bus *child); struct resource *pci_find_parent_resource(const struct pci_dev *dev, struct resource *res); -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:44:23 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:44:23 +0800 Subject: [PATCH 12/16 v6] PCI: support the SR-IOV capability In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084423.GL3773@yzhao12-linux.sh.intel.com> Support Single Root I/O Virtualization (SR-IOV) capability. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/Kconfig | 12 + drivers/pci/Makefile | 2 + drivers/pci/iov.c | 592 ++++++++++++++++++++++++++++++++++++++++++++++ drivers/pci/pci-sysfs.c | 4 + drivers/pci/pci.c | 14 + drivers/pci/pci.h | 48 ++++ drivers/pci/probe.c | 4 + include/linux/pci.h | 39 +++ include/linux/pci_regs.h | 21 ++ 9 files changed, 736 insertions(+), 0 deletions(-) create mode 100644 drivers/pci/iov.c diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index e1ca425..e7c0836 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -50,3 +50,15 @@ config HT_IRQ This allows native hypertransport devices to use interrupts. If unsure say Y. + +config PCI_IOV + bool "PCI SR-IOV support" + depends on PCI + select PCI_MSI + default n + help + This option allows device drivers to enable Single Root I/O + Virtualization. Each Virtual Function's PCI configuration + space can be accessed using its own Bus, Device and Function + Number (Routing ID). Each Virtual Function also has PCI Memory + Space, which is used to map its own register set. diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 4b47f4e..abbfcfa 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -55,3 +55,5 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o ifeq ($(CONFIG_PCI_DEBUG),y) EXTRA_CFLAGS += -DDEBUG endif + +obj-$(CONFIG_PCI_IOV) += iov.o diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c new file mode 100644 index 0000000..dd299aa --- /dev/null +++ b/drivers/pci/iov.c @@ -0,0 +1,592 @@ +/* + * drivers/pci/iov.c + * + * Copyright (C) 2008 Intel Corporation + * + * PCI Express Single Root I/O Virtualization capability support. + */ + +#include +#include +#include +#include +#include +#include "pci.h" + + +#define iov_config_attr(field) \ +static ssize_t field##_show(struct device *dev, \ + struct device_attribute *attr, char *buf) \ +{ \ + struct pci_dev *pdev = to_pci_dev(dev); \ + return sprintf(buf, "%d\n", pdev->iov->field); \ +} + +iov_config_attr(status); +iov_config_attr(totalvfs); +iov_config_attr(initialvfs); +iov_config_attr(numvfs); + +static inline void vf_rid(struct pci_dev *dev, int vfn, u8 *busnr, u8 *devfn) +{ + u16 rid; + + rid = (dev->bus->number << 8) + dev->devfn + + dev->iov->offset + dev->iov->stride * vfn; + *busnr = rid >> 8; + *devfn = rid & 0xff; +} + +static int vf_add(struct pci_dev *dev, int vfn) +{ + int i; + int rc; + u8 busnr, devfn; + struct pci_dev *vf; + struct pci_bus *bus; + struct resource *res; + resource_size_t size; + + vf_rid(dev, vfn, &busnr, &devfn); + + vf = alloc_pci_dev(); + if (!vf) + return -ENOMEM; + + if (dev->bus->number == busnr) + vf->bus = bus = dev->bus; + else { + list_for_each_entry(bus, &dev->bus->children, node) + if (bus->number == busnr) { + vf->bus = bus; + break; + } + BUG_ON(!vf->bus); + } + + vf->sysdata = bus->sysdata; + vf->dev.parent = dev->dev.parent; + vf->dev.bus = dev->dev.bus; + vf->devfn = devfn; + vf->hdr_type = PCI_HEADER_TYPE_NORMAL; + vf->multifunction = 0; + vf->vendor = dev->vendor; + pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_DID, &vf->device); + vf->cfg_size = PCI_CFG_SPACE_EXP_SIZE; + vf->error_state = pci_channel_io_normal; + vf->is_pcie = 1; + vf->pcie_type = PCI_EXP_TYPE_ENDPOINT; + vf->dma_mask = 0xffffffff; + + dev_set_name(&vf->dev, "%04x:%02x:%02x.%d", pci_domain_nr(bus), + busnr, PCI_SLOT(devfn), PCI_FUNC(devfn)); + + pci_read_config_byte(vf, PCI_REVISION_ID, &vf->revision); + vf->class = dev->class; + vf->current_state = PCI_UNKNOWN; + vf->irq = 0; + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + if (!res->parent) + continue; + vf->resource[i].name = pci_name(vf); + vf->resource[i].flags = res->flags; + size = resource_size(res); + do_div(size, dev->iov->totalvfs); + vf->resource[i].start = res->start + size * vfn; + vf->resource[i].end = vf->resource[i].start + size - 1; + rc = request_resource(res, &vf->resource[i]); + BUG_ON(rc); + } + + vf->subsystem_vendor = dev->subsystem_vendor; + pci_read_config_word(vf, PCI_SUBSYSTEM_ID, &vf->subsystem_device); + + pci_device_add(vf, bus); + return pci_bus_add_device(vf); +} + +static void vf_remove(struct pci_dev *dev, int vfn) +{ + u8 busnr, devfn; + struct pci_dev *vf; + + vf_rid(dev, vfn, &busnr, &devfn); + + vf = pci_get_bus_and_slot(busnr, devfn); + if (!vf) + return; + + pci_dev_put(vf); + pci_remove_bus_device(vf); +} + +static int iov_enable(struct pci_dev *dev) +{ + int rc; + int i, j; + u16 ctrl; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (!iov->numvfs) + return -EINVAL; + + if (iov->status) + return 0; + + rc = iov->callback(dev, PCI_IOV_ENABLE); + if (rc) + return rc; + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl |= (PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + for (i = 0; i < iov->numvfs; i++) { + rc = vf_add(dev, i); + if (rc) + goto failed; + } + + iov->status = 1; + return 0; + +failed: + for (j = 0; j < i; j++) + vf_remove(dev, j); + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + return rc; +} + +static int iov_disable(struct pci_dev *dev) +{ + int i; + int rc; + u16 ctrl; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (!iov->status) + return 0; + + rc = iov->callback(dev, PCI_IOV_DISABLE); + if (rc) + return rc; + + for (i = 0; i < iov->numvfs; i++) + vf_remove(dev, i); + + pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl); + ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE); + pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl); + ssleep(1); + + iov->status = 0; + return 0; +} + +static int iov_set_numvfs(struct pci_dev *dev, int numvfs) +{ + int rc; + u16 offset, stride; + struct pci_iov *iov = dev->iov; + + if (!iov->callback) + return -ENODEV; + + if (numvfs < 0 || numvfs > iov->initialvfs || iov->status) + return -EINVAL; + + if (numvfs == iov->numvfs) + return 0; + + rc = iov->callback(dev, PCI_IOV_NUMVFS | iov->numvfs); + if (rc) + return rc; + + pci_write_config_word(dev, iov->cap + PCI_IOV_NUM_VF, numvfs); + pci_read_config_word(dev, iov->cap + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(dev, iov->cap + PCI_IOV_VF_STRIDE, &stride); + if ((numvfs && !offset) || (numvfs > 1 && !stride)) + return -EIO; + + iov->offset = offset; + iov->stride = stride; + iov->numvfs = numvfs; + return 0; +} + +static ssize_t status_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int rc; + long enable; + struct pci_dev *pdev = to_pci_dev(dev); + + rc = strict_strtol(buf, 0, &enable); + if (rc) + return rc; + + mutex_lock(&pdev->iov->ops_lock); + switch (enable) { + case 0: + rc = iov_disable(pdev); + break; + case 1: + rc = iov_enable(pdev); + break; + default: + rc = -EINVAL; + } + mutex_unlock(&pdev->iov->ops_lock); + + return rc ? rc : count; +} + +static ssize_t numvfs_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + int rc; + long numvfs; + struct pci_dev *pdev = to_pci_dev(dev); + + rc = strict_strtol(buf, 0, &numvfs); + if (rc) + return rc; + + mutex_lock(&pdev->iov->ops_lock); + rc = iov_set_numvfs(pdev, numvfs); + mutex_unlock(&pdev->iov->ops_lock); + + return rc ? rc : count; +} + +static DEVICE_ATTR(totalvfs, S_IRUGO, totalvfs_show, NULL); +static DEVICE_ATTR(initialvfs, S_IRUGO, initialvfs_show, NULL); +static DEVICE_ATTR(numvfs, S_IWUSR | S_IRUGO, numvfs_show, numvfs_store); +static DEVICE_ATTR(enable, S_IWUSR | S_IRUGO, status_show, status_store); + +static struct attribute *iov_attrs[] = { + &dev_attr_totalvfs.attr, + &dev_attr_initialvfs.attr, + &dev_attr_numvfs.attr, + &dev_attr_enable.attr, + NULL +}; + +static struct attribute_group iov_attr_group = { + .attrs = iov_attrs, + .name = "iov", +}; + +static int iov_alloc_bus(struct pci_bus *bus, int busnr) +{ + int i; + int rc; + struct pci_dev *dev; + struct pci_bus *child; + + list_for_each_entry(dev, &bus->devices, bus_list) + if (dev->iov) + break; + + BUG_ON(!dev->iov); + pci_dev_get(dev); + mutex_lock(&dev->iov->bus_lock); + + for (i = bus->number + 1; i <= busnr; i++) { + list_for_each_entry(child, &bus->children, node) + if (child->number == i) + break; + if (child->number == i) + continue; + child = pci_add_new_bus(bus, NULL, i); + if (!child) + return -ENOMEM; + + child->subordinate = i; + child->dev.parent = bus->bridge; + rc = pci_bus_add_child(child); + if (rc) + return rc; + } + + mutex_unlock(&dev->iov->bus_lock); + + return 0; +} + +static void iov_release_bus(struct pci_bus *bus) +{ + struct pci_dev *dev, *tmp; + struct pci_bus *child, *next; + + list_for_each_entry(dev, &bus->devices, bus_list) + if (dev->iov) + break; + + BUG_ON(!dev->iov); + mutex_lock(&dev->iov->bus_lock); + + list_for_each_entry(tmp, &bus->devices, bus_list) + if (tmp->iov && tmp->iov->callback) + goto done; + + list_for_each_entry_safe(child, next, &bus->children, node) + if (!child->bridge) + pci_remove_bus(child); +done: + mutex_unlock(&dev->iov->bus_lock); + pci_dev_put(dev); +} + +/** + * pci_iov_init - initialize device's SR-IOV capability + * @dev: the PCI device + * + * Returns 0 on success, or negative on failure. + * + * The major differences between Virtual Function and PCI device are: + * 1) the device with multiple bus numbers uses internal routing, so + * there is no explicit bridge device in this case. + * 2) Virtual Function memory spaces are designated by BARs encapsulated + * in the capability structure, and the BARs in Virtual Function PCI + * configuration space are read-only zero. + */ +int pci_iov_init(struct pci_dev *dev) +{ + int i; + int pos; + u32 pgsz; + u16 ctrl, total, initial, offset, stride; + struct pci_iov *iov; + struct resource *res; + + if (!dev->is_pcie || (dev->pcie_type != PCI_EXP_TYPE_RC_END && + dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)) + return -ENODEV; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_IOV); + if (!pos) + return -ENODEV; + + ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0; + pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl); + ssleep(1); + + pci_read_config_word(dev, pos + PCI_IOV_TOTAL_VF, &total); + pci_read_config_word(dev, pos + PCI_IOV_INITIAL_VF, &initial); + pci_write_config_word(dev, pos + PCI_IOV_NUM_VF, initial); + pci_read_config_word(dev, pos + PCI_IOV_VF_OFFSET, &offset); + pci_read_config_word(dev, pos + PCI_IOV_VF_STRIDE, &stride); + if (!total || initial > total || (initial && !offset) || + (initial > 1 && !stride)) + return -EIO; + + pci_read_config_dword(dev, pos + PCI_IOV_SUP_PGSIZE, &pgsz); + i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0; + pgsz &= ~((1 << i) - 1); + if (!pgsz) + return -EIO; + + pgsz &= ~(pgsz - 1); + pci_write_config_dword(dev, pos + PCI_IOV_SYS_PGSIZE, pgsz); + + iov = kzalloc(sizeof(*iov), GFP_KERNEL); + if (!iov) + return -ENOMEM; + + iov->cap = pos; + iov->totalvfs = total; + iov->initialvfs = initial; + iov->offset = offset; + iov->stride = stride; + iov->align = pgsz << 12; + + for (i = 0; i < PCI_IOV_NUM_BAR; i++) { + res = dev->resource + PCI_IOV_RESOURCES + i; + pos = iov->cap + PCI_IOV_BAR_0 + i * 4; + i += __pci_read_base(dev, pci_bar_unknown, res, pos); + if (!res->flags) + continue; + res->flags &= ~IORESOURCE_SIZEALIGN; + res->end = res->start + resource_size(res) * total - 1; + } + + mutex_init(&iov->ops_lock); + mutex_init(&iov->bus_lock); + + dev->iov = iov; + + return 0; +} + +/** + * pci_iov_release - release resources used by SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_release(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + mutex_destroy(&dev->iov->ops_lock); + mutex_destroy(&dev->iov->bus_lock); + kfree(dev->iov); + dev->iov = NULL; +} + +/** + * pci_iov_create_sysfs - create sysfs for SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_create_sysfs(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + sysfs_create_group(&dev->dev.kobj, &iov_attr_group); +} + +/** + * pci_iov_remove_sysfs - remove sysfs of SR-IOV capability + * @dev: the PCI device + */ +void pci_iov_remove_sysfs(struct pci_dev *dev) +{ + if (!dev->iov) + return; + + sysfs_remove_group(&dev->dev.kobj, &iov_attr_group); +} + +int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + return dev->iov->align; +} + +int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END) + return 0; + + BUG_ON(!dev->iov); + + *type = pci_bar_unknown; + return dev->iov->cap + PCI_IOV_BAR_0 + + 4 * (resno - PCI_IOV_RESOURCES); +} + +/** + * pci_iov_register - register SR-IOV service + * @dev: the PCI device + * @callback: callback function for SR-IOV events + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *, u32)) +{ + u8 busnr, devfn; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!callback || iov->callback) + return -EINVAL; + + vf_rid(dev, iov->totalvfs - 1, &busnr, &devfn); + if (busnr > dev->bus->subordinate) + return -EIO; + + iov->callback = callback; + return iov_alloc_bus(dev->bus, busnr); +} +EXPORT_SYMBOL_GPL(pci_iov_register); + +/** + * pci_iov_unregister - unregister SR-IOV service + * @dev: the PCI device + */ +void pci_iov_unregister(struct pci_dev *dev) +{ + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->callback) + return; + + iov->callback = NULL; + iov_release_bus(dev->bus); +} +EXPORT_SYMBOL_GPL(pci_iov_unregister); + +/** + * pci_iov_enable - enable SR-IOV capability + * @dev: the PCI device + * @numvfs: number of VFs to be available + * + * Returns 0 on success, or negative on failure. + */ +int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + int rc; + struct pci_iov *iov = dev->iov; + + if (!iov) + return -ENODEV; + + if (!iov->callback) + return -EINVAL; + + mutex_lock(&iov->ops_lock); + rc = iov_set_numvfs(dev, numvfs); + if (rc) + goto done; + rc = iov_enable(dev); +done: + mutex_unlock(&iov->ops_lock); + + return rc; +} +EXPORT_SYMBOL_GPL(pci_iov_enable); + +/** + * pci_iov_disable - disable SR-IOV capability + * @dev: the PCI device + * + * Should be called upon Physical Function driver removal, and power + * state change. All previous allocated Virtual Functions are reclaimed. + */ +void pci_iov_disable(struct pci_dev *dev) +{ + struct pci_iov *iov = dev->iov; + + if (!iov || !iov->callback) + return; + + mutex_lock(&iov->ops_lock); + iov_disable(dev); + mutex_unlock(&iov->ops_lock); +} +EXPORT_SYMBOL_GPL(pci_iov_disable); diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 5c456ab..18881f2 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -847,6 +847,9 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev) /* Active State Power Management */ pcie_aspm_create_sysfs_dev_files(dev); + /* Single Root I/O Virtualization */ + pci_iov_create_sysfs(dev); + return 0; } @@ -932,6 +935,7 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev) } pcie_aspm_remove_sysfs_dev_files(dev); + pci_iov_remove_sysfs(dev); } /** diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 11ecd6f..10a43b2 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1936,6 +1936,13 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) if (align) return align > bios_align ? align : bios_align; + if (resno > PCI_ROM_RESOURCE && resno < PCI_BRIDGE_RESOURCES) { + /* device specific resource */ + align = pci_iov_resource_align(dev, resno); + if (align) + return align > bios_align ? align : bios_align; + } + dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); return 0; } @@ -1950,12 +1957,19 @@ int pci_resource_alignment(struct pci_dev *dev, int resno) */ int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type) { + int reg; + if (resno < PCI_ROM_RESOURCE) { *type = pci_bar_unknown; return PCI_BASE_ADDRESS_0 + 4 * resno; } else if (resno == PCI_ROM_RESOURCE) { *type = pci_bar_mem32; return dev->rom_base_reg; + } else if (resno < PCI_BRIDGE_RESOURCES) { + /* device specific resource */ + reg = pci_iov_resource_bar(dev, resno, type); + if (reg) + return reg; } dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index d707477..7735d92 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -181,4 +181,52 @@ static inline int pci_ari_enabled(struct pci_dev *dev) return dev->ari_enabled; } +/* Single Root I/O Virtualization */ +struct pci_iov { + int cap; /* capability position */ + int align; /* page size used to map memory space */ + int status; /* status of SR-IOV */ + u16 totalvfs; /* total VFs associated with the PF */ + u16 initialvfs; /* initial VFs associated with the PF */ + u16 numvfs; /* number of VFs available */ + u16 offset; /* first VF Routing ID offset */ + u16 stride; /* following VF stride */ + struct mutex ops_lock; /* lock for SR-IOV operations */ + struct mutex bus_lock; /* lock for VF bus */ + int (*callback)(struct pci_dev *, u32); /* event callback function */ +}; + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_init(struct pci_dev *dev); +extern void pci_iov_release(struct pci_dev *dev); +void pci_iov_create_sysfs(struct pci_dev *dev); +void pci_iov_remove_sysfs(struct pci_dev *dev); +extern int pci_iov_resource_align(struct pci_dev *dev, int resno); +extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type); +#else +static inline int pci_iov_init(struct pci_dev *dev) +{ + return -EIO; +} +static inline void pci_iov_release(struct pci_dev *dev) +{ +} +static inline void pci_iov_create_sysfs(struct pci_dev *dev) +{ +} +static inline void pci_iov_remove_sysfs(struct pci_dev *dev) +{ +} +static inline int pci_iov_resource_align(struct pci_dev *dev, int resno) +{ + return 0; +} +static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, + enum pci_bar_type *type) +{ + return 0; +} +#endif /* CONFIG_PCI_IOV */ + #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 4b12b58..18ce9c0 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -779,6 +779,7 @@ static int pci_setup_device(struct pci_dev * dev) static void pci_release_capabilities(struct pci_dev *dev) { pci_vpd_release(dev); + pci_iov_release(dev); } /** @@ -962,6 +963,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Alternative Routing-ID Forwarding */ pci_enable_ari(dev); + + /* Single Root I/O Virtualization */ + pci_iov_init(dev); } void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) diff --git a/include/linux/pci.h b/include/linux/pci.h index 80d88f8..77af7e0 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -87,6 +87,12 @@ enum { /* #6: expansion ROM */ PCI_ROM_RESOURCE, + /* device specific resources */ +#ifdef CONFIG_PCI_IOV + PCI_IOV_RESOURCES, + PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1, +#endif + /* address space assigned to buses behind the bridge */ #ifndef PCI_BRIDGE_RES_NUM #define PCI_BRIDGE_RES_NUM 4 @@ -165,6 +171,7 @@ struct pci_cap_saved_state { struct pcie_link_state; struct pci_vpd; +struct pci_iov; /* * The pci_dev structure is used to describe PCI devices. @@ -253,6 +260,7 @@ struct pci_dev { struct list_head msi_list; #endif struct pci_vpd *vpd; + struct pci_iov *iov; }; extern struct pci_dev *alloc_pci_dev(void); @@ -1147,5 +1155,36 @@ static inline void * pci_ioremap_bar(struct pci_dev *pdev, int bar) } #endif +/* SR-IOV events masks */ +#define PCI_IOV_NUM_VIRTFN 0x0000FFFFU /* NumVFs to be set */ +/* SR-IOV events values */ +#define PCI_IOV_ENABLE 0x00010000U /* SR-IOV enable request */ +#define PCI_IOV_DISABLE 0x00020000U /* SR-IOV disable request */ +#define PCI_IOV_NUMVFS 0x00040000U /* SR-IOV disable request */ + +#ifdef CONFIG_PCI_IOV +extern int pci_iov_enable(struct pci_dev *dev, int numvfs); +extern void pci_iov_disable(struct pci_dev *dev); +extern int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *dev, u32 event)); +extern void pci_iov_unregister(struct pci_dev *dev); +#else +static inline int pci_iov_enable(struct pci_dev *dev, int numvfs) +{ + return -EIO; +} +static inline void pci_iov_disable(struct pci_dev *dev) +{ +} +static inline int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *dev, u32 event)) +{ + return -EIO; +} +static inline void pci_iov_unregister(struct pci_dev *dev) +{ +} +#endif /* CONFIG_PCI_IOV */ + #endif /* __KERNEL__ */ #endif /* LINUX_PCI_H */ diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index eb6686b..1b28b3f 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -363,6 +363,7 @@ #define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */ #define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */ #define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */ +#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */ #define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */ #define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */ #define PCI_EXP_DEVCAP 4 /* Device capabilities */ @@ -434,6 +435,7 @@ #define PCI_EXT_CAP_ID_DSN 3 #define PCI_EXT_CAP_ID_PWR 4 #define PCI_EXT_CAP_ID_ARI 14 +#define PCI_EXT_CAP_ID_IOV 16 /* Advanced Error Reporting */ #define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */ @@ -551,4 +553,23 @@ #define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */ #define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */ +/* Single Root I/O Virtualization */ +#define PCI_IOV_CAP 0x04 /* SR-IOV Capabilities */ +#define PCI_IOV_CTRL 0x08 /* SR-IOV Control */ +#define PCI_IOV_CTRL_VFE 0x01 /* VF Enable */ +#define PCI_IOV_CTRL_MSE 0x08 /* VF Memory Space Enable */ +#define PCI_IOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */ +#define PCI_IOV_STATUS 0x0a /* SR-IOV Status */ +#define PCI_IOV_INITIAL_VF 0x0c /* Initial VFs */ +#define PCI_IOV_TOTAL_VF 0x0e /* Total VFs */ +#define PCI_IOV_NUM_VF 0x10 /* Number of VFs */ +#define PCI_IOV_FUNC_LINK 0x12 /* Function Dependency Link */ +#define PCI_IOV_VF_OFFSET 0x14 /* First VF Offset */ +#define PCI_IOV_VF_STRIDE 0x16 /* Following VF Stride */ +#define PCI_IOV_VF_DID 0x1a /* VF Device ID */ +#define PCI_IOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */ +#define PCI_IOV_SYS_PGSIZE 0x20 /* System Page Size */ +#define PCI_IOV_BAR_0 0x24 /* VF BAR0 */ +#define PCI_IOV_NUM_BAR 6 /* Number of VF BARs */ + #endif /* LINUX_PCI_REGS_H */ -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:44:38 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:44:38 +0800 Subject: [PATCH 13/16 v6] PCI: reserve bus range for SR-IOV device In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084438.GM3773@yzhao12-linux.sh.intel.com> Reserve bus range for SR-IOV at device scanning stage. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- drivers/pci/iov.c | 24 ++++++++++++++++++++++++ drivers/pci/pci.h | 5 +++++ drivers/pci/probe.c | 3 +++ 3 files changed, 32 insertions(+), 0 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index dd299aa..c86bd54 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -498,6 +498,30 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno, } /** + * pci_iov_bus_range - find bus range used by SR-IOV capability + * @bus: the PCI bus + * + * Returns max number of buses (exclude current one) used by Virtual + * Functions. + */ +int pci_iov_bus_range(struct pci_bus *bus) +{ + int max = 0; + u8 busnr, devfn; + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + if (!dev->iov) + continue; + vf_rid(dev, dev->iov->totalvfs - 1, &busnr, &devfn); + if (busnr > max) + max = busnr; + } + + return max ? max - bus->number : 0; +} + +/** * pci_iov_register - register SR-IOV service * @dev: the PCI device * @callback: callback function for SR-IOV events diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 7735d92..5206ae7 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -204,6 +204,7 @@ void pci_iov_remove_sysfs(struct pci_dev *dev); extern int pci_iov_resource_align(struct pci_dev *dev, int resno); extern int pci_iov_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type); +extern int pci_iov_bus_range(struct pci_bus *bus); #else static inline int pci_iov_init(struct pci_dev *dev) { @@ -227,6 +228,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno, { return 0; } +extern inline int pci_iov_bus_range(struct pci_bus *bus) +{ + return 0; +} #endif /* CONFIG_PCI_IOV */ #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 18ce9c0..50a1380 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1068,6 +1068,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus) for (devfn = 0; devfn < 0x100; devfn += 8) pci_scan_slot(bus, devfn); + /* Reserve buses for SR-IOV capability. */ + max += pci_iov_bus_range(bus); + /* * After performing arch-dependent fixup of the bus, look behind * all PCI-to-PCI bridges on this bus. -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:45:00 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:45:00 +0800 Subject: [PATCH 14/16 v6] PCI: document for SR-IOV user and developer In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084500.GN3773@yzhao12-linux.sh.intel.com> Create HOW-TO for SR-IOV user and driver developer. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- Documentation/DocBook/kernel-api.tmpl | 1 + Documentation/PCI/pci-iov-howto.txt | 181 +++++++++++++++++++++++++++++++++ 2 files changed, 182 insertions(+), 0 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.txt diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 9d0058e..9a15c50 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c --> !Edrivers/pci/probe.c !Edrivers/pci/rom.c +!Edrivers/pci/iov.c PCI Hotplug Support Library !Edrivers/pci/hotplug/pci_hotplug_core.c diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 0000000..5632723 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt @@ -0,0 +1,181 @@ + PCI Express Single Root I/O Virtualization HOWTO + Copyright (C) 2008 Intel Corporation + + +1. Overview + +1.1 What is SR-IOV + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function while +the virtual devices are referred to as Virtual Functions. Allocation +of Virtual Functions can be dynamically controlled by Physical Function +via registers encapsulated in the capability. By default, this feature +is not enabled and the Physical Function behaves as traditional PCIe +device. Once it's turned on, each Virtual Function's PCI configuration +space can be accessed by its own Bus, Device and Function Number (Routing +ID). And each Virtual Function also has PCI Memory Space, which is used +to map its register set. Virtual Function device driver operates on the +register set so it can be functional and appear as a real existing PCI +device. + +2. User Guide + +2.1 How can I manage SR-IOV + +If a device supports SR-IOV, then there should be some entries under +Physical Function's PCI device directory. These entries are in directory: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/ + (XXXX:BB:DD:F is the domain, bus, device and function number) + +To enable or disable SR-IOV: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/enable + (writing 1/0 means enable/disable VFs, state change will + notify PF driver) + +To change number of Virtual Functions: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/numvfs + (writing positive integer to this file will change NumVFs) + +The total and initial number of VFs can get from: + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/totalvfs + - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/initialvfs + +2.2 How can I use Virtual Functions + +Virtual Functions are treated as hot-plugged PCI devices in the kernel, +so they should be able to work in the same way as real PCI devices. +NOTE: Virtual Function device driver must be loaded to make it work. + + +3. Developer Guide + +3.1 SR-IOV APIs + +To register SR-IOV service, Physical Function device driver needs to call: + int pci_iov_register(struct pci_dev *dev, + int (*callback)(struct pci_dev *, u32)) + The 'callback' is a callback function that the SR-IOV code will invoke + it when events related to VFs happen (e.g. user enable/disable VFs). + The first argument is PF itself, the second argument is event type and + value. For now, following events type are supported: + - PCI_IOV_ENABLE: SR-IOV enable request + - PCI_IOV_DISABLE: SR-IOV disable request + - PCI_IOV_NUMVFS: changing Number of VFs request + And event values can be extract using following masks: + - PCI_IOV_NUM_VIRTFN: Number of Virtual Functions + +To unregister SR-IOV service, Physical Function device driver needs to call: + void pci_iov_unregister(struct pci_dev *dev) + +To enable SR-IOV, Physical Function device driver needs to call: + int pci_iov_enable(struct pci_dev *dev, int numvfs) + 'numvfs' is the number of VFs that PF wants to enable. + +To disable SR-IOV, Physical Function device driver needs to call: + void pci_iov_disable(struct pci_dev *dev) + +Note: above two functions sleeps 1 second waiting on hardware transaction +completion according to SR-IOV specification. + +3.2 Usage example + +Following piece of code illustrates the usage of APIs above. + +static int callback(struct pci_dev *dev, u32 event) +{ + int numvfs; + + if (event & PCI_IOV_ENABLE) { + /* + * request to enable SR-IOV. + * Note: if the PF driver want to support PM, it has + * to check the device power state here to see if this + * request is allowed or not. + */ + ... + + } else if (event & PCI_IOV_DISABLE) { + /* + * request to disable SR-IOV. + */ + ... + + } else if (event & PCI_IOV_NUMVFS) { + /* + * request to change NumVFs. + */ + numvfs = event & PCI_IOV_NUM_VIRTFN; + ... + + } else + return -EINVAL; + + return 0; +} + +static int __devinit dev_probe(struct pci_dev *dev, + const struct pci_device_id *id) +{ + int err; + int numvfs; + + ... + err = pci_iov_register(dev, callback); + ... + err = pci_iov_enable(dev, numvfs); + ... + + return err; +} + +static void __devexit dev_remove(struct pci_dev *dev) +{ + ... + pci_iov_disable(dev); + ... + pci_iov_unregister(dev); + ... +} + +#ifdef CONFIG_PM +/* + * If Physical Function supports the power management, then the + * SR-IOV needs to be disabled before the adapter goes to sleep, + * because Virtual Functions will not work when the adapter is in + * the power-saving mode. + * The SR-IOV can be enabled again after the adapter wakes up. + */ +static int dev_suspend(struct pci_dev *dev, pm_message_t state) +{ + ... + pci_iov_disable(dev); + ... + + return 0; +} + +static int dev_resume(struct pci_dev *dev) +{ + int err; + int numvfs; + + ... + rc = pci_iov_enable(dev, numvfs); + ... + + return 0; +} +#endif + +static struct pci_driver dev_driver = { + .name = "SR-IOV Physical Function driver", + .id_table = dev_id_table, + .probe = dev_probe, + .remove = __devexit_p(dev_remove), +#ifdef CONFIG_PM + .suspend = dev_suspend, + .resume = dev_resume, +#endif +}; -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:45:31 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:45:31 +0800 Subject: [PATCH 16/16 v6] PCI: document the new PCI boot parameters In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084531.GP3773@yzhao12-linux.sh.intel.com> Document the new PCI[x86] boot parameters. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- Documentation/kernel-parameters.txt | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 53ba7c7..5482ae0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file cbmemsize=nn[KMG] The fixed amount of bus space which is reserved for the CardBus bridge's memory window. The default value is 64 megabytes. + assign-mmio=[dddd:]bb [X86] reassign memory resources of all + devices under bus [dddd:]bb (dddd is the domain + number and bb is the bus number). + assign-pio=[dddd:]bb [X86] reassign io port resources of all + devices under bus [dddd:]bb (dddd is the domain + number and bb is the bus number). + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a + device to minimum PAGE_SIZE alignment (dddd is + the domain number and bb, dd and f is the bus, + device and function number). pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. -- 1.5.6.4 From yu.zhao at intel.com Wed Oct 22 01:45:15 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Wed, 22 Oct 2008 16:45:15 +0800 Subject: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries In-Reply-To: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> Message-ID: <20081022084515.GO3773@yzhao12-linux.sh.intel.com> Document the SR-IOV sysfs entries. Cc: Alex Chiang Cc: Grant Grundler Cc: Greg KH Cc: Ingo Molnar Cc: Jesse Barnes Cc: Matthew Wilcox Cc: Randy Dunlap Cc: Roland Dreier Signed-off-by: Yu Zhao --- Documentation/ABI/testing/sysfs-bus-pci | 33 +++++++++++++++++++++++++++++++ 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ceddcff..41cce8f 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -9,3 +9,36 @@ Description: that some devices may have malformatted data. If the underlying VPD has a writable section then the corresponding section of this file will be writable. + +What: /sys/bus/pci/devices/.../iov/enable +Date: October 2008 +Contact: Yu Zhao +Description: + This file appears when a device has the SR-IOV capability. + It holds the status of the capability, and could be written + (0/1) to disable and enable the capability if the PF driver + supports this operation. + +What: /sys/bus/pci/devices/.../iov/initialvfs +Date: October 2008 +Contact: Yu Zhao +Description: + This file appears when a device has the SR-IOV capability. + It holds the number of initial Virtual Functions (read-only). + +What: /sys/bus/pci/devices/.../iov/totalvfs +Date: October 2008 +Contact: Yu Zhao +Description: + This file appears when a device has the SR-IOV capability. + It holds the number of total Virtual Functions (read-only). + + +What: /sys/bus/pci/devices/.../iov/numvfs +Date: October 2008 +Contact: Yu Zhao +Description: + This file appears when a device has the SR-IOV capability. + It holds the number of available Virtual Functions, and + could be written (1 ~ InitialVFs) to change the number if + the PF driver supports this operation. -- 1.5.6.4 From ryov at valinux.co.jp Wed Oct 22 03:38:31 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Wed, 22 Oct 2008 19:38:31 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <48FDB8AC.9020707@windriver.com> References: <20081017.160950.71109894.ryov@valinux.co.jp> <48FDB8AC.9020707@windriver.com> Message-ID: <20081022.193831.193716981.ryov@valinux.co.jp> Hi Chen, > I applied your patches(both) into the latest kernel(27), and dm-ioband > looks work well(other than schedule_timeout in alloc_ioband_device); > But I think you are the author of bio_tracking, so it is high > appreciated if you can give your comments and advices for potential > difference between 27-rc5-mm1 and 2.6.27 to me. There is no major difference between both version, but I have no time to port and test bio-cgroup to/on 2.6.27. Kamezawa-san said > the newest mmotm has the newest *big* change. enjoy it ;) The next bio-cgroup patch will be based on this newest change. Thanks, Ryo Tsuruta From bjorn.helgaas at hp.com Wed Oct 22 07:24:19 2008 From: bjorn.helgaas at hp.com (Bjorn Helgaas) Date: Wed, 22 Oct 2008 08:24:19 -0600 Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' In-Reply-To: <20081022084041.GB3773@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084041.GB3773@yzhao12-linux.sh.intel.com> Message-ID: <200810220824.21356.bjorn.helgaas@hp.com> On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote: > This patch moves all definitions of the PCI resource names to an 'enum', > and also replaces some hard-coded resource variables with symbol > names. This change eases introduction of device specific resources. Thanks for removing a bunch of magic numbers from the code. > static void > pci_restore_bars(struct pci_dev *dev) > { > - int i, numres; > - > - switch (dev->hdr_type) { > - case PCI_HEADER_TYPE_NORMAL: > - numres = 6; > - break; > - case PCI_HEADER_TYPE_BRIDGE: > - numres = 2; > - break; > - case PCI_HEADER_TYPE_CARDBUS: > - numres = 1; > - break; > - default: > - /* Should never get here, but just in case... */ > - return; > - } > + int i; > > - for (i = 0; i < numres; i++) > + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) > pci_update_resource(dev, i); > } The behavior of this function used to depend on dev->hdr_type. Now we don't look at hdr_type at all, so we do the same thing for all devices. For example, for a CardBus device, we used to call pci_update_resource() only for BAR 0; now we call it for BARs 0-6. Maybe this is safe, but I can't tell from the patch, so I think you should explain *why* it's safe in the changelog. > +/* > + * For PCI devices, the region numbers are assigned this way: > + */ > +enum { > + /* #0-5: standard PCI regions */ > + PCI_STD_RESOURCES, > + PCI_STD_RESOURCES_END = 5, > + > + /* #6: expansion ROM */ > + PCI_ROM_RESOURCE, > + > + /* address space assigned to buses behind the bridge */ > +#ifndef PCI_BRIDGE_RES_NUM > +#define PCI_BRIDGE_RES_NUM 4 > +#endif > + PCI_BRIDGE_RESOURCES, > + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1, Since you used "PCI_STD_RESOURCES_END" above, maybe you should use "PCI_BRIDGE_RESOURCES_END" instead of "PCI_BRIDGE_RES_END". Bjorn From bjorn.helgaas at hp.com Wed Oct 22 07:27:32 2008 From: bjorn.helgaas at hp.com (Bjorn Helgaas) Date: Wed, 22 Oct 2008 08:27:32 -0600 Subject: [PATCH 16/16 v6] PCI: document the new PCI boot parameters In-Reply-To: <20081022084531.GP3773@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084531.GP3773@yzhao12-linux.sh.intel.com> Message-ID: <200810220827.33758.bjorn.helgaas@hp.com> On Wednesday 22 October 2008 02:45:31 am Yu Zhao wrote: > Document the new PCI[x86] boot parameters. > > Cc: Alex Chiang > Cc: Grant Grundler > Cc: Greg KH > Cc: Ingo Molnar > Cc: Jesse Barnes > Cc: Matthew Wilcox > Cc: Randy Dunlap > Cc: Roland Dreier > Signed-off-by: Yu Zhao > > --- > Documentation/kernel-parameters.txt | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index 53ba7c7..5482ae0 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file > cbmemsize=nn[KMG] The fixed amount of bus space which is > reserved for the CardBus bridge's memory > window. The default value is 64 megabytes. > + assign-mmio=[dddd:]bb [X86] reassign memory resources of all > + devices under bus [dddd:]bb (dddd is the domain > + number and bb is the bus number). > + assign-pio=[dddd:]bb [X86] reassign io port resources of all > + devices under bus [dddd:]bb (dddd is the domain > + number and bb is the bus number). > + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a > + device to minimum PAGE_SIZE alignment (dddd is > + the domain number and bb, dd and f is the bus, > + device and function number). > > pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power > Management. I think it's nicer to have the documentation change included in the patch that implements the change. For example, I think this and patch 9/16 "add boot option to align ..." should be folded into a single patch. And similarly for the other documentation patches. Bjorn From bjorn.helgaas at hp.com Wed Oct 22 07:34:05 2008 From: bjorn.helgaas at hp.com (Bjorn Helgaas) Date: Wed, 22 Oct 2008 08:34:05 -0600 Subject: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources In-Reply-To: <20081022084324.GI3773@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084324.GI3773@yzhao12-linux.sh.intel.com> Message-ID: <200810220834.07222.bjorn.helgaas@hp.com> On Wednesday 22 October 2008 02:43:24 am Yu Zhao wrote: > This patch adds boot option to align MMIO resource for a device. > The alignment is a bigger value between the PAGE_SIZE and the > resource size. It looks like this forces alignment on PAGE_SIZE, not "a bigger value between the PAGE_SIZE and the resource size." Can you clarify the changelog to specify exactly what alignment this option forces? > The boot option can be used as: > pci=align-mmio=0000:01:02.3 > '[0000:]01:02.3' is the domain, bus, device and function number > of the device. I think you also support using multiple "align-mmio=DDDD:BB:dd.f" options separated by ";", but I had to read the code to figure that out. Can you give an example of this in the changelog and the kernel-parameters.txt patch? Bjorn > Cc: Alex Chiang > Cc: Grant Grundler > Cc: Greg KH > Cc: Ingo Molnar > Cc: Jesse Barnes > Cc: Matthew Wilcox > Cc: Randy Dunlap > Cc: Roland Dreier > Signed-off-by: Yu Zhao > > --- > arch/x86/pci/common.c | 37 +++++++++++++++++++++++++++++++++++++ > drivers/pci/pci.c | 20 ++++++++++++++++++-- > include/linux/pci.h | 1 + > 3 files changed, 56 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c > index 06e1ce0..3c5d230 100644 > --- a/arch/x86/pci/common.c > +++ b/arch/x86/pci/common.c > @@ -139,6 +139,7 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) > > static char *pci_assign_pio; > static char *pci_assign_mmio; > +static char *pci_align_mmio; > > static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) > { > @@ -192,6 +193,36 @@ static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) > } > } > > +int pcibios_resource_alignment(struct pci_dev *dev, int resno) > +{ > + int domain, busnr, slot, func; > + char *str = pci_align_mmio; > + > + if (dev->resource[resno].flags & IORESOURCE_IO) > + return 0; > + > + while (str && *str) { > + if (sscanf(str, "%04x:%02x:%02x.%d", > + &domain, &busnr, &slot, &func) != 4) { > + if (sscanf(str, "%02x:%02x.%d", > + &busnr, &slot, &func) != 3) > + break; > + domain = 0; > + } > + > + if (pci_domain_nr(dev->bus) == domain && > + dev->bus->number == busnr && > + dev->devfn == PCI_DEVFN(slot, func)) > + return PAGE_SIZE; > + > + str = strchr(str, ';'); > + if (str) > + str++; > + } > + > + return 0; > +} > + > int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) > { > struct pci_bus *bus; > @@ -200,6 +231,9 @@ int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) > if (pcibios_bus_resource_needs_fixup(bus)) > return 1; > > + if (pcibios_resource_alignment(dev, resno)) > + return 1; > + > return 0; > } > > @@ -592,6 +626,9 @@ char * __devinit pcibios_setup(char *str) > } else if (!strncmp(str, "assign-mmio=", 12)) { > pci_assign_mmio = str + 12; > return NULL; > + } else if (!strncmp(str, "align-mmio=", 11)) { > + pci_align_mmio = str + 11; > + return NULL; > } > return str; > } > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b02167a..11ecd6f 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1015,6 +1015,20 @@ int __attribute__ ((weak)) pcibios_set_pcie_reset_state(struct pci_dev *dev, > } > > /** > + * pcibios_resource_alignment - get resource alignment requirement > + * @dev: the PCI device > + * @resno: resource number > + * > + * Queries the resource alignment from PCI low level code. Returns positive > + * if there is alignment requirement of the resource, or 0 otherwise. > + */ > +int __attribute__ ((weak)) pcibios_resource_alignment(struct pci_dev *dev, > + int resno) > +{ > + return 0; > +} > + > +/** > * pci_set_pcie_reset_state - set reset state for device dev > * @dev: the PCI-E device reset > * @state: Reset state to enter into > @@ -1913,12 +1927,14 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags) > */ > int pci_resource_alignment(struct pci_dev *dev, int resno) > { > - resource_size_t align; > + resource_size_t align, bios_align; > struct resource *res = dev->resource + resno; > > + bios_align = pcibios_resource_alignment(dev, resno); > + > align = resource_alignment(res); > if (align) > - return align; > + return align > bios_align ? align : bios_align; > > dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); > return 0; > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 2ada2b6..6ac69af 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1121,6 +1121,7 @@ int pcibios_add_platform_entries(struct pci_dev *dev); > void pcibios_disable_device(struct pci_dev *dev); > int pcibios_set_pcie_reset_state(struct pci_dev *dev, > enum pcie_reset_state state); > +int pcibios_resource_alignment(struct pci_dev *dev, int resno); > > #ifdef CONFIG_PCI_MMCONFIG > extern void __init pci_mmcfg_early_init(void); From bjorn.helgaas at hp.com Wed Oct 22 07:35:34 2008 From: bjorn.helgaas at hp.com (Bjorn Helgaas) Date: Wed, 22 Oct 2008 08:35:34 -0600 Subject: [PATCH 8/16 v6] PCI: add boot options to reassign resources In-Reply-To: <20081022084303.GH3773@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084303.GH3773@yzhao12-linux.sh.intel.com> Message-ID: <200810220835.35866.bjorn.helgaas@hp.com> On Wednesday 22 October 2008 02:43:03 am Yu Zhao wrote: > This patch adds boot options so user can reassign device resources > of all devices under a bus. > > The boot options can be used as: > pci=assign-mmio=0000:01,assign-pio=0000:02 > '[dddd:]bb' is the domain and bus number. I think this example is incorrect because you look for ";" to separate options, not ",". Bjorn > Cc: Alex Chiang > Cc: Grant Grundler > Cc: Greg KH > Cc: Ingo Molnar > Cc: Jesse Barnes > Cc: Matthew Wilcox > Cc: Randy Dunlap > Cc: Roland Dreier > Signed-off-by: Yu Zhao > > --- > arch/x86/pci/common.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++ > arch/x86/pci/i386.c | 10 ++++--- > arch/x86/pci/pci.h | 3 ++ > 3 files changed, 82 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c > index b67732b..06e1ce0 100644 > --- a/arch/x86/pci/common.c > +++ b/arch/x86/pci/common.c > @@ -137,6 +137,72 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev) > } > } > > +static char *pci_assign_pio; > +static char *pci_assign_mmio; > + > +static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus) > +{ > + int i; > + int type = 0; > + int domain, busnr; > + > + if (!bus->self) > + return 0; > + > + for (i = 0; i < 2; i++) { > + char *str = i ? pci_assign_pio : pci_assign_mmio; > + > + while (str && *str) { > + if (sscanf(str, "%04x:%02x", &domain, &busnr) != 2) { > + if (sscanf(str, "%02x", &busnr) != 1) > + break; > + domain = 0; > + } > + > + if (pci_domain_nr(bus) == domain && > + bus->number == busnr) { > + type |= i ? IORESOURCE_IO : IORESOURCE_MEM; > + break; > + } > + > + str = strchr(str, ';'); > + if (str) > + str++; > + } > + } > + > + return type; > +} > + > +static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus) > +{ > + int i; > + int type = pcibios_bus_resource_needs_fixup(bus); > + > + if (!type) > + return; > + > + for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) { > + struct resource *res = bus->resource[i]; > + > + if (!res) > + continue; > + if (res->flags & type) > + res->flags = 0; > + } > +} > + > +int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno) > +{ > + struct pci_bus *bus; > + > + for (bus = dev->bus; bus && bus != pci_root_bus; bus = bus->parent) > + if (pcibios_bus_resource_needs_fixup(bus)) > + return 1; > + > + return 0; > +} > + > /* > * Called after each bus is probed, but before its children > * are examined. > @@ -147,6 +213,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *b) > struct pci_dev *dev; > > pci_read_bridge_bases(b); > + pcibios_fixup_bus_resources(b); > list_for_each_entry(dev, &b->devices, bus_list) > pcibios_fixup_device_resources(dev); > } > @@ -519,6 +586,12 @@ char * __devinit pcibios_setup(char *str) > } else if (!strcmp(str, "skip_isa_align")) { > pci_probe |= PCI_CAN_SKIP_ISA_ALIGN; > return NULL; > + } else if (!strncmp(str, "assign-pio=", 11)) { > + pci_assign_pio = str + 11; > + return NULL; > + } else if (!strncmp(str, "assign-mmio=", 12)) { > + pci_assign_mmio = str + 12; > + return NULL; > } > return str; > } > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c > index 8729bde..ea82a5b 100644 > --- a/arch/x86/pci/i386.c > +++ b/arch/x86/pci/i386.c > @@ -169,10 +169,12 @@ static void __init pcibios_allocate_resources(int pass) > (unsigned long long) r->start, > (unsigned long long) r->end, > r->flags, enabled, pass); > - pr = pci_find_parent_resource(dev, r); > - if (pr && !request_resource(pr, r)) > - continue; > - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); > + if (!pcibios_resource_needs_fixup(dev, idx)) { > + pr = pci_find_parent_resource(dev, r); > + if (pr && !request_resource(pr, r)) > + continue; > + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx); > + } > /* We'll assign a new address later */ > r->end -= r->start; > r->start = 0; > diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h > index 15b9cf6..f22737d 100644 > --- a/arch/x86/pci/pci.h > +++ b/arch/x86/pci/pci.h > @@ -117,6 +117,9 @@ extern int __init pcibios_init(void); > extern int __init pci_mmcfg_arch_init(void); > extern void __init pci_mmcfg_arch_free(void); > > +/* pci-common.c */ > +extern int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno); > + > /* > * AMD Fam10h CPUs are buggy, and cannot access MMIO config space > * on their northbrige except through the * %eax register. As such, you MUST From yu.zhao at uniscape.net Wed Oct 22 07:44:24 2008 From: yu.zhao at uniscape.net (Yu Zhao) Date: Wed, 22 Oct 2008 22:44:24 +0800 Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' In-Reply-To: <200810220824.21356.bjorn.helgaas@hp.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084041.GB3773@yzhao12-linux.sh.intel.com> <200810220824.21356.bjorn.helgaas@hp.com> Message-ID: <48FF3C48.9060809@uniscape.net> Bjorn Helgaas wrote: > On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote: >> This patch moves all definitions of the PCI resource names to an 'enum', >> and also replaces some hard-coded resource variables with symbol >> names. This change eases introduction of device specific resources. > > Thanks for removing a bunch of magic numbers from the code. > >> static void >> pci_restore_bars(struct pci_dev *dev) >> { >> - int i, numres; >> - >> - switch (dev->hdr_type) { >> - case PCI_HEADER_TYPE_NORMAL: >> - numres = 6; >> - break; >> - case PCI_HEADER_TYPE_BRIDGE: >> - numres = 2; >> - break; >> - case PCI_HEADER_TYPE_CARDBUS: >> - numres = 1; >> - break; >> - default: >> - /* Should never get here, but just in case... */ >> - return; >> - } >> + int i; >> >> - for (i = 0; i < numres; i++) >> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) >> pci_update_resource(dev, i); >> } > > The behavior of this function used to depend on dev->hdr_type. Now > we don't look at hdr_type at all, so we do the same thing for all > devices. > > For example, for a CardBus device, we used to call pci_update_resource() > only for BAR 0; now we call it for BARs 0-6. > > Maybe this is safe, but I can't tell from the patch, so I think you > should explain *why* it's safe in the changelog. It's safe because pci_update_resource() will ignore unused resources. E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then pci_update_resource() only updates it. BAR 1-6 are ignored since their 'flags' are 0. I'll put more explanation in the changelog. Thanks, Yu From yu.zhao at uniscape.net Wed Oct 22 07:49:56 2008 From: yu.zhao at uniscape.net (Yu Zhao) Date: Wed, 22 Oct 2008 22:49:56 +0800 Subject: [PATCH 8/16 v6] PCI: add boot options to reassign resources In-Reply-To: <200810220835.35866.bjorn.helgaas@hp.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084303.GH3773@yzhao12-linux.sh.intel.com> <200810220835.35866.bjorn.helgaas@hp.com> Message-ID: <48FF3D94.5090001@uniscape.net> Bjorn Helgaas wrote: > On Wednesday 22 October 2008 02:43:03 am Yu Zhao wrote: >> This patch adds boot options so user can reassign device resources >> of all devices under a bus. >> >> The boot options can be used as: >> pci=assign-mmio=0000:01,assign-pio=0000:02 >> '[dddd:]bb' is the domain and bus number. > > I think this example is incorrect because you look for ";" to > separate options, not ",". The semicolon is used to separate multiple parameters for assign-mmio and assign-pio. E.g., 'pci=assign-mmio=0000:01;0001:02;0004:03'. And the comma separates different parameters for 'pci='. From bjorn.helgaas at hp.com Wed Oct 22 07:51:11 2008 From: bjorn.helgaas at hp.com (Bjorn Helgaas) Date: Wed, 22 Oct 2008 08:51:11 -0600 Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' In-Reply-To: <48FF3C48.9060809@uniscape.net> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <200810220824.21356.bjorn.helgaas@hp.com> <48FF3C48.9060809@uniscape.net> Message-ID: <200810220851.13413.bjorn.helgaas@hp.com> On Wednesday 22 October 2008 08:44:24 am Yu Zhao wrote: > Bjorn Helgaas wrote: > > On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote: > >> This patch moves all definitions of the PCI resource names to an 'enum', > >> and also replaces some hard-coded resource variables with symbol > >> names. This change eases introduction of device specific resources. > > > > Thanks for removing a bunch of magic numbers from the code. > > > >> static void > >> pci_restore_bars(struct pci_dev *dev) > >> { > >> - int i, numres; > >> - > >> - switch (dev->hdr_type) { > >> - case PCI_HEADER_TYPE_NORMAL: > >> - numres = 6; > >> - break; > >> - case PCI_HEADER_TYPE_BRIDGE: > >> - numres = 2; > >> - break; > >> - case PCI_HEADER_TYPE_CARDBUS: > >> - numres = 1; > >> - break; > >> - default: > >> - /* Should never get here, but just in case... */ > >> - return; > >> - } > >> + int i; > >> > >> - for (i = 0; i < numres; i++) > >> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) > >> pci_update_resource(dev, i); > >> } > > > > The behavior of this function used to depend on dev->hdr_type. Now > > we don't look at hdr_type at all, so we do the same thing for all > > devices. > > > > For example, for a CardBus device, we used to call pci_update_resource() > > only for BAR 0; now we call it for BARs 0-6. > > > > Maybe this is safe, but I can't tell from the patch, so I think you > > should explain *why* it's safe in the changelog. > > It's safe because pci_update_resource() will ignore unused resources. > E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then > pci_update_resource() only updates it. BAR 1-6 are ignored since their > 'flags' are 0. > > I'll put more explanation in the changelog. This is a logically separate change from merely substituting enum names for magic numbers, so you might even consider splitting it into a separate patch. Better bisection and all that, you know :-) Bjorn From yu.zhao at uniscape.net Wed Oct 22 07:52:43 2008 From: yu.zhao at uniscape.net (Yu Zhao) Date: Wed, 22 Oct 2008 22:52:43 +0800 Subject: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources In-Reply-To: <200810220834.07222.bjorn.helgaas@hp.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084324.GI3773@yzhao12-linux.sh.intel.com> <200810220834.07222.bjorn.helgaas@hp.com> Message-ID: <48FF3E3B.3050006@uniscape.net> Bjorn Helgaas wrote: > On Wednesday 22 October 2008 02:43:24 am Yu Zhao wrote: >> This patch adds boot option to align MMIO resource for a device. >> The alignment is a bigger value between the PAGE_SIZE and the >> resource size. > > It looks like this forces alignment on PAGE_SIZE, not "a bigger > value between the PAGE_SIZE and the resource size." Can you > clarify the changelog to specify exactly what alignment this > option forces? I guess following would explain your question. >> int pci_resource_alignment(struct pci_dev *dev, int resno) >> { >> - resource_size_t align; >> + resource_size_t align, bios_align; >> struct resource *res = dev->resource + resno; >> >> + bios_align = pcibios_resource_alignment(dev, resno); >> + >> align = resource_alignment(res); >> if (align) >> - return align; >> + return align > bios_align ? align : bios_align; >> >> dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno); >> return 0; From yu.zhao at uniscape.net Wed Oct 22 07:53:59 2008 From: yu.zhao at uniscape.net (Yu Zhao) Date: Wed, 22 Oct 2008 22:53:59 +0800 Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' In-Reply-To: <200810220851.13413.bjorn.helgaas@hp.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <200810220824.21356.bjorn.helgaas@hp.com> <48FF3C48.9060809@uniscape.net> <200810220851.13413.bjorn.helgaas@hp.com> Message-ID: <48FF3E87.4090407@uniscape.net> Bjorn Helgaas wrote: > On Wednesday 22 October 2008 08:44:24 am Yu Zhao wrote: >> Bjorn Helgaas wrote: >>> On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote: >>>> This patch moves all definitions of the PCI resource names to an 'enum', >>>> and also replaces some hard-coded resource variables with symbol >>>> names. This change eases introduction of device specific resources. >>> Thanks for removing a bunch of magic numbers from the code. >>> >>>> static void >>>> pci_restore_bars(struct pci_dev *dev) >>>> { >>>> - int i, numres; >>>> - >>>> - switch (dev->hdr_type) { >>>> - case PCI_HEADER_TYPE_NORMAL: >>>> - numres = 6; >>>> - break; >>>> - case PCI_HEADER_TYPE_BRIDGE: >>>> - numres = 2; >>>> - break; >>>> - case PCI_HEADER_TYPE_CARDBUS: >>>> - numres = 1; >>>> - break; >>>> - default: >>>> - /* Should never get here, but just in case... */ >>>> - return; >>>> - } >>>> + int i; >>>> >>>> - for (i = 0; i < numres; i++) >>>> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) >>>> pci_update_resource(dev, i); >>>> } >>> The behavior of this function used to depend on dev->hdr_type. Now >>> we don't look at hdr_type at all, so we do the same thing for all >>> devices. >>> >>> For example, for a CardBus device, we used to call pci_update_resource() >>> only for BAR 0; now we call it for BARs 0-6. >>> >>> Maybe this is safe, but I can't tell from the patch, so I think you >>> should explain *why* it's safe in the changelog. >> It's safe because pci_update_resource() will ignore unused resources. >> E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then >> pci_update_resource() only updates it. BAR 1-6 are ignored since their >> 'flags' are 0. >> >> I'll put more explanation in the changelog. > > This is a logically separate change from merely substituting enum > names for magic numbers, so you might even consider splitting it > into a separate patch. Better bisection and all that, you know :-) Will do. Thanks, Yu From markmc at redhat.com Wed Oct 22 08:32:26 2008 From: markmc at redhat.com (Mark McLoughlin) Date: Wed, 22 Oct 2008 16:32:26 +0100 Subject: [PATCH] virtio_net: hook up the set-tso ethtool op Message-ID: <1224689546.30669.4.camel@blaa> Seems like an oversight that we have set-tx-csum and set-sg hooked up, but not set-tso. Also leads to the strange situation that if you e.g. disable tx-csum, then tso doesn't get disabled. Signed-off-by: Mark McLoughlin --- drivers/net/virtio_net.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index cca6435..79b59cc 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -612,6 +612,7 @@ static int virtnet_set_tx_csum(struct net_device *dev, u32 data) static struct ethtool_ops virtnet_ethtool_ops = { .set_tx_csum = virtnet_set_tx_csum, .set_sg = ethtool_op_set_sg, + .set_tso = ethtool_op_set_tso, }; static int virtnet_probe(struct virtio_device *vdev) -- 1.6.0.1 From randy.dunlap at oracle.com Wed Oct 22 10:01:22 2008 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Wed, 22 Oct 2008 10:01:22 -0700 Subject: [PATCH 16/16 v6] PCI: document the new PCI boot parameters In-Reply-To: <200810220827.33758.bjorn.helgaas@hp.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084531.GP3773@yzhao12-linux.sh.intel.com> <200810220827.33758.bjorn.helgaas@hp.com> Message-ID: <48FF5C62.3020008@oracle.com> Bjorn Helgaas wrote: > On Wednesday 22 October 2008 02:45:31 am Yu Zhao wrote: >> Document the new PCI[x86] boot parameters. >> >> Cc: Alex Chiang >> Cc: Grant Grundler >> Cc: Greg KH >> Cc: Ingo Molnar >> Cc: Jesse Barnes >> Cc: Matthew Wilcox >> Cc: Randy Dunlap >> Cc: Roland Dreier >> Signed-off-by: Yu Zhao >> >> --- >> Documentation/kernel-parameters.txt | 10 ++++++++++ >> 1 files changed, 10 insertions(+), 0 deletions(-) >> >> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >> index 53ba7c7..5482ae0 100644 >> --- a/Documentation/kernel-parameters.txt >> +++ b/Documentation/kernel-parameters.txt >> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file >> cbmemsize=nn[KMG] The fixed amount of bus space which is >> reserved for the CardBus bridge's memory >> window. The default value is 64 megabytes. >> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all >> + devices under bus [dddd:]bb (dddd is the domain >> + number and bb is the bus number). >> + assign-pio=[dddd:]bb [X86] reassign io port resources of all "io" in text should be "IO" or "I/O". (Small "io" is OK as a parameter placeholder.) >> + devices under bus [dddd:]bb (dddd is the domain >> + number and bb is the bus number). >> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a >> + device to minimum PAGE_SIZE alignment (dddd is >> + the domain number and bb, dd and f is the bus, are the bus, >> + device and function number). >> >> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power >> Management. > > I think it's nicer to have the documentation change included in the > patch that implements the change. For example, I think this and > patch 9/16 "add boot option to align ..." should be folded into > a single patch. And similarly for the other documentation patches. > > Bjorn From yu.zhao at intel.com Wed Oct 22 23:50:57 2008 From: yu.zhao at intel.com (Yu Zhao) Date: Thu, 23 Oct 2008 14:50:57 +0800 Subject: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources() In-Reply-To: <86802c440810230010redfbbe7oaf94bf2077ccdcdf@mail.gmail.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084241.GG3773@yzhao12-linux.sh.intel.com> <86802c440810230010redfbbe7oaf94bf2077ccdcdf@mail.gmail.com> Message-ID: <20081023065057.GA4340@yzhao12-linux.sh.intel.com> On Thu, Oct 23, 2008 at 03:10:26PM +0800, Yinghai Lu wrote: > On Wed, Oct 22, 2008 at 1:42 AM, Yu Zhao wrote: > > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c > > index 844df0c..8729bde 100644 > > --- a/arch/x86/pci/i386.c > > +++ b/arch/x86/pci/i386.c > > @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list) > > static void __init pcibios_allocate_resources(int pass) > > { > > struct pci_dev *dev = NULL; > > - int idx, disabled; > > + int idx, enabled; > > u16 command; > > struct resource *r, *pr; > > > > @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass) > > if (!r->start) /* Address not assigned at all */ > > continue; > > if (r->flags & IORESOURCE_IO) > > - disabled = !(command & PCI_COMMAND_IO); > > + enabled = command & PCI_COMMAND_IO; > > else > > - disabled = !(command & PCI_COMMAND_MEMORY); > > - if (pass == disabled) { > > - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", > > + enabled = command & PCI_COMMAND_MEMORY; > > + if (pass == enabled) > > + continue; > > it seems you change the flow here for MMIO > because PCI_COMMAND_MEMORY is 2. > > YH Nice finding! Will change it back to 'disable' next version. Thanks, Yu From ryov at valinux.co.jp Thu Oct 23 04:28:51 2008 From: ryov at valinux.co.jp (Ryo Tsuruta) Date: Thu, 23 Oct 2008 20:28:51 +0900 (JST) Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <1224756158.8286.36.camel@pek-hzhang-d1> References: <48FEDC63.308@windriver.com> <20081022.170536.193712541.ryov@valinux.co.jp> <1224756158.8286.36.camel@pek-hzhang-d1> Message-ID: <20081023.202851.112594221.ryov@valinux.co.jp> Hi Haotian, > The results are almost the same. I can not see any change of Direct I/O > performance for this bio-cgroup kernel feature with dm-ioband support! > > Does the methord to caculate throughout should be the Rate of xdd.linux > output? > Dose my testing approach should be correct? If not, please help me point > out. Could you try to run the xdd programs simultaneously? dm-ioband controls bandwidth while I/O requests are issued simultaneously from processes which belong to different cgroup. If I/O requests are only issued from processes which belong to one cgroup, the processes can use the whole bandwidth. The following URL is an example of how bandwidth is shared to I/O load change. http://people.valinux.co.jp/~ryov/dm-ioband/benchmark/partition1.html Thanks, Ryo Tsuruta From yinghai at kernel.org Thu Oct 23 00:10:26 2008 From: yinghai at kernel.org (Yinghai Lu) Date: Thu, 23 Oct 2008 00:10:26 -0700 Subject: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources() In-Reply-To: <20081022084241.GG3773@yzhao12-linux.sh.intel.com> References: <20081022083809.GA3757@yzhao12-linux.sh.intel.com> <20081022084241.GG3773@yzhao12-linux.sh.intel.com> Message-ID: <86802c440810230010redfbbe7oaf94bf2077ccdcdf@mail.gmail.com> On Wed, Oct 22, 2008 at 1:42 AM, Yu Zhao wrote: > This cleanup makes pcibios_allocate_resources() easier to read. > > Cc: Alex Chiang > Cc: Grant Grundler > Cc: Greg KH > Cc: Ingo Molnar > Cc: Jesse Barnes > Cc: Matthew Wilcox > Cc: Randy Dunlap > Cc: Roland Dreier > Signed-off-by: Yu Zhao > > --- > arch/x86/pci/i386.c | 28 ++++++++++++++-------------- > 1 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c > index 844df0c..8729bde 100644 > --- a/arch/x86/pci/i386.c > +++ b/arch/x86/pci/i386.c > @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list) > static void __init pcibios_allocate_resources(int pass) > { > struct pci_dev *dev = NULL; > - int idx, disabled; > + int idx, enabled; > u16 command; > struct resource *r, *pr; > > @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass) > if (!r->start) /* Address not assigned at all */ > continue; > if (r->flags & IORESOURCE_IO) > - disabled = !(command & PCI_COMMAND_IO); > + enabled = command & PCI_COMMAND_IO; > else > - disabled = !(command & PCI_COMMAND_MEMORY); > - if (pass == disabled) { > - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n", > + enabled = command & PCI_COMMAND_MEMORY; > + if (pass == enabled) > + continue; it seems you change the flow here for MMIO because PCI_COMMAND_MEMORY is 2. YH From haotian.zhang at windriver.com Thu Oct 23 03:02:38 2008 From: haotian.zhang at windriver.com (haotian) Date: Thu, 23 Oct 2008 18:02:38 +0800 Subject: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.8.0: Introduction In-Reply-To: <20081022.170536.193712541.ryov@valinux.co.jp> References: <20081017.160950.71109894.ryov@valinux.co.jp> <48FDB8AC.9020707@windriver.com> <48FEDC63.308@windriver.com> <20081022.170536.193712541.ryov@valinux.co.jp> Message-ID: <1224756158.8286.36.camel@pek-hzhang-d1> Hi Ryo Tsuruta: This is Haotian Zhang, I am testing the bio_tracking as your benchmark reports. I follow the test procedure as you describe in web page with the xdd-count.sh, but the results are not to indicate the implementation of IO count-based bandwidth control on per bio-cgroups. The testing approach is as follow: 1, mount bio-cgroup on /cgroup #mount -t cgroup -bio none /cgroup 2, create 3 cgroup #cd /cgroup \ mkdir 1 2 3 3, create 3 ioband device on each partiton, a.I have 3 ext2 partition /dev/sda5 sda6 sda7: # cat /proc/partitions major minor #blocks name 8 0 58605120 sda 8 1 32901088 sda1 8 2 1 sda2 8 5 8152956 sda5 8 6 8924076 sda6 8 7 8626873 sda7 b. Give weights of 40, 20 and 10 to cgroup1, cgroup2 and cgroup3 respectively, and create ioband device: #echo "0 $DEVSIZE1 ioband $DEV1 1 0 0" \ "cgroup weight 0 :100 1:40 2:20 3:10" | dmsetup create ioband1 #echo "0 $DEVSIZE2 ioband $DEV2 1 0 0" \ "cgroup weight 0 :100 1:40 2:20 3:10" | dmsetup create ioband2 #echo "0 $DEVSIZE3 ioband $DEV3 1 0 0" \ "cgroup weight 0 :100 1:40 2:20 3:10" | dmsetup create ioband3 /*============================================================================ "NOTE" The variables are exported as: DEV1=/dev/sda5 DEV2=/dev/sda6 DEV3=/dev/sda7 DEVSIZE1=$(blockdev --getsize $DEV1) DEVSIZE2=$(blockdev --getsize $DEV2) DEVSIZE3=$(blockdev --getsize $DEV3) RANGE=10240 XDDOPT="-op write -queuedepth 32 -blocksize 512 -reqsize 64 -seek random -datapattern random -dio -timelimit 60 -mbytes $RANGE -seek range $((RANGE * 1048576 / 512))" ?============================================================================*/ c. Check out the ioband: #ls /dev/mapper/ control ioband1 ioband2 ioband3 4, Run 32 processes random direct I/O with data on each ioband device in 60 seconds: //******* THE first device ioband1 /dev/sda5 ***************// #export XDDOPT="-op write -queuedepth 32 -blocksize 512 -reqsize 64 -seek random -datapattern random -dio -timelimit 60 -mbytes 10240 -seek range 20971520" #echo $$ > /cgroup/1/tasks #xdd.linux -targets 1 /dev/mapper/ioband1 $XDDOPT -output cgroup1.txt #tail -4 /root/cgroup1.txt T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize Combined 1 32 262766592 8019 60.203 4.365 133.20 0.0075 0.01 write 32768 Ending time for this run, Thu Oct 23 01:46:38 2008 ?//******* THE second device ioband2 /dev/sda6 ***************// Using another ssh terminal ?#export XDDOPT="-op write -queuedepth 32 -blocksize 512 -reqsize 64 -seek random -datapattern random -dio -timelimit 60 -mbytes 10240 -seek range 20971520" #echo $$ > /cgroup/2/tasks #xdd.linux -targets 1 /dev/mapper/ioband2 $XDDOPT -output cgroup2.txt #tail -4 /root/cgroup2.txt T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize Combined 1 32 243662848 7436 60.263 4.043 123.39 0.0081 0.01 write 32768 Ending time for this run, Thu Oct 23 01:50:55 2008 ??//******* THE third device ioband3 /dev/sda7 ***************// Using another ssh terminal ?#export XDDOPT="-op write -queuedepth 32 -blocksize 512 -reqsize 64 -seek random -datapattern random -dio -timelimit 60 -mbytes 10240 -seek range 20971520" #echo $$ > /cgroup/3/tasks #xdd.linux -targets 1 /dev/mapper/ioband3 $XDDOPT -output cgroup3.txt #tail -4 /root/cgroup3.txt T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize Combined 1 32 222986240 6805 60.073 3.712 113.28 0.0088 0.01 write 32768 Ending time for this run, Thu Oct 23 01:58:31 2008 The results are almost the same. I can not see any change of Direct I/O performance for this bio-cgroup kernel feature with dm-ioband support! Does the methord to caculate throughout should be the Rate of xdd.linux output? Dose my testing approach should be correct? If not, please help me point out. Thanks, Haotian. On Wed, 2008-10-22 at 17:05 +0900, Ryo Tsuruta wrote: > Hi Chen, > > > Chen Zumeng wrote: > > > Hi, Ryo Tsuruta > > > And our test team want to test bio_tracking as your benchmark reports, > > > so would you please send me your test codes? Thanks in advance. > > Hi Ryo Tsuruta, > > > > I wonder if you received last email, so I reply this email to ask > > for your bio_tracking test codes to generate your benchmark reports > > as shown in your website. Thanks in advance :) > > I've uploaded two scripts here: > http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-count.sh > http://people.valinux.co.jp/~ryov/dm-ioband/scripts/xdd-size.sh > > xdd-count.sh controls bandwidth based on the number of I/O requests, > and xdd-size.sh controls bandwidth based onthe number of I/O sectors. > Theses scritpts require xdd disk I/O testing tool which can be > downloaded from here: > http://www.ioperformance.com/products.htm > > Please feel free to ask me questions if you have any questions. > > > > P.S. The following are my changes to avoid schedule_timeout: > > Thanks, but your patch seems to cause a problem when ioband devices > which have the same name are created at the same time. I will fix the > issue in the next release. > > Thanks, > Ryo Tsuruta From rusty at rustcorp.com.au Thu Oct 23 08:33:24 2008 From: rusty at rustcorp.com.au (Rusty Russell) Date: Fri, 24 Oct 2008 02:33:24 +1100 Subject: [PATCH] virtio_net: hook up the set-tso ethtool op In-Reply-To: <1224689546.30669.4.camel@blaa> References: <1224689546.30669.4.camel@blaa> Message-ID: <200810240233.24387.rusty@rustcorp.com.au> On Thursday 23 October 2008 02:32:26 Mark McLoughlin wrote: > Seems like an oversight that we have set-tx-csum and set-sg hooked > up, but not set-tso. > > Also leads to the strange situation that if you e.g. disable tx-csum, > then tso doesn't get disabled. Thanks, applied. Rusty. From balbir at linux.vnet.ibm.com Fri Oct 24 04:47:14 2008 From: balbir at linux.vnet.ibm.com (Balbir Singh) Date: Fri, 24 Oct 2008 17:17:14 +0530 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> Message-ID: <4901B5C2.9070108@linux.vnet.ibm.com> Dong-Jae Kang wrote: > Hi, all > > These days, I am interested in green IT area for low power OS > So, I have a question about it. > Is there any good idea or comments about power management related with > cgroup based resource management? > I have no idea about that, but it seems to be possible to find a good concept. > And I hope so > Is it some strange question? ^^ lesswatts.org, linux-pm (mailing list) are good sources on Power Management. I would recommend asking at those mailing lists (there are several new features like range timers, no idle hertz, sched_mc consolidation and much more). Could you be specific about what you are looking for? Are you looking at Server/Desktop power management? -- Balbir From baramsori72 at gmail.com Fri Oct 24 05:24:21 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Fri, 24 Oct 2008 21:24:21 +0900 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <4901B5C2.9070108@linux.vnet.ibm.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> Message-ID: <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> Hi, Balbir Singh 2008/10/24 Balbir Singh : > Dong-Jae Kang wrote: >> Hi, all >> >> These days, I am interested in green IT area for low power OS >> So, I have a question about it. >> Is there any good idea or comments about power management related with >> cgroup based resource management? >> I have no idea about that, but it seems to be possible to find a good concept. >> And I hope so >> Is it some strange question? ^^ > > lesswatts.org, linux-pm (mailing list) are good sources on Power Management. I > would recommend asking at those mailing lists (there are several new features > like range timers, no idle hertz, sched_mc consolidation and much more). Could > you be specific about what you are looking for? Are you looking at > Server/Desktop power management? > Thank you very much for your kind recommendation.^^ this site information will be helpful for me. As your recommendation, I will try to contact to lesswatts.org, linux-pm (mailing list) I am interested in power management in server side. and I just wonder this question, "Is there any related point between cgroup framework and power management?" Thanks again Regards, Dong-Jae Kang From mjg at redhat.com Fri Oct 24 16:25:03 2008 From: mjg at redhat.com (Matthew Garrett) Date: Sat, 25 Oct 2008 00:25:03 +0100 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> Message-ID: <20081024232503.GA20140@srcf.ucam.org> On Fri, Oct 24, 2008 at 09:24:21PM +0900, Dong-Jae Kang wrote: > I am interested in power management in server side. > and I just wonder this question, "Is there any related point between > cgroup framework and power management?" Not currently, though it would certainly be possible to use cgroups as a mechanism for providing application-specific power management. -- Matthew Garrett | mjg59 at srcf.ucam.org From baramsori72 at gmail.com Sat Oct 25 01:05:40 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Sat, 25 Oct 2008 17:05:40 +0900 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <20081024232503.GA20140@srcf.ucam.org> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> <20081024232503.GA20140@srcf.ucam.org> Message-ID: <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> Thank you for your positive opinion about my question I also hope cgroup framework has good point related with power management I think I need more re-consideration for it. ^^ How do you think about cgroup based management of new HW devices, for example, SSD, NVRAM and so on. Is there any requirement for it ? and is there any required work for it? I didn't seriously consider about that until now.^^ so I don't have cool idea but, I think it is worthy to find new domain to be applied by existing technology thank you. Best Regards, Dong-Jae Kang 2008/10/25 Matthew Garrett : > On Fri, Oct 24, 2008 at 09:24:21PM +0900, Dong-Jae Kang wrote: > >> I am interested in power management in server side. >> and I just wonder this question, "Is there any related point between >> cgroup framework and power management?" > > Not currently, though it would certainly be possible to use cgroups as a > mechanism for providing application-specific power management. > > -- > Matthew Garrett | mjg59 at srcf.ucam.org > -- ------------------------------------------------------------------------------------------------- DONG-JAE, KANG Senior Member of Engineering Staff Internet Platform Research Dept, S/W Content Research Lab Electronics and Telecommunications Research Institute(ETRI) 138 Gajeongno, Yuseong-gu, Daejeon, 305-700 KOREA Phone : 82-42-860-1561 Fax : 82-42-860-6699 Mobile : 82-10-9919-2353 E-mail : djkang at etri.re.kr (MSN) ------------------------------------------------------------------------------------------------- From menage at google.com Sat Oct 25 08:43:53 2008 From: menage at google.com (Paul Menage) Date: Sat, 25 Oct 2008 08:43:53 -0700 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> <20081024232503.GA20140@srcf.ucam.org> <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> Message-ID: <6599ad830810250843u10f65917x3388276211e90316@mail.gmail.com> On Sat, Oct 25, 2008 at 1:05 AM, Dong-Jae Kang wrote: > Thank you for your positive opinion about my question > > I also hope cgroup framework has good point related with power management > I think I need more re-consideration for it. ^^ > > How do you think about cgroup based management of new HW devices, for > example, SSD, NVRAM and so on. > Is there any requirement for it ? > and is there any required work for it? > I didn't seriously consider about that until now.^^ so I don't have cool idea > but, I think it is worthy to find new domain to be applied by existing > technology Control Groups is just a framework for associating state with (user-created) groups of processes. So if you have a problem to solve that involves tracking state for different processes, or applying different behaviour to groups of processes based on that group's state, then cgroups may well be an appropriate tool. In the case you mention (management of new devices) that's already somewhat covered by the existing device isolation subsystem - you can create a cgroup that has (or doesn't have) access to particular HW devices. Paul From baramsori72 at gmail.com Sun Oct 26 00:54:45 2008 From: baramsori72 at gmail.com (Dong-Jae Kang) Date: Sun, 26 Oct 2008 16:54:45 +0900 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <6599ad830810250843u10f65917x3388276211e90316@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> <20081024232503.GA20140@srcf.ucam.org> <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> <6599ad830810250843u10f65917x3388276211e90316@mail.gmail.com> Message-ID: <2891419e0810260054p777a0602ndb4242628a7503d2@mail.gmail.com> Hi, Paul Menage Thank you for your comments > Control Groups is just a framework for associating state with > (user-created) groups of processes. So if you have a problem to solve > that involves tracking state for different processes, or applying > different behaviour to groups of processes based on that group's > state, then cgroups may well be an appropriate tool. > > In the case you mention (management of new devices) that's already > somewhat covered by the existing device isolation subsystem - you can > create a cgroup that has (or doesn't have) access to particular HW > devices. > In some aspect, your opinion is right. Existing controller(ex. disk IO controllers) can be run on new HW devices(ex. SSD), existing block layer and so on. but, what I mean is that such controllers can support more performance if the controllers are rewrited with reconsideration of the features of new HW devices. in other words, what I mean can be optimization of controllers for new devices For example, In case of SSD, current IO scheduler layer is needed ? although i can not sure about it ^^ or process sleep is needed after throwing the IO requests to storage ? the role of page cache in SSD or NVRAM is less important than in normal HDD and .... I heard that many research centers in comanies and universities have studied about smiliar research of course, it can be OS itself, device drivers, block layer, file systems and memory management Under this trend, I just wonder whether the trend can be reflected to cgroup based controllers or not. and whether it is meaningful or not? How do you think about this? My opinion may be some humble ^^ Thank you -- Best Regards, Dong-Jae Kang From menage at google.com Sun Oct 26 01:21:21 2008 From: menage at google.com (Paul Menage) Date: Sun, 26 Oct 2008 01:21:21 -0700 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <2891419e0810260054p777a0602ndb4242628a7503d2@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> <20081024232503.GA20140@srcf.ucam.org> <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> <6599ad830810250843u10f65917x3388276211e90316@mail.gmail.com> <2891419e0810260054p777a0602ndb4242628a7503d2@mail.gmail.com> Message-ID: <6599ad830810260121v1f0b2a0el28ae227ef107e185@mail.gmail.com> On Sun, Oct 26, 2008 at 12:54 AM, Dong-Jae Kang wrote: > > Under this trend, > I just wonder whether the trend can be reflected to cgroup based > controllers or not. Potentially, but I'm not sure that anyone is looking at the kind of thing that you're describing. Feel free to post a design for it if you have some concrete ideas. Paul From minchan.kim at gmail.com Sun Oct 26 19:34:34 2008 From: minchan.kim at gmail.com (MinChan Kim) Date: Mon, 27 Oct 2008 11:34:34 +0900 Subject: [Question] power management related with cgroup based resource management In-Reply-To: <2891419e0810260054p777a0602ndb4242628a7503d2@mail.gmail.com> References: <2891419e0810201954q57087fc8ufcaa0e42f3ca99e2@mail.gmail.com> <4901B5C2.9070108@linux.vnet.ibm.com> <2891419e0810240524u746b731cjcbd127fa4204bd4d@mail.gmail.com> <20081024232503.GA20140@srcf.ucam.org> <2891419e0810250105j3e210df0pa3ba9665bd35313a@mail.gmail.com> <6599ad830810250843u10f65917x3388276211e90316@mail.gmail.com> <2891419e0810260054p777a0602ndb4242628a7503d2@mail.gmail.com> Message-ID: <28c262360810261934g12aa6f15mbd133d39eb57b683@mail.gmail.com> Hi, Dong-Jae. > In some aspect, your opinion is right. > Existing controller(ex. disk IO controllers) can be run on new HW > devices(ex. SSD), existing block layer and so on. > > but, what I mean is that such controllers can support more performance > if the controllers are rewrited with reconsideration of the features > of new HW devices. in other words, what I mean can be optimization of > controllers for new devices > For example, > In case of SSD, current IO scheduler layer is needed ? although i can > not sure about it ^^ > or process sleep is needed after throwing the IO requests to storage ? > the role of page cache in SSD or NVRAM is less important than in > normal HDD and .... What you mention is already included in 2.6.28 merge window. I think we can use this feature on NVRAM, too. http://lwn.net/Articles/303270/ > I heard that many research centers in comanies and universities have > studied about smiliar research > of course, it can be OS itself, device drivers, block layer, file > systems and memory management > > Under this trend, > I just wonder whether the trend can be reflected to cgroup based > controllers or not. > and whether it is meaningful or not? > How do you think about this? > My opinion may be some humble ^^ I think it's not cgroup controller's role but each subsystem's one. As you can see above article, Many mainline guys try to improve performance in each subsystems. Do you have a scenario or idea how to use cgroup frame work to manage devices like NVRAM, SSD ?? > Thank you > -- > Best Regards, > Dong-Jae Kang > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Kinds regards, MinChan Kim From s-uchida at ap.jp.nec.com Wed Oct 29 02:12:51 2008 From: s-uchida at ap.jp.nec.com (Satoshi UCHIDA) Date: Wed, 29 Oct 2008 18:12:51 +0900 Subject: [PATCH][cfq-cgroups] Introduce cgroups structure with ioprio entry. Message-ID: This patch introcude cfq_cgroup structure which is type for group control within expanded CFQ scheduler. In addition, the cfq_cgroup structure has "ioprio" entry which is preference of group for I/O. Signed-off-by: Satoshi UCHIDA --- block/cfq-cgroup.c | 148 +++++++++++++++++++++++++++++++++++++++++ include/linux/cgroup_subsys.h | 6 ++ 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c index aaa00ef..733980d 100644 --- a/block/cfq-cgroup.c +++ b/block/cfq-cgroup.c @@ -15,6 +15,154 @@ #include #include +#define CFQ_CGROUP_MAX_IOPRIO (7) + + +struct cfq_cgroup { + struct cgroup_subsys_state css; + unsigned int ioprio; +}; + +static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont) +{ + return container_of(cgroup_subsys_state(cont, cfq_subsys_id), + struct cfq_cgroup, css); +} + +static inline struct cfq_cgroup *task_to_cfq_cgroup(struct task_struct *tsk) +{ + return container_of(task_subsys_state(tsk, cfq_subsys_id), + struct cfq_cgroup, css); +} + + +static struct cgroup_subsys_state * +cfq_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct cfq_cgroup *cfqc; + + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + if (!cgroup_is_descendant(cont)) + return ERR_PTR(-EPERM); + + cfqc = kzalloc(sizeof(struct cfq_cgroup), GFP_KERNEL); + if (unlikely(!cfqc)) + return ERR_PTR(-ENOMEM); + + cfqc->ioprio = 3; + + return &cfqc->css; +} + +static void cfq_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cont) +{ + kfree(cgroup_to_cfq_cgroup(cont)); +} + +static ssize_t cfq_cgroup_read(struct cgroup *cont, struct cftype *cft, + struct file *file, char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + struct cfq_cgroup *cfqc; + char *page; + ssize_t ret; + + page = (char *)__get_free_page(GFP_TEMPORARY); + if (!page) + return -ENOMEM; + + cgroup_lock(); + if (cgroup_is_removed(cont)) { + cgroup_unlock(); + ret = -ENODEV; + goto out; + } + + cfqc = cgroup_to_cfq_cgroup(cont); + + cgroup_unlock(); + + /* print priority */ + ret = snprintf(page, PAGE_SIZE, "%d \n", cfqc->ioprio); + + ret = simple_read_from_buffer(userbuf, nbytes, ppos, page, ret); + +out: + free_page((unsigned long)page); + return ret; +} + +static ssize_t cfq_cgroup_write(struct cgroup *cont, struct cftype *cft, + struct file *file, const char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + struct cfq_cgroup *cfqc; + ssize_t ret; + long new_prio; + int err; + char *buffer = NULL; + + cgroup_lock(); + if (cgroup_is_removed(cont)) { + cgroup_unlock(); + ret = -ENODEV; + goto out; + } + + cfqc = cgroup_to_cfq_cgroup(cont); + cgroup_unlock(); + + /* set priority */ + buffer = kmalloc(nbytes + 1, GFP_KERNEL); + if (buffer == NULL) + return -ENOMEM; + + if (copy_from_user(buffer, userbuf, nbytes)) { + ret = -EFAULT; + goto out; + } + buffer[nbytes] = 0; + + err = strict_strtoul(buffer, 10, &new_prio); + if ((err) || ((new_prio < 0) || (new_prio > CFQ_CGROUP_MAX_IOPRIO))) { + ret = -EINVAL; + goto out; + } + + cfqc->ioprio = new_prio; + + ret = nbytes; + +out: + kfree(buffer); + + return ret; +} + +static struct cftype files[] = { + { + .name = "ioprio", + .read = cfq_cgroup_read, + .write = cfq_cgroup_write, + }, +}; + +static int cfq_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) +{ + return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files)); +} + +struct cgroup_subsys cfq_subsys = { + .name = "cfq", + .create = cfq_cgroup_create, + .destroy = cfq_cgroup_destroy, + .populate = cfq_cgroup_populate, + .subsys_id = cfq_subsys_id, +}; + + /* * sysfs parts below --> */ diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 9c22396..a9482aa 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -54,3 +54,9 @@ SUBSYS(freezer) #endif /* */ + +#ifdef CONFIG_IOSCHED_CFQ_CGROUP +SUBSYS(cfq) +#endif + +/* */ -- 1.5.6.5 From s-uchida at ap.jp.nec.com Fri Oct 31 04:49:57 2008 From: s-uchida at ap.jp.nec.com (Satoshi UCHIDA) Date: Fri, 31 Oct 2008 20:49:57 +0900 Subject: [PATCH][cfq-cgroups] Interface for parameter of cfq driver data Message-ID: This patch add a interface for parameter of cfq driver data. Signed-off-by: Satoshi UCHIDA --- block/cfq-cgroup.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 58 insertions(+), 1 deletions(-) diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c index 4938fa0..776874d 100644 --- a/block/cfq-cgroup.c +++ b/block/cfq-cgroup.c @@ -1028,6 +1028,62 @@ STORE_FUNCTION(cfq_cgroup_slice_async_rq_store, cfq_slice_async_rq, 1, STORE_FUNCTION(cfq_cgroup_ioprio_store, ioprio, 0, CFQ_CGROUP_MAX_IOPRIO, 0); #undef STORE_FUNCTION +static ssize_t +cfq_cgroup_var_show2(unsigned int var, char *page) +{ + return snprintf(page, PAGE_SIZE, "%d\n", var); +} + +static ssize_t +cfq_cgroup_var_store2(unsigned int *var, const char *page, size_t count) +{ + int err; + char *p = (char *) page; + unsigned long new_var; + + err = strict_strtoul(p, 10, &new_var); + if (err) + count = 0; + + *var = new_var; + + return count; +} + +#define SHOW_FUNCTION2(__FUNC, __VAR, __CONV) \ +static ssize_t __FUNC(elevator_t *e, char *page) \ +{ \ + struct cfq_data *cfqd = e->elevator_data; \ + struct cfq_driver_data *cfqdd = cfqd->cfqdd; \ + unsigned int __data = __VAR; \ + if (__CONV) \ + __data = jiffies_to_msecs(__data); \ + return cfq_cgroup_var_show2(__data, (page)); \ +} +SHOW_FUNCTION2(cfq_cgroup_slice_cgroup_show, cfqdd->cfq_cgroup_slice, 1); +#undef SHOW_FUNCTION2 + +#define STORE_FUNCTION2(__FUNC, __PTR, MIN, MAX, __CONV) \ +static ssize_t __FUNC(elevator_t *e, const char *page, size_t count) \ +{ \ + struct cfq_data *cfqd = e->elevator_data; \ + struct cfq_driver_data *cfqdd = cfqd->cfqdd; \ + unsigned int __data; \ + int ret = cfq_cgroup_var_store2(&__data, (page), count); \ + if (__data < (MIN)) \ + __data = (MIN); \ + else if (__data > (MAX)) \ + __data = (MAX); \ + if (__CONV) \ + *(__PTR) = msecs_to_jiffies(__data); \ + else \ + *(__PTR) = __data; \ + return ret; \ +} +STORE_FUNCTION2(cfq_cgroup_slice_cgroup_store, &cfqdd->cfq_cgroup_slice, 1, + UINT_MAX, 1); +#undef STORE_FUNCTION2 + #define CFQ_CGROUP_ATTR(name) \ __ATTR(name, S_IRUGO|S_IWUSR, cfq_cgroup_##name##_show, \ cfq_cgroup_##name##_store) @@ -1043,6 +1099,7 @@ static struct elv_fs_entry cfq_cgroup_attrs[] = { CFQ_CGROUP_ATTR(slice_async_rq), CFQ_CGROUP_ATTR(slice_idle), CFQ_CGROUP_ATTR(ioprio), + CFQ_CGROUP_ATTR(slice_cgroup), __ATTR_NULL }; -- 1.5.6.5