Re: [00/17] Large Blocksize Support V3

From: Nick Piggin
Date: Thu Apr 26 2007 - 11:38:57 EST


David Chinner wrote:
On Thu, Apr 26, 2007 at 04:10:32AM -0600, Eric W. Biederman wrote:
Ok. Now why are high end hardware manufacturers building crippled
hardware? Or is there only an 8bit field in SCSI for describing
scatter gather entries? Although I would think this would be
move of a controller ranter than a drive issue.


scsi.h:

/*
* The maximum sg list length SCSI can cope with
* (currently must be a power of 2 between 32 and 256)
*/
#define SCSI_MAX_PHYS_SEGMENTS MAX_PHYS_SEGMENTS

And from blkdev.h:

#define MAX_PHYS_SEGMENTS 128
#define MAX_HW_SEGMENTS 128

So currentlt on SCSI we are limited to 128 s/g entries, and the
maximum is 256. So I'd say we've got good grounds for needing
contiguous pages to go beyond 1MB I/O size on x86_64.

Or good grounds to increase the sg limit and push for io controller
manufacturers to do the same. If we have a hack in the kernel that
mostly works, they won't.

Page colouring was always rejected, and lots of people who knew
better got upset because it was the only way the hardware would go
fast...


And what do we do for arches that can't do multiple page sizes, only
only have a limited and mostly useless set of page sizes to choose
from?

You have HW_PAGE_SIZE != PAGE_SIZE.


That's rather wasteful, though. Better to only use the large pages
when the filesystem needs them rather than penalise all filesystems.

But 16k pages are fine for ia64. While you're talking about special
casing stuff, surely a bigger page size could be the config option
instead of higher order pagecache.


That is you hide the fact from
the bulk of the kernel struct page manges 2 or more real hardware pages.
But you expose it to the handful of places that actually care.
Partly this is a path you are starting down in your patches, with
larger page cache support.


Right, exactly. So apart from the contiguous allocation issue, you think
we are doing the right thing?

You could put it that way. Or that it is wrong because of the
fragmenatation problem. Realise that it is somewhat fundamental
considering that it is basically an unsolvable problem with our
current kernel assumptions of unconstrained kernel allocations and
a 1:1 kernel mapping.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/