Re: [patch 1/2] mm: fincore()

From: Johannes Weiner
Date: Fri Feb 15 2013 - 18:13:34 EST

On Fri, Feb 15, 2013 at 01:27:38PM -0800, Andrew Morton wrote:
> On Fri, 15 Feb 2013 01:34:50 -0500
> Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > + * The status is returned in a vector of bytes. The least significant
> > + * bit of each byte is 1 if the referenced page is in memory, otherwise
> > + * it is zero.
> Also, this is going to be dreadfully inefficient for some obvious cases.
> We could address that by returning the info in some more efficient
> representation. That will be run-length encoded in some fashion.
> The obvious way would be to populate an array of
> struct page_status {
> u32 present:1;
> u32 count:31;
> };
> or whatever.

I'm having a hard time seeing how this could be extended to more
status bits without stifling the optimization too much. If we just
add more status bits to one page_status, the likelihood of long runs
where all bits are in agreement decreases. But as the optimization
becomes less and less effective, we are stuck with an interface that
is more PITA than just using mmap and mincore again.

The user has to supply a worst-case-sized vector with one struct
page_status per page in the range, but the per-page item will be
bigger than with the byte vector because of the additional run length

> Another way would be to define the syscall so it returns "number of
> pages present/absent starting at offset `start'". In other words, one
> call to fincore() will return a single `struct page_status'. Userspace
> can then walk through the file and generate the full picture, if needed.
> This also gets inefficient in obvious cases, but it's not as obviously
> bad?

Any run-length encoding will have a problem with multiple status bits,
I guess.

Maybe with a mask of bits the user is interested in?

struct page_status {
unsigned long states;
unsigned long count;

int fincore(int fd, loff_t start, loff_t len,
unsigned long states_mask,
struct page_status *status)

However, one struct page_status per run leaves you with a worst case
of one syscall per page in the range.

I dunno. The byte vector might not be optimal but its worst cases
seem more attractive, is just as extensible, and dead simple to use.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at