Re: [patch 11/13] don't hold i_sem during O_DIRECT writes to blockdevs

From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Jul 28 2002 - 19:04:53 EST


On Sun, 28 Jul 2002, Andrew Morton wrote:
>
> We're moving in the direction of deprecating the raw driver and
> recommending that applications use O_DIRECT reads and writes against
> blockdevs.

This should probably be done unconditionally or not at all.

We've worked very hard on making block devices more "normal" in 2.5.x, and
I don't want to start diverging again.

If this is really a scalability issue, I would suggest that people who
care look into just getting rid of "i_sem", and replacing it with a
read-write semaphore that explicitly protects only "i_size". Then you make
reads and non-extending writes take that semaphore for reading, and
extending writes and truncates taking it for writing.

[ The "nonextending writes" case is somewhat interesting, a write probably
  needs to actually take the semaphore for writing, and then downgrading
  it to reading after it has checked that it doesn't end up extending the
  file.

  What makes this even more interesting is that depending on the semaphore
  implementation you can actually split up the "take write lock" into
  "prepare to take write lock" and "turn it into a read lock" or "confirm
  write lock", where the "prepare to take write lock" allows existing
  readers but not new write-lockers, so that if you downgrade to a read
  lock you never had to synchronize with anybody else who was already
  reading. ]

I'd much rather do this _right_ than have some ugly blockdev-only hack,
since the problem certainly would happen with files too. A lot of people
want to do databases on a filesystem, just because it is so much easier to
administer.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Jul 30 2002 - 14:00:31 EST