Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)

From: Mel Gorman
Date: Tue Jul 01 2008 - 04:09:34 EST


(Christoph's address corrected and Andys added to cc)

On (30/06/08 18:57), Dan Williams didst pronounce:
> Hello,
>
> Prompted by a report from a user I have bisected a performance loss
> apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
> filtered by GFP mask). The test is simple sequential writes to a 4 disk
> raid5 array. Performance should be about 20% greater than 2.6.25 due to
> commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
> write performance). The sample data below shows sporadic performance
> starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc.
>
> revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
> 138 168 169 167 177 149 144
> 140 168 172 170 109 138 142
> 142 165 169 164 119 138 129
> 144 168 169 171 120 139 135
> 142 165 174 166 165 122 154
> MB/s (avg) 141 167 171 168 138 137 141
> % change 0% 18% 21% 19% -2% -3% 0%
> result base good good good [bad] bad bad
>

That is not good at all as this patch is not a straight-forward revert but
the second time it's come under suspicion.

> Notable observations:
> 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
> 54a6eb5c show consistent performance at 170MB/s.

I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one
possibility if GFP_KERNEL is a major factor and it has to scan over the
unusable zone a lot. However, another remote possibility is that many function
calls are more expensive on x86 than on x86_64 (this is a wild guess based
on the registers available). Spectulative patch is below.

If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like?
i.e. is the zonelist filtering looking like a performance regression or is
it just somehow negating the benefits of the raid patch?

> 2/ Single drive performance appears to be unaffected
> 3/ A quick test shows that raid0 performance is also sporadic:
> 2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
> 2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
> 2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
> 2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
> 2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
>

Are these synced writes? i.e. is it possible the performance at the end
is dropped because memory becomes full of dirty pages at that point?

> System/Test configuration:
> (2) Intel(R) Xeon(R) CPU 5150
> mem=1024M
> CONFIG_HIGHMEM4G=y (full config attached)
> mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
> for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
>
> Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> Other suggestions / experiments?
>

There was a deporkify patch which replaced inline function with normal
functions. I worried at the time that all the function calls in an
iterator may cause a performnace problem but I couldn't measure it so
assumed the reduction in text size was a plus. This is a partial
reporkify patch that should reduce the number of function calls that
take place at the cost of larger text. Can you try it out applied
against 2.6.26-rc8 please?

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/include/linux/mmzone.h linux-2.6.26-rc8-repork/include/linux/mmzone.h
--- linux-2.6.26-rc8-clean/include/linux/mmzone.h 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-repork/include/linux/mmzone.h 2008-07-01 00:49:17.000000000 -0700
@@ -742,6 +742,15 @@ static inline int zonelist_node_idx(stru
#endif /* CONFIG_NUMA */
}

+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+ return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+ return 1;
+#endif /* CONFIG_NUMA */
+}
+
/**
* next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
* @z - The cursor used as a starting point for the search
@@ -754,10 +763,26 @@ static inline int zonelist_node_idx(stru
* search. The zoneref returned is a cursor that is used as the next starting
* point for future calls to next_zones_zonelist().
*/
-struct zoneref *next_zones_zonelist(struct zoneref *z,
+static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
- struct zone **zone);
+ struct zone **zone)
+{
+ /*
+ * Find the next suitable zone to use for the allocation.
+ * Only filter based on nodemask if it's set
+ */
+ if (likely(nodes == NULL))
+ while (zonelist_zone_idx(z) > highest_zoneidx)
+ z++;
+ else
+ while (zonelist_zone_idx(z) > highest_zoneidx ||
+ (z->zone && !zref_in_nodemask(z, nodes)))
+ z++;
+
+ *zone = zonelist_zone(z++);
+ return z;
+}

/**
* first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/mmzone.c linux-2.6.26-rc8-repork/mm/mmzone.c
--- linux-2.6.26-rc8-clean/mm/mmzone.c 2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-repork/mm/mmzone.c 2008-07-01 00:48:19.000000000 -0700
@@ -42,33 +42,3 @@ struct zone *next_zone(struct zone *zone
return zone;
}

-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
-{
-#ifdef CONFIG_NUMA
- return node_isset(zonelist_node_idx(zref), *nodes);
-#else
- return 1;
-#endif /* CONFIG_NUMA */
-}
-
-/* Returns the next zone at or below highest_zoneidx in a zonelist */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
- enum zone_type highest_zoneidx,
- nodemask_t *nodes,
- struct zone **zone)
-{
- /*
- * Find the next suitable zone to use for the allocation.
- * Only filter based on nodemask if it's set
- */
- if (likely(nodes == NULL))
- while (zonelist_zone_idx(z) > highest_zoneidx)
- z++;
- else
- while (zonelist_zone_idx(z) > highest_zoneidx ||
- (z->zone && !zref_in_nodemask(z, nodes)))
- z++;
-
- *zone = zonelist_zone(z++);
- return z;
-}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/