Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009.

Similar presentations


Presentation on theme: "Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009."— Presentation transcript:

1 Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

2 Outline What’s Loongson II? What’s Loongcc? How Loongcc works, like for art. The porting process and evaluation of performance

3 The chip Loongson 2F in the Loongson II family Features 64-bit, Out-of-order, 4-issue, (0.8~1GHz) MIPS III-compatible On-chip 64K/64K L1 cache, 512K L2 cache On-chip MMU supporting DDR2 (533MHz)

4 The chip

5

6 Loongcc Yet Another Open64 branch Targeting Loongson family Aims good performance robust Open source

7 Loongcc

8 Loongcc’s transformation of art

9 Transformation of art Structure peeling produces temporary arrays

10 Structure peeling

11 50% cache line utilization

12 Structure peeling

13 100% cache line utilization

14 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop

15 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop

16 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop

17 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop

18 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Old values always killed. No need write dirty cache lines of A to memory after used.

19 Temporary arrays

20

21 Need write them to memory!

22 Temporary arrays

23 Write more to memory.

24 Temporary arrays

25 Write misses!

26 Problems with temporary arrays Unnecessary writes to memory Large cache footprint

27 Temporary arrays Solution?

28 Problems with temporary arrays Solution? Contraction?

29 Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Prevents array contraction! All A need be ready before any B.

30 Temporary arrays Solution?

31 Temporary arrays Solution? Overlay

32 Array overlay

33

34 No write miss. Even cold miss.

35 Array overlay Nothing out of cache. No memory writes.

36 Array overlay

37 A is still in cache.

38 Array overlay No writes to memory! (as long as in cache).

39 Less cache footprint!

40 Effect of Overlay On Loongson 2F, for art

41 Effect of Overlay On Loongson 2F, for art

42 Effect of Overlay On Loongson 2F, for art

43 Effect of Overlay On Loongson 2F, for art

44 Effect of Overlay On Loongson 2F, for art

45 Other source-to-source transformations Array Transposition Flattening Multi-dimension array to one-dimension Structure Splitting Special loop patterns

46 Effect of source-to-source transformation of art.

47 Effect of source-to-source transformation Works good when there exists special patterns, like a hot large structure array. It works good for art and equake. Applying to other SPEC2000INT does not yield good gains (yet). It can only process C sources.

48 Source-to-source transformation Pros Complete information of source level Human readable intermediate results Natural representation of data structure transformations Cons Redo dataflow analysis, alias analysis, collection of frequency information. Interference with all consequent passes of optimization

49 Constructing Loongcc and its performance

50 Porting Process Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team Support full SPEC2000 SPEC2006 under work

51 Porting process

52 Performance We measure contribution of an optimization by the performance loss when the optimization is disabled.

53 Performance Comparison Loongcc base = -O3 –ipa Loongcc peak = follow SPEC peak rule GCC base = -O3 -march=loongson2f -mtune=loongson2f GCC peak = mild tuning of flags GFortran used.

54 Performance Loongcc base outperforms GCC base by 13%/35% Loongcc peak outperforms GCC peak by 28%/78% Apology that we are not real GCC experts.

55 SPEC2000INT

56

57 Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and more arcs in CG Dependency graph). forward-scheduling in IGLS improves gap by 8%.

58 Prefetch Stride prefetch improves mcf by 27% improves parser by 4% and gap by 6.3%.

59 Prefetch Loongson 2F has only “Pseudo Prefetch” lbu %0,addr Illegal address exception suppressed. Higher cost No effect for SPEC2000FP cases yet.

60 Other optimizations Use of conditional move instructions Placing affine global data near each other Peephole optimizations in EBO

61 SPEC2000FP

62 Loongcc compared to GCC Flush to zero mode Inlinin g

63 SPEC2000FP Array contraction Source-to- source transformation Optimizing cache behavior

64 Thank you! Questions please.

65 Answer to Questions What’s the take-home message? We develop a working, open source branch for MIPS, with good performance. We showcase that source-to-source transformation is a good way to express some optimizations.

66 Answer to Questions Why not CPU2006? Support is under work.

67 Performance comparison The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®. GFortran of corresponding version is used.

68 Question about source-to-source transformation The source-to-source transformation is implemented as a plugin to CIL It can only process C sources due to restriction of front-end. The frequency information has to be collected independently.

69 Source-to-source transformation

70

71 Recover index variable to avoid confusing Loongcc

72 Source-to-source transformation CIL, C Intermediate Language Source-to-source transformation framework Dataflow analysis etc. Canonicalize the C source.

73 Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C Use of A B C D Missing D prevents direct contraction.

74 Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C D Use of A B C D Missing D prevents direct contraction. Rematerialize D.


Download ppt "Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009."

Similar presentations


Ads by Google