198 M. Sch<strong>in</strong>dewolf, D. Kramer, and M. C<strong>in</strong>tra References 1. Frank, M.I., Agarwal, A., Vernon, M.K.: LoPC: model<strong>in</strong>g contention <strong>in</strong> parallel algorithms. SIGPLAN Not. 32(7), 276–287 (1997) 2. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporat<strong>in</strong>g long messages <strong>in</strong>to the LogP model — one step closer towards a realistic model for parallel computation. Technical report, Santa Barbara, CA, USA (1995) 3. Culler, D.E., Karp, R.M., Patterson, D., Sahay, A., Santos, E.E., Schauser, K.E., Subramonian, R., von Eicken, T.: LogP: a practical model <strong>of</strong> parallel computation. Commun. ACM 39(11), 78–85 (1996) 4. Alam, S., Vetter, J.: A framework to develop symbolic performance models <strong>of</strong> parallel applications. In: IEEE International Parallel & Distributed Process<strong>in</strong>g Symposium, p. 368 (2006) 5. Bhatia, N., Alam, S.R., Vetter, J.S.: Performance model<strong>in</strong>g <strong>of</strong> emerg<strong>in</strong>g HPC architectures. In: HPCMP Users Group Conference, pp. 367–373 (2006) 6. Ipek, E., de Sup<strong>in</strong>ski, B.R., Schulz, M., McKee, S.A.: An approach to performance prediction for parallel applications. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 196–205. Spr<strong>in</strong>ger, Heidelberg (2005) 7. Mar<strong>in</strong>, G.: Semi-automatic synthesis <strong>of</strong> parameterized performance models for scientific programs. Master’s thesis, Rice University, Houston, Texas (April 2003) 8. Mar<strong>in</strong>, G., Mellor-Crummey, J.: Cross-architecture performance predictions for scientific applications us<strong>in</strong>g parameterized models. In: SIGMETRICS 2004/Performance ’04: Proceed<strong>in</strong>gs <strong>of</strong> the jo<strong>in</strong>t <strong>in</strong>ternational conference on Measurement and model<strong>in</strong>g <strong>of</strong> computer systems, pp. 2–13. ACM, New York (2004) 9. Yang, L.T., Ma, X., Mueller, F.: Cross-platform performance prediction <strong>of</strong> parallel applications us<strong>in</strong>g partial execution. In: SC 2005: Proceed<strong>in</strong>gs <strong>of</strong> the 2005 ACM/IEEE conference on Supercomput<strong>in</strong>g, p. 40. IEEE <strong>Computer</strong> Society, Los Alamitos (2005) 10. Carr<strong>in</strong>gton, L., Snavely, A., Wolter, N.: A performance prediction framework for scientific applications. Future Generation <strong>Computer</strong> <strong>Systems</strong> 22(3), 336–346 (2006) 11. Lee, B.C., Coll<strong>in</strong>s, J., Wang, H., Brooks, D.: CPR: Composable performance regression for scalable multiprocessor models. In: MICRO: 41st International Symposium on Microarchitecture (2008) 12. Lee, B.C., Brooks, D.: Efficiency trends and limits from comprehensive microarchitectural adaptivity. SIGOPS Oper. Syst. Rev. 42(2), 36–47 (2008) 13. Deelman, E., Dube, A., Hoisie, A., Luo, Y., Oliver, R.L., Sundaram-Stukel, D., Wasserman, H.J., Adve, V.S., Bagrodia, R., Browne, J.C., Houstis, E.N., Lubeck, O.M., Rice, J.R., Teller, P.J., Vernon, M.K.: POEMS: end-to-end performance design <strong>of</strong> large parallel adaptive computational systems. In: WOSP 1998: Proceed<strong>in</strong>gs <strong>of</strong> the 1st International Workshop on S<strong>of</strong>tware and Performance, pp. 18–30 (1998) 14. Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006) 15. Dvoˇrák, Z.: Loop optimizer cheatsheet, http://gcc.gnu.org/wiki/ Gett<strong>in</strong>gStarted?action=AttachFile&do=get&target=loopcheat.ps
A Method for Accurate High-Level Performance Evaluation <strong>of</strong> MPSoC <strong>Architecture</strong>s Us<strong>in</strong>g F<strong>in</strong>e-Gra<strong>in</strong>ed Generated Traces Roman Plyask<strong>in</strong> and Andreas Herkersdorf Institute for Integrated <strong>Systems</strong>, Technische Universität München, Arcisstr. 21, 80290 Munich, Germany {roman.plyask<strong>in</strong>,herkersdorf}@tum.de http://www.lis.ei.tum.de Abstract. Performance evaluation at system level has become a prerequisite <strong>in</strong> the design process <strong>of</strong> modern System-on-Chip (SoC) architectures. This fact resulted <strong>in</strong> many simulative methods proposed by the research community. In trace-based simulations, the performance <strong>of</strong> SoC architectures is evaluated us<strong>in</strong>g abstracted traces. This paper presents an approach for the generation <strong>of</strong> the traces at the <strong>in</strong>struction level from a target SW code executed on a cycle accurate CPU simulator. We showed that the use <strong>of</strong> f<strong>in</strong>e-gra<strong>in</strong>ed traces provides accuracy above 95% with an <strong>in</strong>crease <strong>of</strong> simulation performance by factor <strong>of</strong> 1.3 to 3.8 compared to the reference cycle accurate simulator. The result<strong>in</strong>g traces are used dur<strong>in</strong>g high-level explorations <strong>in</strong> our trace-driven SystemC TLM simulator, <strong>in</strong> which performance <strong>of</strong> MPSoC (Multiprocessor SoC) architectures with a variable number <strong>of</strong> CPUs, diverse memory hierarchies and on-chip <strong>in</strong>terconnect can be evaluated. 1 Introduction Constantly <strong>in</strong>creas<strong>in</strong>g complexity <strong>of</strong> System-on-Chip (SoC) architectures, stimulated by the ris<strong>in</strong>g amount <strong>of</strong> transistors on a s<strong>in</strong>gle chip, faces new challenges <strong>in</strong> the design <strong>of</strong> <strong>in</strong>tegrated circuits. In order to shorten product development cycles under high time-to-market pressure, early system-level model<strong>in</strong>g and simulation has become a necessary part <strong>of</strong> the design process. Due to higher complexity and many low-level details, cycle accurate system-level simulations are not feasible from the perspective <strong>of</strong> simulation time. At system level, components are typically modeled at a high level <strong>of</strong> abstraction allow<strong>in</strong>g faster and more flexible design space exploration. Therefore, the ma<strong>in</strong> challenge is to perform systemlevel simulations fast and accurately at the same time. For the performance evaluation, which is addressed <strong>in</strong> this paper, trace-based simulations have been widely used for general purpose computer systems as well as for systems-on-chip [7,8,10,11,13,15]. The trace-based approach represents hardware components as black-box modules that either perform <strong>in</strong>ternal process<strong>in</strong>g or make read or write requests on the communication <strong>in</strong>frastructure. C. Müller-Schloer, W. Karl, and S. Yehia (Eds.): ARCS 2010, LNCS 5974, pp. 199–210, 2010. c○ Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg 2010
- Page 2 and 3:
Lecture Notes in Computer Science 5
- Page 4 and 5:
Volume Editors Christian Müller-Sc
- Page 6 and 7:
General Chair Organization Christia
- Page 8 and 9:
Organization IX Hartmut Schmeck Kar
- Page 10 and 11:
Keynote Table of Contents HyVM - Hy
- Page 12 and 13:
Table of Contents XIII JetBench:AnO
- Page 14 and 15:
How to Enhance a Superscalar Proces
- Page 16 and 17:
4 J. Mische et al. The Real-time Vi
- Page 18 and 19:
6 J. Mische et al. 4.1 Instruction
- Page 20 and 21:
8 J. Mische et al. Additionally the
- Page 22 and 23:
10 J. Mische et al. % of unused pip
- Page 24 and 25:
12 J. Mische et al. % of cycles spe
- Page 26 and 27:
14 J. Mische et al. 16. Lickly, B.,
- Page 28 and 29:
16 G. Aşılıoğlu, E.M. Kaya, and
- Page 30 and 31:
18 G. Aşılıoğlu, E.M. Kaya, and
- Page 32 and 33:
20 G. Aşılıoğlu, E.M. Kaya, and
- Page 34 and 35:
22 G. Aşılıoğlu, E.M. Kaya, and
- Page 36 and 37:
24 G. Aşılıoğlu, E.M. Kaya, and
- Page 38 and 39:
26 T.B. Preußer, P. Reichel, and R
- Page 40 and 41:
28 T.B. Preußer, P. Reichel, and R
- Page 42 and 43:
30 T.B. Preußer, P. Reichel, and R
- Page 44 and 45:
32 T.B. Preußer, P. Reichel, and R
- Page 46 and 47:
34 T.B. Preußer, P. Reichel, and R
- Page 48 and 49:
36 T.B. Preußer, P. Reichel, and R
- Page 50 and 51:
38 P. Bellasi, W. Fornaciari, and D
- Page 52 and 53:
40 P. Bellasi, W. Fornaciari, and D
- Page 54 and 55:
42 P. Bellasi, W. Fornaciari, and D
- Page 56 and 57:
44 P. Bellasi, W. Fornaciari, and D
- Page 58 and 59:
46 P. Bellasi, W. Fornaciari, and D
- Page 60 and 61:
48 P. Bellasi, W. Fornaciari, and D
- Page 62 and 63:
50 J. Zeppenfeld and A. Herkersdorf
- Page 64 and 65:
52 J. Zeppenfeld and A. Herkersdorf
- Page 66 and 67:
54 J. Zeppenfeld and A. Herkersdorf
- Page 68 and 69:
56 J. Zeppenfeld and A. Herkersdorf
- Page 70 and 71:
58 J. Zeppenfeld and A. Herkersdorf
- Page 72 and 73:
60 J. Zeppenfeld and A. Herkersdorf
- Page 74 and 75:
62 B. Jakimovski, B. Meyer, and E.
- Page 76 and 77:
64 B. Jakimovski, B. Meyer, and E.
- Page 78 and 79:
66 B. Jakimovski, B. Meyer, and E.
- Page 80 and 81:
68 B. Jakimovski, B. Meyer, and E.
- Page 82 and 83:
70 B. Jakimovski, B. Meyer, and E.
- Page 84 and 85:
72 B. Jakimovski, B. Meyer, and E.
- Page 86 and 87:
74 M. Bonn and H. Schmeck Fig. 1. J
- Page 88 and 89:
76 M. Bonn and H. Schmeck 2.2 Node
- Page 90 and 91:
78 M. Bonn and H. Schmeck Uptime-ba
- Page 92 and 93:
80 M. Bonn and H. Schmeck 2.4 Simul
- Page 94 and 95:
82 M. Bonn and H. Schmeck done rate
- Page 96 and 97:
84 M. Bonn and H. Schmeck Fig. 8. J
- Page 98 and 99:
86 M. Bonn and H. Schmeck tells the
- Page 100 and 101:
88 J.-P. Steghöfer et al. � �
- Page 102 and 103:
90 J.-P. Steghöfer et al. mechanis
- Page 104 and 105:
92 J.-P. Steghöfer et al. resource
- Page 106 and 107:
94 J.-P. Steghöfer et al. Choose a
- Page 108 and 109:
96 J.-P. Steghöfer et al. 1. Defin
- Page 110 and 111:
98 J.-P. Steghöfer et al. and all
- Page 112 and 113:
100 J.-P. Steghöfer et al. 19. Kim
- Page 114 and 115:
102 K. Kloch et al. large-scale sys
- Page 116 and 117:
104 K. Kloch et al. a�t� 1.0 0.
- Page 118 and 119:
106 K. Kloch et al. constant. This
- Page 120 and 121:
108 K. Kloch et al. Relative number
- Page 122 and 123:
110 K. Kloch et al. (a) infection r
- Page 124 and 125:
112 K. Kloch et al. (ii) Phase of a
- Page 126 and 127:
114 P. Petoumenos et al. Studying t
- Page 128 and 129:
116 P. Petoumenos et al. % of Misse
- Page 130 and 131:
118 P. Petoumenos et al. IQ: n 4-in
- Page 132 and 133:
120 P. Petoumenos et al. downsizing
- Page 134 and 135:
122 P. Petoumenos et al. As long as
- Page 136 and 137:
124 P. Petoumenos et al. comparable
- Page 138 and 139:
Exploiting Inactive Rename Slots fo
- Page 140 and 141:
128 M. Kayaalp et al. In a supersca
- Page 142 and 143:
130 M. Kayaalp et al. INSTRUCTION 1
- Page 144 and 145:
132 M. Kayaalp et al. time. Alterna
- Page 146 and 147:
134 M. Kayaalp et al. Fig. 3. Numbe
- Page 148 and 149:
136 M. Kayaalp et al. The results o
- Page 150 and 151:
Efficient Transaction Nesting in Ha
- Page 152 and 153:
140 Y. Liu et al. HTMs include TCC
- Page 154 and 155:
142 Y. Liu et al. rollback T0 begin
- Page 156 and 157:
144 Y. Liu et al. Processor core Pr
- Page 158 and 159:
146 Y. Liu et al. 5.2 Results and A
- Page 160 and 161: 148 Y. Liu et al. decreases, partia
- Page 162 and 163: Decentralized Energy-Management to
- Page 164 and 165: 152 B. Becker et al. Furthermore, t
- Page 166 and 167: 154 B. Becker et al. An optimizing
- Page 168 and 169: 156 B. Becker et al. smart-home man
- Page 170 and 171: 158 B. Becker et al. freedom like w
- Page 172 and 173: 160 B. Becker et al. Power [W] 4500
- Page 174 and 175: EnergySaving Cluster Roll: Power Sa
- Page 176 and 177: 164 M.F. Dolz et al. Themodulequeri
- Page 178 and 179: 166 M.F. Dolz et al. This daemon al
- Page 180 and 181: 168 M.F. Dolz et al. been submitted
- Page 182 and 183: 170 M.F. Dolz et al. On the other h
- Page 184 and 185: 172 M.F. Dolz et al. at the inactiv
- Page 186 and 187: Effect of the Degree of Neighborhoo
- Page 188 and 189: 176 T. Abdullah et al. A zone based
- Page 190 and 191: 178 T. Abdullah et al. A consumer/p
- Page 192 and 193: 180 T. Abdullah et al. Messages 140
- Page 194 and 195: 182 T. Abdullah et al. (except when
- Page 196 and 197: 184 T. Abdullah et al. % Matchmakin
- Page 198 and 199: 186 T. Abdullah et al. show that th
- Page 200 and 201: 188 M. Schindewolf, D. Kramer, and
- Page 202 and 203: 190 M. Schindewolf, D. Kramer, and
- Page 204 and 205: 192 M. Schindewolf, D. Kramer, and
- Page 206 and 207: 194 M. Schindewolf, D. Kramer, and
- Page 208 and 209: 196 M. Schindewolf, D. Kramer, and
- Page 212 and 213: 200 R. Plyaskin and A. Herkersdorf
- Page 214 and 215: 202 R. Plyaskin and A. Herkersdorf
- Page 216 and 217: 204 R. Plyaskin and A. Herkersdorf
- Page 218 and 219: 206 R. Plyaskin and A. Herkersdorf
- Page 220 and 221: 208 R. Plyaskin and A. Herkersdorf
- Page 222 and 223: 210 R. Plyaskin and A. Herkersdorf
- Page 224 and 225: 212 M.Y. Qadri, D. Matichard, and K
- Page 226 and 227: 214 M.Y. Qadri, D. Matichard, and K
- Page 228 and 229: 216 M.Y. Qadri, D. Matichard, and K
- Page 230 and 231: 218 M.Y. Qadri, D. Matichard, and K
- Page 232 and 233: 220 M.Y. Qadri, D. Matichard, and K
- Page 234 and 235: A Tightly Coupled Accelerator Infra
- Page 236 and 237: 224 F. Nowak and R. Buchty where A
- Page 238 and 239: 226 F. Nowak and R. Buchty Fig. 3.
- Page 240 and 241: 228 F. Nowak and R. Buchty Table 2.
- Page 242 and 243: 230 F. Nowak and R. Buchty Table 4.
- Page 244 and 245: 232 F. Nowak and R. Buchty 5.3 Comp
- Page 246 and 247: Optimizing Stencil Application on M
- Page 248 and 249: 236 F. Xudong et al. compared to th
- Page 250 and 251: 238 F. Xudong et al. stencil comput
- Page 252 and 253: 240 F. Xudong et al. threads in the
- Page 254 and 255: 242 F. Xudong et al. Speedup 14 12
- Page 256 and 257: 244 F. Xudong et al. 5 Related Work
- Page 258: Abdullah, Tariq 174 Alima, Luc Onan