15618 Multi-Core Cache Simulator
Links
- See here for the proposal.
- See here for the milestone report.
- See here for the final report.
- This is the link to the project website.
- This is the link to the project repository.
Summary
We are going to implement a trace-driven multicore cache simulator supporting both snooping and directory based cache coherence protocols. We further want to perform workload analysis for program with different access patterns, locality, sharing, and the effect of different interconnect topologies on cache performance.
Background
We studied multiple cache coherence protocols during the lectures such as MSI, MESI, and MOESI. We also studied a couple of different implementation styles namely snooping-based and directory-based. We are curious about the practical implications and their effect on the performance of a multi-core cache system.
We’re excited about studying the effect of different sizes of a cache line and replacement policies, on the performance of the multi-core cache system.
State diagram for MESI:
Design of snooping-based cache coherence:
Design of directory-based cache coherence:
The challenges
- Correctly reflecting what we learnt about cache coherence protocols from lecture in the actual implementation requires firm understanding of the protocol implications.
- Understanding the core APIs of (SST)[https://github.com/sstsimulator] with limited documentation and active community
- Recoding traces from the program execution on multi-core machine and feeding those traces to our implementation is something none of us have experience before.
- Devising appropriate test plans, programs, and workloads in order to stress the simulator and extract valuable insights is also a challenge.
- Measuring performance of directory-based protocols depends highly on accurate modeling of inter-connect and arbitration.
Resources
- We’ll start implementing the cache system from scratch building upon the core APIs exposed by (SST)[https://github.com/sstsimulator].
- We plan to do development on local machines, and then gather traces, run tests and benchmarks on PSC machines. Another reason of using the PSC machine is because we want study the effect of number of cores on the scalability of different cache coherence implementations.
- We’ll also use Intel Pin to record memory access traces on PSC machines.
Goals and deliverables
PLAN TO ACHIEVE
We plan to achieve a fully-functional multi-core cache coherency simulator capable of being configured with
- the number of cores
- cache block size
- cache replacement policy
- coherency protocol
- implementation style
The advantage of building our own cache coherence simulator on top of the core APIs exposed by SST is that -
- Gain hands-on experience in using and extending an industrial toolkit
- Simplify and abstract the behaviour and workloads we want to study in a controlled environment
- Enable a trace based analysis in SST in addition to artificial address generators available in SST
HOPE TO ACHIEVE
If time permits, we also aim to implement a directory based coherency protocol on top of the SST APIs. This will allow us to perform scalability studies and analysis of directory based vs snooping based protocols. We aim to analyse and present some concrete data on what kinds of programs, access patterns and sharing benefit and scale from directory based protocols.
ANALYSIS
We aim to analyze and gather a concrete understanding of programs with different memory access patterns, sharing and locality with concrete numbers and statistics in order to answer the following questions -
- Performance and traffic generated by different lock implementations (Test and Set, Test and Test and Set)
- Effect of artifactual communication on performance
- Directory based vs Snooping based scalability (125% Goal)
DEMO
We aim to have an interactive demo showing the capabilities of our simulator. We plan to present insights such as reporting the following statistics -
- Different types of cache statistics -
- Miss rate
- Number of invalidations due to coherecncy protocol
- Bus Traffic Classficiation -
- Memory traffic: Request served directly from memory
- Coherency traffic: Request served from one of the caches due to sharing
- Latency of different cache events
- Effect of cache block size on performance producing a plot of miss rate vs cache block size
- Study the effect of different types of memory access patterns and sharing (such as Ocean simulation, stencil etc) to gather insights from the simulator
- Scalability of cache coherency implementation styles (125%)
- snoop-based
- directory based
- We will be doing most of our development and execution locally or on the GHC machines. However for doing the scalabiltiy study of directory vs snoop-based coherency implementation we will be using PSC machine to gather traces on larger number of cores.
- We will be develop our cache component on top of SST architecture. And the multi-core communication will be supported by built-in openmpi apis of SST.
- We will use C++ as our developing language.
SCHEDULE
Week Number |
Checkpoint |
1 |
Study SST API and start build a cache component |
2 |
Complete implementation of cache |
3 |
Gather traces using PIN tool |
4 |
Perform analysis and gather data using simulator |
5 |
Work on report and extending simulator to directory based protocol |
Assumptions
As we began the development process, we made the following assumptions for our multi-core cache coherence simulator:
- At any time, each processor has only one outstanding request
- The bus only support atomic transaction
- We don’t support read and write to actual data. We only concern the addresses issued by each processor
As part of this project we plan to remove assumption 1 by enhancing the cache implementation to incorporate non-blocking semantics.
Updated Schedule for Project Milestone
We’ve been working diligently to keep to the schedule. So far we’ve completed the following portions:
- Study SST Core API
- We went through the SST Core documentation explaining the basic primitives components and APIs
- We also checked out the tutorials online and the simple examples given in the SST-Elements repository
- The next step was building the SST-Core and SST-Elements repository locally and executing a few examples to get hands-on and understand the process of building our own components on top of SST-Core.
- Completed development for simulated CPU load store generator
- Completed building a cache component
- We have already completed the basic development process of a multi-core cache.
- The cache has three ports
- One to receive requests and transmit responses back to the processor
- Second to submit requests to the bus arbitrator to access the bus
- Third to transmit the actual request on the bus and receive the response
- We have a working implementation with broadcast based MSI cache coherency protocol
- Tested the complete implementation on a trace computing the sum of an array in parallel
- Broadcast based interconnect
- Arbiter with round-robin and FIFO policies
- Multi-core cache with MSI coherence protocol
- Understand how to gather multi-threaded memory traces using PIN tool
Below is the updated schedule for the coming 2 weeks in half-week granularity (between milestone and project deadline).
Week Number |
Checkpoint |
Assignee |
Status |
0.5 |
Complete implementation of cache component |
Tanay |
Done |
1.0 |
Complete implementation of bus and arbitrator |
Xuan |
Done |
1.5 |
Enhance implementation of cache component for MESI protocol and additional statistics |
Tanay |
Done |
1.75 |
Enhance implementation to incorporate non-blocking cache semantics and additional statistics |
Both |
Done |
2.0 |
Devise characteristic multi-threaded programs to stress test simulator and study workload patterns |
Both |
Done |
2.0 |
Generate characteristic cache traces using Pintool |
Both |
Done |
2.25 |
Perform analysis and gather data using our simulator |
Both |
Done |
2.5 |
Complete the extended implementation of directory component |
Xuan |
cancelled |
2.75 |
Incoporate changes to cache and bus for directory based protocol |
Tanay |
cancelled |
3.0 |
Work on report and poster session prep |
Both |
Done |
Updated Goals and Deliverables
We have been sticking farily well to the planned schedule and targeted development goals. We already have a working multi-core cache coherency simulator with the following features:
- Variable cache block size
- Variable total cache size
- Configurable associativity
- Configurable replacement policy
- Configurable cache coherence protocol
- Configurable arbitration policy:
In the upcoming weeks we plan to enhance the implementation to inocporate non-blocking cache semnantics after discussing it and taking feedback from Professor Skarlatos. Post that we would carry out performance studies using our simulator to gather insight into the behaviour of shared memory parallel programs with different communication patterns and plan to reproduce what we learned in class regarding artifactual communication with actual data and statistics collected using our simulator (XTSim).
Since our poster session is on December 9th, leaving us with less than 10 days, we are skeptical of achieving our 125% goal but will try our best to keep in line with original GOALs and deliverables.
We aim to present our deliverables in terms of graphs
- Miss rate vs Programs with different access patterns
- Number of invalidations vs Programs with different access patterns
- Miss rate vs cache block size
- Miss rate vs total cache size
- Coherence Traffic vs Programs with different access patterns
- Memory Traffic vs Programs with different access patterns
Oustanding Concerns
We primarily have the following tasks remaining
- Enhance cache implementation to incorporate non-blocking semantics
- Generate test plan for carrying out performance study using our simulator (XTSim)
We are majorly concerned with the remaining time we have since we’re doing an early poster session. We are confident of achieving the 100% goal but are not completely sure of completing the 125% goal of enhancing the implementation to incorporate directory based coherency protocol.