Big Data Analysis with Linear Regression

2 views (last 30 days)
Joao Saavedra
Joao Saavedra on 14 Jul 2021
Answered: Alvaro on 29 Dec 2022
Hi all,
I am doing a project to predict how many cpus will be needed to process a huge file (.nc) of climate data in less than 2 hours (7200s). Sequentially it takes more than 100,000 seconds.
I have the entire program done to process data sequentially and in parallel, up to 8 workers (limit of my cpu). The program takes the datafile, that has data for an entire day of climate data, and divides it in hours (25) so it can process hourly. After the processing is done, i used a stopwatch in the code to record the time taken for each number of workers.
To be easier to process and test the parallell processing, I am using a subset of the data (entire file has more than 270,000 blocks of data).
How can I use the time taken from a subset of the data to extrapolate the cpus needed for the entire data file? I have been lost in this problem for the entire day...
Thanks in advance!

Answers (1)

Alvaro
Alvaro on 29 Dec 2022
It's not straighforward to calculate the number of workers that you would need to process your data in less than 2 hours.
Amdahl's law might give you a bit of a formal approach if you are looking to write down some rough calculations.
I would try to determine the number of cores you need by trial-and-error. Since you are looking for approximately a 14x speedup from the serial computation, a rough guess would be to start with 14 workers in your cluster and clock the time. This assumes that your computations are highly suited for parallelization, but, as noted above, it's likely not that simple. From there, try more or less cores until you can fine tune it to the time you are looking for. It could be worth doing a more thorough experiment to determine the optimal number of workers for your process if you need to analyze a large number of those data files in less than 2 hours.

Categories

Find more on Weather and Atmospheric Science in Help Center and File Exchange

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!