To me Big Data is the little brother of High Performance Computing. In HPC we use algorithms and data to gain new knowledge in simulations. Big Data looks for the knowledge within existing data. There is one more thing both have in common: both areas focus especially on the computation, the simulation and analysis, while the literal Big Data is neglected.
The data has to be stored and transported to be used for simulation or for analysis and although storage space seems cheap at the moment we are facing more than exponential growth and the challenges not only Peta- but Exa- or even Zetta-Bytes.
Since 2000 disks are not getting much faster, their latency stays stable and the density growth slowed down as well. Still the target stays the same for growing capacity and performance analog the "Moore's Law" curve. All this is well known in the HPC area but will be a very new topic for the people now working with Big Data and usually coming from the Software and Algorithm side.
To get to a storage capacity of one Exabyte we will need somewhere between 200.000 to 1.000.000 disk drives. Two major issues with this amount of disks are data availability and data integrity. All the compute power and all the clever algorithms are useless if the data is either not there or wrong.
With 200.000 disks and a MTBF_disk of 600khrs we should expect to have 8 or more disks failing each day. Eventually we have a system continuously rebuilding. With larger disks there rebuilds will take longer and the two-disk redundancy will be rendered useless. The conclusion is that we have to use solutions that minimize the performance impact of rebuilds and that provides flexible redundancy beyond todays RAID technologies.
As I wrote are hard drives as we know them at a certain performance barrier. At the same time it's financially impossible to replace all of them with Flash and SSDs. Then again transporting Petabytes back and forward will take it's time and is expensive in technology and power. This all leads to the point that we are done with the homogeneity of data systems. Different data must reside at different places and be accessible at different times and speed. The "information management logic" that will manager the placing and migration of data to be optimally available for computation or analysis will make the difference here.
Data is worthless if it's not correct. Silent Data Errors, IO drops or just bit rot can destroy the valuable data even without being detected until it is too late. A proven way to address this problem are checksums and versioning. To really be on the safe side those checksums cannot happen somewhere in the storage just above the disks, but should instead be really from the client at the one end down to the data on the disk.
Oh, btw. all this is part of GPFS Native RAID.