imagine a ML training, built on the data from *multiple* sources. Have a look at the rough scheme of how the data pipeline may look like: [https://i.postimg.cc/s2ghLxcg/07-11-2021-104457.jpg](https://i.postimg.cc/s2ghLxcg/07-11-2021-104457.jpg)
However, the said scheme supposes that all the data is aggregated on the side of the ML process owner, which is not secure (e.g. if their servers get hacked).
I’ve heard there are scenarios/technologies when the data, though being aggregated on a certain side, is stored in a kind of “isolated sandboxes”, which complicates unauthorized access to the “full picture”. Could you please kindly name these technologies so that I know the direction to move on?
Finally, I’ve heard it’s possible that even the ML process *owner* has a limited access to the data. For example, if database A says John has a university degree, database B says he is 35, and database C says he lives in Boston, the ML algorithms will be able to process this data without exposing John’s name even to ML process owner. Which finally means ML model owner can return ML results (e.g. financial reliability scoring) to the owners of databases A, B, C as “John’s scoring” without knowing *himself* that it’s John’s. Will highly appreciate if you point me the right direction to study more about this scenario of keeping the data secure.
Thank you in advance guys!