Hi all,

imagine a ML training, built on the data from *multiple* sources. Have a look at the rough scheme of how the data pipeline may look like: [https://i.postimg.cc/s2ghLxcg/07-11-2021-104457.jpg](https://i.postimg.cc/s2ghLxcg/07-11-2021-104457.jpg)

However, the said scheme supposes that all the data is aggregated on the side of the ML process owner, which is not secure (e.g. if their servers get hacked).

I’ve heard there are scenarios/technologies when the data, though being aggregated on a certain side, is stored in a kind of “isolated sandboxes”, which complicates unauthorized access to the “full picture”. Could you please kindly name these technologies so that I know the direction to move on?

Finally, I’ve heard it’s possible that even the ML process *owner* has a limited access to the data. For example, if database A says John has a university degree, database B says he is 35, and database C says he lives in Boston, the ML algorithms will be able to process this data without exposing John’s name even to ML process owner. Which finally means ML model owner can return ML results (e.g. financial reliability scoring) to the owners of databases A, B, C as “John’s scoring” without knowing *himself* that it’s John’s. Will highly appreciate if you point me the right direction to study more about this scenario of keeping the data secure.

Thank you in advance guys!

Share This Discussion

Leave a Comment

Note: By filling this form and submitting your commen, you acknowledge, agree and comply with our terms of service. In addition you acknowledge that you are willingly sharing your email address with AiOWikis and you might receive notification emails from AiOWikis for comment notifications. AiOWiksi guarantees that your email address WILL NOT be used for advertisement or email marketting purposes.

This site uses Akismet to reduce spam. Learn how your comment data is processed.