My previous project used Microsoft Purview Information Protection for end users and data discovery. It was a new implementation, and the customer was unsure how to proceed with Data Loss Prevention (DLP) due to new security policies. I discussed the requirements with the customer to understand where encryption was needed and what labels were required. After identifying these, I created data security labels and applied them to a small group of IT department users for testing. Once testing was complete, we rolled it out to a limited set of users from different departments and gathered feedback.
For example, HR needed a specific label for their data, accessible only to HR users. We then created and applied labels globally to all users. Initially, we didn't force label application or content-based detection; users applied labels based on judgment. Next, we implemented content-based detection by creating Sensitive Information Types (SITs) with keywords and regex patterns based on discussions with the data security team. We created four main labels: public, internal, confidential, and restricted. SITs were added to specific labels, like financial data for the restricted label and internal confidential data for the confidential label. The system would detect content and recommend labels but not force their application.
After applying labels to end users and achieving some maturity in the process, including fine-tuning and adding appropriate Sensitive Information Types (SITs), we moved on to existing data in on-premises file shares. This data volume was huge. We used the Azure Information Protection scanner, which is the scanner, to conduct a proof of concept. After the POC, we performed performance testing on the complex environment, which had many devices like antivirus, firewall, proxy, and network components. I had to ensure the scanning wouldn't impact the environment's performance.
We took a sample of about one terabyte of data for scanning, noting different timings during and off-production hours. This helped determine the best time for scanning and how long it would take to scan one terabyte of unstructured data. Once completed, we defined an approach to handle the full 375 terabytes of data that needed scanning and labeling where appropriate.
We calculated the number of servers needed for the scan, considering this wasn't just a one-time process. We also had to plan for ongoing scans of new files. After deciding on the approach, we began scanning the data. Once the scanning process started, we handed it over to the Business As Usual team for ongoing management.
I created training materials for end users on applying different labels, explaining their purposes, and providing examples. This was crucial since end users would apply the labels in different protection scenarios. The implementation helped the organization in several ways.
Firstly, it ensured compliance with regional laws and regulations by providing a solution to identify and label data appropriately. We now had a system to identify data and apply the right label, which was the main objective. Secondly, it educated end users about the data they were handling. When we gave label recommendations, users learned about the nature of their data and which labels to apply. This increased awareness among end users about the kind of data in their files and the appropriate labels.
Thirdly, it improved the overall data protection ecosystem. For example, it enhanced the effectiveness of our Data Loss Prevention (DLP) system. Instead of DLP starting from scratch, it could work with already classified data, making the process more efficient. We already knew what kind of data existed, which helped DLP work better.
The most valuable aspect of this product is its ability to protect data across boundaries. Currently, I think the most valuable feature is encryption across boundaries.