Creating acceptance data by anonymizing production data without removing relations. That’s what this thesis is about. Because by law it is forbidden to develop and test your solutions on personally identifiable data from your customers. 

Required interest(s)

  • Relation modeling
  • Statistics
  • Machine Learning
  • Big Data
  • Data Mining
  • Data visualisation

What do you get

  • A challenging assignment within a practical environment
  • € 1000 compensation, € 500 + lease car or € 600 + living space
  • Professional guidance
  • Courses aimed at your graduation period
  • Support from our academic Research center at your disposal
  • Two vacation days per month

What you will do

  • 65% Research
  • 10% Analyze, design, realize
  • 25% Documentation

Creating synthetical data is difficult because it is hard to model the relations that are present in the production data when generating ‘fake’ data, and it is also difficult to generate enough data that is similar in amount to existing production data. It would be far easier to anonymize existing production data so that it can not be linked to unique persons. However, when removing everything that can identify unique persons, a high priority should be given to not destroy any interesting relationships within your data that can for example be used to train ML models. The anonymized data should still be representative of production data to be able to accurately test your solutions but should not be retraceable to individuals. And that’s where you come in!

The research question that is defined for this assignment is the following: ‘What is the optimal procedure to anonymize a large dataset, and how do we prove that the existing relationships between the data remain intact?’. This area is highly researched in recent times due to stricter EU laws regarding personal identifiable information and the willingness to further investigate the data that is gathered from customers rather than just storing it. However, while many researchers are working on this particular problem, it still remains largely unsolved. A highly challenging assignment on the cutting edge of data science is what awaits you, and your results, when promising, will be shared with our customers in the Mobility and Public domain, as they have a lot of data that they want to and in many cases cannot yet use in their development of novel AI solutions. You will work on analyzing large data sets, visualizing relationships, creating data mappings and proving that the existing relationships in the data still exist. You will study the existing literature but develop your own creative solution to this problem.

About Info Support Research Center

We anticipate on upcoming and future challenges and ensure our engineers develop cutting-edge solutions based on the latest scientific insights. Our research community proactively tackles emerging technologies. We do this in cooperation with renowned scientists, making sure that research teams are positioned and embedded throughout our organisation and our community, so that their insights are directly applied to our business. We truly believe in sharing knowledge, so we want to do this without any restrictions.

Read more about Info Support Research here.