It is most often created by funneling real-world data through a noise-adding algorithm to construct a new data set. The resulting data set captures the statistical features of the original information without the ability to identify individual contributions.
“If you’re trying to train an algorithm to detect fraud, you don’t care about specific transactions and who made them,” he says. “You care about the statistics, like whether the amounts are just below the limit needed to trigger an audit, or if they tend to occur close to the end of the quarter.” Those kinds of numbers can be shaken out of synthetic data as well as from the original.