Podcast

AI in Procurement | Synthetic data: Customised solutions through vertical AI

In the fifth episode of our AI in Procurement podcast, Fabian Heinrich (CEO of Mercanis) and Dr Klaus Iffländer (Head of AI at Mercanis) talk about the central role of synthetic data in the further development of AI in procurement.

What is it all about?
Why is progress in LLM development slowing down and what role does synthetic data play as a new solution?
How can synthetic data be generated and used specifically for procurement?
What does the verticalisation of AI mean - and how do companies benefit from it?
Can vertical AI models with specially generated data revolutionise the purchasing process?

Synthetic data makes it possible to train large language models (LLMs) with customised, industry-specific information - and thus increase the performance of vertical AI agents. An essential step in supporting procurement teams with intelligent automation.

Is this the future of AI in procurement? We shed light on how synthetic data forms the basis for specialised AI applications - and what this means for the world of procurement.

On our own behalf: There is an e-mail newsletter for the Procurement Unplugged by Mercanis podcast. Subscribe HERE now!

Our Speakers
Fabian Heinrich
Fabian Heinrich
CEO & Co-Founder of Mercanis
Dr. Klaus Iffländer
Dr. Klaus Iffländer
KI Expert & Head of AI at Mercanis

Fabian Heinrich (00:01)
Dear listeners, welcome to another episode of Procurement Unplugged. Today once again with Dr AI, Dr Klaus Iffländer. We have already discussed the topic of AI and generative AI in several episodes. In the last episode, we took a deep dive into Vertical AI Agents and how this can disrupt the whole topic of software, or even cannibalise all of today's software somewhere.

Today's topic is synthetic data and how this data can become a booster for all LLMs and vertical agents. That's a lot of new words, and it might get a bit technical here and there, but that's why we have our Dr AI with us. Welcome Klaus.

Dr. Klaus Iffländer (00:54)
Hello, I'm delighted to be back.

Fabian Heinrich (00:57)
And then, before we start talking about LLMs or all sorts of things, let's get straight into it: what is synthetic data?

Dr. Klaus Iffländer (01:07)
Synthetic data is data that is simply generated artificially. Traditionally, data comes from other systems, for example, or is collected in surveys and comes from other data sources. From databases, emails, in other words, wherever digital data is used.

That's where they come from and that's where they can be used. And sometimes the data that you actually need for certain purposes, even if only for training purposes, is not available. And then these are precisely the examples where you can help yourself with synthetic data. For example, I have now heard that it is used extensively in robotics, because robots that move through real worlds, for example through the home, always have to be trained to find their way around.

And since you can't send robots endlessly through different homes to train them to get around, you now do it in such a way that synthetic data is also generated there. This means that flats and houses are simply designed at random, so to speak, through which robots...

Fabian Heinrich (02:28)
That would have been my next question, where do I get this synthetic data from? That makes a lot of sense, but where do I get the synthetic data?

Dr. Klaus Iffländer (02:38)
You don't get it from just anywhere. Of course, you can also buy data to some extent, but traditionally, synthetic data is simply generated. In other words, it is simply generated by algorithms. In other words, certain rules are specified. For example, a price trend in procurement. From a data perspective, this would simply be a time series of prices. And if you now want to train certain cases with it, for example sudden price drops or sudden...

Or slow price rises, then you give these rules to the algorithm and it creates, so to speak, it thinks it up and then it creates exactly such data sets according to these rules and you can then use them to train LLMs, for example.

Fabian Heinrich (03:30)
Exactly, I mean with the robots, that makes sense, it sounds a bit like those training sets that I used to use to further develop machine learning algorithms. Why is this synthetic data now so irreplaceably important for the LLMs?

Dr. Klaus Iffländer (03:50)
Yes, there has now been a lot of discussion about the fact that progress in LLM development has slowed down. And one of the reasons that has been blamed for this is that the training data has simply run out all over the internet. There are only a limited number of high-quality publications that can be used as training data. For example, all of Wikipedia, all of Twitter or all of Reddit.

Such things are used to train LLMs. And there is only one Wikipedia. In addition, the data sources eventually become a little thin. One way to get round this and still improve LLMs is to use synthetic data. We already have LLMs that can deliver quite good texts, for example. These can also be used to generate new training data, which then covers certain cases.

Fabian Heinrich (04:51)
Of course, I also have this verticalisation in the synthetic data. So not only do I get a second or third Wikipedia via synthetic data, which I can train better, but I can also get or create the synthetic data for my vertical topic, keyword purchasing.

Dr. Klaus Iffländer (05:11)
Exactly, it just makes a lot of sense. So of course, you don't need multiple Wikipedias. So the data source doesn't have to be very, very broadly distributed, but if you want to build vertical agents, for example, and want to bring in certain cases or certain background knowledge, then it's a valid way to create exactly this kind of training data. simply as synthetic data.

Fabian Heinrich (05:41)
And now another very stupid question: if I wanted to train my LLMs, or my vertical agents, where would I generate the synthetic data or, as you mentioned earlier, where would I buy it? I mean, staying with the example of Wikipedia, I can't really buy a second Wikipedia now.

Dr. Klaus Iffländer (06:04)
No, by shopping I meant things like price trends, for example. There are established companies that hold such data and also maintain and archive it. You can add it to such places. This is real data and synthetic data would simply be generated. So you would create corresponding algorithms or...

Fabian Heinrich (06:16)
Well, that would be real data, not synthetic data.

Dr. Klaus Iffländer (06:29)
set up specialised LLMs for this, which then generate precisely this type of data, such as supplier failures. Imagine you have a history of events and for certain of them, the supplier simply fails at the time of delivery and you have a problem. How should the agent react to this? And to test such cases in advance or to train them at all, you would have to generate this data.

Fabian Heinrich (07:04)
Okay, so now let's break it down for the audience: I want to build an agent, so to speak, or perhaps software like Mercu AI is now building an agent, and then the new agent might have the task of risk assessment.

For polymers and then I have a real data set somewhere and the real data set is perhaps my historical default data, historical price data, existing default probability, which I have stored somewhere in the ERP and based on this data I can then generate synthetic data with various LLM algorithms by generating my Wikipedia 2345 N Wikipedia, so to speak, and with these N Wikipedia, which I have generated based on the real-time data, I can then train this vertical agent on this use case, so to speak.

Dr. Klaus Iffländer (08:10)
Exactly, and above all it's about mapping underrepresented cases, such as supplier failures. Because hopefully you probably have a history of very reliable suppliers. And if this is almost always the case, then it's difficult for the agent to know what to do.

If the case arises that the suppliers are now cancelled at short notice or a larger number of them. And precisely this data would then be generated so that the agent knows about it.

Fabian Heinrich (08:43)
If you now look at the technical side, what would be the use cases where you would say that these would be examples where a synthetic data booster would make a lot of sense and where I could generate a lot of value with it?

Dr. Klaus Iffländer (09:00)
Yes, there are a few. For example, imagine you have a chatbot in Procurement where certain enquiries are processed or certain orders are handled. And it's an interaction between the buyer and the supplier.

Then you could take this data and generate more. Or maybe you don't have any right at the start. Then you could generate such dialogues. Because imagine, just as you would do in Chat GPT, that you say, dear Chat GPT, imagine that you are now a buyer and you are in dialogue with a supplier. How does the dialogue work? And then you simply generate hundreds of such dialogues and then take them back to your new agent, who is currently being developed and trained to teach him what happens in certain situations or how he should answer certain questions.

Fabian Heinrich (09:57)
Yes, I mean, this is of course extremely helpful for us as a software or agent provider, because we don't need to set millions of training data somehow. Based on various real-time data, we can of course...

We can create replications using the synthetic data and train these agents in a very powerful way. That's why using this synthetic data to generate very intelligent vertical agents is of course an incredible value driver for us, or as you called it at the beginning, a booster.

Dr. Klaus Iffländer (10:45)
Exactly, exactly. Of course, you already have various data available, but it is often the case that certain cases are underrepresented or it is difficult to obtain product data for them.

You already have a lot of products on the platform and yet it is the case that in certain areas there are few... stored product specifications or something. In such cases, you can also generate data or compliance documents. Especially when new regulations are introduced. Last year, for example, we had the Supply Chain Duty of Care Act and there is always a certain amount of uncertainty in the industry as to what exactly the documents should look like and how exactly they should be covered.

And you can also generate data for things like this in order to ensure software compliance or even coverage of use cases.

Fabian Heinrich (11:41)
Yes, if you think about it a bit, you could also argue that there is no longer such a difference, David versus Goliath, in terms of data. Until now, people have always said, okay, these are providers who have been on the market for 20 or 30 years, they have all the data sovereignty, they benefit from the data, I can now help myself with the synthetic data and although I haven't been on the market for 25 years, can I then train my agents just as well?

Dr. Klaus Iffländer (12:16)
Yes, exactly. So in principle, you can compensate for this difference in size. Of course, it's possible that in 20 years something will have happened that you couldn't anticipate using an algorithm. But I think these differences are marginal.

Fabian Heinrich (12:37)
Yes, very exciting. I mean, of course, this also comes along with technical challenges or risk factors or perhaps somehow, let's say, problems in the implementation, how I now bring a vertical agent to the level via the synthetic data. Perhaps you could shed some more light on that.

Dr. Klaus Iffländer (13:03)
Yes, so of course there's a big difference between the requirements for creating this synthetic data. Because in the simplest case, as I said, you just enter a prompt in Chai GPT and then certain data comes out that you wanted to have generated. But then it's perhaps different with quantitative data, where certain patterns or distributions play a role and represent the added value.

Then you really have to programme and create certain algorithms that pay close attention to such patterns and distributions when generating data, so that the generated data also shows this and can then be used as training data for precisely this purpose. And then there are even more sophisticated methods such as generative adversarial networks. Here, the first step is to take a generator and generate random data.

Fabian Heinrich (14:09)
The generator generates my encyclopaedias from anti-maize data.

Dr. Klaus Iffländer (14:14)
Exactly, if you now synthetically... Exactly, but then there is a counterpart that tries to distinguish whether this data is really synthetic or whether it is real. It classifies the generated data accordingly. And if...

If it recognises this, i.e. as false data, then the generator receives this feedback again and gets better and better as a result. And the opponent, which tries to differentiate between the two, also gets better and better because it recognises the differences better and better. And so both systems get better and better. And as a result, very, very realistic data is generated, because the initially very random data that is generated becomes more and more precise over time and through this feedback, and thus closer and closer to the real data, so to speak. So it's a very sophisticated model, but it also delivers very good results. Of course, it's a bit more complex to set up, but it delivers very good results.

Fabian Heinrich (15:25)
Yes, and if you somehow project that into the future, what do you think the implications are? What implications does that have somewhere on the software market? But also, above all, on our topic of purchasing.

Dr. Klaus Iffländer (15:40)
Yes, as you say, it equalises the differences in size of the software providers, because many companies are now able to do so.

Fabian Heinrich (15:49)
So the argument, okay, Ariba now has a data advantage, they've been around since 1994, that no longer exists, because Mercanis, which has been around since 2020, can also benefit from a data pool via synthetic data that trains the agents to the same extent.

Dr. Klaus Iffländer (16:11)
Exactly, and the question is rather, which company makes the most of it? So who is thinking about the future and developing contemporary vertical agents that really scale the purchasing function digitally, as if they were additional digital employees? I believe that the competition will definitely move in that direction.

Fabian Heinrich (16:34)
That's actually quite exciting in connection with our last episode, where we said that sooner or later, in the next few years, most software players will be cannibalised because I only have the system of records or the databases and then the user interface. Now in that context, if I have the mindset to cannibalise myself, I've got a kind of...

Equal level playing field, because the difference of the data of this competition research, I have built up over the years, I can equalise with the booster of synthetic data.

Dr. Klaus Iffländer (17:10)
Exactly. So that's a development that I think will come, that it will be a different kind of competition. For the users, of course, it's a huge advantage because the agents are becoming more and more capable. They are being trained with better and better data and are then really in a position to conduct price negotiations, for example.

This is of course a huge step forward for the entire industry and the entire purchasing function.

Fabian Heinrich (17:41)
Yes, I think it's just as exciting for all Chief Procurement Officers or anyone who wants to digitise to look at who is already using synthetic data, who, let's say, is taking their agents to the next level.

And actually, what you can take away here as knowledge is that the old-established players, even if they have two or three decades of experience, no longer have an experience advantage, because the experience advantage was or is the data and the new players are now equalising this or it can now be compensated for with the technology of synthetic data. So from that point of view, Klaus, once again extremely exciting, thank you for bringing the topic closer to us.

Thank you very much, we are already looking forward to the next episode with you, which will be about inference time computing and reasoning. Thank you very much.

Dr. Klaus Iffländer (18:39)
Thank you, see you soon.

Also available on
Button to direct to the Spotify pageButton to direct to the Apple Podcast page

Explore more of our Podcasts

NEWSLETTER
Sign up for the newsletter!
Stay up to date and receive news about procurement and Mercanis, as well as new webinars, best practice guides, white papers, case studies, surveys and more.
Sign up now