Disruption
If in the age of AI data is the new oil, then having lots of well-labelled data is clearly a competitive advantage. But if new models based on Transformers such as BERT can essentially create highly valuable embeddings based on widely available data in an unsupervised fashion (and then customize these models with relatively low volumes of domain specific training data), then this changes those dynamics dramatically. If you can hold back these models and offer them as a service, you might skim off enough value to justify a business, assuming the underlying training data is free and widely available (i.e. “free” on the internet).
What then is the secret sauce? Retrieving the underlying data at scale (in a compliant fashion)? Training the model? IP around what tasks are used to train the model (e.g. BERT’s predict masked word; predict whether two sentences are consecutive)?