When looking to invest in a big data project you need to consider a wide range of issues, the following is from my experience of being a database designer, developer of large scale data driven systems and an entreprruner utilising the data.
From my experience the investment your looking at, you need to answer the following:
- The source of the data. How do you get it, do you get to keep it. This can be a key cost factor in your business model.
- Collection time frame. How up to date is the information your getting and keeping. This is a key
- What does it do with the data? There are many things you can do with data and making it accessible, picking the right access strategy is critical for early adoption and long term growth.
- How is it being monetised? The path from capturing and owning a large amount of data and making money from the data is not always straight forward, picking the right business model, supported by the three previous questions will be your key deciding factors on the success or otherwise of the venture.
So, looking at these in a little more detail.
Source of data.
Where does the solution get its data from? Does it have its own “probes” and devices capturing the data itself?
Does it harvest from the public internet? Where are all the retail outlets in a city, Google captures this now from every website it indexes.
Do you need to be handed the data? An associate of ours takes patient data from large hospital systems to analyse it. The same group used to analyse sonar data to find good oil drilling sites in the ocean. This is good because the collection isn’t your concern, its an issue because you have to make sense of whatever is available and each client is different so your investment considerations change.
Do you need to buy the data ongoing? With pay walls and private data there is a huge market in having access to private knowledge. Hedge funds, medical research and other industries spend a lot of money compiling datasets that no one else has. This is a good model IF you have deep pockets to buy the data and you have a good high value add monetisation scheme, usually it involves key people, knowing what to look for and enabling them to “follow their gut” by providing tools that enable them to understand what is below the surface. Some of the more obvious issues are most people wouldn’t know what to do with the data if they got their hands on it, its only valuable to a small amount of people so you need to employ them and keep them long term.
You need a lot of upfront investment and a good ongoing cashflow to keep ensuring the supply of data. This is costly if your first monetisation model doesn’t work or your adoption rate is too slow, so be selective in the markets you target
Collect the raw data itself? These systems typically collect details from users or have sensors or probes that are submitting data back to the system to be collated. This is generally cheaper model, though you do have to pay for the transfer and storage of the data, this at significant scale requires a lot of good engineering work put in early to ensure a reduced operating cost long term.
Harvest the data?
These systems can be amazing, there is a hedge fund which monitors Twitter and mines the tweets to get a gauge on “market sentiment” about their specific investments.
Alternatively the system work like google, crawling the existing web or selected websites to capture the data from other parties. This is good if there are many parties, if there are only 1 or 2, then you have to consider the risks around what do you do if the source of data dries up. Eg. Taking data from an established source like Google search OR a government weather station is fine but if your relying on less stable sources then you will have a few problems like.
- Over time you will run out of data sources as trends and interests from the population change.
- Each supplier may change, upgrade or modify their data provision without warning, there is a constant update and maintenance process to factor in your model.
- If it is too available and too easy the chance of competitors is high and becomes an issue of your service having a good adoption rate.
What is the collection time frame?
The timeframe will impact on the amount of investment and the focus of your investment.
The quicker the collection timeframe the higher the investment in automation and tooling you generally need to budget for, the counter point to this is the investment in key people, knowledge and processes around working with data and clients.
- Is it “Live” – careful this has many different meaning depending on who is talking about “live”, when a software developer says “Live” they are referring to “to the millisecond” reaction time style live processing, think aircraft control, self drive cars, call centre phone routing systems, patient monitoring systems etc … its happening live don’t drop the ball. These are typically highly complex and take a large amount of specialist engineering. Expect a high price tag for the cost of development and support, but the upside can be equally large.
- If it is not one of these “Truly Live” system then it is typically “within a few seconds or even minutes is good enough” fault tolerant or non-mission critical systems. Twitter is a good example here, if the tweet is 2 to 5 seconds late no one notices. These are much more approachable by a generalist developer but may have data scaling problems (much like twitter). There are engineering patterns that can be applied to manage the risks and knowing you have these covered is important.
- The far end of the spectrum is being sent large chunks of one-off datasets, an example we have is every transaction line item for 2 years from every store of a large retail chain, their gaol was to work out “who is my best customer, who should I include in our top tier loyalty program”. This end of the spectrum is much harder to plan a consistent set of high end technology for, your investment is in generalist data manipulation and visualisation system and primarily in your staffs capability of working with the data and the client in order to answer the questions that arise.
In the middle of the spectrum between live and single large chunk, there is standard data collection systems (like online stores, membership management, accounting systems etc) and the use of data warehouses, dashboards and other KPI tracking, process and alerting technologies to extract ongoing value from the ongoing collection of your normal daily business data. These are generally what is being targeted by the bulk of business intelligence / big data systems. These systems can have a huge impact on how a business is run and investment in the right combination of technologies can make a difference like the recent News Ltd end of year results where they made a higher profit out of lower overall revenues by utilising the right set of technologies.
What does it do with the data?
Collecting and storing data is all very well, but generally the benefits are only seen when you go to use the data for business decisions, to feed into business processes, to gain a deeper understanding of what is actually happening.
Getting this aspect of the equation right can be the make or break of an investment, this is the value proposition to the customer.
So what can you do with it? Really anything is the answer, sky is the limit but generally you will see patterns within the spectrum.
- Provide it for others to build on top of? Governments are starting to do this with the Open Data policies in the US and Australia (along with elsewhere). Providing access to machine readable data enables App developers to build services for customers. Done well the provider of the data can make money, done poorly (eg twitter’s early model) the Apps sitting on top of their platform made more money than they did providing the base data.
- Run its own private, ever improving algorithm? Google Search is the obvious one here, google harvests all websites it can, indexes them and allows people to find what they need, no one else knows how they do it, it just works so customers flock to it. Another example from our development, is the automated algorithm that creates and lays out 80% of Australian Newsprint daily, there are a core group of developers who know how it works (and that is what we sell) and the companies pay for the service because it saves them a huge amount of time and optimises to their exact requirements.
- Build a product ecosystem on top of the underlying data. Atlassian have done this with JIRA providing a marketplace of addons that utilise the core existing content. Our own development has seen 12 modules making up the Newspaper solution, a suite of 8 products being sold in the medical monitoring field around a single data source and shortly a marketplace of addons for tracking, logging and managing your time.
- At the other end of the spectrum, utilising the right tooling to enable an extremely valuable high end consultancy that can answer key questions about a topic be it customer segmentation, drug discovery, drug trail analysis, grow patterns of cities for property investment. The big data enables you to charge each client for an answer to their specific question.
How is the data being monetised?
In the majority of cases the data itself is not the thing that will bring in the money. Usually there is more to do than just capture and store the data, its what you do with it that counts.
- Advertising? The defacto answer for most solutions I have seen, that unless you have googles user base doesn’t tend to work so well, be careful about advertising as the primary revenue stream. Even the likes of Facebook and Twitter discovered, unless you capture a huge market that don’t mind seeing the ads you are going to struggle. The alternative, like stack exchange, is to capture a very targeted market for whom the ads you show are highly relevant and just what they want to see (sitting side by side with the reason they are at site).
- Building a product or platform of your own that customers pay for. Every online service you subscribe to and pay money for monthly is a big data (or simply collects a lot of data) and is prone to the considerations in this article.
- Paying for access to build on top of it or be part of it. App developers sitting on top of Google eco system or the Atlassian Marketplace or the Apple Appstore for example. These are very good models for both the marketplace owner and the contributors to the marketplace, the risk is between launch and getting traction, too many apps and you get lost, too few and people lose interest.
- Consulting on Insights and analytics. Skilled and knowledgeable people turning the raw data intro knowledge that can be actioned. clients of ours can tell you what side of the street to open a new store in a new city based on the data you have captured in your home city.
- Used in optimisation algorithms. We automate the layout of every newspaper around Australia using the advertising data and a private algorithm that dramatically improves the revenue and profit per page of newsprint, our live call centre management system had a call routing and placement engine, work management system distributes work item to appropriate. Other companies we know have been able to calculate the exact tonnage of shipping freighters to enable them to take more cargo and not run aground.
- Selling the data. If you have medical history of everyone and analysts can find answers to common ailments, then the data is worth selling on its own (if you legally can sell it), but your list of buyers who can utilise and profit from it themselves is much smaller than the other markets open to you, see Kaggle and kaggle.com/competitions for an example of some of these style engagements.
So, all of the above and more is wrapped up in the term “Big Data”, when it comes to an investment it pays to understand where the data is coming from, how it is being stored, what it is being used for and most importantly how is it going to pay for itself and make a return on your investment.