Interview with Michał Mach, entrepreneur, trainer, consultant, co-owner at caltha.eu, president of advisory board at ePaństwo Foundation.
FPP: What is TDM from the point of view of a practitioner?
Michał Mach: My opinion is that everyone was doing TDM before they knew what it's called. Numeric and textual data was analyzed for patterns since it started to be processed by computers and one can argue that even during those early years it could have been called by the name of data mining. In recent years many new methods using machine learning, artificial intelligence and statistical analysis made a great leap forward, but good old relational databases (last millenium’s 70s technology!) are still large part of whole picture. It just comes down to using different tools and different expertise to achieve the same or similar goals.
And what has changed thanks to the promulgation of those new tools?
First thing is completely new possibilities for analyzing enormously huge data sets. With traditional tools, it would be too complicated to obtain the same results as with TDM tools. Second thing is time – back in old times it was fine to wait for a few days or even months for the results. Right now, analysis often need to be provided real-time: the very moment when data is created, decisions are automatically made based on analytical findings. TDM made businesses more reactive, closer to what’s actually happening in their environment.
For TDM to have sense, large data sets are required – otherwise findings may lead to incorrect conclusions. All industries which traditionally generate or gather large quantity of data (e.g. Internet, telecommunication, health care, retail) profit from TDM. This kind of analysis allows them to better adjust to their clients' needs and do so with less expense.
But does that mean that in business TDM is a chance only for large players?
Not necessarily. In my opinion, TDM allowed creation of new business models, where the basic service for end user is free, but the revenue is generated from the analysis of users' data. Somebody has a good idea for service, it’s made available for the public and when it proves to be good, then come the clients and data – and so it goes. There are also other models. However, there is a significant risk that in the long-run the large ones will become larger and the small ones will have to overcome very difficult barriers to get to the point where they can compete with them. It comes down to amount of information one can use for analysis and making decisions, as well as stability of the data stream that’s “flowing in”. It’s an economic argument for making as much data as possible available and easy to acquire.
And in what way is the data for analysis gathered?
In businesses, especially those directed at mass market, it is quite easy – companies generate the needed data themselves or purchase it from one another. However, large companies rarely want to share their data, because it is a part of their assets, so it’s not available to be used by new players. In majority of cases it’s a good thing, since data refers to information about different aspects of their client’s privacy.
The case is completely different when it comes to public data. Administration generates enormous amounts of data but rarely makes it available willingly. And even if it does, data often has poor quality and format. It’s also difficult to find comprehensive data sets – data is often fragmentary, refers only to narrow aspects of whole picture. In Poland the case is usually that a person or an institution files public data disclosure request, which concern specific data in a specific case. Administration shares that information, it’s made available for further use. However, it is usually only a small portion of a much larger data set. It requires a lot of effort to build large, comprehensive data set from information scattered between many sources.
There are useful datasets made available, but if you assume that making data available should be administration’s responsibility, situation is really bad.
Any examples of how administration shares public data?
Good examples can be seen in local administration e.g. in Gdańsk, one of Polish cities. Some time ago they made a strategic decision to start opening up public data they create and own and since then they have been consistently implementing this idea. Both sizes of data sets and variety of data sets are growing steadily.
I don’t really want to talk about specific bad examples – everyone can see for themselves. Just look for any public data and there will be 90% chance that you’ll encounter at least a few common problems – unusable format, incomplete data, etc.
What changes should be implemented to improve the situation?
It would be best if there was a legal obligation to share public data in most cases, without the need for sending an application. Application process would be required only in specific cases. Also, imposing fixed catalog of formats that need to be used when making data available would probably be useful. However, I feel a bit torn here, personally I would be very careful about implementing some central format standardisation...
It's impossible to have one standard for all types of data. If something is meant to serve every case, it’s usually not useful for anything. In turn, if we implement many different standards, it will cause chaos which may result in more problems than advantages. When I think about it more, sometimes raw data is much better for analyzing than a processed one, converted into imposed standard. My experience is that the first stage of any analytical project is preparing and cleaning the data anyway, so maybe data should just be made available in plain format, as it’s stored by machines that are used to collect it? Hard to tell what would be the best, holistic solution here.
Is the case today that basically anyone can quickly learn to perform TDM? Does TDM cost much and requires specialist skills?
It's easier today than in the past, for sure. Good tools are usually expensive, but there are free alternatives which are getting better and better with time. They used to be really cumbersome and complicated, but situation is improving. Today’s tools are much simpler than before and do not require extremely deep expertise, although basic maths and computer science knowledge helps. Good news is that basic text and data mining may be carried out with relatively low cost and effort nowadays and my opinion is that it will be getting more available with time.
And what is the case of access to data, from a legal point of view?
People don’t think too much about data ownership and legal aspects. Maybe the thought that any laws may apply here never even crosses their minds? Especially in case of public data available for download, nobody thinks about whether it's legal to download and analyze it, but simply does it. It probably comes from the belief of many that if something is available on the Internet for download, then it can be downloaded and used – the presumption of permission.
Personally, when I come across some data which are downloadable and which I would want to use, I simply contact the author and ask for permission. I don’t think many people do it this way – it doesn’t come from ill-will, but rather the lack of knowledge. It’s definitely worth spreading the knowledge about different legal aspects of TDM before it becomes more popular.
Is the knowledge of people still insufficient?
Yes. Of course I do not mean real practitioners who work with TDM on daily basis and live off it – they are mostly aware of legal aspects that need to be dealt with. My feeling is that majority of the society never heard of text and data mining and have no idea that such mechanism are quite broadly used and have significant influence on their lifes. They key is making the concept of TDM more clear and transparent for broader groups of people, educating them on what it’s used for and how it relates to their everyday choices. My personal belief is that whole TDM industry can benefit from spreading such knowledge.