Again ISIT is bringing the latest research and business practices from the world to the region. Again ISIT provides some excellent workshops and forums that will be held in Dolenjske Toplice, Slovenia. And again ISIT is welcoming you with this year conference interview.
The connectivity is the buzzword. The word that is changing our world rapidly. Social networks, new societies, the amount of data, global information networks, access to diverse production models, communication paths, interactive communities… these are only a few of the phenomenas where ICT technologies are leading us to the future.
However what kind of future?
We were talking about that with our invited top presenters at the ISIT 2011 event. This year we are warm welcome Simon Fischer, the research director of Rapid-I, a company that growth up from the academia and produce today one of most recognized open source tool for Data Mining, Marko Bohanec, senior researcher at the Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, and Natasa Przulj, researcher of computing at Imperial College London.
Simon, Marko and Natasa, warm welcomed to the interview. As we all know the modern technologies produces new challenges. We are facing with the challenges that was not even possible to imagine a decade ago and in the same time we are producing new one, we hard imagine today. If we only look to the amount of data that we as a society collect and store. It doubles every eighteen months. Then Social networks that redefined the way people find and share information. They have actually provided a platform for a new wave of applications and their impact has even spilled over into this year’s revolutions in the Middle East. Not to mentioned the Internet itself that has evolved as a layered system, where work on its reasoning layer has already started with the initial efforts on the Semantic Web. All mentioned is very connected with your research, your everyday activities, so let me ask you some of questions that might be interesting for us before we will have pleasure to meet you at the conference.
Let’s begin with you Simon. You had established innovative and high tech company Rapid-I trough your research work at University of Dortmund. Could you explain us some more facts about how all began and how Rapid-I today in facing the growth of digital data, as we know your business model is one of the latest and sustainable, if I may say so, I am talking about open source model around your solution that has a large community and is gaining its popularity in different research areas.
Simon: Well, RapidMiner began as “Yale”, a project started at the AI group of Katharina Morik back in 2001 to replace the then usual practice of running data mining experiments as a collection of Perl scripts, a rather tedious endeavour implemented over and over again by researchers around the world. At that time, only Ingo Mierswa, Ralf Klinkenberg and myself were working on this project. It soon grew to something bigger, extending the functionality required for a typical scientific workflow – load data, cross validate learning algorithm, output performance – to include operators to transform data in manifold ways and to create very complex workflows, e.g. loops, branches, and optimization operators. In fact, these ETL operators account for 95% of the work in real world workflows where data is never in a format you could use for sending them to a learning algorithm.
It soon became evident that there is a need for training, consulting, and support around Yale, leading to the founding of Rapid-I as a company in 2006 – and also to the change to the somewhat more search engine friendly name RapidMiner . As you said, the open source business model is actually a very sustainable one, and it is good to have such a great community as RapidMiner and RapidAnalytics have it. Rapid-I did very well also during the recent crisis, also because the open source model is very attractive for our customers. Furthermore, people start to realize that real data mining can add significant value to their business processes – and in particular that there is much more they can do on top of what many of the more standard business intelligence tools do under the label data mining.
How large is your community and how are you, as a research director of the company, collaborating with the community. I visited your conference this year and probably this is quite interesting event for Rapid-I, as you are gathering a lot of researcher there from the academia around your main software product. Probably this is quite unusual for regular business model and probably the phrase be connected, be open and be local is something that you are betting on.
Simon: It’s difficult to tell how many users RapidMiner and RapidAnalytics have. We only see the download numbers and the usage statistics. From that we can conclude that we have tens of thousands of users, probably 100.000. And yes, the RapidMiner Conference and Community Meetings which we now have organized twice already – were both great successes. It is a great opportunity also for us to get in touch with our community, and it it exciting to talk with them, learn about their ideas and plans and look at the amazing things that they do with RapidMiner and RapidAnalytics. I’m very much looking forward to next year’s RCOMM, which will take place in Budapest.
As a research performing enterprise, we are also in close contact to many researchers through various projects, and we try to keep in touch with academia by regularly attending scientific conferences, both international – like ECML/PKDD or now ISIT – and local like the German KDML. Also, we are building new means for
our users to share their results with others. One of them is the Rapid-I Market Place (marketplace.rapid-i.com) where developers can host their RapidMiner extensions and make them available directly from within RapidMiner. Another one is our connection to myExperiment where you can share your processes.
Your talk on ISIT will be very interesting one, as you will explain how you can use the processes created by the community, i.e. sharing knowledge and experience within community where experiences of community can be used for building better tools for making the processes. I believe that this kind of effects that is growing up inside community could have effects on research, as also effects on business.
Simon: Yes. One way of sharing knowledge and expertise is via the processes you design in RapidMiner. If shared on myExperiment, these processes can be re-used by other researchers that may face similar problems. However, there’s more to it than looking at processes of other people one by one and see if there’s anything you can learn from them. In fact, much of the knowledge contained therein can be extracted and used for meta mining and eventually building assistants and recommender systems that help you to design your own processes, tailored to the specific needs of your data set. The purpose of these assistants is that you don’t need to know or try out whether Naïve Bayes or an SVM is better suited for your problem, the system will know it.
As data mining is more and more used by domain experts and scientists that have no or little training in data mining and statistics, it becomes more and more important to lower the barrier of using data mining as a technique for these users. And that is where the community helps – partly through assisting each other, and partly just by allowing
us to use their shared implicit knowledge and making such tools out of it that help others to use RapidMiner.
Marko, let’s stay a little bit by the data analysis and data minig. It is probably not necessary to write that you are one of most known researcher in this area in Slovenia as abroad, you have developed a number of decision support tools and systems during your rich research career, the DEXi tool is for example very known by our students, as I may say between Slovenian students generally, if we mention only one example. What is your view on this information analytics tools mentioned earlier and your view on open source that has emerged as one of the most important IT movements in recent times. I am sure that at least researchers today are hard imagine the world without this phenomena that was started by Richard Stallman.
Marko: If you mean data mining and data analysis tools, I believe they are indeed indispensible problem-solving and research tools in today’s complex world, which is overwhelmed with data and information. Among these, RapidMiner is one of the most prominent and widely used. I am happy to say that we at Jožef Stefan Institute also in some way contribute to the development of RapidMiner. In the context of EU project e-LICO (An e-Laboratory for Interdisciplinary Collaborative Research in Data Mining and Data-Intensive Science, http://www.e-lico.eu/), we develop methods for the evaluation and ranking of RapidMiner’s workflows, that is, sequences of sub-tasks needed to carry out a given data-mining task.
This brings us directly to the question about open source. Without open source, such large-scale international collaboration would be very difficult, if not impossible. The development of complex software such as RapidMiner requires an integration of hundreds of methods and algorithms developed worldwide. In my opinion, international collaboration based on open source development has, at least in this area, performed much better than centralized and controlled development carried out by commercial companies. Collaboratively-developed tools in the research community, such as WEKA, Orange and RapidMiner, are very successful: all of them are powerful, free, easily accessible and growing.
Your talk on ISIT will presents also some interesting research project as for example cross-domain literature mining that was implemented at your institute. Can you imagine future services that will integrated this kind of research project and what do you think how this can influence on new way of our communication? As we can see already today early semantic web search engines are making new way of searching facts in the digital world. Can we expect a new way of “searching knowledge” parts instead of the “information” parts?
Marko: Sure, as a researcher I am expecting a lot of new services in the future. I still remember how difficult it was, not so long ago, to even find and review research literature – spending hours in the library, waiting weeks to get requested journals, photocopying papers, etc. Now, this is all searchable and clickable on the Internet, which is an immense improvement. This continues to be improved, for example by better and more “intelligent” searching methods, automatic translations, etc. And yes, methods such as the mentioned cross-domain literature mining will bring more and more meaning, semantics, knowledge, “intelligence” – you name it – into the process. However, I would also be cautious and not expect too much. As it seems today, new approaches mainly rely on employing the sheer computing power of machines, which is useful and helps in dealing with a lot of data (for example, through carrying out a word-by-word comparison of thousands of research papers), but dealing with “real knowledge” in human terms remains inherently difficult and is still a long way
Not at last, you are strong involved with machine learning algorithms. Now there are some future prediction that in 2030 the power of computers will exceed the power of human brain. At least for me this prediction is not serious, I can not take in for real, especially (this will maybe sound not
to much scientific) when we think on some facts from neuroscience that we are using only 3% of our brains. Now seriously, what is your vision of near future machine learning algorithms in our everyday life? Can you name us some fresh examples, that we might be able to see near existing services in near future on smart phones, for example?
Marko: I have always been reluctant to compare human brain with computers. It is like comparing the human arm with the bulldozer. I believe they are incomparable. In problem-solving,
I prefer to view the brain and the computer as complementary partners that can solve the problem together much better than each one alone. Furthermore, human brain is an awesome organ which, in my opinion, has “something” that is still unachievable by computers, regardless on their computing power. All the promising developments of artificial intelligence, conducted in the last 50 years, such as expert systems, neural networks, machine learning algorithms and many others, are still unable to deliver that “something”, which we would recognize and understand as true intelligence. Computers are powerful, they are useful, they are better from the human brain in many aspects, but they are not as flexible and intelligent as the human brain. Also, in relation with machine learning, the ability of people, especially children, to learn, is amazing and far outperforms any artificial systems.
Machine learning is already part of existing services and will continue to contribute to new services. As a stand-alone tool or product, I see machine learning algorithms to be somewhat limited to research and data analysis; nevertheless, these applications can have a huge impact for science, economy and management, to name just a few. For common everyday use, however, machine learning algorithms are and will be mostly embedded into other systems and services, not directly visible to the users. Computer games, for example, are already full of machine learning and other artificial intelligence algorithms, which improve their flexibility and adaptability. For more useful examples for the future, I am expecting better
and better solutions for monitoring and management of complex systems, from car driving to exploring deeper space by space probes. I am aware of many current projects which attempt to use machine learning to create better services in medicine and health care, for example in health monitoring and alerting systems for elderly people. There is a huge interest for real-time analysis of vast amounts of unstructured data, for example in the banking sector in order to follow and analyze world-wide news and detect important trends and events. This list is almost endless. Also, I see no real limitation for using all these advanced new methods and services on smart phones and handheld devices. The lack of their computing power is easily compensated by the use of current cloud computing technology and increasingly better network connections.
Natasa, your research in ICT is applied in biology. As we all know, computer science made a breakthrough in realization of Human Genome Project. In almost decade later scientists are seeing that genome held more complexity than many of them had imagined, making it difficult to isolate the functions of the three billion DNA units, or base pairs, whose sequence the project had determined. Can you explain us some milestones that were achieved in this area after 2003, when Human Genome Project was concluded.
Natasa: We have seen an explosion of genomic as well as post-genomic data within the past decade. Genomes have been sequenced for a number of organisms. They have been compared and aligned to infer phylogenetic relationships between genes and species. We are building computational models and tools to make sense out of these data and extract new biological knowledge. In addition, new biotechnologies made large amounts of post-genomic data, including transcriptomic, metabolomic and proteomic data, increasingly available. Both genomic and post-genomic data sets are too vast to comprehend without the use of modeling and computation. In simple terms, genes produce proteins, and it is the proteins that are the main workhorses of the cell, they are involved in all cellular processes and are vital for normal functioning of a cell and an organism. While sequence data could be viewed as “linear,” the post-genomic data deals with relationships between biological entities, such as genes or proteins, which are most often represented using networks. We are still in search of new data mining algorithms that will uncover new biology even from relatively simple-structured, “linear,” genomic data. Mining network data is far more computationally difficult. For example, while aligning sequences was computationally easy (solvable in polynomial time), aligning biological networks is provably computationally intractable (NP-hard) and hence has
to rely on approximate solutions (heuristics) that are currently being sought. Another challenge is that of integration of many different and large genomic and post-genomic data sets to get a complete picture of a cell. The goal of all these challenges is to improve biological understanding and contribute to the design of new drugs.
Your talk on ISIT will address some network algorithms. As we all know the network is the buzzword in last years mostly due to the social network trends that are changing our society. Probably there is a lot of common ground in the theory of algorithms that science is using for social computing as also for bioinformatics. Can you explain us in some words for which problems do you use network algorithms by analyzing bioinformatics data and what network topology could mean for the research in biology in future?
Natasa: A “network”(also called a “graph”) is a mathematical object that is commonly used to represent data sets with binary relationships between objects, such as friendships between individuals, human contacts that influence the spread of infectious disease, collaborations between scientists, social hierarchy in organizations etc. A field of mathematics called “graph theory” that studies networks in a mathematically rigorous ways has existed for about three centuries and hence the mathematics of networks is often well understood. As I mentioned above, in the past decade, large amounts of biological network data became available due to advances in experimental biology. One such data set is protein-protein interaction (PPI) networks, which are networks with nodes representing proteins and links between nodes representing interactions between them. Since many serious diseases, including cancer, are caused by disruption in PPI networks, and since drugs act on proteins, it is of crucial importance to understand these networks. At present, we are collecting the data, our experimental observations are noisy and incomplete. Also, we have already started constructing algorithms and models to infer biological knowledge from these data. The expectation is that network data will be at least as useful as genetic sequence data in uncovering new biology and improving biological understanding and therapeutics.
Thank you all for being with us in this interview. We are looking forward for many more disscusions and opportunities on ISIT 2011.