FedCSIS 2013 Panel: Data Mining in Cyber Age – Opportunities and Limitations
Tuesday, September 10th, 14:00 - 15:30 room nr 1.38
Abstract—Data mining could be defined as an automated or semi-automated process of searching through, extracting and analyzing a large amount of data and its relationships in order to obtain useful information and discover new knowledge. Data mining endeavors to analyze data of many different types (structured and unstructured) and from heterogeneous sources. Moreover, contemporary data mining approaches embrace various application/business domains, such as software repositories, business processes, bioinformatics, medical diagnosis, customer behavior, online marketing, etc. This panel addresses opportunities and limitations with data mining in Cyber Age.
Keywords—data mining, knowledge discovery, process mining, software mining, text mining, web mining, multimedia mining, IoT minng, big data analytics, business intelligence, soft computing, artificial intelligence, domain knowledge in data mining
I. Background and Relevance
In recent years the significance of data mining has substantially increased as today’s businesses, governments and society take advantage of information and knowledge in vast volumes of Internet-available data on top of enterprise databases and other data repositories. In today’s Cyber Age the quantity of data created daily and transferred over the Internet is oscillating around one exabyte (about 245 million of 4.38GB DVD’s), and the mobile data traffic has been catching up. Needless to say that separating the wheat from the chaff and turning this “big data” into useful information in order to build up new knowledge and contribute to common wisdom is a foremost challenge. Data mining is a major technology and facilitator in addressing this challenge.
The 2009 Panel of experts on Future Internet 2020 commissioned by European Commission  has identified a few overriding scenarios in the data-driven Internet-facilitated world of today:
- “Our everyday environments will be context‐aware: systems and devices will be able to sense how, where and why information is being accessed and respond accordingly. The Internet will be our personal global network.” [1, p.4]
- “Relying closely on the Internet with Things, the new Web‐based Service Economy will merge the digital and physical worlds opening up a multitude of niches and value propositions.” [1, p.4]
- “The complex web of services created in the Future Internet requires that privacy and security be built into each service. New businesses will evolve to leverage personal information, making data tracking and ownership a key concern. Important issues also arise in relation to accountability, governance and ethics. All of these create new regulatory spaces.” [1, p.5]
From these three challenges a more concrete research and development agenda emerges. From the broad perspective of the FedCSIS 2013 Panel the background questions are :
- How to extend human memory with digital memories of various intelligent assistants to create new business opportunities and to support everyday personal activities?
- How to take advantage of network architectures and data interconnectivity to facilitate multipoint-to-multipoint communication and provision of services?
- How to compete, interoperate and innovate through open platforms?
- How to deal with the dynamic continuous evolving nature of processes, organizations, and networks?
- How to approach the data ownership and related issues of security, trust, accountability, identification and privacy to gain public acceptance?
- How to leverage data/process mining methods and semantic technologies for distributed and heterogeneous data, information and knowledge management?
II. Agenda and Topics
While the above questions constitute the background for the panel, they are too broad to be exhaustively addressed in a single meeting of experts. Moreover, for the best outcome it is only wise to take advantage of specific expertise of the panel members to set up the agenda and topics. Accordingly, the goals of the panel have been set to discuss the following questions:
Connecting processes and data. [Wil van der Aalst]
- Why are the data mining and machine learning communities not focusing on end-to-end processes?
- How to reveal the desire lines showing what people and organizations are actually doing?
- Should systems and processes be evidence based?
- How to deal with processes that change while being analyzed?
- How to distribute process mining techniques?
Mining software repositories. [Barrett R. Bryant]
- Besides a large amount of textual data (e.g., requirements specifications) that may be mined, there are also many visual data artifacts, such as models expressed in a visual modeling language.
- An interesting subfield of machine learning is grammar inference, the inference of a grammar from strings that the grammar generates .
- Grammar inference may be used to infer meta-models describing software models, when such a meta-model has been lost to software evolution. Furthermore, this type of learning can evolve the models to make the system consistent .
- Grammar inference may also be used to learn domain-specific languages (DSLs) for example programs written in the DSL. This facilitates end-user programming where a domain expert not versed in programming languages may express the type of programs he/she would like to write and the underlying DSL infrastructure may be inferred .
Towards human consistent and comprehensible data mining. [Janusz Kacprzyk]
- The notion of comprehensibility introduced by Prof. Ryszard Michalski in the early 80s.
- The need of human consistency in modern data mining systems where synergic cooperation between human and machine is one of the most important components.
- The need of natural language in representation of the results of data mining and knowledge discovery; Some natural language technology tools that can be used: computational linguistics, natural language processing / understanding / generation, systemic functional linguistics etc.
- The need of handling imprecision in natural language; Some AI / soft computing solutions.
Evolutionary data mining. [Halina Kwaśnicka]
- Along with media storage becoming cheaper and sensors being widely disseminated, the amount of data collected has been increasing exponentially. However, mere collection and storage of data is much easier than using them as a source of useful information. Advanced data mining techniques are increasingly used for scientific and commercial purposes.
- Due to the complexity of data mining processes, scientists use different heuristics, such as evolutionary computation (EC), which can help at the stages of feature extraction, classification and data clustering of data, to mention a few.
- EC can be used in isolation, as the main technique, and in combination with other algorithms, such as neural networks or decision trees.
- The question is what are the potential uses of EC at different steps of data mining, such as data preprocessing, dependence modelling, rule based systems, etc., up to postprocessing of discovered knowledge.
Mining big data. [Pedro José Marrón]
- Data mining is one of the key challenges to solve in the next years. The breadth of information produced every day increases exponentially and with the addition of devices every day (Android registers more than 1 Million new mobile phone activations per day), things are only going to get worse.
- Without approaches that take into account crowdsourcing, collaborative algorithms and in-network processing, the world will effectively drown in data, making it increasingly difficult to separate noise from real information.
New scientific discoveries obtained by looking up seemingly fully explored empirical data. [Ryszard Tadeusiewicz]
- The discovery process includes research planning, acquiring empirical data, conducting experiments and determining the value of scientific research results. Most of time, money and effort is spent on data acquisition and experiments.
- Thus, it is not worth treating those big amounts of valuable data produced during a given discovery process just as outdated documentation which is not going to be analyzed any more. Other, not so easily visible discoveries can be hidden in such data. They may be quite unrelated to the original aim of data analysis but still very important.
- This means that in repositories of seemingly fully explored empirical data resting in archives of scientific institutions there exist materials for many new discoveries. Such discoveries can be obtained without full research planning. This is important because in case of truly innovative discoveries we often cannot formulate precise questions.
- Such discoveries need more sophisticated exploration methods in order to deal with data primarily selected for answering totally different questions. However, such advanced data mining applications are worth the effort required to bear.
The panel will be organized in a plenary open-door session with participants of all FedCSIS events in the room. The panel experts will sit behind a long table on the central stage with audiovisual equipment at their disposal as needed. The organization will follow the following routine:
- The moderator will define the theme and the aims of the panel and will introduce the panel experts (10 min.).
- The experts will go in sequence and present their panel statements (10 min., 2-4 slides), hopefully touching on difficult, controversial and visionary questions to generate public interests. No questions taken from the participants during these presentations.
- In the Questions & Answers session (30 min.) the moderator will request questions from the audience to one or more experts. Each question will be answered by an expert(s) and the remaining experts can also comment.
- At the conclusion the moderator will provide a short summary (5 min.) of the discussion and will invite the panel experts as well as interested members of the audience to write (within one month) short contributions based on the discussion.
- The moderator will collect all contributions and will prepare a draft for a joint paper based on the panel to be published in a reputable place chosen by the experts. The moderator can – with consultation with the experts – invite other opinion makers to contribute to the paper. The paper is to be published by the time of the next FedCSIS conference at the latest.
IV. Panel Membership
- Leszek A. Maciaszek is a Professor of Wrocław University of Economics and the Director of its Institute of Business Informatics. He holds also an Honorary Research Fellow position at Macquarie University ~ Sydney. He is internationally recognized mostly for his work in database technology, software engineering and systems analysis and design. In each of these three fields he published books (Prentice-Hall and Addison-Wesley, some editions translated to Chinese, Russian and Italian). He has served as an expert, reviewer, evaluator and advisor to European Commission, international corporations and government bodies.
- Dominik Ślęzak is a co-founder of Infobright Inc., where he is currently working as chief scientist. He also is with Institute of Mathematics, University of Warsaw. He serves as an associate editor for several international scientific journals, including Information Sciences, Intelligent Information Systems, Knowledge and Information Systems. He also serves in editorial board of Springer’s Communications in Computer and Information Science. His interests include Rough Sets, Knowledge Discovery, and Database Architectures. In 2012-2014, he serves as the president of International Rough Set Society.
Panel Speakers (in alphabetical order):
- Wil van der Aalst is a Full Professor of Information Systems at the Technische Universiteit Eindhoven. He is also the Academic Supervisor of the International Laboratory of Process-Aware Information Systems of the National Research University, Higher School of Economics in Moscow. Moreover, since 2003 he has a part-time appointment at Queensland University of Technology (QUT). His research interests include workflow management, process mining, Petri nets, business process management, process modeling, and process analysis. Wil van der Aalst has published extensively and many of his papers are highly cited (he has an H-index of more than 103 according to Google Scholar, making him the European computer scientist with the highest H-index). His ideas have influenced researchers, software developers, and standardization committees working on process support.
- Barrett R. Bryant is a Professor and Chair of Department of Computer Science and Engineering at the University of North Texas. His research interests span theory and implementation of programming languages, formal specification of software systems, model-driven software engineering, as well as mining and integration of distributed heterogeneous software components. He is a member of EAPLS, and a senior member of ACM and IEEE.
- Janusz Kacprzyk is a Professor of Computer Science at the Systems Research Institute, Polish Academy of Sciences. He is Full Member of the Polish Academy of Sciences and Foreign Member of the Spanish Royal Academy of Economic and Financial Sciences (RACEF). He is Fellow of IEEE and IFSA. His main research interests include the use of computational intelligence in decisions, optimization, control, data analysis and data mining. He is the editor in chief of 5 book series at Springer, and of 2 journals, and a member of editorial boards of more than 40 journals. Currently he is President of the Polish Operational and Systems Research Society and Past President of IFSA.
- Halina Kwaśnicka is a full professor of Computer Science in the Institute of Informatics and a Head of Artificial Intelligence Division at Wrocław University of Technology, Poland. In 2004-2012 she was Deputy Director for Scientific Researches, Institute of Informatics. Her research interests include artificial intelligence, evolutionary computations and hybrid systems. During last decade the intelligent methods of image analysis have become very important area of her work. She is the founder and the chair of the series of annual international symposia on Advances in Artificial Intelligence and Applications (AAIA), which have been regularly organized since 2006.
- Pedro José Marrón is a Professor and Head of the Networked Embedded Systems group at the University of Duisburg-Essen and since 2012 Director of the European Center for Ubiquitous Computing and Smart Cities (UBICITEC). He is also a Lead Scientist at Fraunhofer FKIE in Wachtberg. His research interests are distributed systems, mobile data management, location-aware computing, sensor networks and pervasive systems. Among several projects, he is coordinator of CONET, the Cooperating Objects Network of Excellence and coordinator of PLANET, an Integrated Project that deals with the deployment of large-scale heterogeneous networks.
- Ryszard Tadeusiewicz is a Professor at the AGH University of Science and Technology in Krakow, Poland. His research interests cover neural networks, computer vision, biomedical engineering, and distance learning. He was elected three times as the Rector of AGH. He is a Member of Polish Academy of Sciences, a Foreigner Member of Russian Academy of Natural Sciences, Titular Member of European Academy of Sciences, Arts and Literature in Paris, a Fellow of World Academy of Art and Science, and a Member of European Academy of Sciences and Arts in Salzburg.
- Future Internet 2020. Visions of an Industry Expert Group May 2009. European Commission, 2009
- W.M.P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer-Verlag, Berlin, 2011.
- W. M. P. van der Aalst, H. A. Reijers, A. J. M. M. Weijters, B. F. van Dongen, A. K. Alves de Medeiros, M. Song, H. M. W. Verbeek: Business process mining: an industrial application. Information Systems 32(5) (2007) 713-732
- M. Handte, G. Schiele, V. Majuntke, C. Becker, P. J. Marrón: 3PC: System support for adaptive peer-to-peer pervasive computing. TAAS 7(1) (2012) 10
- C. de la Higuera. Grammatical Inference: Learning Automata and Grammars. Cambridge Univ. Press, 2010.
- F. Javed, M. Mernik, J. Gray, B. R. Bryant: MARS: A Metamodel Recovery System Using Grammar Inference," Information and Software Technology 50 (9-10) (2008), 948-968.
- D. Hrnčič, M. Mernik, B. R. Bryant, F. Javed: A Memetic Grammar Inference Algorithm for Language Learning, Applied Soft Computing 12 (3) (2012) 1006-1020
- J. Kacprzyk, S. Zadrożny: Protoforms of Linguistic Database Summaries as a Human Consistent Tool for Using Natural Language in Data Mining. IJSSCI 1(1) (2009) 100-111
- H. Kwaśnicka, M. Przewoźniczek: Multi Population Pattern Searching Algorithm: A New Evolutionary Method Based on the Idea of Messy Genetic Algorithm. IEEE Trans. Evolutionary Computation 15(5) (2011) 715-734
- S.-H. Liao, P.-H. Chu and P.-Y. Hsiao: Data mining techniques and applications – A decade review from 2000 to 2011. Expert Systems with Applications 39 (2012) 11303-11311
- G. Piatetsky-Shapiro, C. Djeraba, L. getoor, R. Grossman, R. Feldman, M. Zaki: What are The Grand Challenges for Data mining? KDD-206 Panel Report. SIGKDD Explorations 8(2) (2006) 70-77
- R. Tadeusiewicz: Introduction to Intelligent Systems. In: B. M. Wilamowski, J. D. Irvin (Eds.): The Industrial Electronics Handbook, Vol. 3 (Intelligent Systems). CRC Press, Boca Raton (2011) 1-12
- T. Xie, S. Thummalapenta, D. Lo, C. Liu: Data mining for software engineering. Computer 42 (2009) 55-62