SciELO - Scientific Electronic Library Online

 
vol.30 issue3Asphalts' aging phenomenonSum of squares decomposition: controltheory and applications author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Ingeniería e Investigación

Print version ISSN 0120-5609

Ing. Investig. vol.30 no.3 Bogotá Sept./Dec. 2010

 

CMIN - a CRISP-DM-based case tool for supporting data mining projects

Carlos Cobos1, Jhon Zuñiga2,Juan Guarin3, Elizabeth León4 y Martha Mendoza5

1 Systems Engineer. M.Sc in Computer Science, Universidad Industrial de Santander, Colombia. Ph.D., candidate in Computer and Systems Engineering, Universidad Nacional de Colombia, Bogotá, Colombia. Plant Teachers Full Time Category Holder, Universidad del Cauca, Colombia. Researcher ID Group on Information Technology (GIT), Universidad del Cauca, Colombia. ccobos@unicauca.edu.co

2Systems Engineer, Universidad del Cauca, Colombia Programmer, Informática y Gestión S.A., Colombia. Research Assistant Group ID in Information Technology, Universidad del Cauca, Colombia.jzunigaparedes@unicauca.edu.co

3Systems Engineer, Universidad del Cauca, Colombia. Programmer, Solsoft S.A., Colombia. Research Assistant Group ID in Information Technology, Universidad del Cauca, Colombia. jguarin@unicauca.edu.co

4 Systems Engineer. M.Sc., in Systems Engineering, Universidad Nacional de Colombia, Colombia. M.Sc., in Electrical and Computer Engineering, University of Memphis, EEUU. Ph.D., in Computer Science and Computer Engineering, University of Louisville, EEUU. Plant Teachers Full Time Category Assistant Universidad Nacional de Colombia, Bogotá, Colombia. Laboratory researcher in Intelligent Systems Research (LISI), Universidad Nacional de Colombia, Bogotá, Colombia. eleonguz@unal.edu.co

5 Systems Engineer. M.Sc., in Computer Science, Universidad Industrial de Santander, Colombia. Ph.D., student in Engineering Systems and Computing, Universidad Nacional de Colombia sede Bogotá, Colombia. Plant Teachers Full Time Category Holder, Universidad del Cauca, Colombia GTI Researcher, Universidad del Cauca, Colombia. mmendoza@unicauca.edu.co.


ABSTRACT

This paper introduces CMIN, an integrated computer aided software engineering (CASE) tool based on cross-industry standard process for data mining (CRISP-DM) 1.0 designed to support carrying out data mining projects. It is “integrated” in the sense that it supports all phases of a process. A general overview of how CMIN works is presented first, including a treatment of processes, templates and project management. CMIN's capacity for easily and intuitively monitoring projects is highlighted, as is the manner in which CMIN allows a user to increase knowledge regarding using CRISP-DM or any other process defined in the CASE tool through the help and information presented in each step. Next, it is shown how CMIN can bind new data mining algorithms in runtime (without the need to recompile the tool) to support modelling tasks (based on a Workflow) and evaluate data mining projects. Finally, the results of two evaluations of the tool, some conclusions and suggestions for future work are presented.

Keywords: Data mining, CRISP-DM, CASE tools, workflow, reflection.


Received: july 21th 2009

Accepted: november 15th 2010

Introduction

A variety of processes, methodologies and tools have been establishhed in software engineering to standardise software product development and make it simpler. CASE tools are among the available tools; they automatically support a number or all of the aforementioned methodologies' steps and together are known as computer aided software engineering (CASE) (INEI, 1999). CASE tools help reduce the time required for developing a system, in turn helping to stabilise costs and contribute to quality enhancement (Miren Begoña, 2000). CASE tools further allow an analyst to document and model a system, from initially defining the requirements, through to design, implementation and testing (Miren Begoña, 2000).

A range of software tools are available today that help in carrying out data mining software projects (Britos et al., 2005; Kdnuggets, 2005; MetaGroup, 2004). Based on the list of such tools that appear in MetaGroup (MetaGroup, 2004) and Kdnuggets (Kdnuggets, 2005), an evaluation was made of the most representative, including: Clementine (Khabaza & Shearer, 1995; SPSS-Inc., 2009), Insightful Miner (Insightful-Corporation), WEKA (Holmes, Donkin, & Witten, 1994; University-of-Waikato, 2009), CART (Salford-System, 2009), PolyAnalyst (Mai, Krishna, & Reddy, 2005; Megaputer, 2009; Rippa & Lendyuk, 2007) and SAS Enterprise Miner (SAS, 2009a). The general criteria for such evaluation were: its access (cost of the tools), its user interface (how easy or complex the tool was to use according to the user), the process (or methodology) on which it was based, its extensibility (the capacity to easily and dynamically expand the set of algorithms it offers) and support in project development for individuals to work together in groups. It thus came to light that not one of the tools fully complied with the cross-industry standard process for data mining (CRISP-DM) (CRISP-DM, 2006; Chapman et al.,2000), a process for carrying out data mining projects that is at once iterative, open, customisable and widely recognised by industry and academia. It also emerged that none of the tools allowed dynamic real time expansion (without recompiling the tool) of the set of algorithms the tool initially produced and that, despite the fact that some of the tools boasted an easy user interface, not one of them properly guided the carrying out of a project, much less aided the user to learn and deepen their knowledge of process management in conducting a data mining project. As such, the research group (GTI) decided to develop an integrated CASE tool based on CRISP-DM (CRISP-DM, 2006; Chapman et al., 2000), easily extensible in runtime, easy to use and which helps a user to increase his/her knowledge and abilities in carrying out data mining projects.

Cross-industry standard process for data mining (CRISP-DM)

A variety of methodologies exists for directing data mining. These aim at facilitating new projects having similar characteristics, optimase their planning and management, reduce their complexity and allow smoother execution (Gondar Nores, 2004). Two of these methodologies stood out: CRISP-DM (CRISP-DM, 2006) and sample, explore, modify, model, assess (SEMMA) (SAS, 2009b). The latter concerns itself with the technical characteristics or process development,while CRISP-DM mainly focuses on a project's business objectives. CRISP-DM begins by carrying out an analysis of a business problem for transforming it into a technical data mining problem. CRISP-DM can also be integrated with a specific project management methodology complementing administrative and technical tasks. It is also widely distributed at no cost, unlike SEMMA (SAS, 2009b). CRISP-DM defines a structure for data mining projects and provides orientation for their execution. It serves both as a reference model and a user guide (Chapman et al., 2000). The reference model gives a general view of a data mining project's life-cycle, containning each phase with its objective, the tasks, the relationships between them and the step-by-step instructions that must be carried out. The phases defined for the reference model are: understanding the business, data analysis, data preparation, modelling, evaluation and display. Each phase (level 1) is composed of generic tasks (level 2) divided into specific tasks (level 3) and an instance of the process is found in level 4, describing the specific activities to be done in a data mining project. The user guide offers detailed advice, tracks for each phase and each operation within a phase, and provides an example of how to do a data mining project. The user guide is an excellent option for researchers having little experience of data mining.

CMIN conceptual model

The conceptual model is presented first to understand better how CMIN works, with its main concepts and the relationships amongst them (see (Figure 1):

- Users: people who use the system. They may be experts or novices in data mining;

- Process module: this is the module that allows process management, among which is found CRISP-DM. Process definition represents the action of registering a process during aggregation and defining its steps, fields or activities required for carrying out a data mining project. Reports are the documents or deliverables that need to be provided in the course of a project and which aid executing such project;

- Processes: processes that have been added to CMIN and that serve as a basis for managing data mining projects in the tool;

- Project module: the module for managing data mining projects, based on one of the processes previously added to the process module. Projects represent the set of projects already created in CMIN and can be found in two stages (in progress or completed). The fields or activities of a step are the specific activities that have to be carried out to meet the objective of the step to which they belong. The results represent the products of carrying out an activity, which may comprise a suggestion, an explanatory text or an information template that needs to be observed;

- Workflow (WF): a graphical environment that allows users to manage data mining models based on mining tasks defined in CMIN;

- Adding dynamic link libraries (DLL): this module allows the management of objects (new algorithms) that serve to implement the workflow, using DLLs. Types of workflow objects (or types of objects) represents the set of object types recognised by CMIN to be added and in turn used by the WF. Interfaces represents the set of software contracts (e.g. for classification, clustering or

- association rules) to be met by DLLs before being added to the set of objects to be used by the WF. DLLs represent the set of DLLs that CMIN currently holds in its array, or algorithms set (WF objects);

- Workflow objects: the set of objects added to CMIN and which can be used in the workflow, which can grow in such a way that users make new implementations of any of the types of WF objects specified in CMIN; and

- CMIN server: the server that hosts new process definitions and new implementations of workflow objects (algorithms) by way of DLLs, so that users can upgrade CMIN if that is what is desired because CMIN is able to run independently of the server.

CMIN use cases

Two types of users (roles) are considered in CMIN: end users and expert editors (see (Figure 2). The system's use cases are as follows: logging into the system (a pre-condition for using the tool), managing processes, managing projects, managing templates and managing DLLs. On logging into the system, the users must configure the database server to SQL server to load the information necessary for the system's operation (possibly an Express version which comes free of charge). When managing projects, users can carry out the steps suggested by the process that the project is using, in such a way that they implement fields that are defined for each step. In some fields, the workflow can be used if the user needs to use particular data mining techniques or algorithms.

Figure 2 also shows expert editors' use cases. These users, as well as making use of the functionality available to an end user are also able to manage processes (create, modify and delete processes, and their associated steps and fields), manage templates (customisations of a process in a specific area of application, eliminating steps that are not appropriate in that area) and manage the DLLs used in the system. The division of roles is a logical abstraction, since the tool allows any user to take on the role of expert editor, but such user must have a good knowledge of mining processes to define templates and customise them, as well as learn the proper way to create and load new data mining algorithms in CMIN. CMIN has a set of XML web services that enable the centralisation of data mining algorithms' new processes and DLLs. These resources (processes and algorithms) can be synchronised to customers through a simple synchronisation option, making the job of the expert that much easier.

CRISP-DM register in CMIN

The process management module allows new data mining processes to be defined. The following presents how to register CRISP-DM V1.0 in CMIN. First, the expert editor registers the basic information regarding a process (name, status and description), then defines the steps and process fields. Figure 3 shows, on the left-hand side, how to create a shortcut menu with these steps (phases, generic tasks, specific tasks, etc.). Four things are defined in each step: the name, the type of step in the process hierarchy, a description (which helps the CMIN user) and the set of fields (information that the person carrying out the data mining project must register in that step). The result of editing the steps of CRISP-DM 1.0 registered in CMIN are shown on the right-hand side of the Figure.

Later, the editing of the fields of the step is done. Figure 4 depicts a form that asks the editor or expert in mining to register the various fields (which can be many) for each step. For each field, a descripttion must be registered - for example if it is an activity it explains what needs to be done and if it is a suggestion then this is described.The field type that defines whether the field is an activity or suggestion is also registered, as is uses workflow - indicating whether or not in order to perform the activity or field it is necessary to use the WF.

Management of a Project in CMIN

CMIN allows a data mining project based on a process to be carried out. In order to do this, the projects inherit the structure of the process that the user selected previously. The left hand part of Figure 5 shows the addition of a new project to CMIN. This process involves selecting a base process or template (if one has been defined previously). The right hand part of Figure 5 shows how a project is conducted. At (1) the structure of the basic process can be seen, which is executed by the user in such a way that the mining project is conducted in CMIN; at (2) the fields or activities to be performed per 4) shows how to create a cycle of any step in the process. This last point is very important because most projects need to re-process or repeat certain steps at a specific moment along the way; and (5) indicates how the cycles are displayed.

Data mining workflow in CMIN

Figure 6 shows the workflow of CMIN. The types of objects in the workflow are outlined at (1) (data sources, classification algorithms, data description algorithms, filters, displays, and grouping or clustering algorithms); (2) shows an offered object of the “Data Source” type; and (3) presents an object in execution within the workflow.

A software interface or contract (Microsoft-Corporation, 2009a) must be defined for each type of object in the workflow to add algorithms, or objects, to types of objects in run time; this groups the methods necessary for its use and other interaction methods with other types of workflow objects. When a new type of object is. created, it should be reported to CMIN using the form seen on the left in Figure 7.

The interface of the new type is developed beforehand using Visual Studio .NET (Chand, 2000); it is compiled as an assembly and this assembly is loaded into CMIN. The information about the object type is stored in the database and the "DLL" file is copied and stored in the local CMIN folder called Assemblies_CMIN. After entering the type of object, the links which can be established must be defined, i.e. define to which type of object you can give information and which type of object can give you information (see the right hand side of Figure 7).

Adding a new algorithm in CMIN

The process for adding a new object (algorithm) to a type of CMIN object is as follows:

A developer creates a library project in Visual Studio .NET(Chand, 2000) adding the DLL that defines the contract or software interface (Microsoft-Corporation, 2009a) as a reference for the type of object that will be implemented. In other words, the developer adds the clustering.dll to the project if the k-means algorithm is going to be implemented (see (Figure 8)).

- The developer implements the algorithm in the library project (fulfilling the contract), generates the new DLL and compresses it in a zip file (see (Figure 9);

- When a user needs to use the new algorithm in CMIN, the zip file with the DLL should first be selected, then verified that it complies with the contract - this comparison is done using reflection (System. Reflection) (Microsoft-Corporation, 2009b), loading the assemblies and comparing the methods. An image is then uploaded to represent the new algorithm and finally loaded into CMIN (see Figure 10); and

- If the new algorithm meets the requirements of the type of object interface, it is registered in the database and the zip file is decompressed and stored in the local CMIN folder called algorithms, ready to be used in the workflow (Figure 11).

Invoking algorithms in run-time

CMIN stores the algorithm assemblies or DLLs in local folders and it also stores the assemblies of the types of objects, i.e. the interfaces. These types of workflow objects are static and the dynamic part is made up of the algorithms or objects for each type which can be extended in runtime. Taking this into account, the group first defines software interfaces (contracts) that each type of object must fulfil, focusing on methods allowing algorithm interaction with the user and he CMIN core. This means that the CMIN core (the nerve centre of the workflow) functions in a way that is based on the information from the software interfaces. The core knows which methods it must invoke on the objects so that they comply with the contracts for each type of object. For creating and loading objects and invoking methods, the core uses reflection (Microsoft-Corporation, 2009b).The core also validates the relationships that can occur between objects, based on the rules presented in the right-hand part of Figure 7. As a result, the workflow functions as shown in Figure 12.

CMIN assessment

CMIN has undergone two evaluations:

- A preliminary assessment of process management and project management was held in February 2008 with sixteen students from the University of Cauca's optional Data Mining course. In this evaluation each CRISP-DM phase was assigned to two students on thecourse. Based on version 1.0 of CMIN, they made an overall assessment of the tool's compliance with CRISP-DM phases and also evaluated the ease of use thereof. As a general conclusion, the tool fulfilled CRISP-DM requirements 100% although some templates for collecting information in some phases needed to be improved.Given the positive results of this evaluation, a description of the tool was sent to a project presentation meeting in March 2008 to be submitted to the Demofest of the Microsoft Research Academic Summit. The project was selected by Microsoft and a scientific poster on CMIN was presented in Panama City on May 16th 2008. A tool was presented in person to the teachers and researchers who attended the event. Despite the fact that many projects presented at the Demofest boasted investments much higher than that of CMIN, the tool received excellent reviews and Microsoft decided to include it in publicity that appeared on CNN television(Spanish language) in their program ADVANCES (see copyof the video http://www.unicauca.edu.co/~ccobos/cnndelantos. wmv).

- An evaluation of the usability of the tool. This evaluation was carrie out in March 2009 using a Beta test with the participation of the University of Cauca (UC) Engineers and Systems Engineering students who work in data mining. This test had two objectives: a thorough revision of CMIN in a different environment to that of its development, by way of a usability test, and the verification (through an experiment) of whether or not using CMIN could increase the knowledge users had of CRISP-DM. The experiment was conducted in six steps, as follows: 1) a pre-test evaluated the group's initial knowledge of CRISP-DM; 2) a basic presentation of the CMIN tool was given; 3) a workshop on data mining was held (the aim of the workshop was to set a typical classification problem for the group to solve. The IRIS data set - available from the UCI repository (Asuncion & Newman, 2007) - was selected for the workshop. The participants used the workflow and obtained the result shown in Figure 12.); 4) interaction with the group was done by questions and suggestions; 5) a further test was taken, to evaluate the group's new level of knowledge regarding CRISP-DM (the content of this test did not change regarding the pre-test); and 6) a usability test was set, based on a questionnaire from the Universidad Politécnica de Cataluña (Borges de Barros Pereira, 2002).

Overall, the test was successful in that the tool did not throw up any errors while all participants were able to resolve the classification problem presented. The usability test results were very good. CMIN can be said to have a friendly interface that is understandable and through which - most importantly - the management of projects that may involve repetitive and somewhat complex aspects can be handled easily. The interface minimizes what the user needs to learn in the tool. At each step it provides guidance for successfully carrying out data mining project tasks. Figure 13 shows the main results of usability testing wherein, for each indicator, the users expressed an assessment mainly consisting of excellent and good.

As regards the CRISP-DM knowledge test, an increase of between5% and 10% in knowledge of the process was achieved in the short period of the workshop (1 hour), noting that it was not intended that users memorize CRISP-DM phases and its generic and specific tasks. Most important was the change seen in the terminology in test users' responses. Compared to the pre-test, responses proved to be more accurate, more technical and more directly related to the phases of the process.

Conclusions and future work

CMIN is an integrated CASE tool that guides the carrying out of projects through processes, facilitates the integration of the process with the roject and ensures the process's compliance in the execution of the project. CMIN is a tool with expandable functionality (capable of dynamic extension of the algorithm array in runtime) that encourages and facilitates cooperation within the development community, as new functionality can be programmed by community members, then tested and evaluated by a panel before being finally included and distributed to other members of the tool user community through the synchronisation option. Using detailed and appropriate information in each step of any process or in any project in CMIN, it is likely that the user will progressively come to know more about any data mining process (for example, CRISP-DM).

Regarding future work, the research group plans to implement an improved version of the component for project monitoring that takes into account the management of resources for each activity. Cost reports can thus be produced for each step of the project; the group thus recognises the need for integrating suitable project management methodology within CMIN. Additionally, the intention is to focus efforts on building up the tool development community. This ought to allow rapid growth in the existing battery of algorithms that can be sed in CMIN and thus enhance workflow use.

Asuncion, A., Newman, D. J., UCI Machine Learning Repository 2008., 2007. from   http://www.ics.uci.edu/~mlearn/ML Repository.html        [ Links ]

Borges de Barros Pereira, H. Análisis experimental de los criterios de evaluación de usabilidad de aplicaciones multimedia en entornos de educación y formación a distancia Unpublished Doctoral., Universitat Politecnica de Catalunya, Barcelona, 2002.        [ Links ]

Britos, P., Fernández, E., Ochoa, M., Merlino, H., Diez, E., García, R., Metodología de Selección de Herramientas de Explotación de Datos., Paper presented at the II Workshop de Ingeniería del Software y Bases de Datos. XI Congreso Argentino de Ciencias de la Computación, 2005.        [ Links ]

CRISP-DM., CRoss Industry Standard Process for Data Mining., 2006.  from http://www.crisp-dm.org/        [ Links ]

Chand, M., Creating C# Class Library (DLL) Using Visual Studio .NET [Electronic Version]., C# Corner, (2000). from http://www.c-harpcorner.com/UploadFile/mahesh/dll12222005064058AM/dll.aspx        [ Links ]

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., CRISP-DM 1.0: Step-by-step data mining guide: CRISP-DM Consortium., 2000.        [ Links ]

Gondar Nores, J.-E., Metodologías para la Realización de Proyectos de Data Mining [Electronic Version]., 2004. from http://www.estadistico.com/arts.html?20040426        [ Links ]

Holmes, G., Donkin, A., Witten, I. H., WEKA: a machine learning workbench., Paper presented at the Intelligent Information Systems,1994., Proceedings of the 1994 Second Australian and New Zealand Conference on, 1994.        [ Links ]

INEI., Herramientas CASE. Lima, Perú: Instituto Nacional de Estadística e Informática., 1999.        [ Links ]

Insightful-Corporation., Insightful Miner., from http://www.insightful.com/products/iminer/default.asp        [ Links ]

Kdnuggets., Tools data mining., 2005.  from http://www.kdnuggets.com/polls/2005/data_mining_tools.htm        [ Links ]

Khabaza, T., Shearer, C., Data mining with Clementine., Paper presented at the Knowledge Discovery in Databases, [IEE Colloquium on], 1995.        [ Links ]

Mai, C. K., Krishna, I. V. M., Reddy, A. V. Polyanalyst application for forest data mining., Paper presented at the Geoscience and Remote Sensing Symposium, 2005, IGARSS '05. Proceedings. 2005 IEEE International, 2005.        [ Links ]

Megaputer., PolyAnalyst 6.0 - simplify your analytics., 2009. from http://www.megaputer.com/        [ Links ]

MetaGroup., METAspectrum Market Summary., 2004. from http://www.oracle.com/technology/products/bi/odm/pdf/odm_metaspectrum_1004.pdf        [ Links ]

Microsoft-Corporation., interface (C# Reference), 2009a. from http://msdn.microsoft.com/en-us/library/87d83y5b.aspx        [ Links ]

Microsoft-Corporation., Reflection Overview [Electronic Version]. .NET Framework Developer's Guide., 2009b. from http://msdn.microsoft.com/en-us/library/f7ykdhsy.aspx        [ Links ]

Miren Begoña, A.-R., A retrospective view of CASE tools adoption., SIGSOFT Softw. Eng. Notes, 25(2), 2000, pp. 46-50.        [ Links ]

Rippa, S., Lendyuk, T. Selection of Alternative Projects Using Data Mining., Paper presented at the 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS, 2007.        [ Links ]

Salford-System., Classification And Regression Trees (CART)., 2009. from http://www.salfordsystems.com/cart.php        [ Links ]

SAS., Data mining with SAS® Enterprise Miner., 2009a. from http://www.sas.com/technologies/analytics/datamining/miner/        [ Links ]

SAS. SAS Enterprise Miner - SEMMA., 2009b. from http://www.sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html        [ Links ]

SPSS-Inc., Clementine., 2009. from http://www.spss.com/es/ clementine/        [ Links ]

University-of-Waikato., Weka 3: Data Mining Software in Java., 2009. from http://www.cs.waikato.ac.nz/ml/weka/        [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License