Methodologies of Data Mining Essay.
Methodologies of Data Mining
May 7th, 2018
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Six sigma is ideally has become pivotal in business management especially because it has been the basis of reduction of deviations from defects through a defined system utilizing proven quality control principles and techniques. At present, enterprises are in search of effective processes which are dependable. Ideally, this implies that the processes should have no errors whatsoever and with six sigma, such tolerance and defects are eliminated in the organization. According to Kolich, Fafandjel and Yao (2016), the six sigma methodologies involve a combination of methods, techniques as well as tools. Cross industry standard process encompasses a structured method of developing a data-mining project.
Unlike 2007 KDnuggets poll, the outcomes of CRISP-DM are undoubtfully reliable given that they are analyzable and stable over a period of time (Wang, 2011). The popularity of cross-industry standard process for data mining has crossed borders and has remained fairly the top leader amongst all the data mining projects. However, critics have it that CRISP-DM is utilized by less than fifty percent accounting for about 43 percent of methodology. According to Wang (2011), there are six-level phase levels of CRISP-DM. Apart from being user-friendly, it is essentially dependable and each process is indispensable in understanding information.
Business understanding is the first phase of CRISP-DM. Ideally, it is important to comprehend the mission and objectives of the institution and what it seeks to achieve. The memorandum of association outlines the missions and objectives of the company and outlines the reason of incorporation of the company. In this regard, it is important to for a company to evaluate its objectives and ascertain that it is in a position to counter the constraints as it tries to match the resources accordingly. While doing this, the focus should be on the imperative components which have the greatest influence on the project outcomes (Sharma, 2013). It goes without saying that the most effective outcomes from a project are usually in line with set objectives and which are according to the project plan and business success description.
Comprehending about the goals of data mining
CRISP-DM’s second phase is data understanding. In this phase, data enterprise is prompted to outsource data information which is on record in the project resource. Usually, the primary resources are embedded with data essential in comprehending the given information. The best epitome would be utilizing a specific tool for data understanding especially one that take in the given data into this tool (Karimi-Majd & Mahootchi, 2015). At the instance that the organization gets several data sources, it is thus best to put into consideration the essence of time and the means of integrating these sources. To understand the data, there is need to review the initial data collection, description, exploration as well as data quality report.
In this phase, the business has to select the required data information and get to decide on the data to use for analysis. The selection criterion takes into account there relevance of the data goals, quality of data as well as the technical constraints. After data has been selected, the enterprise can now go ahead to clean the data as well as its subsets and ensure that it quality standards (Wang, 2011). Immediately after cleaning the data, it is now time to construct the necessary data through constructive data preparation operations through procedures such as derived attributes and generated records. After all is said and done, the outcomes from multiple databases, table and records may then be integrated and merged and aggregated.
The modeling phase involves section of actual modeling strategy in use. Though data may have already been selected during the business understanding phase, in this phase, there data will be specific and the strategy zeroed on specific techniques such as decision-tree construction with C5.0 or neutral network generation as well (Wang, 2011). After section of the specific modeling technique, the business may go ahead and perform a test design operation and build a model which will then be assessed in later stages.
The evaluation phase involves the ascertainment of the accuracy and generality of the model. In this case, the enterprise points out the deficiencies of the model which may actually include testing the model in the real setting. When this is done, a review process is initiated after the models have shown to be satisfactory an according to the business needs (Stolojescu-Crisan & Isar, 2013). From the results, the enterprise ought to come up with the what ‘next move’ and ascertain future iterations. Ideally, this could be done through itemizing the plausible actions and decisions as well.
The last phase is the deployment phase. Ideally, this phase borrows from the previous phases where it picks the evaluation results. From the results, the enterprise can then plan a deployment plan. A good way of doing this would be through summarizing the purposed plan while incorporating the necessary steps and the strategies on how to perform them. A final report of the data mining engagement with the previous deliverables, summarizing and organizing the results is then drafted (Wang, 2011). The final process is the presentation of the conclusion of the project, the results of which are presented to the customer.
Example of how Six Sigma for Data Mining helps manufacturing organizations.
Manufacturing corporations look for all the means to make operations more effective while considering the production costs. Ideally, these manufacturing firms can effectively take advantage of advanced analytics to reduce the flaws in operations which in the end helps save time and money. Much of the production processes are wasted and this affects the quality and outcomes of the processes. Given the number of production errands that an organization may have, there is need of a granular approach useful in diagnosing and correcting the process flaws. According to Trnka (2012), application of statistics and business data in improving operations is indispensable. Manufacturing organizations can utilize advanced analytics to utilize historical process data, identify patterns as well as correlations among discrete process data after which the firm can optimize the factors which prove to have the greatest yield. One of the best epitomes of six sigma in the working is present use of real-time shop-floor data as well as the capability to conduct such sophisticated statistical assessments. By utilizing isolated data sets, aggregating them as well as using the data to improve insights has been the main errand of many manufacturing organizations utilizing six sigma.
A biopharmaceutical firm for example which deals with vaccines, hormones and blood components which are manufactured live using genetically engineered cells require the production teams to consistently monitor voluminous variables within the production flow so as to ensure that purity of the ingredients as well as the substances being manufactured. Though the process of manufacturing the products may be identical, the outcomes and yields of the manufacturing process can ideally be different. Such significant variability can be hectic and lead to discrepancies in decision making especially when considering the capacity and quality and product desired to be obtained. However, the biopharmaceutical firm can utilize advanced analytics to boost outcomes in vaccine production while maintaining the costs at the lowest position possible.
Evaluation of DMAIC methodology to Six Sigma for Data Mining.
Karimi-Majd and Mahootchi (2015) postulates that the six sigma DMAIC is indeed a roadmap to problem solving as well as product face lift. Six sigma can also enable the organization to closely monitor production activities and for each cluster process the resources outsourced from the main database. The definition phase entails outlining the goals and objectives of the organization as well as the internal and external deliverables. Some of the tools that may be useful in this phase may include project charter, flowchart, stakeholders’ analysis as well as work breakdown structure. The measuring aspect on the other hand involves the determination of the present performance. Ideally, it is all about ascertain the quantity of the problem. In this phase, a data collection plan may be outlined even as the measurement system is validated. The data miner may proceed on to amassing the data before finally defining the relationship among the variables. Some of the essential tools that may come in handy in this phase include a data collection schedule, a sigma computation as well as a measurement system analysis (Sharma, 2013).
The next method is the analyzing phase which basically involves the determination of the critical causes of the defects. In this case, the data analyst may seek to know the reason behind some the variation of the expected results from the actual outcomes Stolojescu-Crisan and Isar, A. (2013). Time series analysis, simulation software, regression analysis, hypothesis testing as well as scatter plots may be used. The improvement phase involves curbing the processes and operations which dwarfs outcomes. An organization implementing the improvement phase performs experiment and tests for possible solutions to problems. Usually, the improvement phase may involve conducting pilot studies, potential improvements may be outlined. According to Kolich, Fafandjel and Yao (2016), to forecast the future performance, the control phase will be used. Ideally, this will involve the development of standards and procedures as well as the ascertainment of the data capability purpose. Profits, cost saving benefit as well as growth is computed using sigma calculations, cost benefit analysis, control plans as well as control charts.
How Define, Measure, Analyze, Improve and Control (DMAIC) can applied to a manufacturing organization.
In production and processing of data, DMAIC comes in handy. Extremely fragmented data can best be understood through DMAIC. To account for information gaps, an organization can first pick up the data and reconcile inconsistencies. In this case, data will be measured against a certain benchmark and ascertained whether it falls short of the benchmark. In this case, data parameters can measured using metrics and the underlying variables bringing about the inconsistencies measured. After ascertaining the existence of variations, the outcomes are then analyzed to determine the outcomes (Kolich, Fafandjel & Yao, 2016). Ideally, this will involve examining the data metrics in depth and when the factors leading to the inconsistences are established, the next step is to come up with ways to improve performance. While doing this, the control measures which seek to regulate the means by which the model operates should be ascertained. The manufacturing firm should ensure that the control measures of developed comply with the industrial regulations.
Karimi-Majd, A., & Mahootchi, M. (2015). A new data mining methodology for generating new service ideas. Information Systems and e-Business Management, 13(3), 421-443. doi:10.1007/s10257-014-0267-y
Kolich, D., Fafandjel, N., & Yao, Y. L. (2016). Data mining methodology for determining the optimal model of cost prediction in ship interim product assembly. Brodogradnja, 67(1), 1-18.
Sharma, A. K. (2013). Data mining based predictions for employees skill. International Journal of Advanced Research in Computer Science, 4(3)
Stolojescu-Crisan, C., & Isar, A. (2013). Forecasting WiMAX traffic by data mining methodology: Doc 840. EURASIP Journal of Wireless Communications and Networking, 2013, 1. doi:10.1186/1687-1499-2013-280
Trnka, A. (2012). results of application data mining algorithms to (lean) six sigma methodology. Annals of the Faculty of Engineering Hunedoara, 10(1), 141.
Wang, J. (2011). The study on cross-industry standard process for data mining in E-marketing. Applied Mechanics and Materials, 66-68, 2298. doi:10.4028/www.scientific.net/AMM.66-68.2298
Woodside, J. M. (2016). BEMO: A parsimonious big data mining methodology. Ajit-e, 7(24), 113. doi:10.5824/1309?1581.2016.3.007.x