Monday, June 3, 2019

Use and Application of Data Mining

Use and Application of Data MiningData mine is the process of extracting patterns from info. Data digging is becoming an increasingly important tool to transform the data into training. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific uncovering 1-3. Data mining can be applied on a variety of data types. Data types imply structured data (relational), multimedia data, free school text, and hypertext as shown in Figure 1-1. We can strip hypertext from XML/XHTML tags to get free text4, 5.Nowadays, text is the most common and well-provided way for cultivation exchange. This due to the fact that much of the worlds data is contained in text instruments (newspaper articles, emails, literature, web pages, etc.). The importance of this way has led umpteen another(prenominal) researchers to find issue sui defer methods to analyze natural language texts to extract the important and useful information. In comp arison with data stored in structured format (databases), texts stored in documents is formless and to deal with such data, a preprocessing is required to transform textual data into a suitable format for automatic processing 6.Text mining is a new and exciting area of computer science research that interested of solving the caper of information overload by using combination techniques from data mining, machine learning, natural language processing, information retrieval, and experience management. Text mining, also known as text data mining 7 or knowledge discovery from textual databases 8, refers generally to the automatic process of extracting interesting and high-quality information or knowledge from unstructured text documents by using a suite of epitome tools 9.Definitely, text mining takes much of its inspiration and direction from core research on data mining. Therefore, text mining and data mining systems contain many high-level architectural similarities. For example, text mining and data mining systems depend on preprocessing routines, pattern-discovery algorithms, and presentation-layer elements 1. Furthermore, text mining adopts many of the specialised types of patterns in its core knowledge discovery operations that were first introduced and vetted in data mining research 9.The difference between data mining and text mining lies in the specific stages of preparation of the data and the difficulty of finding the important patterns due to the semi-structured or unstructured nature of the textual documents being processed.Data mining systems assumes that data bugger off already been stored in a structured format. Therefore, the preprocessing stage focus falls on two critical tasks Scrubbing and normalizing data and creating extensive numbers of table joins. In contrast, for text mining systems, preprocessing tasks focus on the identification and extraction of representative features for natural language documents. These preprocessing tasks a re responsible for transforming unstructured, original-format content in document collections into a more explicitly structured intermediate format, which is a concern that is not relevant for most data mining systems. Text mining preprocessing tasks hold a variety of different types of techniques culled and adapted from information retrieval, information extraction, and computational linguistics research (such as tokenization, stop word remover, normalization, and stemming, etc.)9.Typical text mining tasks involving Text extraction and representation, information retrieval, document summarization, document clustering, document classification.Text representation is concerned with the problem of how to represent text data in get format for automatic processing. In general, documents can be represented in two ways, as a bag of words where the context and the word point are neglected and the other one is to find common phrases in text and deal with them as mavin terms 10.In informa tion retrieval, the information needed to be retrieved is represented as query and the task of the information retrieval systems is to find and return documents that contain the most relevant information to the disposed(p) query. In order to achieve this purpose, text mining techniques are used to analyse text data and make a comparison between the extracted information and the given queries to find out documents that include answers 10, 11.The idea of text summarization is an automatic detection of the most important phrases in a given text document and to create a condensed version of the input text for human use 10. Text summarization can be done for a single document or a document collection (multi-document summarization). Most approaches in this area focus on extracting informative sentences from texts and building summaries based on the extracted information. Recently, many approaches have been tried to create summaries based on semantic information extracted from given text documents 10, 11.Document clustering is a machine learning technique that is used to identify the similarity between text documents based on their content. Unlike document classification, document clustering is an unsupervised method in which on that point are no pre-defined categories. The idea of document clustering is to create links between similar documents in a document collection to allow them to be retrieved together 10-12.Document classification is the assignment of text documents into one or more pre-defined categories based on their content 10, 13. It is a supervised learning problem where the categories are known in advance 10. For the document classification problem, many machine learning techniques including decision trees, K-nearest neighbour, SVM support vector machines and Naive Bayes algorithm have been used to build document classification models. more details about document classification in the next section.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.