Resources in Data Mining Lab

   Facilities

Server: 4X Intel Xeon @ 900 MHz, 4 GB RAM, RAID 3X36 GB
Workstations: Pentium 4, 1.8 GHz, 512 MB RAM, 60 GB Hard Drive

   Links

NASA
National Aeronautic and Space Administration.

IEEE
Institute of Electrical and Electronic Engineering.

ACM
ACM Digital Library.

SIGKDD
ACM Special Interest Group on Knowledge Discovery and Data Mining.

KDD Website
Infomation about Data Mining, Knowledge Discovery, Text Mining, Web Mining.

NSF
National Science Foundation.

KSTC
Kentucky Science & Engineering Foundation.

   Datasets

UCI Knowledge Discovery in Databases Archive

This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.

The Machine Learning network Online Information Service

This site is dedicated to the field of machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining. Get information about research groups and persons within the community. Browse through the list of software and data sets.

   Talks

Wireless heterogeneous networks

CCFDP:Collaborative Click Fraud Detection and Prevention System

Multiple Time Series Prediction: Cost-Sensitive Analysis

   Tools

This web page is a quick reference file for the following programs. This is not designed to be a full help or tutorial on these programs, but it is merely to help you get started with the available tools.

   Tiberius

Tiberius is a visual neural data-mining tool that uses many algorithms to predict a desired output given selected inputs. Tiberius, according to its authors, is a multi-layered perceptron neural network, which discovers relationships between input variables and output variables.

To use Tiberius, open up the program and click on the create button. Tiberius accepts many file formats, including the following:

Excel Files

Access Files

Tab Delimited Files

Comma Delimited Text

Universal Data Link

The only data that can be accepted is numerical data. Preproccessing can be done on the next screen. This screen allows the user to select what inputs and outputs he wants, and which ones he wants to exclude. Once the user selects the desired inputs, outputs and ÅËunk” data to use, he can then choose to train the data to predict an outcome. See the online help for a more detailed explanation on how to load files and to use Tiberius.

Quick Facts

Name of Software

Tiberius (Authors: Phil Brierley and Andrew Lewis)

Type

Multi-Layered Perceptron Neural Network

File Formats Accepted

Excel, Access, Tab Delimited, Comma Delimited, Universal Data Link

Limitations

Only accepts numerical data.

Online Help (Web)

http://www.philbrierley.com/

   Roc (Robust Bayesian Classifier)

Roc, is an implementation of the Robust Bayesian estimator, which is described by its authors as a Bayesian Classifier that has the ability to handle missing data and is not affected by the the patterns of the missing data. The classifier uses a wizard-style interface that prompts the user every step of the way regarding the inputting data files, setting parameters, and the selection of different styles of algorithms. The Bayesian classifier is specially formulated to handle cases with missing data. It uses Bayes Theorem and classical Bayesian techniques (as described in many texts) to classify a set of data with unknown classes.

To use Roc, open up the program and follow the instructions on the screen. Roc's simple, wizard-like interface makes it easy to manipulate parameters. After the welcome screen, the user can input the training file by clicking the Browse button. Roc accepts text files only, either in tab delimited form (*.txt), or space delimited form (*.prn). The user can also load a classifier file, which is a file that has been already ÅÕrained” from previous data.

Once the user loads the training file, he can go through the different dialogs of the interface to set parameters, and load the test file. The test file must also be a tab or space delimited text file.

Quick Facts

Name of Software

Cviz (Multidimensional clustering software) – IBM Corporation

Type

Multidimensional clustering software, ability to handle large amounts of features

File Formats Accepted

Comma Separated Value (*.csv) format

Limitations

50000 samples, up to 200 features

Online Help (Web)

http://www.lans.ece.utexas.edu/course/ee380l/share/soft/cviz/manual/cviz.html

   See 5

See 5 is a program that utilizes the C5.0 algorithm for producing decision trees. See 5 requires two types of files to run, a names file and a data file. The names file contains the names of the attributes, and are placed in a special format readable by the program. The data file contains the raw data that corresponds to each particular feature. To load a file, you only need to load the data file. The names file must be in the same directory. This file gets automatically read in after the data file is loaded.

Once a data file is loaded, then click on the construct classifier button (or click from the file menu), and set the parameters as necessary in the dialog box that appears. After clicking ok, the classifier will then run, and a decision tree will be made from the data.

All file types used with See 5 are as follows:

Names file – as described above, contains names of the attributes

Data file – contains the raw data

Test file – an optional file that can be used to test the training data contained in the data file

Cost file – an optional file that associates costs with the raw data

Name of Software

See 5 – Rulequest Research

Type

Decision Tree software that uses the C5.0 algorithm

File Formats Accepted

Data files (*.data), comma separated values. Names files (*.names), special format, (see online help)

Limitations

400 cases (test and training) in demo version

Online Help (Web)

http://www.rulequest.com/see5-win.html

   WizWhy

WizWhy is a data mining software tool that predicts the output of a data set based on if-then rules within the data.

After starting the program in Microsoft Windows, to open a file for analysis, click on the “Open Data of Type” button on the Basic Data tab. You can choose from different database (file) types, including ASCII, dBase, MS Access, MS SQL, Oracle, ODBC, and OLE DB.

After loading the data set into the program, the different data fields will then be automatically displayed. Click on the feature you want to be the independent variable. After that, you can click on the Rule Parameters tab to adjust any parameters, if necessary. Click on the Issue Rules button to issue the reports. There are 5 different types of reports issued. They are as follows:

Rule Report – Lists discovered if-then rules along with an analysis of the rules explanatory power.

Trend Report – summarizes the data by exhibiting the one-condition trends in the data.

Unexpected Rule Report – displays the rules that are unexpected compared to more basic rules and trends (similar to interesting analysis).

Comprehensive Rule Report – lists the if-and-only-if rules.

Unexpected Cases Report – highlights deviations in the dependent variable’s value from the expected value, based on the discovered rules.

More information on setting the parameters and the types of reports generated can be found in the online help.

Name of Software

WizWhy (Wizsoft)

Type

Predictor / Interesting Analysis software

File Formats Accepted

ASCII, dBase, MS Access, MS SQL, Oracle, ODBC, and OLE DB.

Limitations

1000 Records

Online Help (Web)

www.wizsoft.com

   Sipina

Sipina is software tool that produces an analysis of a data set similar to that of a decision tree, but in a lattice format.   According to its authors, this is more general than the ID3 or C4.5 methods.

To use a data set in Sipina, the data format is ASCII, but with a (*.data) extension. You can also use dBase format (*.dbf), Paradox format (*.db), or you can export data from a Lotus 123 (*.wks) spreadsheet. You can open a file by choosing “Open Data” from the File menu. From there, you can choose to open an entire data base, a data file (text), a parameter, or a validation data set. Once the desired file is opened, you can immediately start the analysis by choosing “Start Automatic Analysis” from the Analysis Menu. This action will automatically generate the lattice. Before or after doing this, however, you can adjust other parameters to suit your needs. See the online help for more information on how to do this.

Name of Software

Sipina (D.A. Zighed and R. Rakotomolala)

Type

General Decision Tree analysis (Lattice structure)

File Formats Accepted

data format. Also dBase (.dbf), Paradox (*.db), Lotus 123 (*.wks)

Limitations

Theoretically 232 cases, over 16000 attributes

Online Help (Web)

http://eric.univ-lyon2.fr/~ricco/sipina.html

   OLAP

EASY OLAP is a software solution tool for displaying and analyzing data from any data source via multidimensional views. The application provides access to data in a way that makes obtaining information fast and efficient. It provides interactive data visualization and data creation tools - an intuitive, multidimensional view of the data, making it the ideal solution for accessing, identifying, and analyzing key information.

In OLAP, the user can input raw data directly into the software, or he can import it via an Excel spreadsheet. The interface in OLAP is similar to an Excel spreadsheet, making it simple to import/export data items and to manually input data.

Name of Software

OLAP – (Peyo)

Type

Multidimensional data viewer and analyzer

File Formats Accepted

MS Excel files and manual input

Limitations

None Stated

Online Help (Web)

http://www.peyo-home.sk/main.html?household/index.htm~main

   Belief Network

The Belief Network by Powersoft is 3 software tools in one: A Bayesian Belief Network constructor (using discrete data), a predictor using Bayesian techniques, and a preprocessor software tool that can change the formats of data and discretize continuous variables.

Each of the three software tools described above uses a very simple wizard-like interface to preprocess, load data, and to adjust any parameters, if necessary. Simply follow the instructions each step of the way in the wizards to perform the desired tasks. The user will probably need to run the preprocessor first to transform the data into a usable form for the program. For both the Bayesian Belief Network and Predictor software tools, the following file formats are accepted: Acess, dBase 3, 4, and 5, Foxpro 2, 2.5, 2.6, 3.0, Paradox 3.x, 4.x, 5.x, Excel 3, 4, 5.0, or 95, 97, text files, ODBC files.

   XMDV

XMDV is a visualizing tool used for int

Quick Facts

Name of Software

Belief Network – Powersoft

Type

Bayesian Belief Network analyzer and predictor

File Formats Accepted

See Above

Limitations

None Stated

Online Help (Web)

www.cs.ualberta.ca/~jcheng

Interpreting a variety of databases. XMDV was created in order to analyze these databases by the use of scatter plots, star glyphs, parallel coordinates, and dimensional stacking.

To use XMDV, open up the program and open a database set. XMDV accepts file formats with the extension .okc. These files are text files that are specially formatted in order for XMDV to properly interpret the data.

The data that can be accepted is multivariate. Numbers and categories can be in the database. The preprocessing of the data is straight forward. The first line of the database file is two numbers. The first number is the number of dimensions or fields in the database. The second number represents the number of records that are in the database file. The following lines in the database file are the field names, if the field is categorical, then the line must be formatted as the following: “Sex(1,3,'Infant','Mmale','Female')”. The numbers represent the total number of choices in the field. For example, Infant would be represented by a one in the database file. After the fields lines, the next set of lines that are needed are the data ranges lines. It is recommended that the ranges are padded a bit in order to display in XMDV. After the ranges, the last thing that remains is the actual data in the database file. The data must be separated by commas

Name of Software

XMDV

Type

Visualization

File Formats Accepted

Comma delimited-like in a (*.okc) format

Limitations

Only up to twenty dimensions, but can be changed

Online Help (Web)

http://davis.wpi.edu/~xmdv/documents.html

   Analog

Analog is a program that analyzes log files from World Wide Web Servers. This tool is free and windows compatible. The program must be run at the command line. When the program has finished its analysis, it will generate a report in html format. There is a configuration file that is available in order to customize the analysis. No pre-processing or other data files are required.

Name of Software

Analog

Type

World Wide Web Analyzer

File Formats Accepted

None

Limitations

None

Online Help (Web)

http://www.analog.cx/docs/Readme.html

   ARTool

ARTool is a rule based analysis tool. It will search through a database file and check for rule-based associations. ARTool is written in java and will run on any platform as long as the jdk’s are installed. There is some pre-processing of the data that needs to occur. The data is primarily categorical where rules associations and wanting to be discovered. First the categories must be entered in an ascii file, the format is:

1 green apples

2 red apples

3 oranges

4 bananas

5 grapes

These lines define the categories and the numbers are the representations. So when the actual data in entered, the format looks like:

1 5

3

3 5

The lines are the actual data, or transactions. The first line would represent a consumer that bought green apples and grapes. This is the way that data file must be formatted. After the data is entered and formatted properly, the java tool asc2db is used to convert the ascii file into a database file. The data is now ready to be loaded into ARTool.

Name of Software

ARTool

Type

Rule-based associations

File Formats Accepted

Files with the extension .db

Limitations

None

Online Help (Web)

http://www.cs.umb.edu/~laur/ARtool/

   JAVA Neural Networks Simulator (JNNS)

The Java Neural Network Simulator (JNNS) is a program written in Java used for running tests to analyze learning machines. The program can run on any platform and has a nice graphical user interface. The program can be launched by the command line or by double clicking the jar file. There are example files that can be opened in order to see examples of how to set up a network. Importing data is possible, but the time and effort to create a network file is not of much worth. A new file should be created and in the long run, the total amount of time spent on the project will be less if the network file was created by hand. Once a new file is created, options such as link behavior and pattern type must be selected in order to train the new network. More extensive help and tutorial can be found on the web.

Name of Software

JAVA Neural Networks Simulator (JNNS)

Type

Neural Networks Simulator

File Formats Accepted

Pattern Files (.pat) Network Files (.net) Configuration Files (.cfg) and Log Files (.log)

Limitations

None

Online Help (Web)

http://www-ra.informatik.uni-tuebingen.de/

   The Visual Statistics System (ViSta)

ViSta is a data-mining tool used for to compute basics statistics. These basic statistics pave the way for other analysis such as Regressions or ANOVA tables. This tool accepts proprietary file formats. The data can be multivariate – numerical, categorical, or both. The data can be entered into ViSat directly, or it can be imported from another ViSta file or ascii data file.

Name of Software

Vista

Type

Statistical Analysis

File Formats Accepted

Proprietary Files (.LST) or ASCII Files (.asc)

Limitations

None

Online Help (Web)

http://www.visualstats.org/vista-frames/online/index.html

   WEKA

Weka is collection of machine learning algorithms. Implemented schemes for classification are decision tree inducers, rule learners, naïve Bayes, decision tables, locally weighted regression, support vector machines, instance-based learners, logistic regression, voted perceptrons, multi-layer perceptrons. Implemented schemes for numeric prediction are linear regression, model tree generators, locally weighted regression, instance-based learners, decision tables, and multi-layer perceptron.   Implemented “meta-schemes” are bagging, stacking, boosting, regression via classification, classification via regression, and cost sensitive classification. The program is written in JAVA and can be run on any platform.

In Weka, the user can input raw data directly into the software, or he can import it via a comma-delimited file. There are special symbols that are expected in order for the file to be read properly. The first line has “@” symbol followed by the word relation. The next word on the line is basically the name of the data. The next few lines are the attributes. The “@” symbol is first followed by the word attribute. Following the word attribute is the field name, and following the field name is either numeric or the actual choices for the categorical data. After all of the field names are entered, then the next line is @DATA. The next lines are the actual data values separated by commas. Each record is on a new line. The file is now ready to be read into Weka.

Name of Software

WEKA

Type

Classification

File Formats Accepted

Files with the .arff extension

Limitations

Parameters have individual limits

Online Help (Web)

http://www.cs.waikato.ac.nz/~ml/weka/

   SPSS

SPSS is a data-mining tool that is suite of algorithms. SPSS is similar to SAS in that there are many options for data analysis. There are algorithms such as Linear Regression and K-Means Classification. SPSS is a tool primarily based for the windows platforms. It is commercial and it is not free. There is not a lot of preprocessing involved when loading the data into SPSS. The data can directly be entered into the SPSS or a data file can be entered. The default extension for a data file is .sav. SPSS will also accept .txt files that are either comma or tab separated. There are many other formats that SPSS will support, they are SAS formats, Excel formats, Lotus formats, Dbase formats, SYLK formats, Systat formats and files with the extension .dat. There is a menu at the top that has a choice of analysis tools or other options that are available to the user.

Name of Software

SPSS

Type

Suite of algorithms

File Formats Accepted

SAS, Systat, Dbase, Lotus, Excel, Text, and Data files

Limitations

None

Online Help (Web)

http://www.spss.com/spssmr/support/

   SAS

SAS is a data-mining tool that is suite of algorithms. SAS has many algorithms to use for data mining. SAS is still primarily statistical based and has features such as the Enterprise Miner in order to use for data mining. The Enterprise Miner is a graphical interface which allows for drag and drop of different types of algorithms. Files from Excel and other text data files are able to be imported into SAS for analyzation. There is also an extensive help and tutorial included with SAS.

Quick Facts

Name of Software

SAS

Type

Suite of algorithms

File Formats Accepted

SAS, Systat, Dbase, Lotus, Excel, Text, and Data files

Limitations

None

Online Help (Web)

http://www.sas.com/service/edu/intro.html