Facilities
|
|
Server: 4X Intel Xeon @ 900 MHz, 4 GB RAM, RAID 3X36 GB
Workstations: Pentium 4, 1.8 GHz, 512 MB RAM, 60 GB Hard Drive
|
Links
|
- NASA
National Aeronautic and Space Administration.
- IEEE
Institute of Electrical and Electronic Engineering.
- ACM
ACM Digital Library.
- SIGKDD
ACM Special Interest Group on Knowledge Discovery and Data Mining.
- KDD Website
Infomation about Data Mining, Knowledge Discovery, Text Mining, Web Mining.
- NSF
National Science Foundation.
- KSTC
Kentucky Science & Engineering Foundation.
|
Datasets
|
- UCI Knowledge Discovery in Databases Archive
This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.
- The Machine Learning network Online Information Service
This site is dedicated to the field of machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining. Get information about research groups and persons within the community. Browse through the list of software and data sets.
|
Talks
|
|
Tools
|
This
web page is a quick reference file for the following
programs. This is not designed to be a full help
or tutorial on these programs, but it is merely
to help you get started with the available tools.
|
Tiberius
|
Tiberius is a visual neural data-mining tool that uses many algorithms to predict a desired output given selected inputs. Tiberius, according to its authors, is a multi-layered perceptron neural network, which discovers relationships between input variables and output variables.
To use Tiberius, open up the program and click on the create button. Tiberius accepts many file formats, including the following:
- Excel Files
- Access Files
- Tab Delimited Files
- Comma Delimited Text
- Universal Data Link
The only data that can be accepted is numerical data. Preproccessing can be done on the next screen. This screen allows the user to select what inputs and outputs he wants, and which ones he wants to exclude. Once the user selects the desired inputs, outputs and ÅËunk” data to use, he can then choose to train the data to predict an outcome. See the online help for a more detailed explanation on how to load files and to use Tiberius.
|
Quick Facts |
Name of Software |
Tiberius (Authors: Phil Brierley and Andrew Lewis) |
Type |
Multi-Layered Perceptron Neural Network |
File Formats Accepted |
Excel, Access, Tab Delimited, Comma Delimited, Universal Data Link |
Limitations |
Only accepts numerical data. |
Online Help (Web) |
http://www.philbrierley.com/ |
|
|
Roc (Robust Bayesian Classifier)
|
Roc, is an implementation of the Robust Bayesian estimator, which is described by its authors as a Bayesian Classifier that has the ability to handle missing data and is not affected by the the patterns of the missing data. The classifier uses a wizard-style interface that prompts the user every step of the way regarding the inputting data files, setting parameters, and the selection of different styles of algorithms. The Bayesian classifier is specially formulated to handle cases with missing data. It uses Bayes Theorem and classical Bayesian techniques (as described in many texts) to classify a set of data with unknown classes.
To use Roc, open up the program and follow the instructions on the screen. Roc's simple, wizard-like interface makes it easy to manipulate parameters. After the welcome screen, the user can input the training file by clicking the Browse button. Roc accepts text files only, either in tab delimited form (*.txt), or space delimited form (*.prn). The user can also load a classifier file, which is a file that has been already ÅÕrained” from previous data.
Once the user loads the training file, he can go through
the different dialogs of the interface to set parameters, and load the
test file. The test file must also be a tab or space delimited text file.
|
|
See 5
|
See 5 is a program that utilizes the C5.0 algorithm for producing decision trees. See 5 requires two types of files to run, a names file and a data file. The names file contains the names of the attributes, and are placed in a special format readable by the program. The data file contains the raw data that corresponds to each particular feature. To load a file, you only need to load the data file. The names file must be in the same directory. This file gets automatically read in after the data file is loaded.
Once a data file is loaded, then click on the construct classifier button (or click from the file menu), and set the parameters as necessary in the dialog box that appears. After clicking ok, the classifier will then run, and a decision tree will be made from the data.
All file types used with See 5 are as follows:
Names file – as
described above, contains names of the attributes Data file – contains
the raw data Test file – an
optional file that can be used to test
the training data contained in the data
file Cost file – an
optional file that associates costs
with the raw data
|
Name of Software |
See 5 – Rulequest Research |
Type |
Decision Tree software that uses the C5.0 algorithm |
File Formats Accepted |
Data files (*.data), comma separated values. Names files (*.names), special format, (see online help) |
Limitations |
400 cases (test and training) in demo version |
Online Help (Web) |
http://www.rulequest.com/see5-win.html |
|
WizWhy
|
WizWhy is a data mining software tool that predicts the output of a data set based on if-then rules within the data.
After
starting the program in Microsoft Windows, to open
a file for analysis, click on the “Open Data of Type” button
on the Basic Data tab. You can choose from different database (file) types, including ASCII, dBase, MS Access, MS SQL, Oracle, ODBC, and OLE DB.
After loading the data set into the program, the different data fields will then be automatically displayed. Click on the feature you want to be the independent variable. After that, you can click on the Rule Parameters tab to adjust any parameters, if necessary. Click on the Issue Rules button to issue the reports. There are 5 different types of reports issued. They are as follows:
Rule
Report – Lists discovered if-then rules along
with an analysis of the rules explanatory power.
Trend
Report – summarizes the data by exhibiting
the one-condition trends in the data.
Unexpected
Rule Report – displays the rules that are
unexpected compared to more basic rules and trends
(similar to interesting analysis).
Comprehensive
Rule Report – lists the if-and-only-if rules.
Unexpected
Cases Report – highlights deviations in the dependent variable’s
value from the expected value, based on the discovered
rules.
More information on setting the parameters and the types of reports generated can be found in the online help.
|
Name of Software |
WizWhy (Wizsoft) |
Type |
Predictor / Interesting Analysis software |
File Formats Accepted |
ASCII, dBase, MS Access, MS SQL, Oracle, ODBC, and OLE DB. |
Limitations |
1000 Records |
Online Help (Web) |
www.wizsoft.com |
|
Sipina
|
Sipina is software tool that produces an analysis of a data set similar to that of a decision tree, but in a lattice format. According to its authors, this is more general than the ID3 or C4.5 methods.
To use a data set in Sipina, the data format is ASCII, but with a (*.data) extension. You can also use dBase format (*.dbf), Paradox format (*.db), or you can export data from a Lotus 123 (*.wks) spreadsheet. You
can open a file by choosing “Open Data” from
the File menu. From there, you can choose to open an entire data base, a data file (text), a parameter, or a validation data set. Once
the desired file is opened, you can immediately
start the analysis by choosing “Start Automatic Analysis” from
the Analysis Menu. This action will automatically generate the lattice. Before or after doing this, however, you can adjust other parameters to suit your needs. See the online help for more information on how to do this.
|
Name of Software |
Sipina (D.A. Zighed and R. Rakotomolala) |
Type |
General Decision Tree analysis (Lattice structure) |
File Formats Accepted |
data format. Also dBase (.dbf), Paradox (*.db), Lotus 123 (*.wks) |
Limitations |
Theoretically 232 cases, over 16000 attributes |
Online Help (Web) |
http://eric.univ-lyon2.fr/~ricco/sipina.html |
|
OLAP
|
EASY
OLAP is a software solution tool for displaying and analyzing data from any data source via multidimensional views. The application provides access to data in a way
that makes obtaining information fast and efficient.
It provides interactive data visualization and
data creation tools - an intuitive, multidimensional
view of the data, making it the ideal solution
for accessing, identifying, and analyzing key information.
In
OLAP, the user can input raw data directly into
the software, or he can import it via an Excel
spreadsheet. The
interface in OLAP is similar to an Excel spreadsheet,
making it simple to import/export data items and
to manually input data.
|
Belief Network
|
The Belief Network by Powersoft is 3 software tools
in one: A
Bayesian Belief Network constructor (using discrete
data), a predictor using Bayesian techniques, and
a preprocessor software tool that can change the
formats of data and discretize continuous variables.
Each of the three software tools described above uses a very simple wizard-like interface to preprocess, load data, and to adjust any parameters, if necessary. Simply follow the instructions each step of the way in the wizards to perform the desired tasks. The user will probably need to run the preprocessor first to transform the data into a usable form for the program. For both the Bayesian Belief Network and Predictor software tools, the following file formats are accepted: Acess, dBase 3, 4, and 5, Foxpro 2, 2.5, 2.6, 3.0, Paradox 3.x, 4.x, 5.x, Excel 3, 4, 5.0, or 95, 97, text files, ODBC files.
|
XMDV
|
XMDV is a visualizing tool used for int
Quick
Facts |
Name of Software |
Belief Network – Powersoft |
Type |
Bayesian Belief Network analyzer and predictor |
File Formats Accepted |
See
Above |
Limitations |
None Stated |
Online Help (Web) |
www.cs.ualberta.ca/~jcheng |
Interpreting
a variety of databases. XMDV was created in order
to analyze these databases by the use of scatter
plots, star glyphs, parallel coordinates, and dimensional
stacking.
To use XMDV, open up the program and open a database set. XMDV accepts file formats with the extension .okc. These files are text files that are specially formatted in order for XMDV to properly interpret the data.
The data that
can be accepted is multivariate. Numbers
and categories can be in the database. The
preprocessing of the data is straight forward. The
first line of the database file is two numbers. The
first number is the number of dimensions or fields
in the database. The second number represents
the number of records that are in the database
file. The following lines in the database
file are the field names, if the field is categorical,
then the line must be formatted as the following: “Sex(1,3,'Infant','Mmale','Female')”. The numbers
represent the total number of choices in the field. For
example, Infant would be represented by a one in the
database file. After the fields lines, the next
set of lines that are needed are the data ranges lines. It
is recommended that the ranges are padded a bit in
order to display in XMDV. After the ranges, the
last thing that remains is the actual data in the database
file. The data must be separated by commas
|
Name of Software |
XMDV |
Type |
Visualization |
File Formats Accepted |
Comma delimited-like in a (*.okc) format |
Limitations |
Only up to twenty dimensions, but can be changed |
Online Help (Web) |
http://davis.wpi.edu/~xmdv/documents.html
|
|
Analog
|
Analog
is a program that analyzes log files from World Wide
Web Servers. This
tool is free and windows compatible. The program must be run at the command
line. When
the program has finished its analysis, it will generate
a report in html format. There is a configuration file that is
available in order to customize the analysis. No
pre-processing or other data files are required.
|
ARTool
|
ARTool is a rule based analysis tool. It will search through a database file and check for rule-based associations. ARTool is written in java and will run on any platform as long as the jdk’s are installed. There is some pre-processing of the data that needs to occur. The data is primarily categorical where rules associations and wanting to be discovered. First the categories must be entered in an ascii file, the format is:
1 green apples
2 red apples
3 oranges
4 bananas
5 grapes
These lines define the categories and the numbers are the representations. So when the actual data in entered, the format looks like:
1 5
3
3 5
The lines are the actual data, or transactions. The first line would represent a consumer that bought green apples and grapes. This is the way that data file must be formatted. After the data is entered and formatted properly, the java tool asc2db is used to convert the ascii file into a database file. The data is now ready to be loaded into ARTool.
|
JAVA Neural Networks Simulator (JNNS)
|
The Java Neural Network Simulator (JNNS) is a program written in Java used for running tests to analyze learning machines. The program can run on any platform and has a nice graphical user interface. The program can be launched by the command line or by double clicking the jar file. There are example files that can be opened in order to see examples of how to set up a network. Importing data is possible, but the time and effort to create a network file is not of much worth. A new file should be created and in the long run, the total amount of time spent on the project will be less if the network file was created by hand. Once a new file is created, options such as link behavior and pattern type must be selected in order to train the new network. More extensive help and tutorial can be found on the web.
|
Name of Software |
JAVA Neural Networks Simulator (JNNS) |
Type |
Neural Networks Simulator |
File Formats Accepted |
Pattern Files (.pat) Network Files (.net) Configuration Files (.cfg) and Log Files (.log) |
Limitations |
None |
Online Help (Web) |
http://www-ra.informatik.uni-tuebingen.de/ |
|
The Visual Statistics System (ViSta)
|
ViSta is a data-mining tool used for to compute basics statistics. These basic statistics pave the way for other analysis such as Regressions or ANOVA tables. This tool accepts proprietary file formats. The data can be multivariate – numerical, categorical, or both. The data can be entered into ViSat directly, or it can be imported from another ViSta file or ascii data file.
|
WEKA
|
Weka is collection of machine learning algorithms. Implemented schemes for classification are decision tree inducers, rule learners, naïve Bayes, decision tables, locally weighted regression, support vector machines, instance-based learners, logistic regression, voted perceptrons, multi-layer perceptrons. Implemented schemes for numeric prediction are linear regression, model tree generators, locally weighted regression, instance-based learners, decision tables, and multi-layer perceptron. Implemented “meta-schemes” are bagging, stacking, boosting, regression via classification, classification via regression, and cost sensitive classification. The program is written in JAVA and can be run on any platform.
In Weka, the user can input raw data directly into
the software, or he can import it via a comma-delimited
file. There are special symbols that are
expected in order for the file to be read properly. The
first line has “@” symbol
followed by the word relation. The next word
on the line is basically the name of the data. The
next few lines are the attributes. The “@” symbol
is first followed by the word attribute. Following
the word attribute is the field name, and following
the field name is either numeric or the actual choices
for the categorical data. After all of the field
names are entered, then the next line is @DATA. The
next lines are the actual data values separated by
commas. Each record is on a new line. The
file is now ready to be read into Weka.
|
Name of Software |
WEKA |
Type |
Classification |
File Formats Accepted |
Files with the .arff extension |
Limitations |
Parameters have individual limits |
Online Help (Web) |
http://www.cs.waikato.ac.nz/~ml/weka/
|
|
SPSS
|
SPSS is a data-mining tool that is suite of algorithms. SPSS is similar to SAS in that there are many options for data analysis. There are algorithms such as Linear Regression and K-Means Classification. SPSS is a tool primarily based for the windows platforms. It is commercial and it is not free. There is not a lot of preprocessing involved when loading the data into SPSS. The data can directly be entered into the SPSS or a data file can be entered. The default extension for a data file is .sav. SPSS will also accept .txt files that are either comma or tab separated. There are many other formats that SPSS will support, they are SAS formats, Excel formats, Lotus formats, Dbase formats, SYLK formats, Systat formats and files with the extension .dat. There is a menu at the top that has a choice of analysis tools or other options that are available to the user.
|
Name of Software |
SPSS |
Type |
Suite of algorithms |
File Formats Accepted |
SAS, Systat, Dbase, Lotus, Excel, Text, and Data files |
Limitations |
None |
Online Help (Web) |
http://www.spss.com/spssmr/support/ |
|
SAS
|
SAS
is a data-mining tool that is suite of algorithms. SAS
has many algorithms to use for data mining. SAS
is still primarily statistical based and has features
such as the Enterprise Miner in order to use for
data mining. The Enterprise Miner is a graphical
interface which allows for drag and drop of different
types of algorithms. Files from Excel and other
text data files are able to be imported into SAS
for analyzation. There is also an extensive help
and tutorial included with SAS.
Quick Facts |
Name of Software |
SAS |
Type |
Suite of algorithms |
File Formats Accepted |
SAS, Systat, Dbase, Lotus, Excel, Text, and Data files |
Limitations |
None |
Online Help (Web) |
http://www.sas.com/service/edu/intro.html |
|