Data Help
Loading data
Data can be loaded into StatCrunch from one of four ways:
- Import a file from a local machine or network.
- Type in the URL of a file on the WWW.
- Pasting data directly into StatCrunch.
- Choose a sample data set.
Load Data From File or URL
To load data from a file, select this option from the
Load Data submenu of the Data menu. StatCrunch can load
text and Excel files (must end with a .xls extension) from a local
computer or from a WWW address.
- To load a file from a local machine, either type the full path to the file in the File input box or click the browse button to locate the file on the local system.
- To load a file from a WWW address, type in the Internet location in the WWW address input box. (e.g., http://www.test123.com/mydata.xls)
- If the first line of the file contains column names, make sure the Use first line as variable names option is checked.
- When loading an ASCII text file, specify the observation delimiter to be used. Note that this step may be ignored if loading an Excel file. The delimiter options are whitespace (any whitespace character such as a space or tab), tab, comma (for .csv files) and semicolon.
- Click the Load Data button , and after a few moments, the data will appear in the data table.
From Paste
Data may also be pasted into StatCrunch. This
option is not as reliable across platforms, so loading data from a
file is preferred. To load data from paste, select this option from
the Load Data submenu of the Data menu.
- Copy the data to be loaded into StatCrunch, and then paste it into
the text box. Data may also be entered into the box manually, and a
formatting header may be added to the top of the
text input if necessary. On some platforms, it is impossible to insert
tabs when editing tab-delimited data that has been pasted into the
text box.
- If the first line of the file contains variable names, make sure
the Use first line as variable names option is checked.
- Specify the observation delimiter to be used. The delimiter
options are whitespace (any whitespace character such as a space or
tab), tab, comma (for .csv files) and semicolon. When pasting from
Excel or other spreadsheets, choose the tab delimiter.
- Click the Load Data button, and after a few moments, the
data will appear in the data table.
Sample Data
A number of sample data sets are provided for users to experiment with
in StatCrunch. Some data sets are not applicable to certain analysis or
graphics. Be careful to test an analysis/graphic with appropriate
data. To see a complete listing of sample sets, click the Sample
Data option under the Load Data submenu of the Data
menu.
Formatting Data
Text data can be specially formatted for StatCrunch by adding a header to
the data before importing it. This is typically not necessary
when loading data from spreadsheets. The header can be used to
provide additional information about data, to specify variable names
and to specify the number of observations per column. An example
appears below.
#Test Data
varnames = Control Treatment
nobs = 23 21
All lines that begin with # are ignored when the information is
loaded into StatCrunch. these lines can be used as comments that provide
general information about the data set.
The line beginning with varnames = is used to specify the names
of the variables, and the line beginning with nobs = specifies
the number of observations per variable. If a data set has the same
number of observations for each variable, the nobs statement
can be ignored. This option is only useful when there is an unequal
number of observations per column. In the example above, the first
column would contain 23 observations and the second column would
contain 21 observations. Also, most methods for loading data have the option of using the
first uncommented row of the data for variable names as a default, so
that in many cases the varnames = statement may also be
omitted.
Save Data
Saves the data in the data table to your My Data folder.
- Specify a name for the data file.
- Specify whether or not to overwrite an existing file.
- Specify the delimiter to use to separate values.
- Click Save to write the data file.
Data Table
The data table refers to the grid of cells used for entering and
editing data in StatCrunch. The top row of the data table is for column
names, and any data values entered into this row will not be used in
subsequent analyses. The left most column serves as row ids for the
data table. A row of data values may be highlighted by clicking
inside a row id cell. Clicking inside a highlighted row id cell will
deselect the row. The observations in highlighted rows will also be
highlighted in most graphics that StatCrunch produces. All highlighted
rows in the data table may be deselected by clicking on the Row
label in the top row of the row id column.
The data table is equipped with horizontal and vertical scroll bars at
the top and bottom of the data table, respectively, for quickly moving
through the table. The Page down and Page up keys can
also be used to quickly scroll through the table. The number of cells
presented in the table may not be increased by using these tools.
This can be done by using the keyboard actions (described below) or by
resizing the main window that contains the data table.
The cell that is highlighted receives keyboard input. StatCrunch will
first try to make a double precision number out of data entered, but
if this is not possible, StatCrunch will store the input as a string. If
"43StatCrunch" is entered, StatCrunch will read this as "43". In other
words, strings that start with numbers may be truncated after the
number. StatCrunch also uses scientific notation so that "0.000001" may
appear as "1E-6". All numbers are displayed with floating point
precision, but they are stored as double precision.
One should think of the cursor as occupying the position to the far
right in the highlighted cell. At this time, the cursor may not be
manually positioned inside the highlighted cell. The backspace key can
be used to delete values entered in a one by one fashion beginning
with the last value entered. The delete key can be used to remove all
values in the highlighted cell.
When a cell loses the highlight, the values entered in the cell may be
reformatted for better display. In the case of numbers for example,
the value "0.00000000012345" may be represented as "1.2345E-10". The
width of the column will adjust if the information entered exceeds the
column's width. See Column Widths below to
see how to change this behavior to accommodate fixed width columns.
When working with large data sets this auto-sizing feature should be
turned off for faster response times.
To change the cell that is highlighted, one may click the mouse inside
a new cell or by using the tab, return or arrow keys. If the user
tries to move to a cell that is not fully visible, the data table
will shift so that the cell becomes fully visible. The tab key moves
the highlight one cell to the right. The return key moves the
highlight one cell down. The up, down, right and left arrow keys shift
the highlighted cell in the direction of the arrow. All of these keys
can be used to increase the number of cells displayed in the table.
For example, if the data table currently displays 8 columns and the user
presses the tab key while in a highlighted cell in the 8th
column, StatCrunch will then display the 9th column with the highlighted
cell in the 9th column.
Export Data
The data stored in the data table can be exported to a new window for
downloading to the user's local machine by selecting the Export
data option under the Data Table submenu of the Data
menu. The data can be exported in either a text format or as an HTML
table. For the text option, one can specify the observation delimiter
to be used. The text option should be used to export data in a format
that can be reloaded in the future. The HTML option should be used to
export the data for publication purposes (web pages, pasting into word
processing programs, etc...). When the data appears in the new
window, use the File menu on the browser to save or print it.
Font
Choose the font and size from the drop downs provided and then click Update font!.
Column Widths
The width of a column can be adjusted by clicking on the right edge
of the column and dragging the mouse. One can also select the
Column Widths option under the Data menu to set
preferences for the default widths of columns.
- Specify the default width (in pixels) for columns. By default, a
column's width is set to 65 pixels.
- By default the column width will be automatically adjusted if
data entered exceeds the default width. To turn off this auto
resizing feature, check the Auto size columns to fit values
option. If the auto-sizing is turned off, only the first few values
in a cell that fit within the default width will be displayed and then
the ">" character is placed at the end of the cell to indicate there
is more information that can not be displayed. When working with
large data sets this auto-sizing feature should be turned off for
faster response times when doing data manipulations.
- Check the Apply defaults to all option to override any
custom sizing that has been done using the mouse.
Delete Columns
Simply select the columns to be deleted and click Delete. Please note that once a column is deleted it can not be recovered.
Move Columns
Select the columns to be moved. The columns will be relocated from
left to right in the order that they are selected. Choose the column
that the selected columns will be moved to the left of, or select the
Move after last column option.
Clear
This feature will create a new empty data table. All analysis results
and graphics appearing in separate windows will not be affected.
Row Selection
Rows in the data table may be highlighted by clicking the mouse inside
the cell containing the row id number. Clicking inside a highlighted
row id cell will deselect the row. Highlighted rows will also be
highlighted in most graphics that StatCrunch produces. All rows in the
data table may be deselected by clicking on the Row label in
the top row of the row id column. In addition to using the mouse in
this manner, the following options are available for selecting and
deselecting rows.
Interactive Tools
This option can be used to create sliders and category selectors that
allow a great deal of flexibility for data selection.
- Select the columns containing numeric values for which sliders are to be generated.
- Select the columns containing values (text or numeric) for which category
selectors are to be generated.
- Click the Create Tool button to generate the selection
tools. The appropriate numeric slide selectors and category selectors
will then be generated, and all rows in the data table will be
highlighted. For numeric sliders, the handles on each end of the
slider can be adjusted to control the range of values for the column
that are highlighted. For category selectors, the
categories for the column to be highlighted can be chosen from a list of distinct
values for the column. The selection behavior can be changed
from selecting rows that satisfy all selectors (equivalent to
intersecting the conditions set for all selectors) to
selecting rows that satisfy any selectors (equivalent to the
union of the conditions set for all selectors). Selectors
may be very useful when used in conjunction with interactive graphic
in exploratory analysis.
Select Where
This feature will highlight rows in the data table based on a boolean
expression using ANSI SQL syntax entered by the user. See the WHERE section for
more details on boolean expressions.
- Specify a boolean expression is entered that is to be evaluated
for each row in the data set.
- Click the Select Rows! button to
highlight the appropriate rows. If the expression is true, then the
row is highlighted. If the expression is false, no action is taken,
so if the row was already highlighted, it will still be highlighted
after the computation.
Deselect All
This option under the Data menu deselects all rows highlighted
in the data table. All highlighted rows may also be deselected by
clicking on the Row label in the top row of the row id column.
Create Column(s)
StatCrunch offers a number of procedures for adding new columns from
scratch or by performing a variety of data manipulations on existing
columns.
Transform data
This method is now deprecated. See compute expressions below.
Compute Expression
StatCrunch offers a great deal of flexibility for doing data transformations with this option.
One may
type an expression directly into the Transformation box. The following are examples of mathematical expressions that can be entered for the Hotdog Data. See Expressions for
a listing of operators and functions that can be used.
- Calories + Sodium
-
computes the sum of the Calories and Sodium columns
- Calories - Sodium
-
computes the difference of the Calories and Sodium columns
- Calories*Sodium
-
computes the product of the Calories and Sodium columns
- Calories/Sodium
-
computes the ratio of the Calories and Sodium columns
- log(Calories)
-
computes the natural logarithm of the Calories column
- (Calories - mean(calories))/std(calories)
-
computes the standard scores of the Calories column
One may also construct an expression from a menu
of common transformations. The typing option puts a great deal of
power in the hand of the user while the menu option provides direction
for those users less familiar with standard mathematical expressions.
The list of common transformations includes: log(Y), sqrt(Y), 1/Y,
Y-X, Y/X, a+b*Y, Y**b, sin(Y), arcsin(Y), log10(Y+a), etc.. In each
case, the necessary columns (Y and/or X) must be selected and the
values for constants (a and/or b) must be specified. When the Set
Transformation button is clicked, the proper mathematical syntax
will be placed in the Transformation input box. Click the
Transform button to compute the values for the new transformed
column. This column will be added to the right of any existing
columns.
Sequence Data
With this option, StatCrunch offers the user the ability to create a new column of patterned sequence data.
- In the From input box, specify the numeric starting value for the sequence.
- In the To input box, specify the numeric ending value for the sequence.
- In the By input box, specify the numeric increment value for the sequence.
- Specify the number of times to repeat each value in the sequence.
- Specify the number of times to repeat the sequence.
- Click the Create Sequence button to generate a new column containing the sequence values. As example, if From is set to 1, To is set to 10, By is set to 2, Repeat each value is set to 1, and Repeat the sequence is set to 2, the new column would contain the values: 1, 3, 5, 7, 9, 1, 3, 5, 7, 9.
Split Column
This option splits values in a column into separate columns based on the values of a grouping column. This option is useful when isolating particular subgroups within a column.
- Select the Column that contains the values to be split into multiple columns
- Select the Grouping Column. A new column will be created for each distinct value of this column.
- Click Create Groups to create the new column(s). As an example, suppose the data table contains six rows of data, where var1 contains the value "A" in the first three rows and the value "B" in the next three rows. Also, suppose var2 contains the values 1,2,3,4,5,6 in rows one through six respectively. If the user were to select var2 as the Column and var1
as the Group Column, then two new columns would be created with three values in each new column. The first new column would contain the values 1,2,3 and would be labeled "A var2". The second new column would contain the values 4,5,6 and would be labeled "B var2".
Stack Columns
This option stacks values in selected columns into a single column. A separate label column is also created containing the
column header for each value from the orignal column.
- Select the Columns to be stacked.
- Specify an optional value for Store labels in. This value will be the column header for the
new label column.
- Specify an optional value for Store values in. This value will be the column header for the
new column conatining the stacked values.
- Click Stack Columns! to create the new column(s).
Bin Column
This option bins data in a column based on a bining scheme specified
by the user. This option is useful when creating levels
for a numeric column to be used for further analysis.
- Select the column to be bined under Bin Column.
- Specify how to construct the bins. The bins are defined by a
sequence of values that define the bin edges. When bining the data
values, a bin will include the left edge and exclude the right edge.
The choices for defining bins are:
- Automatically define bins - This option uses the same
automated algorithm to define bins as is used to construct a
histogram. The algorithm tries to divide the range of the data values
into a set of bins of the equal width.
- Use fixed width bins - With this option, the user
specifies a starting point for the sequence of bins and defines the
bin width. As an example, a starting point of 40 and bin width of 5
will generate a sequence of bins that looks like 40,45,50,55,... Note
that if the starting point specified by the user is below the minimum
value for the column then the algorithm will set the starting point to
the minimum value.
- Use bins defined by cut points: This option allows the
user to specify a sequence of comma delimited values for the bin
edges. This option can be used to create bins of varying widths.
It is also useful when the user wants to define custom levels for a
set of numerical values.
- Click the Calculate button to compute a new column
containing the bin associated with each row of the selected column.
For example, if the bin edges are defined by 40,45,50 and 55, then for
a value of 47.5, a bined value of 45 to 50 will be produced.
Sort Columns
This option sorts numeric or text columns in ascending or descending order.
- Select the columns to be sorted.
- Select the column to be used to define the new ordering for Sort by.
- By default the values will be sorted in ascending order. To get the descending order, check the Descending box.
- Click the Sort columns button to create a new column for each column selected.
Rank Columns
This option ranks columns containing numeric values. Ties are handles
by taking the average of the tied rank values.
- Select the columns to be ranked.
- Click Rank columns button to create a new column of ranks for each column selected.
Sample Columns
This option samples single or multiple columns with or without replacement.
- Select the columns to be sampled.
- Specify the Sample size to be used. This will be
reflected in the number of rows for the new columns added to the data
table.
- Specify the Number of samples to be generated. This will
be reflected in the number of new columns added to the data table.
- By default the sampling is done without replacement. Check the
Sample with replacement option if desired.
- To sample in a multivariate manner where the
rows of the new sample columns match rows in the original data, check
the Sample all columns at one time option. Otherwise, a
separate sample will be taken from each column selected.
- To create an additional new column indicating the row id
number from the original data that was selected for the sample, check
the Save row ids for samples option. The row id column will
appear to the left of each sample column.
- Choose an option for storing the sample results:
- Splitting the values across separate columns will add each sample as a new column to the data table.
- Stacking the values will take each sample and stack them in sequential order. The stacked values will be
added to the data table along with a sample id indicating the original sample to which each corresponding stacked sample value belongs.
- Computing a statistic will calculate the value of a user-specified statistic formulated as a
StatCrunch expression for each
sample. The stacked statistic values will be added to the data table. When formulating the expression, the sample data
should be referred to as "Sample(column_name)". For example, when sampling from a column named Rank, one
would use the expression mean("Sample(Rank)") to compute the mean of each sample.
- Choose the generator you wish to use.
- The preset generator is initialized with two seeds each time StatCrunch loads.
- The default generator is seeded dynamically based on clock time is also an option. One can choose to
fix the seed on this second generator as well using the third seeding option.
- To get a different realization each time use the dynamic seed option.
- To control the realization generated use the fixed seed option.
- Click the Sample Column(s) button to generate the new sample columns.
Simulate Data
This option simulates data from any of the following distributions:
- Select the distribution of interest from the list of distributions .
- Specify the number of rows and columns to be simulated.
- Specify the appropriate parameters for the distribution.
- Choose an option for storing the simulation results:
- Splitting the values across separate columns will add each simulated column to the data table.
- Stacking the values will take each simulated column and stack them in sequential order. The stacked values will be
added to the data table along with a column id indicating the original column to which each corresponding stacked value belongs.
- Computing a statistic will calculate the value of a user-specified statistic formulated as a
StatCrunch expression for each
simulated column. The stacked statistic values will be added to the data table. When formulating the expression, the simulated data
should be referred to by using the name of the distribution being simulated. For example, when simulating a binomial distribution, one
would use the expression mean(Binomial) to compute the mean of each simulated column of data.
- Choose the generator you wish to use.
- The preset generator is initialized with two seeds each time StatCrunch loads.
- The default generator is seeded dynamically based on clock time is also an option. One can choose to
fix the seed on this second generator as well using the third seeding option.
- To get a different realization each time use the dynamic seed option.
- To control the realization generated use the fixed seed option.
- Click the Simulate button to generate the simulation results.
Indicator columns
This option creates a column of indicator values (0 or 1) for each unique value in a selected column.
This may be especially useful for regression analyses.
- Select the column(s) for which indicator columns are to be created.
- Click on Create Indicators.