Data Help

Loading data

Data can be loaded into StatCrunch from one of four ways:

Import a file from a local machine or network.
Type in the URL of a file on the WWW.
Pasting data directly into StatCrunch.
Choose a sample data set.

Load Data From File or URL

To load data from a file, select this option from the Load Data submenu of the Data menu. StatCrunch can load text and Excel files (must end with a .xls extension) from a local computer or from a WWW address.

To load a file from a local machine, either type the full path to the file in the File input box or click the browse button to locate the file on the local system.
To load a file from a WWW address, type in the Internet location in the WWW address input box. (e.g., http://www.test123.com/mydata.xls)
If the first line of the file contains column names, make sure the Use first line as variable names option is checked.
When loading an ASCII text file, specify the observation delimiter to be used. Note that this step may be ignored if loading an Excel file. The delimiter options are whitespace (any whitespace character such as a space or tab), tab, comma (for .csv files) and semicolon.
Click the Load Data button , and after a few moments, the data will appear in the data table.

From Paste

Data may also be pasted into StatCrunch. This option is not as reliable across platforms, so loading data from a file is preferred. To load data from paste, select this option from the Load Data submenu of the Data menu.

Copy the data to be loaded into StatCrunch, and then paste it into the text box. Data may also be entered into the box manually, and a formatting header may be added to the top of the text input if necessary. On some platforms, it is impossible to insert tabs when editing tab-delimited data that has been pasted into the text box.
If the first line of the file contains variable names, make sure the Use first line as variable names option is checked.
Specify the observation delimiter to be used. The delimiter options are whitespace (any whitespace character such as a space or tab), tab, comma (for .csv files) and semicolon. When pasting from Excel or other spreadsheets, choose the tab delimiter.
Click the Load Data button, and after a few moments, the data will appear in the data table.

Sample Data

A number of sample data sets are provided for users to experiment with in StatCrunch. Some data sets are not applicable to certain analysis or graphics. Be careful to test an analysis/graphic with appropriate data. To see a complete listing of sample sets, click the Sample Data option under the Load Data submenu of the Data menu.

Formatting Data

Text data can be specially formatted for StatCrunch by adding a header to the data before importing it. This is typically not necessary when loading data from spreadsheets. The header can be used to provide additional information about data, to specify variable names and to specify the number of observations per column. An example appears below.

#Test Data
varnames = Control Treatment
nobs = 23 21

All lines that begin with # are ignored when the information is loaded into StatCrunch. these lines can be used as comments that provide general information about the data set. The line beginning with varnames = is used to specify the names of the variables, and the line beginning with nobs = specifies the number of observations per variable. If a data set has the same number of observations for each variable, the nobs statement can be ignored. This option is only useful when there is an unequal number of observations per column. In the example above, the first column would contain 23 observations and the second column would contain 21 observations. Also, most methods for loading data have the option of using the first uncommented row of the data for variable names as a default, so that in many cases the varnames = statement may also be omitted.

Save Data

Saves the data in the data table to your My Data folder.

Specify a name for the data file.
Specify whether or not to overwrite an existing file.
Specify the delimiter to use to separate values.
Click Save to write the data file.

Data Table

The data table refers to the grid of cells used for entering and editing data in StatCrunch. The top row of the data table is for column names, and any data values entered into this row will not be used in subsequent analyses. The left most column serves as row ids for the data table. A row of data values may be highlighted by clicking inside a row id cell. Clicking inside a highlighted row id cell will deselect the row. The observations in highlighted rows will also be highlighted in most graphics that StatCrunch produces. All highlighted rows in the data table may be deselected by clicking on the Row label in the top row of the row id column.

The data table is equipped with horizontal and vertical scroll bars at the top and bottom of the data table, respectively, for quickly moving through the table. The Page down and Page up keys can also be used to quickly scroll through the table. The number of cells presented in the table may not be increased by using these tools. This can be done by using the keyboard actions (described below) or by resizing the main window that contains the data table.

The cell that is highlighted receives keyboard input. StatCrunch will first try to make a double precision number out of data entered, but if this is not possible, StatCrunch will store the input as a string. If "43StatCrunch" is entered, StatCrunch will read this as "43". In other words, strings that start with numbers may be truncated after the number. StatCrunch also uses scientific notation so that "0.000001" may appear as "1E-6". All numbers are displayed with floating point precision, but they are stored as double precision.

One should think of the cursor as occupying the position to the far right in the highlighted cell. At this time, the cursor may not be manually positioned inside the highlighted cell. The backspace key can be used to delete values entered in a one by one fashion beginning with the last value entered. The delete key can be used to remove all values in the highlighted cell.

When a cell loses the highlight, the values entered in the cell may be reformatted for better display. In the case of numbers for example, the value "0.00000000012345" may be represented as "1.2345E-10". The width of the column will adjust if the information entered exceeds the column's width. See Column Widths below to see how to change this behavior to accommodate fixed width columns. When working with large data sets this auto-sizing feature should be turned off for faster response times.

To change the cell that is highlighted, one may click the mouse inside a new cell or by using the tab, return or arrow keys. If the user tries to move to a cell that is not fully visible, the data table will shift so that the cell becomes fully visible. The tab key moves the highlight one cell to the right. The return key moves the highlight one cell down. The up, down, right and left arrow keys shift the highlighted cell in the direction of the arrow. All of these keys can be used to increase the number of cells displayed in the table. For example, if the data table currently displays 8 columns and the user presses the tab key while in a highlighted cell in the 8th column, StatCrunch will then display the 9th column with the highlighted cell in the 9th column.

Export Data

The data stored in the data table can be exported to a new window for downloading to the user's local machine by selecting the Export data option under the Data Table submenu of the Data menu. The data can be exported in either a text format or as an HTML table. For the text option, one can specify the observation delimiter to be used. The text option should be used to export data in a format that can be reloaded in the future. The HTML option should be used to export the data for publication purposes (web pages, pasting into word processing programs, etc...). When the data appears in the new window, use the File menu on the browser to save or print it.

Font

Choose the font and size from the drop downs provided and then click Update font!.

Column Widths

The width of a column can be adjusted by clicking on the right edge of the column and dragging the mouse. One can also select the Column Widths option under the Data menu to set preferences for the default widths of columns.

Specify the default width (in pixels) for columns. By default, a column's width is set to 65 pixels.
By default the column width will be automatically adjusted if data entered exceeds the default width. To turn off this auto resizing feature, check the Auto size columns to fit values option. If the auto-sizing is turned off, only the first few values in a cell that fit within the default width will be displayed and then the ">" character is placed at the end of the cell to indicate there is more information that can not be displayed. When working with large data sets this auto-sizing feature should be turned off for faster response times when doing data manipulations.
Check the Apply defaults to all option to override any custom sizing that has been done using the mouse.

Delete Columns

Simply select the columns to be deleted and click Delete. Please note that once a column is deleted it can not be recovered.

Move Columns

Select the columns to be moved. The columns will be relocated from left to right in the order that they are selected. Choose the column that the selected columns will be moved to the left of, or select the Move after last column option.

Clear

This feature will create a new empty data table. All analysis results and graphics appearing in separate windows will not be affected.

Row Selection

Rows in the data table may be highlighted by clicking the mouse inside the cell containing the row id number. Clicking inside a highlighted row id cell will deselect the row. Highlighted rows will also be highlighted in most graphics that StatCrunch produces. All rows in the data table may be deselected by clicking on the Row label in the top row of the row id column. In addition to using the mouse in this manner, the following options are available for selecting and deselecting rows.

Interactive Tools

This option can be used to create sliders and category selectors that allow a great deal of flexibility for data selection.

Select the columns containing numeric values for which sliders are to be generated.
Select the columns containing values (text or numeric) for which category selectors are to be generated.
Click the Create Tool button to generate the selection tools. The appropriate numeric slide selectors and category selectors will then be generated, and all rows in the data table will be highlighted. For numeric sliders, the handles on each end of the slider can be adjusted to control the range of values for the column that are highlighted. For category selectors, the categories for the column to be highlighted can be chosen from a list of distinct values for the column. The selection behavior can be changed from selecting rows that satisfy all selectors (equivalent to intersecting the conditions set for all selectors) to selecting rows that satisfy any selectors (equivalent to the union of the conditions set for all selectors). Selectors may be very useful when used in conjunction with interactive graphic in exploratory analysis.

Select Where

This feature will highlight rows in the data table based on a boolean expression using ANSI SQL syntax entered by the user. See the WHERE section for more details on boolean expressions.

Specify a boolean expression is entered that is to be evaluated for each row in the data set.
Click the Select Rows! button to highlight the appropriate rows. If the expression is true, then the row is highlighted. If the expression is false, no action is taken, so if the row was already highlighted, it will still be highlighted after the computation.

Deselect All

This option under the Data menu deselects all rows highlighted in the data table. All highlighted rows may also be deselected by clicking on the Row label in the top row of the row id column.

Create Column(s)

StatCrunch offers a number of procedures for adding new columns from scratch or by performing a variety of data manipulations on existing columns.

Transform data

This method is now deprecated. See compute expressions below.

Compute Expression

StatCrunch offers a great deal of flexibility for doing data transformations with this option. One may type an expression directly into the Transformation box. The following are examples of mathematical expressions that can be entered for the Hotdog Data. See Expressions for a listing of operators and functions that can be used.

Calories + Sodium: computes the sum of the Calories and Sodium columns
Calories - Sodium: computes the difference of the Calories and Sodium columns
Calories*Sodium: computes the product of the Calories and Sodium columns
Calories/Sodium: computes the ratio of the Calories and Sodium columns
log(Calories): computes the natural logarithm of the Calories column
(Calories - mean(calories))/std(calories): computes the standard scores of the Calories column

One may also construct an expression from a menu of common transformations. The typing option puts a great deal of power in the hand of the user while the menu option provides direction for those users less familiar with standard mathematical expressions. The list of common transformations includes: log(Y), sqrt(Y), 1/Y, Y-X, Y/X, a+b*Y, Y**b, sin(Y), arcsin(Y), log10(Y+a), etc.. In each case, the necessary columns (Y and/or X) must be selected and the values for constants (a and/or b) must be specified. When the Set Transformation button is clicked, the proper mathematical syntax will be placed in the Transformation input box. Click the Transform button to compute the values for the new transformed column. This column will be added to the right of any existing columns.

Sequence Data

With this option, StatCrunch offers the user the ability to create a new column of patterned sequence data.

In the From input box, specify the numeric starting value for the sequence.
In the To input box, specify the numeric ending value for the sequence.
In the By input box, specify the numeric increment value for the sequence.
Specify the number of times to repeat each value in the sequence.
Specify the number of times to repeat the sequence.
Click the Create Sequence button to generate a new column containing the sequence values. As example, if From is set to 1, To is set to 10, By is set to 2, Repeat each value is set to 1, and Repeat the sequence is set to 2, the new column would contain the values: 1, 3, 5, 7, 9, 1, 3, 5, 7, 9.

Split Column

This option splits values in a column into separate columns based on the values of a grouping column. This option is useful when isolating particular subgroups within a column.

Select the Column that contains the values to be split into multiple columns
Select the Grouping Column. A new column will be created for each distinct value of this column.
Click Create Groups to create the new column(s). As an example, suppose the data table contains six rows of data, where var1 contains the value "A" in the first three rows and the value "B" in the next three rows. Also, suppose var2 contains the values 1,2,3,4,5,6 in rows one through six respectively. If the user were to select var2 as the Column and var1 as the Group Column, then two new columns would be created with three values in each new column. The first new column would contain the values 1,2,3 and would be labeled "A var2". The second new column would contain the values 4,5,6 and would be labeled "B var2".

Stack Columns

This option stacks values in selected columns into a single column. A separate label column is also created containing the column header for each value from the orignal column.

Select the Columns to be stacked.
Specify an optional value for Store labels in. This value will be the column header for the new label column.
Specify an optional value for Store values in. This value will be the column header for the new column conatining the stacked values.
Click Stack Columns! to create the new column(s).

Bin Column

This option bins data in a column based on a bining scheme specified by the user. This option is useful when creating levels for a numeric column to be used for further analysis.

Select the column to be bined under Bin Column.
Specify how to construct the bins. The bins are defined by a sequence of values that define the bin edges. When bining the data values, a bin will include the left edge and exclude the right edge. The choices for defining bins are:
1. Automatically define bins - This option uses the same automated algorithm to define bins as is used to construct a histogram. The algorithm tries to divide the range of the data values into a set of bins of the equal width.
2. Use fixed width bins - With this option, the user specifies a starting point for the sequence of bins and defines the bin width. As an example, a starting point of 40 and bin width of 5 will generate a sequence of bins that looks like 40,45,50,55,... Note that if the starting point specified by the user is below the minimum value for the column then the algorithm will set the starting point to the minimum value.
3. Use bins defined by cut points: This option allows the user to specify a sequence of comma delimited values for the bin edges. This option can be used to create bins of varying widths. It is also useful when the user wants to define custom levels for a set of numerical values.
Click the Calculate button to compute a new column containing the bin associated with each row of the selected column. For example, if the bin edges are defined by 40,45,50 and 55, then for a value of 47.5, a bined value of 45 to 50 will be produced.

Sort Columns

This option sorts numeric or text columns in ascending or descending order.

Select the columns to be sorted.
Select the column to be used to define the new ordering for Sort by.
By default the values will be sorted in ascending order. To get the descending order, check the Descending box.
Click the Sort columns button to create a new column for each column selected.

Rank Columns

This option ranks columns containing numeric values. Ties are handles by taking the average of the tied rank values.

Select the columns to be ranked.
Click Rank columns button to create a new column of ranks for each column selected.

Sample Columns

This option samples single or multiple columns with or without replacement.

Select the columns to be sampled.
Specify the Sample size to be used. This will be reflected in the number of rows for the new columns added to the data table.
Specify the Number of samples to be generated. This will be reflected in the number of new columns added to the data table.
By default the sampling is done without replacement. Check the Sample with replacement option if desired.
To sample in a multivariate manner where the rows of the new sample columns match rows in the original data, check the Sample all columns at one time option. Otherwise, a separate sample will be taken from each column selected.
To create an additional new column indicating the row id number from the original data that was selected for the sample, check the Save row ids for samples option. The row id column will appear to the left of each sample column.
Choose an option for storing the sample results:
- Splitting the values across separate columns will add each sample as a new column to the data table.
- Stacking the values will take each sample and stack them in sequential order. The stacked values will be added to the data table along with a sample id indicating the original sample to which each corresponding stacked sample value belongs.
- Computing a statistic will calculate the value of a user-specified statistic formulated as a StatCrunch expression for each sample. The stacked statistic values will be added to the data table. When formulating the expression, the sample data should be referred to as "Sample(column_name)". For example, when sampling from a column named Rank, one would use the expression mean("Sample(Rank)") to compute the mean of each sample.
Choose the generator you wish to use.
- The preset generator is initialized with two seeds each time StatCrunch loads.
- The default generator is seeded dynamically based on clock time is also an option. One can choose to fix the seed on this second generator as well using the third seeding option.
- To get a different realization each time use the dynamic seed option.
- To control the realization generated use the fixed seed option.
Click the Sample Column(s) button to generate the new sample columns.

Simulate Data

This option simulates data from any of the following distributions:

Bernoulli http://en.wikipedia.org/wiki/Bernoulli_distribution
Beta http://en.wikipedia.org/wiki/Beta_distribution
Binomial http://en.wikipedia.org/wiki/Binomial_distribution
Cauchy http://en.wikipedia.org/wiki/Cauchy_distribution located at 0 with scale 1
Chi-Square http://en.wikipedia.org/wiki/Chi-square_distribution
Exponential http://en.wikipedia.org/wiki/Exponential_distribution
F http://en.wikipedia.org/wiki/F_distribution
Gamma http://en.wikipedia.org/wiki/Gamma_distribution
Normal http://en.wikipedia.org/wiki/Normal_distribution
Poisson http://en.wikipedia.org/wiki/Poisson_distribution
T http://en.wikipedia.org/wiki/T_distribution
Uniform http://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29
Weibull http://en.wikipedia.org/wiki/Weibull_distribution with a=k and b = l^k

Select the distribution of interest from the list of distributions .
Specify the number of rows and columns to be simulated.
Specify the appropriate parameters for the distribution.
Choose an option for storing the simulation results:
- Splitting the values across separate columns will add each simulated column to the data table.
- Stacking the values will take each simulated column and stack them in sequential order. The stacked values will be added to the data table along with a column id indicating the original column to which each corresponding stacked value belongs.
- Computing a statistic will calculate the value of a user-specified statistic formulated as a StatCrunch expression for each simulated column. The stacked statistic values will be added to the data table. When formulating the expression, the simulated data should be referred to by using the name of the distribution being simulated. For example, when simulating a binomial distribution, one would use the expression mean(Binomial) to compute the mean of each simulated column of data.
Choose the generator you wish to use.
- The preset generator is initialized with two seeds each time StatCrunch loads.
- The default generator is seeded dynamically based on clock time is also an option. One can choose to fix the seed on this second generator as well using the third seeding option.
- To get a different realization each time use the dynamic seed option.
- To control the realization generated use the fixed seed option.
Click the Simulate button to generate the simulation results.

Indicator columns

This option creates a column of indicator values (0 or 1) for each unique value in a selected column. This may be especially useful for regression analyses.

Select the column(s) for which indicator columns are to be created.
Click on Create Indicators.