Comprehensions¶

load required packages

In [1]:

using DataArrays
using DataFrames
using Base.Dates

comprehensions: easy way to build Arrays

In [2]:

[ii for ii=1:4]

Out[2]:

4-element Array{Int64,1}:
 1
 2
 3
 4

forcing type of individual entries through prepending type declaration

In [3]:

Float64[ii for ii=1:4]

Out[3]:

4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0

similar logic: collect elements of any type in Array of type Any

In [4]:

Any["Hello" 3 4.0 NA]

Out[4]:

1×4 Array{Any,2}:
 "Hello"  3  4.0  NA

comprehensions also can be used to capture more complex iterated output
for example: iteration over sample size or parameter values

In [5]:

[rand(nObs, 2) for nObs in [2, 10]]

Out[5]:

2-element Array{Array{Float64,2},1}:
 [0.938897 0.565; 0.220571 0.595751]                                              
 [0.348172 0.223396; 0.719497 0.744549; … ; 0.561501 0.0500655; 0.914669 0.105726]

using single index ii, it is not directly possible to get Array{T, 2} through comprehension

In [6]:

[[1 2] for ii=1:4]

Out[6]:

4-element Array{Array{Int64,2},1}:
 [1 2]
 [1 2]
 [1 2]
 [1 2]

Splicing¶

successively returning components of collection
could be used to paste elements of a collection into function arguments
allows easy creation of Arrays

applying [ ] to collection only captures whole collection as single entry of an Array

In [7]:

kk = (1, 2, 3, 4)
[kk]

Out[7]:

1-element Array{NTuple{4,Int64},1}:
 (1, 2, 3, 4)

with splicing: each element of the collection gets its own entry within an Array

In [8]:

[kk...]

Out[8]:

4-element Array{Int64,1}:
 1
 2
 3
 4

works out of the box: even for new types

In [9]:

type foo
    value
end

In [10]:

fooObj = foo(3)

Out[10]:

foo(3)

In [11]:

kk = (fooObj, fooObj, fooObj)
[kk...]

Out[11]:

3-element Array{foo,1}:
 foo(3)
 foo(3)
 foo(3)

for some types, there might be more meaningful ways to vertically store successive values than inside of an Array
vcat: allows combination of objects in specified structure

In [12]:

vcat([1 2], [1 2])

Out[12]:

2×2 Array{Int64,2}:
 1  2
 1  2

vcat also works for variable number of input arguments:

In [13]:

kk = ([1 2], [3 4], [5 6])

Out[13]:

([1 2], [3 4], [5 6])

In [14]:

vcat(kk[1], kk[2], kk[3])

Out[14]:

3×2 Array{Int64,2}:
 1  2
 3  4
 5  6

together with splicing, vcat conveniently transforms tuple of values into concise Array:

In [15]:

vcat(kk...)

Out[15]:

3×2 Array{Int64,2}:
 1  2
 3  4
 5  6

In [16]:

vcat([[1 2] for ii=1:4]...)

Out[16]:

4×2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

[ ] applied to spliced elements does not implicitly call vcat anymore
applied to Array{Int,2}, this does not result in two-dimensional Array

In [17]:

kk = [[1 2] for ii=1:4]
[kk...]

Out[17]:

4-element Array{Array{Int64,2},1}:
 [1 2]
 [1 2]
 [1 2]
 [1 2]

alternatively, two-dimensional result could be achieved without splicing through usage of two index variables

In [18]:

[jj for ii=1:4, jj=1:2]

Out[18]:

4×2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

in general, splicing and vcat work for data structures different to Array
for example, application of splicing and comprehension to DataFrames will return a DataFrame again

In [19]:

df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

kk = (df, df)
xx = vcat(kk...)

Out[19]:

	a	b
1	5	8
2	6	NA
3	NA	NA
4	5	8
5	6	NA
6	NA	NA

In [20]:

typeof(xx)

Out[20]:

DataFrames.DataFrame

Iterators¶

under the hood, comprehensions make use of iterators:

iterators successively return values from a collection
iterators can be specified for each type

for example: column iterator of DataFrames, which returns a tuple with column name and values given as DataArray for each column

In [21]:

df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

[col for col in eachcol(df)]

Out[21]:

2-element Array{Tuple{Symbol,DataArrays.DataArray{Int64,1}},1}:
 (:a, [5, 6, NA]) 
 (:b, [8, NA, NA])

DataFrame iterator returns tuple, so that values only (without column name) are obtained through indexing

In [22]:

[col[2] for col in eachcol(df)]

Out[22]:

2-element Array{DataArrays.DataArray{Int64,1},1}:
 [5, 6, NA] 
 [8, NA, NA]

two applications of iterators come to mind immediately:

iteratively manipulating entries of a type
building a new object by iteratively using an existing object

Iteratively manipulating entries¶

example: iteratively manipulating columns of DataFrame

as seen above, values need to be referenced within column iterator tuple with subindex 2
setting first entry of each column to 10:

In [23]:

for col in eachcol(df)
    col[2][1] = 10
end
df

Out[23]:

	a	b
1	10	10
2	6	NA
3	NA	NA

let's try multiplying each column by 10:

In [24]:

try
    for col in eachcol(df)
        col[2] = col[2].*10
    end
catch e
    show(e)
end

MethodError(setindex!, ((:a, [10, 6, NA]), [100, 60, NA], 2), 0x0000000000005549)

the correct way is:

In [25]:

for col in eachcol(df)
    col[2][:] = col[2].*10
end
df

Out[25]:

	a	b
1	100	100
2	60	NA
3	NA	NA

applying a similar logic for the manipulation of entries of an Array{Int, 1} fails:

In [26]:

kk = [1, 2, 3, 4]
try
    for entry in kk
        entry[1] = entry[1]*5
    end
catch e
    show(e)
end
kk

MethodError(setindex!, (1, 5, 1), 0x000000000000554a)

Out[26]:

4-element Array{Int64,1}:
 1
 2
 3
 4

Creating new objects by iteratively manipulating existing objects¶

iterator protocols make recursive data manipulation easy
combined with comprehension, this allows for easy creation of new objects
for example: creating Array of squared entries

In [27]:

kk = [1 2 3 4]
kk2 = [ii.^2 for ii in kk]

Out[27]:

1×4 Array{Int64,2}:
 1  4  9  16

again, there might be more meaningful ways to combine the individual parts than an Array as we get it from comprehension

In [28]:

df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])
df

Out[28]:

	a	b
1	5	8
2	6	NA
3	NA	NA

In [29]:

kk = [col[2].*2 for col in eachcol(df)]

Out[29]:

2-element Array{DataArrays.DataArray{Int64,1},1}:
 [10, 12, NA]
 [16, NA, NA]

as we iterate over columns, results should be combined horizontally
simple splicing is not sufficient here, as it uses vcat

we need hcat instead:

In [30]:

hcat([col[2].*2 for col in eachcol(df)]...)

Out[30]:

3×2 DataArrays.DataArray{Int64,2}:
 10    16  
 12      NA
   NA    NA

instead of manually combining manipulated values from an iterator each time, we also could define a default data structure returned through function map

Map¶

through multiple dispatch, the output of map can be customized to the iterator type used
for example: multiplication of each DataFrame column could be done in two different ways
first way: iterating over entries of an Array (which contains the column names) will return an Array

In [31]:

df = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
map(nam -> df[nam].*2, names(df))

Out[31]:

2-element Array{DataArrays.DataArray{Int64,1},1}:
 [2, 4, 6]  
 [8, 10, 12]

second way: using method map for DataFrame column iterator

In [32]:

df2 = map(col -> col.*2, eachcol(df))
df2

Out[32]:

	a	b
1	2	8
2	4	10
3	6	12

map can also be defined for two collections

In [33]:

vals1 = [10 20]
vals2 = [40 1]
map(+, vals1, vals2)

Out[33]:

1×2 Array{Int64,2}:
 50  21

Reduce¶

using function reduce individual components of a collection can be aggregated
through multiple dispatch, reduce can have different implementations for each type
using map and reduce together, individual entries of iterable collections can be manipulated and aggregated to a single result

example: calculating row means

In [34]:

df

Out[34]:

	a	b
1	1	4
2	2	5
3	3	6

In [35]:

meanDf = reduce((x,y) -> (x[2].+y[2])./size(df, 2), eachcol(df))

Out[35]:

3-element DataArrays.DataArray{Float64,1}:
 2.5
 3.5
 4.5

example: calculating row sum with weighted columns

using map to calculate weighted columns
using reduce to sum up individual weighted columns

In [36]:

df = DataFrame(a = [1, 2, 3, 4], b = [4, 5, 6, 7], c = [2, 4, 8, 10])

Out[36]:

	a	b	c
1	1	4	2
2	2	5	4
3	3	6	8
4	4	7	10

getting weighted columns:

In [37]:

wgts = [0.4 0.2 0.4]
kk = map((x, y) -> x.*y[2], wgts, eachcol(df))

Out[37]:

3-element Array{DataArrays.DataArray{Float64,1},1}:
 [0.4, 0.8, 1.2, 1.6]
 [0.8, 1.0, 1.2, 1.4]
 [0.8, 1.6, 3.2, 4.0]

In [38]:

wgts[1] * [1, 2, 3, 4]

Out[38]:

4-element Array{Float64,1}:
 0.4
 0.8
 1.2
 1.6

aggregation with reduce

In [39]:

reduce((x, y) -> (x .+ y), map((x, y) -> x.*y[2], wgts, eachcol(df)))

Out[39]:

4-element DataArrays.DataArray{Float64,1}:
 2.0
 3.4
 5.6
 7.0

Session info¶

In [40]:

versioninfo()

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

In [41]:

Pkg.status()

172 required packages:
 - AbstractFFTs                  0.2.0
 - Atom                          0.6.1
 - AutoGrad                      0.0.7
 - AutoHashEquals                0.1.1
 - AxisAlgorithms                0.1.6
 - AxisArrays                    0.1.4
 - BenchmarkTools                0.0.8
 - Blink                         0.5.3
 - Blosc                         0.3.0
 - BufferedStreams               0.3.3
 - BusinessDays                  0.7.1
 - CSV                           0.1.4
 - Calculus                      0.2.2
 - CatIndices                    0.0.2
 - CategoricalArrays             0.1.6
 - Clustering                    0.8.0
 - CodeTools                     0.4.6
 - Codecs                        0.3.0
 - ColorTypes                    0.5.2
 - ColorVectorSpace              0.4.4
 - Colors                        0.7.4
 - Combinatorics                 0.4.1
 - Compat                        0.28.0
 - Compose                       0.5.3
 - ComputationalResources        0.0.2
 - Conda                         0.5.3
 - Contour                       0.3.0
 - Convex                        0.5.0
 - CoordinateTransformations     0.4.1
 - CoupledFields                 0.0.1
 - CustomUnitRanges              0.0.4
 - DBAPI                         0.1.0
 - DSP                           0.3.2
 - Dagger                        0.2.0
 - DataArrays                    0.6.2
 - DataFrames                    0.10.0
 - DataStreams                   0.1.3
 - DataStructures                0.6.0
 - DecFP                         0.3.0
 - DecisionTree                  0.6.1
 - DiffBase                      0.2.0
 - Distances                     0.4.1
 - DistributedArrays             0.4.0
 - Distributions                 0.14.2
 - DualNumbers                   0.3.0
 - FFTViews                      0.0.2
 - FFTW                          0.0.3
 - FileIO                        0.5.1
 - FixedPointNumbers             0.3.9
 - Formatting                    0.2.1
 - ForwardDiff                   0.4.2
 - GLM                           0.7.0
 - GR                            0.23.0
 - GZip                          0.3.0
 - Gadfly                        0.6.3
 - Glob                          1.1.1
 - Graphics                      0.2.0
 - HDF5                          0.8.2
 - HTTPClient                    0.2.1
 - Hexagons                      0.1.0
 - Hiccup                        0.1.1
 - HttpCommon                    0.2.7
 - HttpParser                    0.3.0
 - HttpServer                    0.2.0
 - HypothesisTests               0.5.1
 - IJulia                        1.5.1
 - IdentityRanges                0.0.1
 - ImageAxes                     0.3.1
 - ImageCore                     0.4.0
 - ImageFiltering                0.1.4
 - ImageMetadata                 0.2.3
 - ImageTransformations          0.3.1
 - Images                        0.11.0
 - IndexedTables                 0.2.1
 - IndirectArrays                0.1.1
 - Interact                      0.4.5
 - Interpolations                0.6.2
 - IntervalSets                  0.1.1
 - IterTools                     0.1.0
 - Iterators                     0.3.1
 - JDBC                          0.2.0
 - JLD                           0.6.11
 - JSON                          0.13.0
 - JavaCall                      0.5.1
 - JuMP                          0.17.1
 - JuliaWebAPI                   0.3.1
 - Juno                          0.3.0
 - KernelDensity                 0.3.2
 - Knet                          0.8.3
 - LNR                           0.0.2
 - LaTeXStrings                  0.2.1
 - Lazy                          0.11.7
 - LegacyStrings                 0.2.2
 - Libz                          0.2.4
 - LightGraphs                   0.9.4
 - LightXML                      0.5.0
 - LineSearches                  0.1.5
 - Loess                         0.3.0
 - Logging                       0.3.1
 - MLBase                        0.7.0
 - MNIST                         0.0.2
 - MacroTools                    0.3.7
 - MappedArrays                  0.0.7
 - MathProgBase                  0.6.4
 - MbedTLS                       0.4.5
 - Measures                      0.1.0
 - Media                         0.3.0
 - Mustache                      0.1.4
 - Mux                           0.2.3
 - NaNMath                       0.2.6
 - NamedArrays                   0.6.1
 - NamedTuples                   4.0.0
 - NearestNeighbors              0.3.0
 - Nettle                        0.3.0
 - NullableArrays                0.1.1
 - ODBC                          0.5.2
 - OffsetArrays                  0.3.0
 - Optim                         0.7.8
 - PDMats                        0.7.0
 - PaddedViews                   0.1.0
 - Parameters                    0.7.2
 - ParserCombinator              1.7.11
 - PlotRecipes                   0.2.0
 - PlotlyJS                      0.6.4
 - Plots                         0.12.3+            master
 - Polynomials                   0.1.5
 - PooledArrays                  0.1.1
 - PositiveFactorizations        0.0.4
 - Primes                        0.1.3
 - ProtoBuf                      0.4.0
 - PyCall                        1.14.0
 - PyPlot                        2.3.2
 - QuadGK                        0.1.2
 - QuantEcon                     0.12.1
 - Query                         0.6.0
 - RCall                         0.7.3
 - RDatasets                     0.2.0
 - RangeArrays                   0.2.0
 - Ratios                        0.1.0
 - Reactive                      0.5.2
 - Reexport                      0.0.3
 - Requests                      0.5.0
 - Requires                      0.4.3
 - ReverseDiffSparse             0.7.3
 - Rmath                         0.1.7
 - Roots                         0.4.0
 - Rotations                     0.5.0
 - Rsvg                          0.1.0
 - SCS                           0.3.3
 - SHA                           0.3.3
 - SIUnits                       0.1.0
 - ScikitLearnBase               0.3.0
 - ShowItLikeYouBuildIt          0.0.1
 - Showoff                       0.1.1
 - SimpleTraits                  0.5.0
 - SortingAlgorithms             0.1.1
 - SpecialFunctions              0.2.0
 - StatPlots                     0.4.2
 - StaticArrays                  0.6.1
 - StatsBase                     0.17.0
 - StatsFuns                     0.5.0
 - TexExtensions                 0.1.0
 - TextParse                     0.1.6
 - TiledIteration                0.0.2
 - TimeSeries                    0.10.0
 - Tokenize                      0.1.8
 - URIParser                     0.1.8
 - UnicodePlots                  0.2.5
 - WeakRefStrings                0.2.0
 - WebSockets                    0.2.3
 - WoodburyMatrices              0.2.2
 - ZMQ                           0.4.3
17 additional packages:
 - BaseTestNext                  0.2.2
 - BinDeps                       0.6.0
 - Cairo                         0.3.1
 - DataValues                    0.2.0
 - DocStringExtensions           0.4.0
 - Documenter                    0.11.2
 - DynAssMgmt                    0.0.0-             master (unregistered)
 - EconDatasets                  0.0.2+             master
 - GeometryTypes                 0.4.2
 - Gtk                           0.13.0
 - IterableTables                0.4.2
 - LibCURL                       0.2.2
 - NetworkLayout                 0.1.1
 - PlotThemes                    0.1.4
 - PlotUtils                     0.4.3
 - RData                         0.1.0
 - RecipesBase                   0.2.2

In [42]:

scriptEndIsReached = true

Out[42]:

true