Comprehensions

  • load required packages
In [1]:
using DataArrays
using DataFrames
using Base.Dates
  • comprehensions: easy way to build Arrays
In [2]:
[ii for ii=1:4]
Out[2]:
4-element Array{Int64,1}:
 1
 2
 3
 4
  • forcing type of individual entries through prepending type declaration
In [3]:
Float64[ii for ii=1:4]
Out[3]:
4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
  • similar logic: collect elements of any type in Array of type Any
In [4]:
Any["Hello" 3 4.0 NA]
Out[4]:
1×4 Array{Any,2}:
 "Hello"  3  4.0  NA
  • comprehensions also can be used to capture more complex iterated output
  • for example: iteration over sample size or parameter values
In [5]:
[rand(nObs, 2) for nObs in [2, 10]]
Out[5]:
2-element Array{Array{Float64,2},1}:
 [0.938897 0.565; 0.220571 0.595751]                                              
 [0.348172 0.223396; 0.719497 0.744549; … ; 0.561501 0.0500655; 0.914669 0.105726]
  • using single index ii, it is not directly possible to get Array{T, 2} through comprehension
In [6]:
[[1 2] for ii=1:4]
Out[6]:
4-element Array{Array{Int64,2},1}:
 [1 2]
 [1 2]
 [1 2]
 [1 2]

Splicing

  • successively returning components of collection
  • could be used to paste elements of a collection into function arguments
  • allows easy creation of Arrays
  • applying [ ] to collection only captures whole collection as single entry of an Array
In [7]:
kk = (1, 2, 3, 4)
[kk]
Out[7]:
1-element Array{NTuple{4,Int64},1}:
 (1, 2, 3, 4)
  • with splicing: each element of the collection gets its own entry within an Array
In [8]:
[kk...]
Out[8]:
4-element Array{Int64,1}:
 1
 2
 3
 4
  • works out of the box: even for new types
In [9]:
type foo
    value
end
In [10]:
fooObj = foo(3)
Out[10]:
foo(3)
In [11]:
kk = (fooObj, fooObj, fooObj)
[kk...]
Out[11]:
3-element Array{foo,1}:
 foo(3)
 foo(3)
 foo(3)
  • for some types, there might be more meaningful ways to vertically store successive values than inside of an Array
  • vcat: allows combination of objects in specified structure
In [12]:
vcat([1 2], [1 2])
Out[12]:
2×2 Array{Int64,2}:
 1  2
 1  2
  • vcat also works for variable number of input arguments:
In [13]:
kk = ([1 2], [3 4], [5 6])
Out[13]:
([1 2], [3 4], [5 6])
In [14]:
vcat(kk[1], kk[2], kk[3])
Out[14]:
3×2 Array{Int64,2}:
 1  2
 3  4
 5  6
  • together with splicing, vcat conveniently transforms tuple of values into concise Array:
In [15]:
vcat(kk...)
Out[15]:
3×2 Array{Int64,2}:
 1  2
 3  4
 5  6
In [16]:
vcat([[1 2] for ii=1:4]...)
Out[16]:
4×2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2
  • [ ] applied to spliced elements does not implicitly call vcat anymore
  • applied to Array{Int,2}, this does not result in two-dimensional Array
In [17]:
kk = [[1 2] for ii=1:4]
[kk...]
Out[17]:
4-element Array{Array{Int64,2},1}:
 [1 2]
 [1 2]
 [1 2]
 [1 2]
  • alternatively, two-dimensional result could be achieved without splicing through usage of two index variables
In [18]:
[jj for ii=1:4, jj=1:2]
Out[18]:
4×2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2
  • in general, splicing and vcat work for data structures different to Array
  • for example, application of splicing and comprehension to DataFrames will return a DataFrame again
In [19]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

kk = (df, df)
xx = vcat(kk...)
Out[19]:
ab
158
26NA
3NANA
458
56NA
6NANA
In [20]:
typeof(xx)
Out[20]:
DataFrames.DataFrame

Iterators

under the hood, comprehensions make use of iterators:

  • iterators successively return values from a collection
  • iterators can be specified for each type
  • for example: column iterator of DataFrames, which returns a tuple with column name and values given as DataArray for each column
In [21]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

[col for col in eachcol(df)]
Out[21]:
2-element Array{Tuple{Symbol,DataArrays.DataArray{Int64,1}},1}:
 (:a, [5, 6, NA]) 
 (:b, [8, NA, NA])
  • DataFrame iterator returns tuple, so that values only (without column name) are obtained through indexing
In [22]:
[col[2] for col in eachcol(df)]
Out[22]:
2-element Array{DataArrays.DataArray{Int64,1},1}:
 [5, 6, NA] 
 [8, NA, NA]

two applications of iterators come to mind immediately:

  • iteratively manipulating entries of a type
  • building a new object by iteratively using an existing object

Iteratively manipulating entries

example: iteratively manipulating columns of DataFrame

  • as seen above, values need to be referenced within column iterator tuple with subindex 2
  • setting first entry of each column to 10:
In [23]:
for col in eachcol(df)
    col[2][1] = 10
end
df
Out[23]:
ab
11010
26NA
3NANA
  • let's try multiplying each column by 10:
In [24]:
try
    for col in eachcol(df)
        col[2] = col[2].*10
    end
catch e
    show(e)
end
MethodError(setindex!, ((:a, [10, 6, NA]), [100, 60, NA], 2), 0x0000000000005549)
  • the correct way is:
In [25]:
for col in eachcol(df)
    col[2][:] = col[2].*10
end
df
Out[25]:
ab
1100100
260NA
3NANA
  • applying a similar logic for the manipulation of entries of an Array{Int, 1} fails:
In [26]:
kk = [1, 2, 3, 4]
try
    for entry in kk
        entry[1] = entry[1]*5
    end
catch e
    show(e)
end
kk
MethodError(setindex!, (1, 5, 1), 0x000000000000554a)
Out[26]:
4-element Array{Int64,1}:
 1
 2
 3
 4

Creating new objects by iteratively manipulating existing objects

  • iterator protocols make recursive data manipulation easy
  • combined with comprehension, this allows for easy creation of new objects
  • for example: creating Array of squared entries
In [27]:
kk = [1 2 3 4]
kk2 = [ii.^2 for ii in kk]
Out[27]:
1×4 Array{Int64,2}:
 1  4  9  16
  • again, there might be more meaningful ways to combine the individual parts than an Array as we get it from comprehension
In [28]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])
df
Out[28]:
ab
158
26NA
3NANA
In [29]:
kk = [col[2].*2 for col in eachcol(df)]
Out[29]:
2-element Array{DataArrays.DataArray{Int64,1},1}:
 [10, 12, NA]
 [16, NA, NA]
  • as we iterate over columns, results should be combined horizontally
  • simple splicing is not sufficient here, as it uses vcat
  • we need hcat instead:
In [30]:
hcat([col[2].*2 for col in eachcol(df)]...)
Out[30]:
3×2 DataArrays.DataArray{Int64,2}:
 10    16  
 12      NA
   NA    NA
  • instead of manually combining manipulated values from an iterator each time, we also could define a default data structure returned through function map

Map

  • through multiple dispatch, the output of map can be customized to the iterator type used
  • for example: multiplication of each DataFrame column could be done in two different ways
  • first way: iterating over entries of an Array (which contains the column names) will return an Array
In [31]:
df = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
map(nam -> df[nam].*2, names(df))
Out[31]:
2-element Array{DataArrays.DataArray{Int64,1},1}:
 [2, 4, 6]  
 [8, 10, 12]
  • second way: using method map for DataFrame column iterator
In [32]:
df2 = map(col -> col.*2, eachcol(df))
df2
Out[32]:
ab
128
2410
3612
  • map can also be defined for two collections
In [33]:
vals1 = [10 20]
vals2 = [40 1]
map(+, vals1, vals2)
Out[33]:
1×2 Array{Int64,2}:
 50  21

Reduce

  • using function reduce individual components of a collection can be aggregated
  • through multiple dispatch, reduce can have different implementations for each type
  • using map and reduce together, individual entries of iterable collections can be manipulated and aggregated to a single result

example: calculating row means

In [34]:
df
Out[34]:
ab
114
225
336
In [35]:
meanDf = reduce((x,y) -> (x[2].+y[2])./size(df, 2), eachcol(df))
Out[35]:
3-element DataArrays.DataArray{Float64,1}:
 2.5
 3.5
 4.5

example: calculating row sum with weighted columns

  • using map to calculate weighted columns
  • using reduce to sum up individual weighted columns
In [36]:
df = DataFrame(a = [1, 2, 3, 4], b = [4, 5, 6, 7], c = [2, 4, 8, 10])
Out[36]:
abc
1142
2254
3368
44710
  • getting weighted columns:
In [37]:
wgts = [0.4 0.2 0.4]
kk = map((x, y) -> x.*y[2], wgts, eachcol(df))
Out[37]:
3-element Array{DataArrays.DataArray{Float64,1},1}:
 [0.4, 0.8, 1.2, 1.6]
 [0.8, 1.0, 1.2, 1.4]
 [0.8, 1.6, 3.2, 4.0]
In [38]:
wgts[1] * [1, 2, 3, 4]
Out[38]:
4-element Array{Float64,1}:
 0.4
 0.8
 1.2
 1.6
  • aggregation with reduce
In [39]:
reduce((x, y) -> (x .+ y), map((x, y) -> x.*y[2], wgts, eachcol(df)))
Out[39]:
4-element DataArrays.DataArray{Float64,1}:
 2.0
 3.4
 5.6
 7.0

Session info

In [40]:
versioninfo()
Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
In [41]:
Pkg.status()
172 required packages:
 - AbstractFFTs                  0.2.0
 - Atom                          0.6.1
 - AutoGrad                      0.0.7
 - AutoHashEquals                0.1.1
 - AxisAlgorithms                0.1.6
 - AxisArrays                    0.1.4
 - BenchmarkTools                0.0.8
 - Blink                         0.5.3
 - Blosc                         0.3.0
 - BufferedStreams               0.3.3
 - BusinessDays                  0.7.1
 - CSV                           0.1.4
 - Calculus                      0.2.2
 - CatIndices                    0.0.2
 - CategoricalArrays             0.1.6
 - Clustering                    0.8.0
 - CodeTools                     0.4.6
 - Codecs                        0.3.0
 - ColorTypes                    0.5.2
 - ColorVectorSpace              0.4.4
 - Colors                        0.7.4
 - Combinatorics                 0.4.1
 - Compat                        0.28.0
 - Compose                       0.5.3
 - ComputationalResources        0.0.2
 - Conda                         0.5.3
 - Contour                       0.3.0
 - Convex                        0.5.0
 - CoordinateTransformations     0.4.1
 - CoupledFields                 0.0.1
 - CustomUnitRanges              0.0.4
 - DBAPI                         0.1.0
 - DSP                           0.3.2
 - Dagger                        0.2.0
 - DataArrays                    0.6.2
 - DataFrames                    0.10.0
 - DataStreams                   0.1.3
 - DataStructures                0.6.0
 - DecFP                         0.3.0
 - DecisionTree                  0.6.1
 - DiffBase                      0.2.0
 - Distances                     0.4.1
 - DistributedArrays             0.4.0
 - Distributions                 0.14.2
 - DualNumbers                   0.3.0
 - FFTViews                      0.0.2
 - FFTW                          0.0.3
 - FileIO                        0.5.1
 - FixedPointNumbers             0.3.9
 - Formatting                    0.2.1
 - ForwardDiff                   0.4.2
 - GLM                           0.7.0
 - GR                            0.23.0
 - GZip                          0.3.0
 - Gadfly                        0.6.3
 - Glob                          1.1.1
 - Graphics                      0.2.0
 - HDF5                          0.8.2
 - HTTPClient                    0.2.1
 - Hexagons                      0.1.0
 - Hiccup                        0.1.1
 - HttpCommon                    0.2.7
 - HttpParser                    0.3.0
 - HttpServer                    0.2.0
 - HypothesisTests               0.5.1
 - IJulia                        1.5.1
 - IdentityRanges                0.0.1
 - ImageAxes                     0.3.1
 - ImageCore                     0.4.0
 - ImageFiltering                0.1.4
 - ImageMetadata                 0.2.3
 - ImageTransformations          0.3.1
 - Images                        0.11.0
 - IndexedTables                 0.2.1
 - IndirectArrays                0.1.1
 - Interact                      0.4.5
 - Interpolations                0.6.2
 - IntervalSets                  0.1.1
 - IterTools                     0.1.0
 - Iterators                     0.3.1
 - JDBC                          0.2.0
 - JLD                           0.6.11
 - JSON                          0.13.0
 - JavaCall                      0.5.1
 - JuMP                          0.17.1
 - JuliaWebAPI                   0.3.1
 - Juno                          0.3.0
 - KernelDensity                 0.3.2
 - Knet                          0.8.3
 - LNR                           0.0.2
 - LaTeXStrings                  0.2.1
 - Lazy                          0.11.7
 - LegacyStrings                 0.2.2
 - Libz                          0.2.4
 - LightGraphs                   0.9.4
 - LightXML                      0.5.0
 - LineSearches                  0.1.5
 - Loess                         0.3.0
 - Logging                       0.3.1
 - MLBase                        0.7.0
 - MNIST                         0.0.2
 - MacroTools                    0.3.7
 - MappedArrays                  0.0.7
 - MathProgBase                  0.6.4
 - MbedTLS                       0.4.5
 - Measures                      0.1.0
 - Media                         0.3.0
 - Mustache                      0.1.4
 - Mux                           0.2.3
 - NaNMath                       0.2.6
 - NamedArrays                   0.6.1
 - NamedTuples                   4.0.0
 - NearestNeighbors              0.3.0
 - Nettle                        0.3.0
 - NullableArrays                0.1.1
 - ODBC                          0.5.2
 - OffsetArrays                  0.3.0
 - Optim                         0.7.8
 - PDMats                        0.7.0
 - PaddedViews                   0.1.0
 - Parameters                    0.7.2
 - ParserCombinator              1.7.11
 - PlotRecipes                   0.2.0
 - PlotlyJS                      0.6.4
 - Plots                         0.12.3+            master
 - Polynomials                   0.1.5
 - PooledArrays                  0.1.1
 - PositiveFactorizations        0.0.4
 - Primes                        0.1.3
 - ProtoBuf                      0.4.0
 - PyCall                        1.14.0
 - PyPlot                        2.3.2
 - QuadGK                        0.1.2
 - QuantEcon                     0.12.1
 - Query                         0.6.0
 - RCall                         0.7.3
 - RDatasets                     0.2.0
 - RangeArrays                   0.2.0
 - Ratios                        0.1.0
 - Reactive                      0.5.2
 - Reexport                      0.0.3
 - Requests                      0.5.0
 - Requires                      0.4.3
 - ReverseDiffSparse             0.7.3
 - Rmath                         0.1.7
 - Roots                         0.4.0
 - Rotations                     0.5.0
 - Rsvg                          0.1.0
 - SCS                           0.3.3
 - SHA                           0.3.3
 - SIUnits                       0.1.0
 - ScikitLearnBase               0.3.0
 - ShowItLikeYouBuildIt          0.0.1
 - Showoff                       0.1.1
 - SimpleTraits                  0.5.0
 - SortingAlgorithms             0.1.1
 - SpecialFunctions              0.2.0
 - StatPlots                     0.4.2
 - StaticArrays                  0.6.1
 - StatsBase                     0.17.0
 - StatsFuns                     0.5.0
 - TexExtensions                 0.1.0
 - TextParse                     0.1.6
 - TiledIteration                0.0.2
 - TimeSeries                    0.10.0
 - Tokenize                      0.1.8
 - URIParser                     0.1.8
 - UnicodePlots                  0.2.5
 - WeakRefStrings                0.2.0
 - WebSockets                    0.2.3
 - WoodburyMatrices              0.2.2
 - ZMQ                           0.4.3
17 additional packages:
 - BaseTestNext                  0.2.2
 - BinDeps                       0.6.0
 - Cairo                         0.3.1
 - DataValues                    0.2.0
 - DocStringExtensions           0.4.0
 - Documenter                    0.11.2
 - DynAssMgmt                    0.0.0-             master (unregistered)
 - EconDatasets                  0.0.2+             master
 - GeometryTypes                 0.4.2
 - Gtk                           0.13.0
 - IterableTables                0.4.2
 - LibCURL                       0.2.2
 - NetworkLayout                 0.1.1
 - PlotThemes                    0.1.4
 - PlotUtils                     0.4.3
 - RData                         0.1.0
 - RecipesBase                   0.2.2
In [42]:
scriptEndIsReached = true
Out[42]:
true