Name	Name	Last commit message	Last commit date
Latest commit History 19 Commits
ExampleData	ExampleData
ExampleQuery	ExampleQuery
ProjectJar	ProjectJar
libs	libs
.gitattributes	.gitattributes
README.md	README.md

ASTROIDE V1.0

ASTROIDE: Astronomical In-memory Distributed Engine

ASTROIDE is a distributed data server for astronomical data. ASTROIDE introduces effective methods for efficient astronomical query execution on Spark through data partitioning with HEALPix and customized optimizer. ASTROIDE offers a simple, expressive and unified interface through ADQL, a standard language for querying databases in astronomy.

We are in the process of making our source code open.

Requirements

This version requires Spark 2.2.x and Hadoop 2.7+ installed .
Initialize an environment variable called ASTROIDE_HOME to include the directory where you cloned this repository

$ git clone https://github.com/MBrahem/ASTROIDE $ export ASTROIDE_HOME=

Example:

$ export ASTROIDE_HOME=/home/hadoop/ASTROIDE-master/

Add jars in ''conf/spark-defaults.conf'' by adding these lines:

spark.driver.extraClassPath /libs/healpix-1.0.jar:/libs/adql1.3.jar spark.executor.extraClassPath /libs/healpix-1.0.jar:/libs/adql1.3.jar

These libraries already exists in libs directory, for more details please refer to:

Input Data

ASTROIDE supports reading ONLY input format csv or compressed csv.

You can download example of astronomical data here or all GAIA DR1 here and put downloaded files in HDFS.

Example:

$ cd $ASTROIDE_HOME/ExampleData $ hdfs dfs -mkdir data $ hdfs dfs -put * data/

ASTROIDE allows users to use data whose coordinates are expressed according to the International Celestial Reference System ICRS

Other coordinate systems will be supported in future versions.

Partitioning

Partitioning is a fundamental component for parallel data processing. It reduces computer resources when only a sub-part of relevant data are involved in a query, and distributes tasks evenly when the query concerns a large number of partitions. This hence globally improves the query performances.

Partitioning is a mandatory process for executing astronomical queries efficiently in ASTROIDE.

Using ASTROIDE, partitioning is executed as follows:

$ spark-submit --class fr.uvsq.adam.astroide.executor.BuildHealpixPartitioner [ --master ] -fs []

The above line can be explained as:

hdfs: hdfs path. For e.g. hdfs://{NameNodeHost}:8020
schema: text file containing the schema of input file (separator: ","). If your data already contains the schema, you can skip this option
infile: path to input file on HDFS where astronomical data are stored
separator: separator used in the input file (e.g. "," or "|")
outfile: path to output parquet file on HDFS where astronomical data will be stored after partitioning
partitionSize: approximate partition size in MB(recommended 256MB)
healpixOrder: the value of healpix order to use (recommended value 12)
name_coordinates1: attribute name of the first spherical coordinates (usually ra)
name_coordinates2: attribute name of the second spherical coordinates (usually dec)
boundariesFile: a text file where ASTROIDE will save the metadata (ranges of partitions) Please precise a path to a local file on your master node.

Example:

$ hdfs dfs -mkdir data $ spark-submit --class fr.uvsq.adam.astroide.executor.BuildHealpixPartitioner $ASTROIDE_HOME/ProjectJar/astroide.jar -fs hdfs://localhost:8020 data/gaia.csv "," partitioned/gaia.parquet 32 12 ra dec $ASTROIDE_HOME/gaia.txt

Or using schema:

$ spark-submit --class fr.uvsq.adam.astroide.executor.BuildHealpixPartitioner $ASTROIDE_HOME/ProjectJar/astroide.jar -fs hdfs://localhost:8020 $ASTROIDE_HOME/ExampleData/tycho2Schema.txt data/tycho2.gz "|" partitioned/tycho2.parquet 32 12 ra dec $ASTROIDE_HOME/tycho2.txt

ASTROIDE retrieves partition boundaries and stores them as metadata. Note that in our case, all we need to store are the three values (n, l, u) where n is the partition number, l is the first HEALPix cell of the partition number n and u is the last HEALPix cell of the partition number n. It should also be noted that we store the partitioned files, along with the metadata, on HDFS and use them for future queries.

Below is an example of small file with 5 partitions:

$ cat $ASTROIDE_HOME/gaia.txt [3,268313,340507] [0,0,86480] [1,86481,178805] [2,178807,268308]

The partitioned file will be stored as a parquet file that looks like:

$ hdfs dfs -ls partitioned/gaia.parquet/ Found 5 items -rw-r--r-- 1 user supergroup 0 2018-06-22 11:32 partitioned/gaia.parquet/_SUCCESS drwxr-xr-x - user supergroup 0 2018-06-22 11:32 partitioned/gaia.parquet/nump=0 drwxr-xr-x - user supergroup 0 2018-06-22 11:32 partitioned/gaia.parquet/nump=1 drwxr-xr-x - user supergroup 0 2018-06-22 11:32 partitioned/gaia.parquet/nump=2 drwxr-xr-x - user supergroup 0 2018-06-22 11:32 partitioned/gaia.parquet/nump=3

Run astronomical queries

AstroSpark focuses on three main basic astronomical queries, these queries require more calculations than ordinary search or join in legacy relational databases:

Cone Search is one of the most frequent queries in the astronomical domain. It returns a set of stars whose positions lie within a circular region of the sky.
Cross-Matching query aims at identifying and comparing astronomical objects belonging to different observations of the same sky region.
kNN query returns the k closest sources from a query point q.

You can run astronomical queries using object fr.uvsq.adam.astroide.executor.AstroideQueries which executes ADQL queries using our framework.

Note: Please make sure that your data has been already partitioned to execute astronomcal queries.

After data partitioning, you can start executing astronomical queries using ADQL.

ASTROIDE supports ADQL Standard. It includes three kinds of astronomical operators as follows. All these operators can be directly passed to astroide throught queryFile

$ spark-submit --class fr.uvsq.adam.astroide.executor.AstroideQueries [--master ] -fs []

For KNN & ConeSearch queries

file1: output file created after partitioning (parquet file)
file2: boundaries file created after partitioning (text file)
queryFile: ADQL query (text file) see examples below
action: action that you will execute on query result (show, count, save)
outfile: output file where result can be saved on HDFS

If no action is defined, ASTROIDE will show only the execution plan.

For CrossMatch queries

file1: first dataset (parquet file)
file2: second dataset (parquet file)
queryFile: ADQL query (text file) see examples below
action: action that you will execute on query result (show, count, save)
outfile: output file where result can be saved on HDFS

If no action is defined, ASTROIDE will show only the execution plan.

Example:

$ spark-submit --class fr.uvsq.adam.astroide.executor.AstroideQueries $ASTROIDE_HOME/ProjectJar/astroide.jar -fs hdfs://localhost:8020 partitioned/gaia.parquet $ASTROIDE_HOME/gaia.txt 12 $ASTROIDE_HOME/ExampleQuery/conesearch.txt show ... 2018-06-22 11:43:59 INFO DAGScheduler:54 - Job 2 finished: show at AstroideQueries.scala:87, took 0,508019 s +-------------+------------------+-------------------+ | source_id| ra| dec| +-------------+------------------+-------------------+ |5050881701504| 44.95293718578692|0.13388443523264248| |1099511693312|44.966545443077436|0.04631022905873263| |1614907863552|44.951154531610996|0.10530901086672136| ...

Queries Examples

These are some correct ADQL queries that you can refer to test in ASTROIDE.

You can save only one query in a text file and run your application using fr.uvsq.adam.astroide.executor.AstroideQueries.

ConeSearch Queries

0); SELECT ra,dec,ipix FROM (SELECT * FROM table WHERE 1=CONTAINS(point('icrs',ra,dec),circle('icrs', 44.97845893652677, 0.09258081167082206, 0.05))) As t;">SELECT * FROM table WHERE 1=CONTAINS(point('icrs',ra,dec),circle('icrs', 44.97845893652677, 0.09258081167082206, 0.05)); SELECT ra,dec,ipix FROM table WHERE 1=CONTAINS(point('icrs',ra,dec),circle('icrs', 44.97845893652677, 0.09258081167082206, 0.05)) ORDER BY ipix; SELECT source_id,ra FROM table WHERE (1=CONTAINS(point('icrs',ra,dec),circle('icrs', 44.97845893652677, 0.09258081167082206, 0.05)) AND ra > 0); SELECT ra,dec,ipix FROM (SELECT * FROM table WHERE 1=CONTAINS(point('icrs',ra,dec),circle('icrs', 44.97845893652677, 0.09258081167082206, 0.05))) As t;

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBrahem/ASTROIDE

Folders and files

Latest commit

History

Repository files navigation

ASTROIDE V1.0

Requirements

Input Data

Partitioning

Run astronomical queries

Queries Examples

ConeSearch Queries

KNN Queries

CrossMatching Queries

Dataframe API

Cross-matching

Publications

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

MBrahem/ASTROIDE

Folders and files

Latest commit

History

Repository files navigation

ASTROIDE V1.0

Requirements

Input Data

Partitioning

Run astronomical queries

Queries Examples

ConeSearch Queries

KNN Queries

CrossMatching Queries

Dataframe API

Cross-matching

Publications

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages