To an existing, large ElasticSearch 5 index, I want to add a date field, containing the date of the indexation of each document. Afterwards I want to query this index, to return all documents, created in the last minute.
I am running MongoDB (v3.4 Community Edition) on an AWS EC2 m4.large instance. I installed it according to this MongoDB tutorial. I haven’t modified any MongoDB config. I haven’t configured any replica set or shard. I have a Jersey API which interacts with MongoDB using org.mongodb.morphia (v1.3.2) Java Driver.
I have created a load test using SoapUI where I make 100 calls to API which in turn create 100 write operations (one document is created in one collection per write operation) for MongoDB. I run the test for 24 hours. At the end, I see my webserver trying to communicate with the Mongo server but the connection is timing out and the host CPU is running at 100%. I conclude that the MongoDB server is choking.
I then tried this benchmarking exercise, still with a database hosted on an m4.large. 50,000,000 records were created in 923.133 seconds i.e. 54,113 insertions per second. That is over 500 times faster!
If MongoDB is performing so well then why it is choking at 100 insertion per second when going through JAVA driver? Is the Java driver slow? Is my use of the Java driver wrong? Is my EC2 instance size is too low? Would it help to add replication (RAID and Replica sets)?
I am new to MongoDB hosting and really appreciate your help learning.
I have installed MondoDB (v3.4) on AWS EC2 instance (m4.large) using following link. I havent modify any MongoDB config. I havent configured any replica set or shard. I have a jersey API which interact with MongoDB using JAVA driver. I created a load test using SoapUI where I made the 100 calls to API which in turn created 100 write operations (one document is created in one collection per write operation) for MongoDB. I ran the test for 24 Hours. I saw at the end my webserver was trying to communicate with Mongo Sever but connection was timing out and Host CPU was running at 100%.That is why I concluded MongoDB server is chocking.
Is my EC2 instance size is too low?
Am I have to do replication (RAID and Replica sets)?
I am going to try benchmarking exercise mentioned here. I will updated the post after. I am new to MongoDB hosting and really appreciate your help learning.
Update 1: I did the MongoDB benchmarking test above. Database hosted on M4.Large instance gave following data. 50,000,000 records were created in 923.133 seconds i.e. 54,113 insertion per second. If MongoDB is performing so well then why it is chocking at 100 insertion per second when done through JAVA driver. Is Java driver slow? Is my implementation of JAVA Driver wrong? I am using org.mongodb.morphia (v1.3.2) Java Driver.
My question: Why it is used a temporary table and filesort but it reports only 1 row examined? It seems that because of using a temporary table, it should process more than one row. How can I determine the real number of rows processed? How to solve this discrepancy about number of processed rows?
Note that the task I was assigned to do now is to eliminate heavy (involving too many rows) queries. And now I do not know how to do this.
I am interested in task: pair-matching. However, as I dig deeper, I found myself confused.
Here is a brief summary on evaluating pair-matching performance in LFW dataset:
LFW dataset is divided into View1 and View2. View1 is for development of algorithms, you can use it to select model, tune parameters and choose features. View2 is for reporting accuracy of your model produced by View1.
For development purposes, we recommend using the below training/testing split, which was generated randomly and independently of the splits for 10-fold cross validation, to avoid unfairly overfitting to the sets above during development. For instance, these sets may be viewed as a model selection set and a validation set. See the tech report below for more details.
As a benchmark for comparison, we suggest reporting performance as 10-fold cross validation using splits we have randomly generated.
I also found an example of carrying out the experiment with PCA for face pair-matching in the LFW 2008 paper.
Eigenfaces for pair matching. We computed eigenvectors from the training set of View 1 and determined the threshold value for classifying pairs as matched or mismatched that gave the best performance on the test set of View 1. For each run of View 2, the training set was used to compute the eigenvectors, and pairs were classified using the threshold on Euclidian distance from View 1.
State of the art pair matching. To determine the current best performance on pair matching, we ran an implementation of the current state of the art recognition system of Nowak and Jurie .11 The Nowak algorithm gives a similarity score to each pair, and View 1 was used to determine the threshold value for classifying pairs as matched or mismatched. For each of the 10 folds of View 2 of the database, we trained on 9 of the sets and computed similarity measures for the held out test set, and classified pairs using the threshold
My questions are:
How to do training with View1 data using 10-fold cross validation?
The data is already split into pairsDevTrain.txt and pairsDevTest.txt. Does it mean that I need to merge these two file and then do a standard 10-fold cross validation to train my model?
Why is 10-fold cross validation required in View2?
Since model and parameter is all determined using data in View1, why not just use all View2 data to report performance.
Since 10-fold cross validation is required in View2, there must be a training process. Why retrain another model?
It is worth mentioning here, both in View1 and View2. train and test data don’t share common identity, i.e. person1 appear in train, will not appear in test.
10-fold cross validation is recommended for both View1 and View2. 10-fold splits are given for View2 but not View1. Is there a reason why?
Thank you beforehand for helping me understand the performance evaluation for LFW.
When running a benchmark using the program AS SSD on an NVMe drive, When the second checkbox seen in the image below is unchecked, the drive gets terrible write performance in AS SSD (see first benchmark screenshot below), but not in CrystalDiskMark (see last screenshot); however, if I check that box, then AS SSD performs well. Does anyone know what’s going on here? My concern is that according to that checkbox description, I should NOT have it checked since my drive doesn’t have its own power supply, but AS SSD is so slow, I’m concerned other programs may be affected.
I’m creating a site using JointsWP Foundation 6 theme and have created a new fixed side menu which includes the logo and social links. My problem is everytime a user click on the menu it reloads causing a shift – is there a way of stopping this – is it a page load issue or have i come about it the wrong way? I tried adding a caching plugin but it hasn’t seemed to help. Any suggestions appreciated.
Here is examples of my code:
<body <?php body_class(); ?>>
/wp-content/themes/JointsWP-CSS-master/assets/images/logo.png" alt="big green space"/>
t: +44 000 000000
and the page.php
<?php get_header(); ?>
<main id="main" class="large-9 medium-9 columns contentSection" role="main">
<?php if (have_posts()) : while (have_posts()) : the_post(); ?>
<?php get_template_part( 'parts/loop', 'page' ); ?>
<?php endwhile; endif; ?>
<!--</main> <!-- end #main -->
<!--</div> <!-- end #inner-content -->
<!--</div> <!-- end #content -->
i have added 2 test pages so that you can see – biggreenspace.com/test-page-1 and you will be able to navigate to test page 2 (the other menu will take you to the maintenance screen). This primarily happens in Chrome and Firefox – not in IE edge.
Running the regex substitution in Julia and wrapping Julia code in Python
The use case for the tokenize() function usually takes a single input but if the same function is called 1,000,000,000 times, it’s rather slow and the GIL is going to lock up the core and process each sentence at a time.
The aim of the question is to ask for ways to speed up a Python code that’s made up of regex substitution, esp. when running the tokenize() function for 1,000,000,000+ times.
If Cython/Julia or any faster language + wrapper is suggested, it would be good if you give an one regex example of how the regex is written in Cython/Julia/Others and the suggestion on how the wrapper would look like.
They’re performance reads on Linux and Windows like this:
Crucial MX300 –> same on both OSs
sudo hdparm -tT /dev/sda # Crucial
Timing cached reads: 13700 MB in 2.00 seconds = 6854.30 MB/sec
Timing buffered disk reads: 1440 MB in 3.00 seconds = 479.58 MB/sec
SanDisk Plus –> way faster on Windows!
sudo hdparm -tT /dev/sdb # SanDisk
Timing cached reads: 7668 MB in 2.00 seconds = 3834.92 MB/sec
Timing buffered disk reads: 798 MB in 3.00 seconds = 265.78 MB/sec # TOO LOW !!
The sequential read performance of the SanDisk on Linux is about half of its performance on Windows!
My Question is of course: Why and can that be fixed? Is this due to the SanDisk SSD Plus being handled as a SCSI drive?
~$ grep SDSSD /var/log/syslog
systemd: Found device SanDisk_SDSSDA240G
kernel: [ 2.152138] ata2.00: ATA-9: SanDisk SDSSDA240G, Z32070RL, max UDMA/133
kernel: [ 2.174689] scsi 1:0:0:0: Direct-Access ATA SanDisk SDSSDA24 70RL PQ: 0 ANSI: 5
smartd: Device: /dev/sdb [SAT], SanDisk SDSSDA240G, S/N:162783441004, WWN:5-001b44-4a404e4f0, FW:Z32070RL, 240 GB
smartd: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.SanDisk_SDSSDA240G-162783441004.ata.state
smartd: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.SanDisk_SDSSDA240G-162783441004.ata.state
Compared to the Crucial MX300 which has on linux almost the same performance as on Windows:
~$ grep MX300 /var/log/syslog
systemd: Found device Crucial_CT750MX300SSD1
kernel: [ 1.775520] ata1.00: ATA-10: Crucial_CT750MX300SSD1, M0CR050, max UDMA/133
smartd: Device: /dev/sda [SAT], Crucial_CT750MX300SSD1, S/N:16251486AC40, WWN:5-00a075-11486ac40, FW:M0CR050, 750 GB
smartd: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Crucial_CT750MX300SSD1-16251486AC40.ata.state
smartd: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Crucial_CT750MX300SSD1-16251486AC40.ata.state
Any help is very welcome!
The difference that hdparm is showing on Linux is very real. I created two identical directories, one in each of the two drives, each directory containing about 25Gb of files (36395 files), and ran the exact same hashdeep checksum creation script on both dirs (the script just creates a md5-checksum for every file in the test dirs and stores all the checksums in one single file). These are the results:
test-sandisk# time create-file-integrity-md5sums.sh .
test-mx300# time create-file-integrity-md5sums.sh .
Same test with a single 7Gb file:
test-sandisk# time create-file-integrity-md5sums.sh .
test-mx300# time create-file-integrity-md5sums.sh .